Module 1
Module 1
1
Introduction: Machine learning, Examples of Machine Learning Problems, Learning versus
Designing, Training versus Testing, Characteristics of Machine learning tasks, Predictive and
descriptive tasks.
Machine learning
machine learning as – a “Field of study that gives computers the capability to learn
without being explicitly programmed”.
One of the biggest characteristics of machine learning is its ability to automate repetitive
tasks and thus, increasing productivity. A huge number of organizations are already using
machine learning-powered paperwork and email automation. In the financial sector, for
example, a huge number of repetitive, data-heavy and predictable tasks are needed to be
performed. Because of this, this sector uses different types of machine learning solutions
to a great extent. The make accounting tasks faster, more insightful, and more accurate.
Some aspects that have been already addressed by machine learning include addressing
financial queries with the help of chatbots, making predictions, managing expenses,
simplifying invoicing, and automating bank reconciliations.
4- THE ABILITY TO TAKE EFFICIENCY TO THE NEXT LEVEL WHEN MERGED WITH IOT
Thanks to the huge hype surrounding the IoT, machine learning has experienced a great
rise in popularity. IoT is being designated as a strategically significant area by many
companies. And many others have launched pilot projects to gauge the potential of IoT in
the context of business operations. But attaining financial benefits through IoT isn’t easy.
In order to achieve success, companies, which are offering IoT consulting services and
platforms, need to clearly determine the areas that will change with the implementation of
IoT strategies. Many of these businesses have failed to address it. In this scenario,
machine learning is probably the best technology that can be used to attain higher levels
of efficiency. By merging machine learning with IoT, businesses can boost the efficiency
of their entire production processes.
It’s a fact that fostering a positive credit score usually takes discipline, time, and lots of
financial planning for a lot of consumers. When it comes to the lenders, the consumer
credit score is one of the biggest measures of credit worthiness that involve a number of
factors including payment history, total debt, length of credit history etc. But wouldn’t it
be great if there is a simplified and better measure? With the help of machine learning,
lenders can now obtain a more comprehensive consumer picture. Bank can now predict
whether the customer is a low spender or a high spender and understand his/her tipping
point of spending. Apart from mortgage lending, financial institutions are using the same
techniques for other types of consumer loans.
Traditionally, data analysis has always encompassed trial and error methods, an approach
which becomes impossible when we are working with large and heterogeneous datasets.
Machine learning comes as the best solution to all these issues by offering effective
alternatives to analyzing massive volumes of data. By developing efficient and fast
algorithms, as well as, data-driven models for processing of data in real-time, machine
learning is able to generate accurate analysis and results.
7- BUSINESS INTELLIGENCE AT ITS BEST
Machine learning characteristics, when merged with big data analytical work, can
generate extreme levels of business intelligence with the help of which several different
industries are making strategic initiatives. From retail to financial services to healthcare,
and many more – machine learning has already become one of the most effective
technologies to boost business operations.
Descriptive analysis or statistics does exactly what the name implies: they “describe”, or
summarize, raw data and make it something that is interpretable by humans. They are
analytics that describe the past. The past refers to any point of time that an event has
occurred, whether it is one minute ago, or one year ago. Descriptive analytics are useful
because they allow us to learn from past behaviors, and understand how they might
influence future outcomes.
The vast majority of the statistics we use fall into this category. (Think basic arithmetic
like sums, averages, percent changes.) Usually, the underlying data is a count or
aggregate of a filtered column of data to which basic math is applied. For all practical
purposes, there are an infinite number of these statistics. Descriptive statistics are useful
to show things like total stock in inventory, average dollars spent per customer and year-
over-year change in sales. Common examples of descriptive analytics are reports that
provide historical insights regarding the company’s production, financials, operations,
sales, finance, inventory and customers.
Use descriptive analytics when you need to understand at an aggregate level what is
going on in your company, and when you want to summarize and describe different
aspects of your business.
Predictive analytics has its roots in the ability to “predict” what might happen. These
analytics are about understanding the future. Predictive analytics provides companies
with actionable insights based on data. Predictive analytics provides estimates about the
likelihood of a future outcome. It is important to remember that no statistical algorithm
can “predict” the future with 100% certainty. Companies use these statistics to forecast
what might happen in the future. This is because the foundation of predictive analytics is
based on probabilities.
These statistics try to take the data that you have, and fill in the missing data with best
guesses. They combine historical data found in ERP, CRM, HR and POS systems to
identify patterns in the data and apply statistical models and algorithms to capture
relationships between various data sets. Companies use predictive statistics and analytics
any time they want to look into the future. Predictive analytics can be used throughout the
organization, from forecasting customer behavior and purchasing patterns to identifying
trends in sales activities. They also help forecast demand for inputs from the supply
chain, operations and inventory.
One common application most people are familiar with is the use of predictive analytics
to produce a credit score. These scores are used by financial services to determine the
probability of customers making future credit payments on time. Typical business uses
include understanding how sales might close at the end of the year, predicting what items
customers will purchase together, or forecasting inventory levels based upon a myriad of
variables.
Use predictive analytics any time you need to know something about the future, or fill in
the information that you do not have.
Module 1.2
Machine learning Models: Geometric Models, Logical Models, and Probabilistic Models.
Models form the central concept in machine learning as they are what is being learned
from the data, in order to solve a given task.
Module 1.3
Features: Feature types, Feature Construction and Transformation, Feature Selection.
Features determine much of the success of a machine learning application, because a
model is only as good as its features. A feature can be thought of as a kind of
measurement that can be easily performed on any instance.
Mathematically, they are functions that map from the instance space to some set of
feature values called the domain of the feature.
Features are nothing but the independent variables in machine learning models. What is
required to be learned in any specific machine learning problem is a set of these features
(independent variables), coefficients of these features, and parameters for coming up with
appropriate functions or models (also termed hyperparameters).
Feature Transformation
Data preprocessing is one of the many crucial steps of any data science project. As we
know, our real-life data is often very unorganized and messy and without data
preprocessing. First, we have to preprocess our data and then feed that processed data to
our data science models for good performance. One part of preprocessing is Feature
Transformation
It refers to the algorithm family that creates new features using the existing features.
These new features may not have the same interpretation as the original features, but they
may have more explanatory power in a different space rather than in the original space.
This can also be used for Feature Reduction. It can be done in many ways, by linear
combinations of original features or using non-linear functions. It helps machine learning
algorithms to converge faster.
Like Linear and Logistic regression, some data science models assume that the variables
follow a normal distribution. More likely, variables in real datasets will follow a skewed
distribution. By applying some transformations to these skewed variables, we can map
this skewed distribution to a normal distribution to increase the performance of our
models.
The following transformation techniques can be applied to data sets, such as:
1. Log Transformation: Generally, these transformations make our data close to a
normal distribution but cannot exactly abide by a normal distribution. This transformation
is not applied to those features which have negative values. This transformation is mostly
applied to right-skewed data. Convert data from the addictive scale to multiplicative
scale, i.e., linearly distributed data.
Box-cox requires the input data to be strictly positive (not even zero is acceptable), while
Yeo-Johnson supports both positive and negative data.
Feature Selection
Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features.
While developing the machine learning model, only a few variables in the dataset are
useful for building the model, and the rest features are either redundant or irrelevant. If
we input the dataset with all these redundant and irrelevant features, it may negatively
impact and reduce the overall performance and accuracy of the model. Hence it is very
important to identify and select the most appropriate features from the data and remove
the irrelevant or less important features, which is done with the help of feature selection
in machine learning.
Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and relevant
dataset to the model in order to get a better result.
A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature selection
and extraction processes may have the same objective, both are completely different from
each other. The main difference between them is that feature selection is about selecting
the subset of the original feature set, whereas feature extraction creates new features.
Feature selection is a way of reducing the input variable for the model by using only
relevant data in order to reduce overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features or
excluding the irrelevant features in the dataset without changing them.
Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is
necessary to provide a pre-processed and good input dataset in order to get better
outcomes. We collect a huge amount of data to train our model and help it to learn better.
Generally, the dataset consists of noisy data, irrelevant data, and some part of useful data.
Moreover, the huge amount of data also slows down the training process of the model,
and with noise and irrelevant data, the model may not predict and perform well. So, it is
very necessary to remove such noises and less-important data from the dataset and to do
this, and Feature selection techniques are used.
Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a spare
part, and to do this, we have a dataset. This dataset contains a Model of the car, Year,
Owner's name, Miles. So, in this dataset, the name of the owner does not contribute to the
model performance as it does not decide if the car should be crushed or not, so we can
remove this column and select the rest of the features(column) for the model building.
There are mainly two types of Feature Selection techniques, which are:
● Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be
used for the labelled dataset.
● Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be
used for the unlabelled dataset.
1. Wrapper Methods
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-processing
step.
The filter method filters out the irrelevant feature and redundant columns from the model
by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does
not overfit the data.
Some common techniques of Filter methods are as follows:
● Missing Value Ratio: The value of the missing value ratio can be used for
evaluating the feature set against the threshold value. The formula for obtaining
the missing value ratio is the number of missing values in each column divided by
the total number of observations. The variable is having more than the threshold
value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds the
most important features that contribute the most to training in a particular iteration. Some
techniques of embedded methods are:
To know this, we need to first identify the type of input and output variables. In machine
learning, variables are of mainly two types:
Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.
Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should be used, but
with categorical output.
This is the case of regression predictive modelling with categorical input. It is a different
example of a regression problem. We can use the same measures as discussed in the
above case but in reverse order.
4. Categorical Input, Categorical Output:
The commonly used technique for such a case is Chi-Squared Test. We can also use
Information gain in this case.
We can summarise the above cases with appropriate measures in the below table: