0% found this document useful (0 votes)
22 views57 pages

Unit Ii ML

Welcome to Gboard clipboard, any text you copy will be saved here.

Uploaded by

fokom47102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views57 pages

Unit Ii ML

Welcome to Gboard clipboard, any text you copy will be saved here.

Uploaded by

fokom47102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

UNIT II

Overview
This tutorial is divided into three parts; they are:

1. What Is Model Selection


2. Considerations for Model Selection
3. Model Selection Techniques
What Is Model Selection
Model selection is the process of selecting one final machine learning
model from among a collection of candidate machine learning models for
a training dataset.
Model selection is a process that can be applied both across different
types of models (e.g. logistic regression, SVM, KNN, etc.) and across
models of the same type configured with different model hyperparameters
(e.g. different kernels in an SVM).

For example, we may have a dataset for which we are interested in


developing a classification or regression predictive model. We do not
know beforehand as to which model will perform best on this problem, as
it is unknowable. Therefore, we fit and evaluate a suite of different
models on the problem.

Model selection is the process of choosing one of the models as the final
model that addresses the problem.
Model selection is different from model assessment.
For example, we evaluate or assess candidate models in order to choose
the best one, and this is model selection. Whereas once a model is
chosen, it can be evaluated in order to communicate how well it is
expected to perform in general; this is model assessment.

The process of evaluating a model’s performance is known as model


assessment, whereas the process of selecting the proper level of
flexibility for a model is known as model selection.
Considerations for Model Selection
Fitting models is relatively straightforward, although selecting among
them is the true challenge of applied machine learning.
Firstly, we need to get over the idea of a “best” model.
All models have some predictive error, given the statistical noise in the
data, the incompleteness of the data sample, and the limitations of each
different model type. Therefore, the notion of a perfect or best model is
not useful. Instead, we must seek a model that is “good enough.”

Some algorithms require specialized data preparation in order to best


expose the structure of the problem to the learning algorithm. Therefore,
we must go one step further and consider model selection as the process
of selecting among model development pipelines.
Each pipeline may take in the same raw training dataset and outputs a
model that can be evaluated in the same manner but may require
different or overlapping computational steps, such as:

 Data filtering.
 Data transformation.
 Feature selection.
 Feature engineering.
 And more…
The closer you look at the challenge of model selection, the more nuance
you will discover.

Now that we are familiar with some considerations involved in model


selection, let’s review some common methods for selecting a model.

Model Selection Techniques


In this ideal situation, we would split the data into training, validation,
and test sets, then fit candidate models on the training set, evaluate and
select them on the validation set, and report the performance of the final
model on the test set.
If we are in a data-rich situation, the best approach […] is to randomly
divide the dataset into three parts: a training set, a validation set, and a
test set. The training set is used to fit the models; the validation set is
used to estimate prediction error for model selection; the test set is used
for assessment of the generalization error of the final chosen model.

.
This is impractical on most predictive modeling problems given that we
rarely have sufficient data, or are able to even judge what would be
sufficient.

In many applications, however, the supply of data for training and testing
will be limited, and in order to build good models, we wish to use as much
of the available data as possible for training. However, if the validation
set is small, it will give a relatively noisy estimate of predictive
performance.

.
Instead, there are two main classes of techniques to approximate the
ideal case of model selection; they are:

 Probabilistic Measures: Choose a model via in-sample error and complexity.

 Resampling Methods: Choose a model via estimated out-of-sample error.


Let’s take a closer look at each in turn.

Probabilistic Measures
Probabilistic measures involve analytically scoring a candidate model
using both its performance on the training dataset and the complexity of
the model.
It is known that training error is optimistically biased, and therefore is not
a good basis for choosing a model. The performance can be penalized
based on how optimistic the training error is believed to be. This is
typically achieved using algorithm-specific methods, often linear, that
penalize the score based on the complexity of the model.

Historically various ‘information criteria’ have been proposed that


attempt to correct for the bias of maximum likelihood by the addition of a
penalty term to compensate for the over-fitting of more complex models.
– Page 33, Pattern Recognition and Machine Learning, 2006.
A model with fewer parameters is less complex, and because of this, is
preferred because it is likely to generalize better on average.
Four commonly used probabilistic model selection measures include:

 Akaike Information Criterion (AIC).


 Bayesian Information Criterion (BIC).
 Minimum Description Length (MDL).
 Structural Risk Minimization (SRM).
Probabilistic measures are appropriate when using simpler linear models
like linear regression or logistic regression where the calculating of
model complexity penalty (e.g. in sample bias) is known and tractable.

Resampling Methods
Resampling methods seek to estimate the performance of a model (or
more precisely, the model development process) on out-of-sample data.
This is achieved by splitting the training dataset into sub train and test
sets, fitting a model on the sub train set, and evaluating it on the test set.
This process may then be repeated multiple times and the mean
performance across each trial is reported.

It is a type of Monte Carlo estimate of model performance on out-of-


sample data, although each trial is not strictly independent as depending
on the resampling method chosen, the same data may appear multiple
times in different training datasets, or test datasets.
Three common resampling model selection methods include:

 Random train/test splits.


 Cross-Validation (k-fold, LOOCV, etc.).
 Bootstrap.
Most of the time probabilistic measures (described in the previous
section) are not available, therefore resampling methods are used.

By far the most popular is the cross-validation family of methods that


includes many subtypes.

Probably the simplest and most widely used method for estimating
prediction error is cross-validation.
An example is the widely used k-fold cross-validation that splits the
training dataset into k folds where each example appears in a test set
only once.

Another is the leave one out (LOOCV) where the test set is comprised of a
single sample and each sample is given an opportunity to be the test set,
requiring N (the number of samples in the training set) models to be
constructed and evaluated.

TRAINING MODEL
Training a machine learning (ML) model is a process in which a machine learning algorithm is
fed with training data from which it can learn. ML models can be trained to benefit businesses in
numerous ways, by quickly processing huge volumes of data, identifying patterns, finding
anomalies or testing correlations that would be difficult for a human to do unaided.

What is Model Training?


Model training is at the heart of the data science development lifecycle where the data science
team works to fit the best weights and biases to an algorithm to minimize the loss function over
prediction range. Loss functions define how to optimize the ML algorithms. A data science team
may use different types of loss functions depending on the project objectives, the type of data
used and the type of algorithm.

When a supervised learning technique is used, model training creates a mathematical


representation of the relationship between the data features and a target label. In unsupervised
learning, it creates a mathematical representation among the data features themselves.

Importance of Model Training

Model training is the primary step in machine learning, resulting in a working model that can
then be validated, tested and deployed. The model’s performance during training will eventually
determine how well it will work when it is eventually put into an application for the end-users.

Both the quality of the training data and the choice of the algorithm are central to the model
training phase. In most cases, training data is split into two sets for training and then validation
and testing.

The selection of the algorithm is primarily determined by the end-use case. However, there are
always additional factors that need to be considered, such as algorithm-model complexity,
performance, interpretability, computer resource requirements, and speed. Balancing out these
various requirements can make selecting algorithms an involved and complicated process.

How To Train a Machine Learning Model


Training a model requires a systematic, repeatable process that maximizes your utilization of
your available training data and the time of your data science team. Before you begin the training
phase, you need to first determine your problem statement, access your data set and clean the
data to be presented to the model.

In addition to this, you need to determine which algorithms you will use and what parameters
(hyperparameters) they will run with. With all of this done, you can split your dataset into a
training set and a testing set, then prepare your model algorithms for training.

Split the Dataset

Your initial training data is a limited resource that needs to be allocated carefully. Some of it can
be used to train your model, and some of it can be used to test your model – but you can’t use the
same data for each step. You can’t properly test a model unless you have given it a new data set
that it hasn’t encountered before. Splitting the training data into two or more sets allows you to
train and then validate the model using a single source of data. This allows you to see if the
model is overfit, meaning that it performs well with the training data but poorly with the test
data.

A common way of splitting the training data is to use cross-validation. In 10-fold cross-
validation, for example, the data is split into ten sets, allowing you to train and test the data ten
times. To do this:

1. Split the data into ten equal parts or folds.


2. Designate one fold as the hold-out fold.
3. Train the model on the other nine folds.
4. Test the model on the hold-out fold.

Repeat this process ten times, each time selecting a different fold to be the hold-out fold. The
average performance across the ten hold-out folds is your performance estimate, called the cross-
validated score.

Select Algorithms to Test

In machine learning, there are thousands of algorithms to choose from, and there is no sure way
to determine which will be the best for any specific model. In most cases, you will likely try
dozens, if not hundreds, of algorithms in order to find the one that results in an accurate working
model. Selecting candidate algorithms will often depend on:
 Size of the training data.
 Accuracy and interpretability of the required output.
 Speed of training time required, which is inversely proportional to accuracy.
 Linearity of the training data.
 Number of features in the data set.

Tune the Hyperparameters

Hyperparameters are the high-level attributes set by the data science team before the model is
assembled and trained. While many attributes can be learned from the training data, they cannot
learn their own hyperparameters.

As an example, if you are using a regression algorithm, the model can determine the regression
coefficients itself by analyzing the data. However, it cannot dictate the strength of the penalty it
should use to regularize an overabundance of variables. As another example, a model using the
random forest technique can determine where decision trees will be split, but the number of trees
to be used needs to be tuned beforehand.

Fit and Tune Models

Now that the data is prepared and the model’s hyperparameters have been determined, it’s time
to start training the models. The process is essentially to loop through the different algorithms
using each set of hyperparameter values you’ve decided to explore. To do this:

1. Split the data.


2. Select an algorithm.
3. Tune the hyperparameter values.
4. Train the model.
5. Select another algorithm and repeat steps 3 and 4..

Next, select another set of hyperparameter values you want to try for the same algorithm, cross-
validate it again and calculate the new score. Once you have tried each hyperparameter value,
you can repeat these same steps for additional algorithms.

Think of these trials as track and field heats. Each algorithm has demonstrated what it can do
with the different hyperparameter values. Now you can select the best version from each
algorithm and send them on to the final competition.

Choose the Best Model

Now it’s time to test the best versions of each algorithm to determine which gives you the best
model overall.

1. Make predictions on your test data.


2. Determine the ground truth for your target variable during the training of that model.
3. Determine the performance metrics from your predictions and the ground truth target variable.
4. Run each finalist model with the test data.
Once the testing is done, you can compare their performance to determine which are the better
models. The overall winner should have performed well (if not the best) in training as well as in
testing. It should also perform well on your other performance metrics (like speed and empirical
loss), and – ultimately – it should adequately solve or answer the question posed in your problem
statement.

Systematic Approach to Model Training


Using a systematic and repeatable model training process is of paramount importance for any
organization planning to build successful machine learning model at scale. Central to this is
having all of your resources, tools, libraries and documentation in a single enterprise platform
that will foster collaboration instead of hindering it.

Interpretability in Machine Learning


Why we need to understand how our models make predictions

Should we always trust a model that performs well? A model could reject your
application for a mortgage or diagnose you with cancer. The consequences of these
decisions are serious and, even if they are correct, we would expect an explanation.
A human would be able to tell you that your income is too low for a mortgage or
that a specific cluster of cells is likely malignant. A model that provided similar
explanations would be more useful than one that just provided predictions.

Source: flaticon
By obtaining these explanations, we say we are interpreting a machine learning
model. In the rest of this article, we’ll explain in more detail what is meant by
interpretability. We’ll then move on to the importance and benefits of being able to
interpret your models. There are, however, still some downsides. We’ll end off by
discussing these and why, in some cases, you may prefer a less interpretable model.

What do we mean by interpretability?

In a previous article, I discuss the concept of model interpretability and how it


relates to interpretable and explainable machine learning. To summarise,
interpretability is the degree to which a model can be understood in human terms.
Model A is more interpretable than model B if it is easier for a human to understand
how model A makes predictions. For example, a Convolutional Neural Network is
less interpretable than a Random Forest which is less interpretable than a Decision
Tree.

With this in mind, we say a model is an interpretable model if it can be understood


without any other aids/techniques. Interpretable models are highly interpretable. In
comparison, explainable models are too complicated to be understood without the
help of additional techniques. We say these models have low interpretability. We
can see how these concepts are related in Figure 1. Generally, models can be
classified as either interpretable or explainable but there are grey areas where
people would disagree.
Figure 1: The Interpretability Spectrum (Source: Author)

Why is interpretability important?

As mentioned, we require additional techniques, such as feature


importance or LIME, to understand how explainable models work. Implementing
these techniques can be a lot of effort and, importantly, they only provide
approximations for how a model works. So, we cannot be completely certain that
we understand an explainable model. We can have a similar situation when
comparing interpretable models.
Source: flaticon

For example, logistic regression and decision trees. Neither of these requires
additional techniques but logistic regression may still require more effort to
interpret. We would need an understanding of the sigmoid function and how
coefficients are related to odds/probability. This complexity may also lead to errors
in our interpretations. In general, the more interpretable a model; the easier it is to
understand and the more certain we can be that our understanding is correct.
Interpretability is important because of the many benefits that flow from this.

Easier to explain

Our first benefit is that interpretable models are easier to explain to other people.
For any topic, the better we understand it the easier it is to explain. We should also
be able to explain it in simple terms (i.e. without mentioning the technical details).
In industry, there are many people who may expect simple explanations for how
your model works. These people will not necessarily have technical backgrounds or
experience with machine learning.

For example, suppose we have created a model that predicts whether someone will
make a life insurance claim. We want to use this model to automate life insurance
underwriting at our company. To sign off on the model, our boss would require a
detailed explanation of how it works. A disgruntled customer may rightly demand
an explanation for why they were not approved for life cover. A regulator could
even require such an explanation by law.

Trying to explain to these people how a neural network makes predictions may
cause a lot of confusion. Due to the uncertainty, they may not even accept the
explanation. In comparison, interpretable models like logistic regression can be
understood in human terms. This means they can be explained in human terms. For
example, we could explain precisely how much the customer’s smoking habit has
increased their probability of dying.

Easier to sense check and fix errors

The relationship described above is causal (i.e. smoking causes cancer/death). In


general, machine learning models only care about associations. For example, a
model could use someone’s country of origin to predict if they had skin cancer.
However, as with smoking, can we say someone’s country causes cancer? The
reason for this is that skin cancer is caused by sunshine and some countries are just
sunnier than others. So we can only say skin cancer is associated with certain
countries.

Figure 2: Wolf vs husky experiment (Source: M. Tulio Ribeiro, S. Singh & C.


Guestrin)

A good example of where associations can go wrong comes from an experiment


performed by researches at the University of Washington. The researches trained an
image recognition model to classify animals as a husky or wolf. Using LIME, they
tried to understand how their model made predictions. In Figure 2, we can see that
the model was basing its predictions on image backgrounds. If the background had
snow, the animal was always classified as a wolf. They had essentially built a
model that detects snow.

The issue is that wolves are associated with snow. Wolves will usually be found in
the snow whereas huskies are not. What this example shows us is that models can,
not only make incorrect predictions, but they can also make correct predictions in
the wrong way. As data scientists, we need to sense check our models to make sure
they are not making predictions in this way. The more interpretable your model the
easier this is to do.

Source: flaticon

Easier to determine future performance

As time goes on, a model’s prediction power may deteriorate. This is because
relationships between model features and the target variable can change. For
example, due to the wage gap, income may currently be a good predictor of gender.
As society becomes more equal, income would lose its predictive strength. We need
to be aware of these potential changes and their impact on our models. This is
harder to do for explainable models. As it is less clear how features are being used,
even if we know the impact on individual features, we may not be able to tell the
impact on the model as a whole.
Easier to learn from the model

It is human nature to try to find meaning in the unknown. Machine learning can
help us discover patterns in our data we didn’t know existed. However, we cannot
identify these patterns by just looking at the model’s predictions. Any lessons are
lost if we can not interpret our model. Ultimately, the less interpretable a model the
harder it is to learn from it.

Source: flaticon

Algorithm Fairness

It is important that your models make unbiased decisions so that they do not
perpetuate any historical injustices. Identifying sources of bias can be difficult. It
often comes from associations between model features and protected variables (e.g.
race or gender). For example, due to a history of forced segregation in South Africa,
race is highly associated with someones’ location\neighbourhood. Location can act
as a proxy for race. A model that uses location may be biased towards a certain
race.

Using an interpretable model will not necessarily mean that you will have an
unbiased model. It also does not mean that it will be easier to determine if the
model is fair or not. This is because most measures of fairness (e.g. false positive
rate, disparate impact) are model agnostic. They are just as easy to calculate for any
model. What using an interpretable model does do is make it easier to identify and
correct the source of bias. We know what features are being used and we can check
which of these are associated with the protected variables.

Downsides to interpretability

Okay, we get it… interpretable models are great. They are easier to understand,
explain and learn from. They also allow us to better sense check current
performance, future performance and model fairness. There are however downsides
to interpretability and situations where we would prefer an explainable model.

Open to manipulation

Systems based on ML are open to manipulation or fraud. For example, suppose we


have a system that automatically gives out car loans. An important feature could be
the number of credit cards. The more cards a customer has the risker she is. If a
customer knew this they could temporarily cancel all their cards, take out a car loan
and then reapply for all the credit cards.

Source: flaticon

The probability of the customer repaying the loan does not change when she
cancels her cards. The customer has manipulated the model to make an incorrect
prediction. The more interpretable a model the more transparent and easy it is to
manipulate. This is the case even if the inner working of a model are kept secret.
The relationships between features and target variable are usually simpler making
them easier to guess.

Less to learn

We mentioned that interpretable models are easier to learn from. The flip side is
that they are less likely to teach us something new. An explainable model like a
neural network can automatically model interactions and non-linear relationships in
data. By interpreting these models we can uncover these relationships that we never
knew existed.

In comparison, algorithms like linear regression can only model linear relationships.
To model non-linear relationships, we would have to use feature engineering to
include any relevant variable in our dataset. This would require prior knowledge of
the relationships defeating the purpose of interpreting the model.

Domain knowledge/ expertise requirement

Building interpretable models can require significant domain knowledge and


expertise. Generally, interpretable models, like regression, can only model linear
relationships in your data. To model non-linear relationships we have to perform
feature engineering. For example, for a medical diagnosis model, we may want to
calculate BMI using height and weight. Knowing what features will be predictive
and, therefore, what features to create requires domain knowledge in a particular
field.
Your team may not have this knowledge. Alternatively, you could use an
explainable model which will automatically model non-linear relationships in your
data. This removes the need to create any new features; essentially leaving the
thinking up to the computer. The downside, as we’ve discussed thoroughly above,
is a poorer understanding of how the features are being used to make predictions.

Complexity-Accuracy Trade-off

What we can see from the above is that, generally, the less complicated a model the
more interpretable. So, for higher interpretability, there can be the trade-off of lower
accuracy. This is because, in some cases, simpler models can make less accurate
predictions. This really depends on the problem you are trying to solve. For
instance, you would get poor results using logistic regression to do image
recognition.

For many problems, an interpretable model would perform as well as an


explainable model. In the article below, we compare an interpretable model,
Logistic Regression, to an explainable model, a Neural Network. We show that by
putting a bit of thought into our problem and creating new features we can achieve
similar accuracy with an interpretable model. It is a good practical take on some of
the concepts we’ve discussed in this article

Various ways to evaluate a machine learning model’s


performance
Because finding accuracy is not enough.

In this blog, we will discuss the various ways to check the performance of our
machine learning or deep learning model and why to use one in place of the
other. We will discuss terms like:

1. Confusion matrix

2. Accuracy

3. Precision

4. Recall

5. Specificity

6. F1 score

7. Precision-Recall or PR curve

8. ROC (Receiver Operating Characteristics) curve

9. PR vs ROC curve.

For simplicity, we will mostly discuss things in terms of a binary


classification problem where let’s say we’ll have to find if an image is of a cat
or a dog. Or a patient is having cancer (positive) or is found healthy
(negative). Some common terms to be clear with are:

True positives (TP): Predicted positive and are actually positive.

False positives (FP): Predicted positive and are actually negative.

True negatives (TN): Predicted negative and are actually negative.

False negatives (FN): Predicted negative and are actually positive.

So let's get started!

Confusion matrix

It’s just a representation of the above parameters in a matrix format. Better


visualization is always good :)
Accuracy

The most commonly used metric to judge a model and is actually not a clear
indicator of the performance. The worse happens when classes are
imbalanced.

Take for example a cancer detection model. The chances of actually having
cancer are very low. Let’s say out of 100, 90 of the patients don’t have cancer
and the remaining 10 actually have it. We don’t want to miss on a patient who
is having cancer but goes undetected (false negative). Detecting everyone as
not having cancer gives an accuracy of 90% straight. The model did nothing
here but just gave cancer free for all the 100 predictions.
We surely need better alternatives.

Precision

Percentage of positive instances out of the total predicted positive instances.


Here denominator is the model prediction done as positive from the whole
given dataset. Take it as to find out ‘how much the model is right when it says
it is right’.

Recall/Sensitivity/True Positive Rate

Percentage of positive instances out of the total actual positive instances.


Therefore denominator (TP + FN) here is the actual number of positive
instances present in the dataset. Take it as to find out ‘how much extra right
ones, the model missed when it showed the right ones’.

Specificity

Percentage of negative instances out of the total actual negative instances.


Therefore denominator (TN + FP) here is the actual number of negative
instances present in the dataset. It is similar to recall but the shift is on the
negative instances. Like finding out how many healthy patients were not
having cancer and were told they don’t have cancer. Kind of a measure to see
how separate the classes are.

F1 score

It is the harmonic mean of precision and recall. This takes the contribution of
both, so higher the F1 score, the better. See that due to the product in the
numerator if one goes low, the final F1 score goes down significantly. So a
model does well in F1 score if the positive predicted are actually positives
(precision) and doesn't miss out on positives and predicts them negative
(recall).

One drawback is that both precision and recall are given equal importance due
to which according to our application we may need one higher than the other
and F1 score may not be the exact metric for it. Therefore either weighted-F1
score or seeing the PR or ROC curve can help.

PR curve

It is the curve between precision and recall for various threshold values. In
the figure below we have 6 predictors showing their respective precision-
recall curve for various threshold values. The top right part of the graph is the
ideal space where we get high precision and recall. Based on our application
we can choose the predictor and the threshold value. PR AUC is just the area
under the curve. The higher its numerical value the better.

image link

ROC curve

ROC stands for receiver operating characteristic and the graph is plotted
against TPR and FPR for various threshold values. As TPR increases FPR
also increases. As you can see in the first figure, we have four categories and
we want the threshold value that leads us closer to the top left corner.
Comparing different predictors (here 3) on a given dataset also becomes easy
as you can see in figure 2, one can choose the threshold according to the
application at hand. ROC AUC is just the area under the curve, the higher its
numerical value the better.

wiki link
PR vs ROC curve

Both the metrics are widely used to judge a models performance.

Which one to use PR or ROC?

The answer lies in TRUE NEGATIVES.

Due to the absence of TN in the precision-recall equation, they are useful


in imbalanced classes. In the case of class imbalance when there is a majority
of the negative class. The metric doesn’t take much into consideration the
high number of TRUE NEGATIVES of the negative class which is in
majority, giving better resistance to the imbalance. This is important when the
detection of the positive class is very important.

Like to detect cancer patients, which has a high class imbalance because very
few have it out of all the diagnosed. We certainly don’t want to miss on a
person having cancer and going undetected (recall) and be sure the detected
one is having it (precision).

Due to the consideration of TN or the negative class in the ROC equation,


it is useful when both the classes are important to us. Like the detection of
cats and dog. The importance of true negatives makes sure that both the
classes are given importance, like the output of a CNN model in determining
the image is of a cat or a dog.

Conclusion

The evaluation metric to use depends heavily on the task at hand. For a long
time, accuracy was the only measure I used, which is really a vague option. I
hope this blog would have been useful for you. That's all from my side. Feel
free to suggest corrections and improvements.

5 Ways to Improve Performance of ML Models


1. Choosing the Right Algorithms
Algorithms are the key factor used to train the ML models. The data feed into this that
helps the model to learn from and predict with accurate results. Hence, choosing the right
algorithm is important to ensure the performance of your machine learning model.

Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN, K-
Means, Random Forest and Dimensionality Reduction Algorithms and Gradient Boosting
are the leading ML algorithms you can choose as per your ML model compatibility.

2. Use the Right Quantity of Data


The next important factor you can consider while developing a machine learning model is
choosing the right quantity of data sets. And there are multirole factors and for deep
learning-based ML models, a huge quantity of datasets is required for algorithms.

Depending on the complexities of problem and learning algorithms, model skill, data size
evaluation and use of statistical heuristic rule are the leading factors determine the
quantity and types of training data sets that help in improving the performance of the
model.

3. Quality of Training Data Sets


Just like quantity, the quality of machine learning training data set is another key factor,
you need to keep in mind while developing an ML model. If the quality of machine
learning training data sets is not good or accurate your model will never give accurate
results, affecting the overall performance of the model not suitable to use in real-life.

Actually, there are different methods to measure the quality of the training data set.
Standard quality-assurance methods and detailed for in-depth quality assessment are the
leading two popular methods you can use to ensure the quality of data sets. Quality of
data is important to get unbiased decisions from the ML models, so you need to make
sure to use the right quality of training data sets to improve the performance of your ML
model.

4. Supervised or Unsupervised ML
Moreover, the above-discussed ML algorithms, the performance of such AI-based models
are affected by methods or process of machine learning. And supervised, unsupervised
and reinforcement learning are the algorithm consist of a target/outcome variable (or
dependent variable) which is to be predicted from a given set of predictors (independent
variables).

In unsupervised machine learning, a model is given any target or outcome variable to


predict/estimate. And, it is used for clustering population in different groups, which is
widely used for segmenting customers in different groups for specific intervention. For
supervised ML, labeled or annotated data is required, while for unsupervised ML the
approach is different.

Similarly, reinforcement Learning is another important algorithm, used to train the model
to make specific decisions. In this training process, the machine learns from previous
experiences and tries to store the best suitable knowledge for the right predictions.

5. Model Validation and Testing


Building a machine learning model is not enough to get the right predictions, as you have
to check the accuracy and need to validate the same to ensure get the precise results. And
validating the model will improve the performance of the ML model.

FEATURE TRANSFORMATION
Feature transformation (FT) refers to family of algorithms that create new features using the
existing features. These new features may not have the same interpretation as the original
features, but they may have more discriminatory power in a different space than the original
space. This can also be used for feature reduction. FT may happen in many ways, by
simple/linear combinations of original features or using non-linear functions. Some
common techniques for FT are:

 Scaling or normalizing features within a range, say between 0 to 1.


 Principle Component Analysis and its variants.
 Random Projection.
 Neural Networks.
 SVM also transforms features internally.
 Transforming categorical features to numerical.

What Are The Feature Transformation Techniques?

According to a study by Gartner, 87% of data science projects never go into


production. And again, studies show that preparing the dataset for the model is
the most time-consuming task (80%). It seems that the most crucial and
challenging part of a project is the preparation part. A lot of effort is put in here,
and time is spent, but the result is often not good enough to go into production.
According to Pedro Domingos, the most important thing in a machine learning
model is the features. Therefore, we have to be very careful when dealing with
features so that we can be successful in the preprocessing stage and the project in
general.

Feature transformation is a very useful and therefore very important stage of


preprocessing stage. It allows us to get the maximum benefit from the features in
the dataset for the success of the model. From this perspective, in some cases, it
may even be a more critical process than the chosen machine learning or
evaluation method. Feature transformation application is also part of Feature
Engineering.

By applying some mathematical techniques to existing features, we obtain new


features from these features (from one form to another form, from one dimension
to higher dimensions) or perform feature reduction in the opposite direction. In
this way, we modify the existing data and try to increase the information we have
(we add background experience), or at least increase the contribution to the
model's success (accuracy) by keeping the information constant.

Why transform?

 It is mandatory to digitize categorical features for models to work properly.


This is called encoding.

 The scale between features in the dataset can be very different from each
other (or they may have different units). For example, while the “Age” feature
varies between 0–100, “Car Price” can vary between 0–1000000. Some
machine learning methods are affected by these scale differences, so
normalizing the difference will contribute to the success of the model.

 Models like KNN and SVM are distance-based algorithms means that the
distance between points is used to obtain clusters or to find out similarities.
The distance of unscaled features would also be unscaled and will be
misleading the model. In addition, models using gradient descent optimization
such as linear regression, logistic regression, or neural networks are also
negatively affected by unscaled data. Coefficients of linear models are also
affected by unscaled features. (In general, we don’t need to scale features in
ensemble methods as the depth will probably not change.)

 Transformation decreases the effects of outliers since the variability is reduced


by scaling.

 It improves model performance regarding the non-linear relationship between


the target feature and the independent feature.

 Some machine learning models are based on the assumption that the features
are normally distributed. However, in real-life problems, the data is usually not
normally distributed. In this case, we apply transformations to approximate
these skewed data to the normal distribution so that the models can yield
better results.

 Scaling

Min Max Scaler

It scales the values between 0 and 1(by default). In the default cases, the minimum
value in the series will be zero. If we don’t want the minimum value to be zero we
can define another range, too.

x_scaled = (x-x_min) / (x_max-x_min)


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)

Normalizing (range [0,1]) is important if the range of each feature is very different
from one another. And it doesn’t need to assume that the distribution of feature is
normal. It may increase overfitting.

Standard Scaler

Also called Z Score Normalization. It scales the values between two standard
deviation range, resizes the values in 1 standard deviation range and makes
median 0. This method assumes that the feature has normal distribution. It helps
to reduce overfitting when used properly.

x_scaled =x-mean/std
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

MaxAbs Scaler

All values in the feature are divided by the absolute maximum value of that
feature. Thus, the range becomes [-1,1]. If we want to keep 0 values as 0, we can
use this method.
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
df_scaled = scaler.fit_transform(df)

Robust Scaler

When we perform the scaling operation with constant values (mean, maximum
value, etc.), the operation becomes vulnerable to outlier values. If there are a lot
of outliers in the data, scaling over IQR can provide a more robust distribution. It
scales the values between a median of 0.

x_scaled = (x-median)/IQR = (x-Q1)/(Q3-Q1)


from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
x_scaled = scaler.fit_transform(x)

Quantile Transformer Scaler

It converts feature distribution to normal distribution. Hereby, it also takes care of


outliers. It uses a cumulative distribution function. It is better to use this scaler for
non-linear relationships because it can break linear relationships when scaling and
transforming the distribution. It scales the values into a range of 0 to 1.
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()
x_scaled = scaler.fit_transform(x)

Gaussian Transformations

Many machine learning models work best if the features have Gaussian
distributions. There are numerous methods to transform features to normal
distribution.

I will use the below function to plot histogram and Q-Q distributions.

Plot function
Q-Q plots are used to compare two distributions. If the two distributions were
identical then this plot would be a line directly on y=x. In our case, we can say that
if all points are on a straight line, they are perfectly normally distributed.

Plots. Image by author.

Log Transformation

Log transformation is one of the most used Gaussian transformation methods. The
log of each value is taken in feature, a nice way to deal with large numbers (Log of
1,000,000 is only 6). Thus, it reduces the impact of both high and low values in
features.

It is used to approximate non-normally distributed data to the normal distribution.


It is generally used for right-skewed features. Since it is logarithmic, it cannot be
used for features that have negative values.
x_log = np.log(x)

The feature shown below is slightly right-skewed (upper charts). After the
logarithmic transition, it is better at the point of Gaussian distribution.
Log Transformation. Image by author.

Reciprocal Transformation

It cannot be applied to the value of zero. It reverses the order between data with
the same sign, i.e., larger values become small and vice versa (or like rewinding the
time). It can be applied to right-skewed data.

Let’s use Reciprocal transformation to the above feature.


x_recip = 1 / data["sepal width (cm)"]
plot_gauss(x_recip)
Reciprocal transformation. Image by author.

Square Transformation

It is generally applied to left-skewed data.

Let’s generate a left-skewed data to explore.


Left skewed data generation
plot_gauss(x)
plot_gauss(x**(2))

Before and after of square transformation. Image by author.


Square Root Transformation

It is only valid for positive numbers and is generally applied to right-skewed data.

Let’s generate right-skewed data for exploration (just set the skewness value to
positive in the data generation code above).
plot_gauss(x)
plot_gauss(x**(1/2))

Before and after of square root transformation. Image by author.

Power Transformations
Can be used when there is heteroskedasticity which is a concern in linear
relationships. It means that the variance of residuals isn’t constant (so, there is a
pattern in the variance).

Heteroskedasticity vs homoscedasticity. Source: Wikipedia


Power transformation makes data more Gaussian-like. These methods use a
parameter called lambda. This parameter helps to reduce skewness and stabilize
the variance.

Box-Cox Transformation

This is included in the concept of power transformations. Data must be positive.


Lambda, λ is a value between -5 and 5. An optimal lambda value should be
selected (like hyperparameter tuning).

T(Y) = (Y exp(λ)-1)/λ

We can use also scipy.stats to calculate lambda.


x_boxcox, lda = stat.boxcox(x)
print(lda)
#0.7964531473656952

Before and after of box-cox transformation. Image by author.


Yeo-Johnson Transformation

The power transformation technique if the data contains zero or negative values.
from sklearn.preprocessing import PowerTransformer
transformer = PowerTransformer(method='yeo-johnson')
data = transformer.fit_transform(data)

Unit Vector Normalizer or Scaler

Interestingly, this method normalizes or scales the rows.

Binning

Simply put, the name given to the grouping operation can be applied to both
numeric and categorical features.
#Numerical Feature
df['Age_bin'] = pd.cut(df['Age'], bins=[0,30,70,100], labels=["Young", "Mid",
"Old"])#Out:
0 Young
1 Mid
2 Young
3 Mid
4 Mid
...
886 Young
887 Young
888 Young
889 Young
890 Mid
Name: Age_bin, Length: 891, dtype: category
Categories (3, object): ['Young' < 'Mid' < 'Old']#Categorical Feature
conditions = [
data['Item'].str.contains('Bmw'),
data['Item'].str.contains('Hp'),
data['Item'].str.contains('Iphone'),
data['Item'].str.contains('Audi')]choices = ['Car', 'Computer', 'Phone',
'Car']data['Item_Group'] = np.select(conditions, choices, default='Other')

Binning provides an improvement in overfitting, but some information is


sacrificed. When grouping in categorical data, it would be better to create a
separate “other” group by combining the groups with a low percentage in the
overall total (0.01%)

Encoders

We cannot feed categorical features as string types directly into models. Encoding
methods are used to convert these features numerical.

Integer / Label Encoding

Each unique group in the feature is assigned a number. There is no relationship


between the numbers given and the groups from the point of view of information.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit_transform(["Bmw","Mercedes","Bmw","Audi"])#Out: array([1, 2, 1, 0])

One Hot Encoder

The integer value assigned to the unique groups is converted to the binary value in
themselves. There will be as many binary values as the number of groups. Each
data takes a value of 1 in that column for its own group and 0 in other columns.
One hot encoding. Image by author.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit_transform(x).toarray()#Out:
array([[0., 1.],
[1., 0.],
[1., 0.],
...,
[1., 0.],
[0., 1.],
[0., 1.]])

It is useful to be careful when using One Hot Encoding. Because it creates more
than one feature from a single feature, if there are too many groups, it can create
too many new features and these new features can be correlated with each other.

Count and Frequency Encoding

Category group names are replaced by their number or frequency in the feature.
Frequency encoding can be used if the target variable and feature have a
frequency-related relationship. It works well with tree algorithms.

We can use the feature_engine library for this purpose.


from feature_engine.encoding import CountFrequencyEncoderencoder =
CountFrequencyEncoder(encoding_method='frequency',variables=['Cabin'])
encoder.fit(df)
df_encoded = encoder.transform(df)

Mean Encoding

It takes the mean value of the target feature for all category groups within a
feature and assigns it to each category.
from feature_engine.encoding import MeanEncoder
encoder = MeanEncoder(variables=['cabin'])
# fit the encoder
encoder.fit(X, y)
# transform the data
X_encoded = encoder.transform(X)
encoder.encoder_dict_
#Out:
{'cabin': {'A': 0.5294117647058824,
'B': 0.7619047619047619,
'C': 0.5633802816901409,
'D': 0.71875,
'E': 0.71875,
'F': 0.6666666666666666,
'G': 0.5,
'T': 0.0,
'n': 0.30484330484330485}}

Ordinal Encoding

If we can establish an ordering relationship between categorical values, then we


can perform an ordinal encoding. For example, consider a survey question answer,
the options are “never, sometimes, and always”. In the range [0,2], we can assign
0 for “never”, 1 for “sometimes” and 2 for “always”.

Weight of Evidence Encoder


We can explain WoE in terms of an event. An event means whether an example is
valid (for example a person is “Survived” or for non-event, it is “Not-Survived”).
Then, we calculate it by taking the natural logarithm of the event ratio.

WOE = ln(event% / non-event %)

It can be used in Logistic Regression since the approach is similar, actually, it


evolved from Logistic Regression.
import category_encoders as cecolumns = [col for col in data.columns]
encoder = ce.WOEEncoder(cols=columns)
x_encoded = woe_encoder.fit_transform(data[columns], y).add_suffix('_woe')

Probability Ratio Encoding

It is similar to the Weight of Evidence method, but this time we only use the
probability (we don’t use natural logarithm). It is better to use it in binary
classification problems.
from sunbird.categorical_encoding import probability_ratio_encoding
probability_ratio_encoding(data, 'Cabin', 'Survived')

Hashing Encoding

In this method, categorical variables are transformed into a set of higher-


dimensional numbers. However, since the total number of dimensions will be
lower than the One Hot Encoding method, hashing method can be tried in case of
a high cardinality problem.
We can limit the number of dimensions by setting the n_component argument.
However, some information is obviously sacrificed by limiting the dimension
space. Hashing Encoder uses md5 hashing algorithm by default.
import category_encoders as ce
encoder=ce.HashingEncoder(cols='Cabin',n_components=4)

Conclusion

Data preprocessing is the longest and most critical phase of a machine learning
project. Feature transformation work is also one of the most critical stages of the
preprocessing stage. Encoding, scaling, or transformation, there is a lot of method
and research in the literature and in practice. I think it is a very important skill to
have knowledge about these methods and to be able to sense where and which
one might work.

Feature Selection Techniques in


Machine Learning
Feature selection is a way of selecting the subset of the most relevant features from the original
features set by removing the redundant, irrelevant, or noisy features.

While developing the machine learning model, only a few variables in the dataset are
useful for building the model, and the rest features are either redundant or irrelevant. If
we input the dataset with all these redundant and irrelevant features, it may negatively
impact and reduce the overall performance and accuracy of the model. Hence it is very
important to identify and select the most appropriate features from the data and
remove the irrelevant or less important features, which is done with the help of feature
selection in machine learning.
Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and
relevant dataset to the model in order to get a better result.

In this topic, we will discuss different feature selection techniques for machine learning.
But before that, let's first understand some basics of feature selection.

o What is Feature Selection?


o Need for Feature Selection
o Feature Selection Methods/Techniques
o Feature Selection statistics

What is Feature Selection?


A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is
about selecting the subset of the original feature set, whereas feature extraction creates
new features. Feature selection is a way of reducing the input variable for the model by
using only relevant data in order to reduce overfitting in the model.

Play Videox

So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features or
excluding the irrelevant features in the dataset without changing them.

Need for Feature Selection


Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is
necessary to provide a pre-processed and good input dataset in order to get better
outcomes. We collect a huge amount of data to train our model and help it to learn
better. Generally, the dataset consists of noisy data, irrelevant data, and some part of
useful data. Moreover, the huge amount of data also slows down the training process of
the model, and with noise and irrelevant data, the model may not predict and perform
well. So, it is very necessary to remove such noises and less-important data from the
dataset and to do this, and Feature selection techniques are used.

Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a
spare part, and to do this, we have a dataset. This dataset contains a Model of the car,
Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed
or not, so we can remove this column and select the rest of the features(column) for the
model building.
Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.

Feature Selection Techniques


There are mainly two types of Feature Selection techniques, which are:

o Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and can be used for
the labelled dataset.
o Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used
for the unlabelled dataset.
There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with other
combinations. It trains the algorithm by using the subset of features iteratively.

On the basis of the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.

Some techniques of wrapper methods are:

o Forward selection - Forward selection is an iterative process, which begins with an


empty set of features. After each iteration, it keeps adding on a feature and evaluates the
performance to check whether it is improving the performance or not. The process
continues until the addition of a new variable/feature does not improve the performance
of the model.
o Backward elimination - Backward elimination is also an iterative approach, but it is the
opposite of forward selection. This technique begins the process by considering all the
features and removes the least significant feature. This elimination process continues
until removing the features does not improve the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this method
tries & make each possible combination of features and return the best performing
feature set.
o Recursive Feature Elimination-
Recursive feature elimination is a recursive greedy optimization approach, where features
are selected by recursively taking a smaller and smaller subset of features. Now, an
estimator is trained with each set of features, and the importance of each feature is
determined using coef_attribute or through a feature_importances_attribute.

2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.

The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.

The advantage of using filter methods is that it needs low computational time and does
not overfit the data.
Some common techniques of Filter methods are as follows:

o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio

Information Gain: Information gain determines the reduction in entropy while


transforming the dataset. It can be used as a feature selection technique by calculating
the information gain of each variable with respect to the target variable.

Chi-square Test: Chi-square test is a technique to determine the relationship between


the categorical variables. The chi-square value is calculated between each feature and
the target variable, and the desired number of features with the best chi-square value is
selected.

Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It returns
the rank of the variable on the fisher's criteria in descending order. Then we can select
the variables with a large fisher's score.

Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number of
missing values in each column divided by the total number of observations. The variable
is having more than the threshold value can be dropped.

3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter
method.
These methods are also iterative, which evaluates each iteration, and optimally finds the
most important features that contribute the most to training in a particular iteration.
Some techniques of embedded methods are:

o Regularization- Regularization adds a penalty term to different parameters of the


machine learning model for avoiding overfitting in the model. This penalty term is added
to the coefficients; hence it shrinks some coefficients to zero. Those features with zero
coefficients can be removed from the dataset. The types of regularization techniques are
L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
o Random Forest Importance - Different tree-based methods of feature selection help us
with feature importance to provide a way of selecting features. Here, feature importance
specifies which feature has more importance in model building or has a great impact on
the target variable. Random Forest is such a tree-based method, which is a type of
bagging algorithm that aggregates a different number of decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini impurity) over all
the trees. Nodes are arranged as per the impurity values, and thus it allows to pruning of
trees below a specific node. The remaining nodes create a subset of the most important
features.

How to choose a Feature Selection Method?


For machine learning engineers, it is very important to understand that which feature
selection method will work properly for their model. The more we know the datatypes of
variables, the easier it is to choose the appropriate statistical measure for feature
selection.
To know this, we need to first identify the type of input and output variables. In machine
learning, variables are of mainly two types:

o Numerical Variables: Variable with continuous values such as integer, float


o Categorical Variables: Variables with categorical values such as Boolean, ordinal,
nominals.

Below are some univariate statistical measures, which can be used for filter-based
feature selection:

1. Numerical Input, Numerical Output:

Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.

o Pearson's correlation coefficient (For linear Correlation).


o Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:

Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should be used, but
with categorical output.
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).

3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling with categorical input. It is a different
example of a regression problem. We can use the same measures as discussed in the
above case but in reverse order.

4. Categorical Input, Categorical Output:

This is a case of classification predictive modelling with categorical Input variables.

The commonly used technique for such a case is Chi-Squared Test. We can also use
Information gain in this case.

We can summarise the above cases with appropriate measures in the below table:

Input Variable Output Variable Feature Selection technique

Numerical Numerical
o Pearson's correlation coefficient (For linear Correlation).
o Spearman's rank coefficient (for non-linear correlation).

Numerical Categorical
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).

Categorical Numerical
o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).

Categorical Categorical
o Chi-Squared test (contingency tables).
o Mutual Information.

Conclusion
Feature selection is a very complicated and vast field of machine learning, and lots of
studies are already made to discover the best methods. There is no fixed rule of the best
feature selection method. However, choosing the method depend on a machine
learning engineer who can combine and innovate approaches to find the best method
for a specific problem. One should try a variety of model fits on different subsets of
features selected through different statistical Measures.

You might also like