Unit Ii ML
Unit Ii ML
Overview
This tutorial is divided into three parts; they are:
Model selection is the process of choosing one of the models as the final
model that addresses the problem.
Model selection is different from model assessment.
For example, we evaluate or assess candidate models in order to choose
the best one, and this is model selection. Whereas once a model is
chosen, it can be evaluated in order to communicate how well it is
expected to perform in general; this is model assessment.
Data filtering.
Data transformation.
Feature selection.
Feature engineering.
And more…
The closer you look at the challenge of model selection, the more nuance
you will discover.
.
This is impractical on most predictive modeling problems given that we
rarely have sufficient data, or are able to even judge what would be
sufficient.
In many applications, however, the supply of data for training and testing
will be limited, and in order to build good models, we wish to use as much
of the available data as possible for training. However, if the validation
set is small, it will give a relatively noisy estimate of predictive
performance.
.
Instead, there are two main classes of techniques to approximate the
ideal case of model selection; they are:
Probabilistic Measures
Probabilistic measures involve analytically scoring a candidate model
using both its performance on the training dataset and the complexity of
the model.
It is known that training error is optimistically biased, and therefore is not
a good basis for choosing a model. The performance can be penalized
based on how optimistic the training error is believed to be. This is
typically achieved using algorithm-specific methods, often linear, that
penalize the score based on the complexity of the model.
Resampling Methods
Resampling methods seek to estimate the performance of a model (or
more precisely, the model development process) on out-of-sample data.
This is achieved by splitting the training dataset into sub train and test
sets, fitting a model on the sub train set, and evaluating it on the test set.
This process may then be repeated multiple times and the mean
performance across each trial is reported.
Probably the simplest and most widely used method for estimating
prediction error is cross-validation.
An example is the widely used k-fold cross-validation that splits the
training dataset into k folds where each example appears in a test set
only once.
Another is the leave one out (LOOCV) where the test set is comprised of a
single sample and each sample is given an opportunity to be the test set,
requiring N (the number of samples in the training set) models to be
constructed and evaluated.
TRAINING MODEL
Training a machine learning (ML) model is a process in which a machine learning algorithm is
fed with training data from which it can learn. ML models can be trained to benefit businesses in
numerous ways, by quickly processing huge volumes of data, identifying patterns, finding
anomalies or testing correlations that would be difficult for a human to do unaided.
Model training is the primary step in machine learning, resulting in a working model that can
then be validated, tested and deployed. The model’s performance during training will eventually
determine how well it will work when it is eventually put into an application for the end-users.
Both the quality of the training data and the choice of the algorithm are central to the model
training phase. In most cases, training data is split into two sets for training and then validation
and testing.
The selection of the algorithm is primarily determined by the end-use case. However, there are
always additional factors that need to be considered, such as algorithm-model complexity,
performance, interpretability, computer resource requirements, and speed. Balancing out these
various requirements can make selecting algorithms an involved and complicated process.
In addition to this, you need to determine which algorithms you will use and what parameters
(hyperparameters) they will run with. With all of this done, you can split your dataset into a
training set and a testing set, then prepare your model algorithms for training.
Your initial training data is a limited resource that needs to be allocated carefully. Some of it can
be used to train your model, and some of it can be used to test your model – but you can’t use the
same data for each step. You can’t properly test a model unless you have given it a new data set
that it hasn’t encountered before. Splitting the training data into two or more sets allows you to
train and then validate the model using a single source of data. This allows you to see if the
model is overfit, meaning that it performs well with the training data but poorly with the test
data.
A common way of splitting the training data is to use cross-validation. In 10-fold cross-
validation, for example, the data is split into ten sets, allowing you to train and test the data ten
times. To do this:
Repeat this process ten times, each time selecting a different fold to be the hold-out fold. The
average performance across the ten hold-out folds is your performance estimate, called the cross-
validated score.
In machine learning, there are thousands of algorithms to choose from, and there is no sure way
to determine which will be the best for any specific model. In most cases, you will likely try
dozens, if not hundreds, of algorithms in order to find the one that results in an accurate working
model. Selecting candidate algorithms will often depend on:
Size of the training data.
Accuracy and interpretability of the required output.
Speed of training time required, which is inversely proportional to accuracy.
Linearity of the training data.
Number of features in the data set.
Hyperparameters are the high-level attributes set by the data science team before the model is
assembled and trained. While many attributes can be learned from the training data, they cannot
learn their own hyperparameters.
As an example, if you are using a regression algorithm, the model can determine the regression
coefficients itself by analyzing the data. However, it cannot dictate the strength of the penalty it
should use to regularize an overabundance of variables. As another example, a model using the
random forest technique can determine where decision trees will be split, but the number of trees
to be used needs to be tuned beforehand.
Now that the data is prepared and the model’s hyperparameters have been determined, it’s time
to start training the models. The process is essentially to loop through the different algorithms
using each set of hyperparameter values you’ve decided to explore. To do this:
Next, select another set of hyperparameter values you want to try for the same algorithm, cross-
validate it again and calculate the new score. Once you have tried each hyperparameter value,
you can repeat these same steps for additional algorithms.
Think of these trials as track and field heats. Each algorithm has demonstrated what it can do
with the different hyperparameter values. Now you can select the best version from each
algorithm and send them on to the final competition.
Now it’s time to test the best versions of each algorithm to determine which gives you the best
model overall.
Should we always trust a model that performs well? A model could reject your
application for a mortgage or diagnose you with cancer. The consequences of these
decisions are serious and, even if they are correct, we would expect an explanation.
A human would be able to tell you that your income is too low for a mortgage or
that a specific cluster of cells is likely malignant. A model that provided similar
explanations would be more useful than one that just provided predictions.
Source: flaticon
By obtaining these explanations, we say we are interpreting a machine learning
model. In the rest of this article, we’ll explain in more detail what is meant by
interpretability. We’ll then move on to the importance and benefits of being able to
interpret your models. There are, however, still some downsides. We’ll end off by
discussing these and why, in some cases, you may prefer a less interpretable model.
For example, logistic regression and decision trees. Neither of these requires
additional techniques but logistic regression may still require more effort to
interpret. We would need an understanding of the sigmoid function and how
coefficients are related to odds/probability. This complexity may also lead to errors
in our interpretations. In general, the more interpretable a model; the easier it is to
understand and the more certain we can be that our understanding is correct.
Interpretability is important because of the many benefits that flow from this.
Easier to explain
Our first benefit is that interpretable models are easier to explain to other people.
For any topic, the better we understand it the easier it is to explain. We should also
be able to explain it in simple terms (i.e. without mentioning the technical details).
In industry, there are many people who may expect simple explanations for how
your model works. These people will not necessarily have technical backgrounds or
experience with machine learning.
For example, suppose we have created a model that predicts whether someone will
make a life insurance claim. We want to use this model to automate life insurance
underwriting at our company. To sign off on the model, our boss would require a
detailed explanation of how it works. A disgruntled customer may rightly demand
an explanation for why they were not approved for life cover. A regulator could
even require such an explanation by law.
Trying to explain to these people how a neural network makes predictions may
cause a lot of confusion. Due to the uncertainty, they may not even accept the
explanation. In comparison, interpretable models like logistic regression can be
understood in human terms. This means they can be explained in human terms. For
example, we could explain precisely how much the customer’s smoking habit has
increased their probability of dying.
The issue is that wolves are associated with snow. Wolves will usually be found in
the snow whereas huskies are not. What this example shows us is that models can,
not only make incorrect predictions, but they can also make correct predictions in
the wrong way. As data scientists, we need to sense check our models to make sure
they are not making predictions in this way. The more interpretable your model the
easier this is to do.
Source: flaticon
As time goes on, a model’s prediction power may deteriorate. This is because
relationships between model features and the target variable can change. For
example, due to the wage gap, income may currently be a good predictor of gender.
As society becomes more equal, income would lose its predictive strength. We need
to be aware of these potential changes and their impact on our models. This is
harder to do for explainable models. As it is less clear how features are being used,
even if we know the impact on individual features, we may not be able to tell the
impact on the model as a whole.
Easier to learn from the model
It is human nature to try to find meaning in the unknown. Machine learning can
help us discover patterns in our data we didn’t know existed. However, we cannot
identify these patterns by just looking at the model’s predictions. Any lessons are
lost if we can not interpret our model. Ultimately, the less interpretable a model the
harder it is to learn from it.
Source: flaticon
Algorithm Fairness
It is important that your models make unbiased decisions so that they do not
perpetuate any historical injustices. Identifying sources of bias can be difficult. It
often comes from associations between model features and protected variables (e.g.
race or gender). For example, due to a history of forced segregation in South Africa,
race is highly associated with someones’ location\neighbourhood. Location can act
as a proxy for race. A model that uses location may be biased towards a certain
race.
Using an interpretable model will not necessarily mean that you will have an
unbiased model. It also does not mean that it will be easier to determine if the
model is fair or not. This is because most measures of fairness (e.g. false positive
rate, disparate impact) are model agnostic. They are just as easy to calculate for any
model. What using an interpretable model does do is make it easier to identify and
correct the source of bias. We know what features are being used and we can check
which of these are associated with the protected variables.
Downsides to interpretability
Okay, we get it… interpretable models are great. They are easier to understand,
explain and learn from. They also allow us to better sense check current
performance, future performance and model fairness. There are however downsides
to interpretability and situations where we would prefer an explainable model.
Open to manipulation
Source: flaticon
The probability of the customer repaying the loan does not change when she
cancels her cards. The customer has manipulated the model to make an incorrect
prediction. The more interpretable a model the more transparent and easy it is to
manipulate. This is the case even if the inner working of a model are kept secret.
The relationships between features and target variable are usually simpler making
them easier to guess.
Less to learn
We mentioned that interpretable models are easier to learn from. The flip side is
that they are less likely to teach us something new. An explainable model like a
neural network can automatically model interactions and non-linear relationships in
data. By interpreting these models we can uncover these relationships that we never
knew existed.
In comparison, algorithms like linear regression can only model linear relationships.
To model non-linear relationships, we would have to use feature engineering to
include any relevant variable in our dataset. This would require prior knowledge of
the relationships defeating the purpose of interpreting the model.
Complexity-Accuracy Trade-off
What we can see from the above is that, generally, the less complicated a model the
more interpretable. So, for higher interpretability, there can be the trade-off of lower
accuracy. This is because, in some cases, simpler models can make less accurate
predictions. This really depends on the problem you are trying to solve. For
instance, you would get poor results using logistic regression to do image
recognition.
In this blog, we will discuss the various ways to check the performance of our
machine learning or deep learning model and why to use one in place of the
other. We will discuss terms like:
1. Confusion matrix
2. Accuracy
3. Precision
4. Recall
5. Specificity
6. F1 score
7. Precision-Recall or PR curve
9. PR vs ROC curve.
Confusion matrix
The most commonly used metric to judge a model and is actually not a clear
indicator of the performance. The worse happens when classes are
imbalanced.
Take for example a cancer detection model. The chances of actually having
cancer are very low. Let’s say out of 100, 90 of the patients don’t have cancer
and the remaining 10 actually have it. We don’t want to miss on a patient who
is having cancer but goes undetected (false negative). Detecting everyone as
not having cancer gives an accuracy of 90% straight. The model did nothing
here but just gave cancer free for all the 100 predictions.
We surely need better alternatives.
Precision
Specificity
F1 score
It is the harmonic mean of precision and recall. This takes the contribution of
both, so higher the F1 score, the better. See that due to the product in the
numerator if one goes low, the final F1 score goes down significantly. So a
model does well in F1 score if the positive predicted are actually positives
(precision) and doesn't miss out on positives and predicts them negative
(recall).
One drawback is that both precision and recall are given equal importance due
to which according to our application we may need one higher than the other
and F1 score may not be the exact metric for it. Therefore either weighted-F1
score or seeing the PR or ROC curve can help.
PR curve
It is the curve between precision and recall for various threshold values. In
the figure below we have 6 predictors showing their respective precision-
recall curve for various threshold values. The top right part of the graph is the
ideal space where we get high precision and recall. Based on our application
we can choose the predictor and the threshold value. PR AUC is just the area
under the curve. The higher its numerical value the better.
image link
ROC curve
ROC stands for receiver operating characteristic and the graph is plotted
against TPR and FPR for various threshold values. As TPR increases FPR
also increases. As you can see in the first figure, we have four categories and
we want the threshold value that leads us closer to the top left corner.
Comparing different predictors (here 3) on a given dataset also becomes easy
as you can see in figure 2, one can choose the threshold according to the
application at hand. ROC AUC is just the area under the curve, the higher its
numerical value the better.
wiki link
PR vs ROC curve
Like to detect cancer patients, which has a high class imbalance because very
few have it out of all the diagnosed. We certainly don’t want to miss on a
person having cancer and going undetected (recall) and be sure the detected
one is having it (precision).
Conclusion
The evaluation metric to use depends heavily on the task at hand. For a long
time, accuracy was the only measure I used, which is really a vague option. I
hope this blog would have been useful for you. That's all from my side. Feel
free to suggest corrections and improvements.
Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN, K-
Means, Random Forest and Dimensionality Reduction Algorithms and Gradient Boosting
are the leading ML algorithms you can choose as per your ML model compatibility.
Depending on the complexities of problem and learning algorithms, model skill, data size
evaluation and use of statistical heuristic rule are the leading factors determine the
quantity and types of training data sets that help in improving the performance of the
model.
Actually, there are different methods to measure the quality of the training data set.
Standard quality-assurance methods and detailed for in-depth quality assessment are the
leading two popular methods you can use to ensure the quality of data sets. Quality of
data is important to get unbiased decisions from the ML models, so you need to make
sure to use the right quality of training data sets to improve the performance of your ML
model.
4. Supervised or Unsupervised ML
Moreover, the above-discussed ML algorithms, the performance of such AI-based models
are affected by methods or process of machine learning. And supervised, unsupervised
and reinforcement learning are the algorithm consist of a target/outcome variable (or
dependent variable) which is to be predicted from a given set of predictors (independent
variables).
Similarly, reinforcement Learning is another important algorithm, used to train the model
to make specific decisions. In this training process, the machine learns from previous
experiences and tries to store the best suitable knowledge for the right predictions.
FEATURE TRANSFORMATION
Feature transformation (FT) refers to family of algorithms that create new features using the
existing features. These new features may not have the same interpretation as the original
features, but they may have more discriminatory power in a different space than the original
space. This can also be used for feature reduction. FT may happen in many ways, by
simple/linear combinations of original features or using non-linear functions. Some
common techniques for FT are:
Why transform?
The scale between features in the dataset can be very different from each
other (or they may have different units). For example, while the “Age” feature
varies between 0–100, “Car Price” can vary between 0–1000000. Some
machine learning methods are affected by these scale differences, so
normalizing the difference will contribute to the success of the model.
Models like KNN and SVM are distance-based algorithms means that the
distance between points is used to obtain clusters or to find out similarities.
The distance of unscaled features would also be unscaled and will be
misleading the model. In addition, models using gradient descent optimization
such as linear regression, logistic regression, or neural networks are also
negatively affected by unscaled data. Coefficients of linear models are also
affected by unscaled features. (In general, we don’t need to scale features in
ensemble methods as the depth will probably not change.)
Some machine learning models are based on the assumption that the features
are normally distributed. However, in real-life problems, the data is usually not
normally distributed. In this case, we apply transformations to approximate
these skewed data to the normal distribution so that the models can yield
better results.
Scaling
It scales the values between 0 and 1(by default). In the default cases, the minimum
value in the series will be zero. If we don’t want the minimum value to be zero we
can define another range, too.
Normalizing (range [0,1]) is important if the range of each feature is very different
from one another. And it doesn’t need to assume that the distribution of feature is
normal. It may increase overfitting.
Standard Scaler
Also called Z Score Normalization. It scales the values between two standard
deviation range, resizes the values in 1 standard deviation range and makes
median 0. This method assumes that the feature has normal distribution. It helps
to reduce overfitting when used properly.
x_scaled =x-mean/std
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
MaxAbs Scaler
All values in the feature are divided by the absolute maximum value of that
feature. Thus, the range becomes [-1,1]. If we want to keep 0 values as 0, we can
use this method.
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
df_scaled = scaler.fit_transform(df)
Robust Scaler
When we perform the scaling operation with constant values (mean, maximum
value, etc.), the operation becomes vulnerable to outlier values. If there are a lot
of outliers in the data, scaling over IQR can provide a more robust distribution. It
scales the values between a median of 0.
Gaussian Transformations
Many machine learning models work best if the features have Gaussian
distributions. There are numerous methods to transform features to normal
distribution.
I will use the below function to plot histogram and Q-Q distributions.
Plot function
Q-Q plots are used to compare two distributions. If the two distributions were
identical then this plot would be a line directly on y=x. In our case, we can say that
if all points are on a straight line, they are perfectly normally distributed.
Log Transformation
Log transformation is one of the most used Gaussian transformation methods. The
log of each value is taken in feature, a nice way to deal with large numbers (Log of
1,000,000 is only 6). Thus, it reduces the impact of both high and low values in
features.
The feature shown below is slightly right-skewed (upper charts). After the
logarithmic transition, it is better at the point of Gaussian distribution.
Log Transformation. Image by author.
Reciprocal Transformation
It cannot be applied to the value of zero. It reverses the order between data with
the same sign, i.e., larger values become small and vice versa (or like rewinding the
time). It can be applied to right-skewed data.
Square Transformation
It is only valid for positive numbers and is generally applied to right-skewed data.
Let’s generate right-skewed data for exploration (just set the skewness value to
positive in the data generation code above).
plot_gauss(x)
plot_gauss(x**(1/2))
Power Transformations
Can be used when there is heteroskedasticity which is a concern in linear
relationships. It means that the variance of residuals isn’t constant (so, there is a
pattern in the variance).
Box-Cox Transformation
T(Y) = (Y exp(λ)-1)/λ
The power transformation technique if the data contains zero or negative values.
from sklearn.preprocessing import PowerTransformer
transformer = PowerTransformer(method='yeo-johnson')
data = transformer.fit_transform(data)
Binning
Simply put, the name given to the grouping operation can be applied to both
numeric and categorical features.
#Numerical Feature
df['Age_bin'] = pd.cut(df['Age'], bins=[0,30,70,100], labels=["Young", "Mid",
"Old"])#Out:
0 Young
1 Mid
2 Young
3 Mid
4 Mid
...
886 Young
887 Young
888 Young
889 Young
890 Mid
Name: Age_bin, Length: 891, dtype: category
Categories (3, object): ['Young' < 'Mid' < 'Old']#Categorical Feature
conditions = [
data['Item'].str.contains('Bmw'),
data['Item'].str.contains('Hp'),
data['Item'].str.contains('Iphone'),
data['Item'].str.contains('Audi')]choices = ['Car', 'Computer', 'Phone',
'Car']data['Item_Group'] = np.select(conditions, choices, default='Other')
Encoders
We cannot feed categorical features as string types directly into models. Encoding
methods are used to convert these features numerical.
The integer value assigned to the unique groups is converted to the binary value in
themselves. There will be as many binary values as the number of groups. Each
data takes a value of 1 in that column for its own group and 0 in other columns.
One hot encoding. Image by author.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit_transform(x).toarray()#Out:
array([[0., 1.],
[1., 0.],
[1., 0.],
...,
[1., 0.],
[0., 1.],
[0., 1.]])
It is useful to be careful when using One Hot Encoding. Because it creates more
than one feature from a single feature, if there are too many groups, it can create
too many new features and these new features can be correlated with each other.
Category group names are replaced by their number or frequency in the feature.
Frequency encoding can be used if the target variable and feature have a
frequency-related relationship. It works well with tree algorithms.
Mean Encoding
It takes the mean value of the target feature for all category groups within a
feature and assigns it to each category.
from feature_engine.encoding import MeanEncoder
encoder = MeanEncoder(variables=['cabin'])
# fit the encoder
encoder.fit(X, y)
# transform the data
X_encoded = encoder.transform(X)
encoder.encoder_dict_
#Out:
{'cabin': {'A': 0.5294117647058824,
'B': 0.7619047619047619,
'C': 0.5633802816901409,
'D': 0.71875,
'E': 0.71875,
'F': 0.6666666666666666,
'G': 0.5,
'T': 0.0,
'n': 0.30484330484330485}}
Ordinal Encoding
It is similar to the Weight of Evidence method, but this time we only use the
probability (we don’t use natural logarithm). It is better to use it in binary
classification problems.
from sunbird.categorical_encoding import probability_ratio_encoding
probability_ratio_encoding(data, 'Cabin', 'Survived')
Hashing Encoding
Conclusion
Data preprocessing is the longest and most critical phase of a machine learning
project. Feature transformation work is also one of the most critical stages of the
preprocessing stage. Encoding, scaling, or transformation, there is a lot of method
and research in the literature and in practice. I think it is a very important skill to
have knowledge about these methods and to be able to sense where and which
one might work.
While developing the machine learning model, only a few variables in the dataset are
useful for building the model, and the rest features are either redundant or irrelevant. If
we input the dataset with all these redundant and irrelevant features, it may negatively
impact and reduce the overall performance and accuracy of the model. Hence it is very
important to identify and select the most appropriate features from the data and
remove the irrelevant or less important features, which is done with the help of feature
selection in machine learning.
Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and
relevant dataset to the model in order to get a better result.
In this topic, we will discuss different feature selection techniques for machine learning.
But before that, let's first understand some basics of feature selection.
Play Videox
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in model
building." Feature selection is performed by either including the important features or
excluding the irrelevant features in the dataset without changing them.
Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a
spare part, and to do this, we have a dataset. This dataset contains a Model of the car,
Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed
or not, so we can remove this column and select the rest of the features(column) for the
model building.
Below are some benefits of using feature selection in machine learning:
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with other
combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with this
feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method
does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does
not overfit the data.
Some common techniques of Filter methods are as follows:
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It returns
the rank of the variable on the fisher's criteria in descending order. Then we can select
the variables with a large fisher's score.
The value of the missing value ratio can be used for evaluating the feature set against
the threshold value. The formula for obtaining the missing value ratio is the number of
missing values in each column divided by the total number of observations. The variable
is having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter
method.
These methods are also iterative, which evaluates each iteration, and optimally finds the
most important features that contribute the most to training in a particular iteration.
Some techniques of embedded methods are:
Below are some univariate statistical measures, which can be used for filter-based
feature selection:
Numerical Input variables are used for predictive regression modelling. The common
method to be used for such a case is the Correlation coefficient.
Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should be used, but
with categorical output.
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
This is the case of regression predictive modelling with categorical input. It is a different
example of a regression problem. We can use the same measures as discussed in the
above case but in reverse order.
The commonly used technique for such a case is Chi-Squared Test. We can also use
Information gain in this case.
We can summarise the above cases with appropriate measures in the below table:
Numerical Numerical
o Pearson's correlation coefficient (For linear Correlation).
o Spearman's rank coefficient (for non-linear correlation).
Numerical Categorical
o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
Categorical Numerical
o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).
Categorical Categorical
o Chi-Squared test (contingency tables).
o Mutual Information.
Conclusion
Feature selection is a very complicated and vast field of machine learning, and lots of
studies are already made to discover the best methods. There is no fixed rule of the best
feature selection method. However, choosing the method depend on a machine
learning engineer who can combine and innovate approaches to find the best method
for a specific problem. One should try a variety of model fits on different subsets of
features selected through different statistical Measures.