Supervised Classification Notes
Supervised Classification Notes
Accuracy is often not the right metric for a binary classification problem. So,
when thinking about errors in classification, confusion matrix is more commonly
used.
Type 2 error
Type 1 error
TP+TN
Accuracy = . Thrown off by skew data.
TP+ TN + FP+ FN
TP
Recall/ Sensitivity is the ability to identify the actual positive instances =
TP+ FN
TP
Precision is out of all we predicted positive, how many are actually correct =
TP+ FNP
TN
Specificity is trying to avoid false alarms =
FP+TN
Precision∗Recall
F1- score = 2 *
Precision+ Recall
. This is sometimes called the harmonic mean.
K-Nearest Neighbors:
A simplified way to interpret K Nearest Neighbors is by thinking of the output of this
method as a decision boundary which is then used to classify new points.
K Nearest Neighbors with high values of k might likely not generalize well with
new data. A best practice is to use the elbow method to find a model with low k
and high decrease in error.
When building a KNN classifier for a variable with 2 classes, it is advantageous
to set the neighbor count k to an odd number. An odd neighbor count works as a
tie breaker. It ensures there cannot be a tie in the number of n nearest neighbors
for two given classes.
The Euclidean distance between two points will always be shorter than the
Manhattan distance.
KNN is easy to interpret, adapt wells to new training data and simple to
implement as it does not require parameter estimation.
Regression can be
done with
KNeighborsRegressor
Regularization in SVM:
Regularization is applied to avoid overfitting. Here 1/c = lambda which is the
regularization strength
parameter.
Less c more penalizing
simpler model.
Stochastic Gradient
Descent (SGD)using
Nystroem or Radial Basis
Function (RBF) sampler
convert original dataset
into higher dimensions.
Model selection:
The main idea behind support vector machines is to find a hyperplane that separates
classes by determining decision boundaries that maximize the distance between
classes.
When comparing logistic regression and SVMs, one of the main differences is that the
cost function for logistic regression has a cost function that decreases to zero, but rarely
reaches zero. SVMs use the Hinge Loss function as a cost function to penalize
misclassification. This tends to lead to better accuracy at the cost of having less
sensitivity on the predicted probabilities.
By using gaussian kernels, you transform your data space vectors into a different
coordinate system, and may have better chances of finding a hyperplane that classifies
well your data.SVMs with RBFs Kernels are slow to train with data sets that are large or
have many features.
For KNN Fast fitting, slow prediction (lots of distances to measure), decision
boundary is flexible.
Logistic Regression Learns parameter values, fitting may be slow (must find
best parameters), prediction is fast, decision boundary is simple and less flexible.
Support vector machines, which are linear classifications are either simple or
linear. So pretty simple boundaries, linear and probably fast to compute,
or require the kernel trick to come up with a nonlinear classification, which will
take a lot longer to actually fit the model.
Decision Trees:
Decision Tree models are non-linear and are considered a greedy algorithm.
They segment data based on features to predict results.
They split nodes into leaves.
They can be used for either classification or regression.
They are very visual and easy to interpret.
Here 4/12 is
(2+2)/ (8+4)
and 8/12 is
(6+2)/ (8+4)
In classification error, the final average classification error can be identical to parent
node.
Entropy allows average information of children to be less than parent and thus, results
in information gain and continuous splitting.
o Other types of pruning that we can do is, we can also decide a certain threshold of
information gain. So, we need to get a certain amount of information gain to continue
splitting. Or we can have a minimum number of rows in the subset which we are no
longer allowing further splits.
Outcome is continuous.
Decision trees split your data using impurity measures. They are a greedy algorithm and
are not based on statistical assumptions.
The most common splitting impurity measures are Entropy and Gini index.Decision
trees tend to overfit and to be very sensitive to different data.
Cross validation and pruning sometimes help with some of this.
Great advantages of decision trees are that they are really easy to interpret and require
no data preprocessing.
Improvement: Create lot of different trees and combine different predictions of those
trees Ensemble method.
Each tree in ensemble methods gives votes and majority vote class will be our final
predicted class for every row. Getting this majority class is meta-classification. This
process of majority voting is called bagging or bootstrap aggregating.
Random Forest:
Reduce variance than bagging.
Why random forest? In reality trees are not
independent and since we are
sampling with replacement, they
are likely to be very highly
correlated.
Solution – further de-correlate
trees or increase randomness.
To achieve this, we restrict the
number of features the trees are
allowed to be built from. So, each
tree will be built from a random
subset of not just rows, but of a
random subset of columns as
well.
The main difference between random forests and bagging, is that random forests
introduce more randomness by using only subsets of features, not only subsets of
observations. And in general, they tend to have better out of sample accuracy.
Use StratifiedShuffleSplit when classes are imbalanced class 0 has 10 values and
class 1 has 67 values then use StratifiedShuffleSplit.
Boosting and Stacking:
Boosting:
Meta-classifier
The term for a decision tree with just one split that we will be using as a decision
stump. This first stump will split the universe of outcomes into two different
values.
The intuition that here is that we are just building from our original decision
stump, further small decision stumps in order to improve our original decision
boundary. Each one of these stumps are going to be called the weak learners.
With boosting, we create a new way to decide on and stack together many weak
learners intelligently to ultimately come up with a strong classification algorithm.
Bagging vs Boosting:
In boosting, as we increase the number of trees, we
improve on mistakes that we made by prior trees.
So, at a certain point, we do risk that danger of overfitting, because we keep trying to
improve and improve off of the errors from past trees.
Learning rate needs to be optimized in order to
properly regularize the model.
The nature of boosting algorithms tends to produce good results in the presence of
outliers and rare events.
Boosting algorithms create trees iteratively or successively by boosting observations
with high residuals from the previous tree model.
Boosting algorithms create trees iteratively, not independently, by boosting observations
with high residuals from the previous tree model. They use the entire data set, not only
bootstrapped samples.
Boosting is an ensemble model that does not use bootstrapped samples to fit the base
trees, takes residuals into account, and fits the base trees iteratively.
Random Trees, Random Forest, Bagging - order of most randomness to least
randomness.
Random Forest is the only ensemble method that uses a subset of the features for each
tree.
Random Forest is an ensemble model that needs you to look at out of bag error.
Tune number of trees as a hyperparameter that needs to be optimized is the best way
to choose the number of trees to build on a Bagging ensemble.
Type of Ensemble modeling approach is NOT a special case of model averaging –
boosting.
Stacking:
In VC we have one parameter
voting. Voting is of two types hard
and soft. Hard voting means what
ever the estimator gives us as
predicted output we consider it as
output, whereas in soft voting we
take the average of the
probability of class. We increase
the weights if we want specific
class like 0 or 1.
Tree ensembles have been found to generalize well when scoring new data. Some
useful and popular tree ensembles are bagging, boosting, and random forests. Bagging,
which combines decision trees by using bootstrap aggregated samples. An advantage
specific to bagging is that this method can be multithreaded or computed in parallel.
Most of these ensembles are assessed using out-of-bag error.
Random Forest
Random forest is a tree ensemble that has a similar approach to bagging. Their main
characteristic is that they add randomness by only using a subset of features to train
each split of the trees it trains. Extra Random Trees is an implementation that adds
randomness by creating splits at random, instead of using a greedy search to find split
variables and split points.
Boosting
Boosting methods are additive in the sense that they sequentially retrain decision trees
using the observations with the highest residuals on the previous tree. To do so,
observations with a high residual are assigned a higher weight.
Gradient Boosting
0-1 loss function, which ignores observations that were correctly classified.
The shape of this loss function makes it difficult to optimize.
Adaptive boosting loss function, which has an exponential nature. The shape
of this function is more sensitive to outliers.
Gradient boosting loss function. The most common gradient boosting
implementation uses a binomial log-likelihood loss function called deviance.
It tends to be more robust to outliers than AdaBoost.
The additive nature of gradient boosting makes it prone to overfitting. This can be
addressed using cross validation or fine tuning the number of boosting iterations. Other
hyperparameters to fine tune are:
Stacking
Stacking is an ensemble method that combines any type of model by combining the
predicted probabilities of classes. In that sense, it is a generalized case of bagging. The
two most common ways to combine the predicted probabilities in stacking are: using a
majority vote or using weights for each predicted probability.
In general, logistic regressions are used as classifiers. A voting classifier is the type of
ensemble that you would use to combine several classifiers.
Resample = upsampling +
downsampling
o Do a stratified train-test-split.
o Up or down sample the dataset.
o Build models.
With unbalanced dataset, the data is often is not easily separable. We must
choose to make sacrifices to one class or the other.
For every minor-class data point identified as such, we might wrongly label a few
major-class points as minor-class. So, as recall goes up, precision will likely go
down.
Upsampling mitigates some of the excessive weight on the minor class. Recall is
typically high than precision, but with lesser gap. But this may result in overfitting as
we are using few repeated rows to upsample and class a balanced one.
Modeling Approaches:
1. Weighting and Stratified Sampling:
Weighted sampling:
Stratified sampling:
Random oversampling:
Synthetic oversampling:
o SMOTE
o ADASYN
SMOTE:
For both borderline and SVM SMOTE, a neighborhood is defined using the
parameter and neighbors to decide the number of neighbors to use, to decide
whether a sample is in-danger, whether it's safe or whether it's an outlier.
ADASYN:
4. Blagging (Ensemble):
Synthetic Upsampling generates observations that were not part of the original data.
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend
to perform better because they are less likely to be overfit.
If training set is large, low bias / high variance models (e.g. Logistic
Regression) tend to perform better because they can reflect more complex
relationships.