0% found this document useful (0 votes)
5 views

Supervised Classification Notes

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Supervised Classification Notes

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

SUPERVISED - CLASSIFICATION.

What is needed for classification?


 Features that can be quantified.
 Labels that are known.
Methods to measure similarity:
 Logistic regression – Extension of linear regression
 K-Nearest Neighbors – a nonlinear and simplistic approach to
categorizing according to the similarity of past examples nearest to the feature
space of the label we are trying to predict.
 Support Vector Machine – Linear classifier. Leverage kernel trick to allow for
complex decision boundaries.
 Neural Networks – Combines non-linear and linear intermediate steps to come-
up with a complex decision boundary.
 Decision trees – use intermediate decision boundaries that are nonlinear to come
up with a more complex final decision boundary.
 Random forests, boosting, and ensemble methods – build off decision trees and
other classifiers to show how we can leverage multiple classifiers to help reduce
both variance and bias in a final model.
Each of the above models can be used both for classification and regression.
Logistic Regression:
 x here is the original linear
function.
 Here algorithm is not skewed
by extreme samples.

 The value of this function will always be between 0 and 1.

 Number of classes = 2  separate the classes based on the decision boundary.


 Number of classes >2  use one vs all method. For example, if we have 3
classes A, B and C, first consider A as one class and B&C as the one class to
understand the A class. After this, consider B as one class and A&C as the one
class to understand the B class. Later consider C as one class and A&B as the
one class to understand the C class. This method is one vs all.
Applications:
 Customer spending
 Customer engagement – Eg: Customer most likely to engage in 6 months
 E-commerce - predict which transactions are fraudulent, using customer
characteristics such as location, IP address, etc.
 And many others

Confusion Matrix, Accuracy, Specificity, Precision, and Recall:

 Accuracy is often not the right metric for a binary classification problem. So,
when thinking about errors in classification, confusion matrix is more commonly
used.

 Type 2 error

Type 1 error
TP+TN
Accuracy = . Thrown off by skew data.
TP+ TN + FP+ FN
TP
Recall/ Sensitivity is the ability to identify the actual positive instances =
TP+ FN
TP
Precision is out of all we predicted positive, how many are actually correct =
TP+ FNP
TN
Specificity is trying to avoid false alarms =
FP+TN
Precision∗Recall
F1- score = 2 *
Precision+ Recall
. This is sometimes called the harmonic mean.

Receiver Operator Characteristic or ROC curve:


 Indicates sensitivity or the recall. ROC – Probability.

False Positive Rate = (1 – specificity)

AUC is area under curve.

Which approach best works for choosing a classifier?


 ROC Curve – Better for balanced classes.
 Precision-Recall curve – Better for imbalanced classes.
The right curve depends on tying results (TPs, TNs etc.) to outcomes (relative cost
of FP or FN)
The curves compare classifiers generally (across possible decision thresholds),
which may be less relevant to business objectives.
For accuracy, precision, recall, F1-score, use actual labels and for ROC use
probability of a class in ‘y’.

K-Nearest Neighbors:
A simplified way to interpret K Nearest Neighbors is by thinking of the output of this
method as a decision boundary which is then used to classify new points.
 K Nearest Neighbors with high values of k might likely not generalize well with
new data. A best practice is to use the elbow method to find a model with low k
and high decrease in error.
 When building a KNN classifier for a variable with 2 classes, it is advantageous
to set the neighbor count k to an odd number. An odd neighbor count works as a
tie breaker. It ensures there cannot be a tie in the number of n nearest neighbors
for two given classes.
 The Euclidean distance between two points will always be shorter than the
Manhattan distance.
 KNN is easy to interpret, adapt wells to new training data and simple to
implement as it does not require parameter estimation.

 K nearest neighbor methods is useful for classification. The elbow method is


frequently used to identify a model with low K and low error rate.
 These methods are popular due to their easy computation and interpretability,
although it might take time scoring new observations, it lacks estimators, and
might not be suited for large data sets.
 Here choose k = multiple of
no. of classes + 1

 Both classification and


regression can be performed
using KNN.

 K=20: When k = Number of


points, a single value is
predicted i.e., the mean of all
the 20 points
 K=3: When k is larger than 1,
KNN regression acts as a
smoothing function just the
rolling average of the closest
three points or K equals to 3
here.
 K=1: the prediction simply
connects each one of the
points exactly.

KNN Pros and Cons:

As number of features increase, the dimensions also increase.


KNN vs Linear Regression:

Regression can be
done with
KNeighborsRegressor

Support Vector Machine:


SVMs return labels like either 1 or 0 and those labels are decided by which side of a
certain decision boundary they fall on.
 Support Vector Machines do not return predicted probabilities.
 Support Vector Machines use decision boundaries for classification.
 The algorithm behind Support Vector Machines calculates hyperplanes that
minimize misclassification error.
For support vector machines, we are
going to depend on the hinge
loss, which will not penalize values
outside of our margin, assuming that
we predicted them correctly, but will
more heavily penalize those
values that are further and further
away from that margin.

Regularization in SVM:
 Regularization is applied to avoid overfitting.  Here 1/c = lambda which is the
regularization strength
parameter.
 Less c  more penalizing 
simpler model.

 Use LinearSVM for


regression.
Support Vector Machine: Kernel
Any linear model can be turned into a non-linear model by applying a kernel to the
model.

 With gamma we will be


controlling the Gaussian
distribution's reach.

 The higher the value of


gamma, the less
regularization we will have.

 And for both gamma and


C, lower values mean
more regularization
and higher values mean
less regularization and
more complex models.

 Stochastic Gradient
Descent (SGD)using
Nystroem or Radial Basis
Function (RBF) sampler
convert original dataset
into higher dimensions.
Model selection:

The main idea behind support vector machines is to find a hyperplane that separates
classes by determining decision boundaries that maximize the distance between
classes.

When comparing logistic regression and SVMs, one of the main differences is that the
cost function for logistic regression has a cost function that decreases to zero, but rarely
reaches zero. SVMs use the Hinge Loss function as a cost function to penalize
misclassification. This tends to lead to better accuracy at the cost of having less
sensitivity on the predicted probabilities.

Regularization can help SVMs generalize better with future data.

By using gaussian kernels, you transform your data space vectors into a different
coordinate system, and may have better chances of finding a hyperplane that classifies
well your data.SVMs with RBFs Kernels are slow to train with data sets that are large or
have many features.

Characteristics of different classifiers:

 For KNN  Fast fitting, slow prediction (lots of distances to measure), decision
boundary is flexible.
 Logistic Regression  Learns parameter values, fitting may be slow (must find
best parameters), prediction is fast, decision boundary is simple and less flexible.
 Support vector machines, which are linear classifications are either simple or
linear. So pretty simple boundaries, linear and probably fast to compute,
or require the kernel trick to come up with a nonlinear classification, which will
take a lot longer to actually fit the model.
Decision Trees:
 Decision Tree models are non-linear and are considered a greedy algorithm.
 They segment data based on features to predict results.
 They split nodes into leaves.
 They can be used for either classification or regression.
 They are very visual and easy to interpret.

Trees that predict categorical results are decision


trees.

Trees that predict quantities or continuous


values are regression trees.

 Values at leaves are averages of members.


 By increasing the depth of a tree, one can allow for more possible values. The
bigger the depth of the tree, the more different average and different subsets you
are going to be working with. But as depth increases, we may overfit.

Building a decision tree:


 Select a feature and split into binary tree. Continue splitting with available
features. Split until the leaf is pure (only one class remains)  overfit our training
set or a maximum depth is reached  to avoid overfit or a predefined
performance metric is achieved to avoid overfit.
How to find the right splits?
o We need a wave evaluating all the possible splits. Once we have that
wave evaluating the splits, we can then use greedy search to find the best
split. And greedy search here means that at every step, we find the best
split regardless of what happened in prior steps or what would happen in
future steps.
What defines best split?
 One that maximizes information gain from the split.

Splitting based on classification error:


Max[p(i|t)] is accuracy.

Here 4/12 is
(2+2)/ (8+4)
and 8/12 is
(6+2)/ (8+4)

Here all leaves may not be


homogeneous (single like play or not
play is homogeneous)

Splitting based on entropy:

 Entropy means information loss.


 Splitting based on entropy allows further
splits to occur and can eventually reach a
goal of homogeneous nodes.
Why do we reach homogeneous nodes with entropy but not with classification
error?

 Classification error is a flat function


with maximum at center. Here
center represents ambiguity (50-50
split). Splitting metrics favor results
that are furthest away from center.

 Entropy has the same maximum,


but it is curved. Curvature allows
splitting until modes are pure.

In classification error, the final average classification error can be identical to parent
node.
Entropy allows average information of children to be less than parent and thus, results
in information gain and continuous splitting.

 In practice, gini index is often used for


splitting.
 Function is similar to entropy – has bulge.
 Does not contain logarithm term.

 Completely homogeneous nodes (pure nodes) leads to overfitting.


Decision trees and high variance:
Problem – Decision trees tend to overfit which means small changes in data greatly
affect prediction.
Solution – prune trees (eliminate or reduce few leaf nodes)
How to decide what leaves to prune?
o Prune based on classification error threshold.

o Other types of pruning that we can do is, we can also decide a certain threshold of
information gain. So, we need to get a certain amount of information gain to continue
splitting. Or we can have a minimum number of rows in the subset which we are no
longer allowing further splits.

Strengths of decision tree:


 Easy to interpret and implement.
o If-then-else logic
 Handle any data category.
o Binary, ordinal, continuous
 No preprocessing or scaling required.

Outcome is continuous.

Decision trees split your data using impurity measures. They are a greedy algorithm and
are not based on statistical assumptions.

The most common splitting impurity measures are Entropy and Gini index.Decision
trees tend to overfit and to be very sensitive to different data.
Cross validation and pruning sometimes help with some of this.

Great advantages of decision trees are that they are really easy to interpret and require
no data preprocessing.

Ensemble based methods and bagging:


 Pruning is often not significant for the model to generalize well.

Improvement: Create lot of different trees and combine different predictions of those
trees  Ensemble method.

Each tree in ensemble methods gives votes and majority vote class will be our final
predicted class for every row. Getting this majority class is meta-classification. This
process of majority voting is called bagging or bootstrap aggregating.

How many trees to fit in bagging?

 Bagging performance improves with increase in more trees  it means the


bigger the number of trees the less overfit our decision trees will be.
 Bagging trees are easy to implement and interpret.
 Heterogeneous input data (different data types) is allowed with no preprocessing
required.
 Less variability than decision tree  less overfit.
 Can grow trees in parallel where each tree is independent of each other as it is
specific to its own dataset. This is more efficient than boosting as boosting will
not grow trees in parallel.
A model that averages the predictions of multiple models reduces the variance of
a single model and has high chances to generalize well when scoring new data.
Bagging is a tree ensemble that combines the prediction of several trees that
were trained on bootstrap samples of the data.

Random Forest:
 Reduce variance than bagging.
Why random forest? In reality trees are not
independent and since we are
sampling with replacement, they
are likely to be very highly
correlated.
Solution – further de-correlate
trees or increase randomness.
To achieve this, we restrict the
number of features the trees are
allowed to be built from. So, each
tree will be built from a random
subset of not just rows, but of a
random subset of columns as
well.

 Here m is no. of features.


This algorithm is called random
forest. In this subset of rows and
columns are random.

More trees are needed to get better out-of-sample accuracy.


Errors are further reduced in random forest relative to bagging.
Grow enough trees until error settles down.
Beyond certain point, additional trees wont improve results.
What if random forest does not reduce variance?
If random forest does not reduce variance, introduce even more randomness or
select features randomly and create splits randomly – do not choose greedily.
These extra trees are called extra random trees.

The main difference between random forests and bagging, is that random forests
introduce more randomness by using only subsets of features, not only subsets of
observations. And in general, they tend to have better out of sample accuracy.

Use StratifiedShuffleSplit when classes are imbalanced  class 0 has 10 values and
class 1 has 67 values then use StratifiedShuffleSplit.
Boosting and Stacking:
Boosting:
 Meta-classifier
 The term for a decision tree with just one split that we will be using as a decision
stump. This first stump will split the universe of outcomes into two different
values.
 The intuition that here is that we are just building from our original decision
stump, further small decision stumps in order to improve our original decision
boundary. Each one of these stumps are going to be called the weak learners.
 With boosting, we create a new way to decide on and stack together many weak
learners intelligently to ultimately come up with a strong classification algorithm.

How boosting works?


o Create initial decision stump.
o Fit the data and calculate residuals.
o Adjust the weight of the points that are falsely classified. Weight these falsely
classified points more heavily in next week learner. The next week learner
should be rewarded more, to get the previously falsely classified points right.
o So, we lower the weights of the records that the first model got right
and increase the weights of the ones that are wrongly classified.

Combine all the


decision boundaries
of all the week
learners.

Successive classifiers are weighted by


learning rates lambda.
More trees, less lambda means
correcting errors at a slower pace. On
the other hand if it's too high then we
can easily be overfit by allowing.
each successive tree to have too much
influence on our final decision.
So, a smaller learning rate, means less
overfitting,
hence a higher bias if we have a
smaller learning rate and a lower
variance.
Adaboost and Gradient Boosting Overview:

Bagging vs Boosting:
 In boosting, as we increase the number of trees, we
improve on mistakes that we made by prior trees.
 So, at a certain point, we do risk that danger of overfitting, because we keep trying to
improve and improve off of the errors from past trees.
 Learning rate needs to be optimized in order to
properly regularize the model.

We can use subsample to add


randomness to reduce overfitting.
By using a subsample, our base
learners do not train on the entire
data set.
And this alone allows for faster
optimization, as well as a bit of
regularization as it will not
perfectly fit to our entire data set.

The nature of boosting algorithms tends to produce good results in the presence of
outliers and rare events.
Boosting algorithms create trees iteratively or successively by boosting observations
with high residuals from the previous tree model.
Boosting algorithms create trees iteratively, not independently, by boosting observations
with high residuals from the previous tree model. They use the entire data set, not only
bootstrapped samples.
Boosting is an ensemble model that does not use bootstrapped samples to fit the base
trees, takes residuals into account, and fits the base trees iteratively.
Random Trees, Random Forest, Bagging - order of most randomness to least
randomness.
Random Forest is the only ensemble method that uses a subset of the features for each
tree.
Random Forest is an ensemble model that needs you to look at out of bag error.
Tune number of trees as a hyperparameter that needs to be optimized is the best way
to choose the number of trees to build on a Bagging ensemble.
Type of Ensemble modeling approach is NOT a special case of model averaging –
boosting.

Stacking:
In VC we have one parameter
voting. Voting is of two types hard
and soft. Hard voting means what
ever the estimator gives us as
predicted output we consider it as
output, whereas in soft voting we
take the average of the
probability of class. We increase
the weights if we want specific
class like 0 or 1.

Ensemble Based Methods and Bagging

Tree ensembles have been found to generalize well when scoring new data. Some
useful and popular tree ensembles are bagging, boosting, and random forests. Bagging,
which combines decision trees by using bootstrap aggregated samples. An advantage
specific to bagging is that this method can be multithreaded or computed in parallel.
Most of these ensembles are assessed using out-of-bag error.

Random Forest

Random forest is a tree ensemble that has a similar approach to bagging. Their main
characteristic is that they add randomness by only using a subset of features to train
each split of the trees it trains. Extra Random Trees is an implementation that adds
randomness by creating splits at random, instead of using a greedy search to find split
variables and split points.

Boosting

Boosting methods are additive in the sense that they sequentially retrain decision trees
using the observations with the highest residuals on the previous tree. To do so,
observations with a high residual are assigned a higher weight.

Gradient Boosting

The main loss functions for boosting algorithms are:

 0-1 loss function, which ignores observations that were correctly classified.
The shape of this loss function makes it difficult to optimize.
 Adaptive boosting loss function, which has an exponential nature. The shape
of this function is more sensitive to outliers.
 Gradient boosting loss function. The most common gradient boosting
implementation uses a binomial log-likelihood loss function called deviance.
It tends to be more robust to outliers than AdaBoost.
The additive nature of gradient boosting makes it prone to overfitting. This can be
addressed using cross validation or fine tuning the number of boosting iterations. Other
hyperparameters to fine tune are:

 learning rate (shrinkage)


 subsample
 number of features.

Stacking

Stacking is an ensemble method that combines any type of model by combining the
predicted probabilities of classes. In that sense, it is a generalized case of bagging. The
two most common ways to combine the predicted probabilities in stacking are: using a
majority vote or using weights for each predicted probability.

In general, logistic regressions are used as classifiers. A voting classifier is the type of
ensemble that you would use to combine several classifiers.

Modeling unbalanced classes:


 Classifiers are usually built to optimize accuracy and hence will poorly perform on
unbalanced classes.
 For unbalanced datasets, we can balance the size of the classes by either
downsampling the larger classes or upsampling the smaller one.
In downsampling, we select as many points as there are in minority class, from
majority class. In upsampling, we duplicate the minority class.

Resample = upsampling +
downsampling

Steps for unbalanced dataset:

o Do a stratified train-test-split.
o Up or down sample the dataset.
o Build models.

 With unbalanced dataset, the data is often is not easily separable. We must
choose to make sacrifices to one class or the other.
 For every minor-class data point identified as such, we might wrongly label a few
major-class points as minor-class. So, as recall goes up, precision will likely go
down.

Downsampling adds tremendous importance to the minor class, typically shooting up


recall and bringing down precision.

Upsampling mitigates some of the excessive weight on the minor class. Recall is
typically high than precision, but with lesser gap. But this may result in overfitting as
we are using few repeated rows to upsample and class a balanced one.

Cross-validation works for any global model-making choice, including sampling.

Modeling Approaches:
1. Weighting and Stratified Sampling:

Weighted sampling:

 Many models allow weighted observation.


o Adjust these, so that the total weights are equal across the classes.

 Easy to do when it is available.


 No need to sacrifice data.

Stratified sampling:

2. Random and Synthetic Oversampling:

Random oversampling:

 Simplest oversampling approach.


 Resample with replacement with minority class.
 No concerns about geometry of feature space.
 Good for categorical data.

Synthetic oversampling:

 Start with a point in minority class.


 Choose one of the K nearest neighbors.
 Add a new point randomly connecting two points.

Two main approaches:

o SMOTE
o ADASYN

SMOTE:

SMOTE stands for Synthetic minority oversampling technique.


Regular: Connect minority class points to any neighbors (even
other classes) as long as they are nearest neighbors.

Borderline: Classify points as outliers, safe, or in-danger. Outliers


means neighbors belong to a different class. Safe means all neighbors are
from same class. In-danger means at least half neighbors are from same
class but not all are from same class.

a. Connect minority in-danger points only to minority points.


b. Connect minority in-danger points to whatever nearby.

SVM: Use minority support vectors to generate new points.

For both borderline and SVM SMOTE, a neighborhood is defined using the
parameter and neighbors to decide the number of neighbors to use, to decide
whether a sample is in-danger, whether it's safe or whether it's an outlier.

ADASYN:

Adaptive synthetic sampling works very similarly to SMOTE.

3. Near neighbor methods:

 Cluster Centroids implementations - NearMiss, Tomek Links, and Nearest Neighbors.

We are generally trying to keep points that


are near our decision boundaries.

Easily be skewed by the presence of


outliers that may cause clusters to stick
together far from boundary.
Select the positive samples for which the
average distance to the farthest samples
of the negative class is the smallest.

So NearMiss-2 will not be as affected by


outliers, since it does not focus on
minimizing distance to the nearest
samples, but rather minimizing the
distance from the farthest samples.
So still minimizing distance, but now
minimizing distance from the farthest
samples. Thus, looking to minimize the
distance to the farthest samples can
help reduce the effects of noise here.
This can still be affected by marginal
outliers.

NearMiss-3 is going to be a two-step


algorithm.
First, it is going to be for each negative
sample. We are going to find the K-nearest
neighbors of the positive class,
then the positive samples selected are going
to be the ones for which the average
distance to the N-nearest neighbors is the
largest.
NearMiss-3 is probably the version which
will be less affected by noise due to this first
step sample selection. So, it won't be as
affected by outliers.
Tomek link exists if two samples from different
classes are the nearest neighbors of one
another. We then can either just remove the
majority class, where we just remove the
positive, or we can remove both classes.

The point being that this will remove the points


that are too close together and create more
distinct classes.

Essentially, all we do here is run K-nearest


neighbors with K equal to 1. Then if you
misclassify a point in one of the majority
classes, that point will be removed. We again
end up with clusters that are more distinct,
with the new values being as we see on
the right and the decision boundary, given our
new data using our downsampling or
undersampling (downsampling), is going to
be what the decision boundary.

4. Blagging (Ensemble):

Combination of over and under sampling.

 SMOTE + Tomek’s link


 SMOTE + Edited Nearest Neighbors
Continuously downsample each of our
bootstrap samples.

We have our bootstrap samples and then


we downsample each of the majority
classes and then you use these now
balanced samples to learn each one of
our individual decision trees.
This will again allow for more weight to
be attributed to that minority class,
ensuring that we have a more
balanced decision being made.
Cohen’s Kappa is best if you're
working with a team and
it's actually a measure of
agreement between two different
raters or two different models,
where each rater will be classifying
n items into mutually exclusive
categories so just performing
classification, and the goal here is
to come up with a ratio of observed
agreement between these two
models as compared to the
probability of there being
agreement just by chance.

A best practice is to do a stratified train/test split before, then use an upsample or


downsample technique, and last build a predictive model.

Random Upsampling preserves all original observations.

Synthetic Upsampling generates observations that were not part of the original data.

If training set is small, high bias / low variance models (e.g. Naive Bayes) tend
to perform better because they are less likely to be overfit.

If training set is large, low bias / high variance models (e.g. Logistic
Regression) tend to perform better because they can reflect more complex
relationships.

You might also like