0% found this document useful (0 votes)
5 views

Data Imbalance Problem

This pdf contains the data imbalance Problem

Uploaded by

m.nitishdhawan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Imbalance Problem

This pdf contains the data imbalance Problem

Uploaded by

m.nitishdhawan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Imbalance problem

Is accuracy
correct way to
measure Quality
of model?
Why this • Fraud Detection
• Anomaly Detection
happen? • Healthcare
Confusion Matrix
The trick of Accuracy

Accuracy: 98%
F1 Score: 0

Image: Analytics Vidya


How to measure quality of model?
ROC Curve and ROC AUC Score
Methods to Overcome Data Imbalance
Problem
• Class weight
• Oversampling
• Random oversampling
• Synthetic Minority Over-sampling Technique (SMOTE)
• ADASYN
• Undersampling
• Random undersampling
• Cluster centroids
• Near miss
• Tomeks links
To the very core
Class weight
• Provide a weight for each class which places more emphasis on the
minority classes

wj=n_samples / (n_classes * n_samplesj)


Here,
•wj is the weight for each class(j signifies the class)
•n_samples is the total number of samples or rows in the dataset
•n_classes is the total number of unique classes in the target
•n_samplesj is the total number of rows of the respective class
You modify the loss function to accommodate
for Imbalanced data

Image: Analytics Vidya


After Experimenting

The f1-score for the testing


data: 0.10098851188885921

Before Using Class weights

Image: Analytics Vidya


More Experimenting

f1-score for the testing data: 0.1579371474617244

Image: Analytics Vidya


Oversampling

• Oversampling the minority classes


to increase the number of
minority observations until we've
reached a balanced dataset

• Random Oversampling
• Randomly sample the
minority classes and simply
duplicate the sampled
observations
Synthetic Minority Over-sampling Technique (SMOTE)

• It generates new observations by


interpolating between
observations in the original
dataset
• For a given observation xi, a new
(synthetic) observation is
generated by interpolating
between one of the k-nearest
neighbors, xzi.
Problems with SMOTE
Introducing ADASYN
Procedure
Explanation
Under Sampling

• Throwing away data to make it


easier to learn characteristics
about the minority classes

• Random under sampling


• simply sample the majority
class at random until reaching
a similar number of
observations as the minority
classes
Random Undersampling
Replacement with cluster centroids
Tomek Links

Ivan, Tomek. "Two modifications of CNN." IEEE transactions on Systems, Man and Communications, SMC 6 (1976): 769-772.
Tomek Link Removal

Let d(x_i, x_j) denotes the Euclidean distance


between x_i and x_j, where x_i denotes sample that
belongs to the minority class and x_j denotes sample
that belongs to the majority class. If there is no
sample x_k satisfies the following condition:
d(x_i, x_k) < d(x_i, x_j), or
d(x_j, x_k) < d(x_i, x_j)
then the pair of (x_i, x_j) is a Tomek Link.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145

Near miss -1
NearMiss-1 select samples from the
majority class for which the average
distance of the N closest samples of
a minority class is smallest.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145

Near miss -2

Select samples from the majority class for


which the average distance of the N
farthest samples of a minority class is
smallest.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145

Near miss -3

NearMiss-3 is a 2-steps algorithm. First,


for each negative sample, their m nearest-
neighbors will be kept. Then, the positive
samples selected are the ones for which
the average distance to the k nearest-
neighbors is the largest..
Performance Comparison

Zeng, Min, et al. "Effective prediction of three common diseases by combining SMOTE with Tomek links technique
for imbalanced medical data." 2016 IEEE International Conference of Online Analysis and Computing Science
(ICOACS). IEEE, 2016.
Performance Continued
Image Data Augmentation

• Process of artificially increasing the amount of data by generating new data


points from existing data.
• Done by adding minor alterations to data to generate new data points in the
latent space of original data to amplify the dataset.

• Synthetic data: When data is generated artificially without using real-world


images. Often produced by GANs.
• Augmented data: Derived from original images with some sort of minor
geometric transformations (such as flipping, translation, rotation, or the addition
of noise) in order to increase the diversity of the training set.
Image Data Augmentation

• In real-world cases, we might have a dataset of photos captured under a specific


set of conditions.
• While our test data may exist in a number of variations, such as varied
orientations, locations, scales, brightness, and so on. We can accommodate such
cases by training deep neural networks with synthetically manipulated data.

• Popular in computer vision


Image Data Augmentation

• Advantages:
• Brings diversity in data
• Deals with the limited data problem
• Improve model prediction
• Reduce the cost of collecting and labeling data
• Help resolve class imbalance problem
• Increase generalization of the model
• Reduced data overfitting
Image Data Augmentation

Image processing activities for data augmentation

• Padding • Cropping
• Rotation • Darkening & Brightening
• Scaling • Gray scaling
• Flipping (vertical and horizontal) • Changing contrast
• Translation (image is moved along X, • Adding noise
Y direction)
Image Data Augmentation

FLIP

Original Horizontally flipped Vertically flipped


Image Data Augmentation

Rotate
Image Data Augmentation

Scale

Original Scaled out by 10% Scaled out by 20%


Image Data Augmentation

Crop
Image Data Augmentation

Adding noise

Original Gaussian noise Salt pepper noise


Image Data Augmentation

Translation

Original Translated to right Translated upwards


Regularization: Early Stopping

• Overfitting also happens when there is too much training.


• Solution: Stop training before we have a chance to overfit.
Regularization: Early Stopping

• Stop training before we have a chance to overfit


Regularization: Early Stopping

• Stop training before we have a chance to overfit


Regularization: Early Stopping

• Stop training before we have a chance to overfit


Normalization
• Recall the loss function
Some of the common Norms
Curve Fitting

Source: Medium
Dropout
Training:

➢ Each time before updating the parameters


⚫ Each neuron has p% to dropout
Dropout
Training:

Thinner!

➢ Each time before updating the parameters


⚫ Each neuron has p% to dropout
The structure of the network is changed.
⚫ Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout

Training of Dropout Testing of Dropout


Assume dropout rate is 50% No dropout

𝑤1 0.5 × 𝑤1
𝑤2 𝑧 0.5 × 𝑤2 𝑧′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Dropout is a kind of ensemble.
Training
Ensemble Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures


Dropout is a kind of ensemble.
Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

➢Using one mini-batch to train one network


➢Some parameters in the network are shared
Dropout is a kind of ensemble.
Testing of Dropout testing data x

All the
weights

……
multiply
p%

y1 y2 y3

average ≈ y
Demo
……

500
……

model.add( dropout(0.8) )
……
500
model.add( dropout(0.8) )
Softmax

y1 y2
…… y10
Some Results

Paper Name: Dropout: A Simple Way to Prevent Neural Networks from Overfitting

You might also like