Data Imbalance Problem
Data Imbalance Problem
Is accuracy
correct way to
measure Quality
of model?
Why this • Fraud Detection
• Anomaly Detection
happen? • Healthcare
Confusion Matrix
The trick of Accuracy
Accuracy: 98%
F1 Score: 0
• Random Oversampling
• Randomly sample the
minority classes and simply
duplicate the sampled
observations
Synthetic Minority Over-sampling Technique (SMOTE)
Ivan, Tomek. "Two modifications of CNN." IEEE transactions on Systems, Man and Communications, SMC 6 (1976): 769-772.
Tomek Link Removal
Near miss -1
NearMiss-1 select samples from the
majority class for which the average
distance of the N closest samples of
a minority class is smallest.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145
Near miss -2
Near miss -3
Zeng, Min, et al. "Effective prediction of three common diseases by combining SMOTE with Tomek links technique
for imbalanced medical data." 2016 IEEE International Conference of Online Analysis and Computing Science
(ICOACS). IEEE, 2016.
Performance Continued
Image Data Augmentation
• Advantages:
• Brings diversity in data
• Deals with the limited data problem
• Improve model prediction
• Reduce the cost of collecting and labeling data
• Help resolve class imbalance problem
• Increase generalization of the model
• Reduced data overfitting
Image Data Augmentation
• Padding • Cropping
• Rotation • Darkening & Brightening
• Scaling • Gray scaling
• Flipping (vertical and horizontal) • Changing contrast
• Translation (image is moved along X, • Adding noise
Y direction)
Image Data Augmentation
FLIP
Rotate
Image Data Augmentation
Scale
Crop
Image Data Augmentation
Adding noise
Translation
Source: Medium
Dropout
Training:
Thinner!
𝑤1 0.5 × 𝑤1
𝑤2 𝑧 0.5 × 𝑤2 𝑧′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Dropout is a kind of ensemble.
Training
Ensemble Set
y1 y2 y3 y4
average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
All the
weights
……
multiply
p%
y1 y2 y3
average ≈ y
Demo
……
500
……
model.add( dropout(0.8) )
……
500
model.add( dropout(0.8) )
Softmax
y1 y2
…… y10
Some Results
Paper Name: Dropout: A Simple Way to Prevent Neural Networks from Overfitting