0% found this document useful (0 votes)

5 views

Data Imbalance Problem

This pdf contains the data imbalance Problem

Uploaded by

m.nitishdhawan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data Imbalance Problem

This pdf contains the data imbalance Problem

Uploaded by

m.nitishdhawan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Data Imbalance problem

Is accuracy
correct way to
measure Quality
of model?
Why this • Fraud Detection
• Anomaly Detection
happen? • Healthcare
Confusion Matrix
The trick of Accuracy

Accuracy: 98%
F1 Score: 0

Image: Analytics Vidya

How to measure quality of model?
ROC Curve and ROC AUC Score
Methods to Overcome Data Imbalance
Problem
• Class weight
• Oversampling
• Random oversampling
• Synthetic Minority Over-sampling Technique (SMOTE)
• ADASYN
• Undersampling
• Random undersampling
• Cluster centroids
• Near miss
• Tomeks links
To the very core
Class weight
• Provide a weight for each class which places more emphasis on the
minority classes

wj=n_samples / (n_classes * n_samplesj)

Here,
•wj is the weight for each class(j signifies the class)
•n_samples is the total number of samples or rows in the dataset
•n_classes is the total number of unique classes in the target
•n_samplesj is the total number of rows of the respective class
You modify the loss function to accommodate
for Imbalanced data

Image: Analytics Vidya

After Experimenting

The f1-score for the testing

data: 0.10098851188885921

Before Using Class weights

Image: Analytics Vidya

More Experimenting

f1-score for the testing data: 0.1579371474617244

Image: Analytics Vidya

Oversampling

• Oversampling the minority classes

to increase the number of
minority observations until we've
reached a balanced dataset

• Random Oversampling
• Randomly sample the
minority classes and simply
duplicate the sampled
observations
Synthetic Minority Over-sampling Technique (SMOTE)

• It generates new observations by

interpolating between
observations in the original
dataset
• For a given observation xi, a new
(synthetic) observation is
generated by interpolating
between one of the k-nearest
neighbors, xzi.
Problems with SMOTE
Introducing ADASYN
Procedure
Explanation
Under Sampling

• Throwing away data to make it

easier to learn characteristics
about the minority classes

• Random under sampling

• simply sample the majority
class at random until reaching
a similar number of
observations as the minority
classes
Random Undersampling
Replacement with cluster centroids
Tomek Links

Ivan, Tomek. "Two modifications of CNN." IEEE transactions on Systems, Man and Communications, SMC 6 (1976): 769-772.
Tomek Link Removal

Let d(x_i, x_j) denotes the Euclidean distance

between x_i and x_j, where x_i denotes sample that
belongs to the minority class and x_j denotes sample
that belongs to the majority class. If there is no
sample x_k satisfies the following condition:
d(x_i, x_k) < d(x_i, x_j), or
d(x_j, x_k) < d(x_i, x_j)
then the pair of (x_i, x_j) is a Tomek Link.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145

Near miss -1
NearMiss-1 select samples from the
majority class for which the average
distance of the N closest samples of
a minority class is smallest.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145

Near miss -2

Select samples from the majority class for

which the average distance of the N
farthest samples of a minority class is
smallest.
Source: https://hersanyagci.medium.com/under-sampling-methods-for-imbalanced-data-clustercentroids-
randomundersampler-nearmiss-eae0eadcc145

Near miss -3

NearMiss-3 is a 2-steps algorithm. First,

for each negative sample, their m nearest-
neighbors will be kept. Then, the positive
samples selected are the ones for which
the average distance to the k nearest-
neighbors is the largest..
Performance Comparison

Zeng, Min, et al. "Effective prediction of three common diseases by combining SMOTE with Tomek links technique
for imbalanced medical data." 2016 IEEE International Conference of Online Analysis and Computing Science
(ICOACS). IEEE, 2016.
Performance Continued
Image Data Augmentation

• Process of artificially increasing the amount of data by generating new data

points from existing data.
• Done by adding minor alterations to data to generate new data points in the
latent space of original data to amplify the dataset.

• Synthetic data: When data is generated artificially without using real-world

images. Often produced by GANs.
• Augmented data: Derived from original images with some sort of minor
geometric transformations (such as flipping, translation, rotation, or the addition
of noise) in order to increase the diversity of the training set.
Image Data Augmentation

• In real-world cases, we might have a dataset of photos captured under a specific

set of conditions.
• While our test data may exist in a number of variations, such as varied
orientations, locations, scales, brightness, and so on. We can accommodate such
cases by training deep neural networks with synthetically manipulated data.

• Popular in computer vision

Image Data Augmentation

• Advantages:
• Brings diversity in data
• Deals with the limited data problem
• Improve model prediction
• Reduce the cost of collecting and labeling data
• Help resolve class imbalance problem
• Increase generalization of the model
• Reduced data overfitting
Image Data Augmentation

Image processing activities for data augmentation

• Padding • Cropping
• Rotation • Darkening & Brightening
• Scaling • Gray scaling
• Flipping (vertical and horizontal) • Changing contrast
• Translation (image is moved along X, • Adding noise
Y direction)
Image Data Augmentation

FLIP

Original Horizontally flipped Vertically flipped

Image Data Augmentation

Rotate
Image Data Augmentation

Scale

Original Scaled out by 10% Scaled out by 20%

Image Data Augmentation

Crop
Image Data Augmentation

Adding noise

Original Gaussian noise Salt pepper noise

Image Data Augmentation

Translation

Original Translated to right Translated upwards

Regularization: Early Stopping

• Overfitting also happens when there is too much training.

• Solution: Stop training before we have a chance to overfit.
Regularization: Early Stopping

• Stop training before we have a chance to overfit

Regularization: Early Stopping

• Stop training before we have a chance to overfit

Regularization: Early Stopping

• Stop training before we have a chance to overfit

Normalization
• Recall the loss function
Some of the common Norms
Curve Fitting

Source: Medium
Dropout
Training:

➢ Each time before updating the parameters

⚫ Each neuron has p% to dropout
Dropout
Training:

Thinner!

➢ Each time before updating the parameters

⚫ Each neuron has p% to dropout
The structure of the network is changed.
⚫ Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout

Training of Dropout Testing of Dropout

Assume dropout rate is 50% No dropout

𝑤1 0.5 × 𝑤1
𝑤2 𝑧 0.5 × 𝑤2 𝑧′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Dropout is a kind of ensemble.
Training
Ensemble Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

1 2 3 4

Train a bunch of networks with different structures

Dropout is a kind of ensemble.
Ensemble
Testing data x

Network Network Network Network

1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

➢Using one mini-batch to train one network

➢Some parameters in the network are shared
Dropout is a kind of ensemble.
Testing of Dropout testing data x

All the
weights

……
multiply
p%

y1 y2 y3

average ≈ y
Demo
……

500
……

model.add( dropout(0.8) )
……
500
model.add( dropout(0.8) )
Softmax

y1 y2
…… y10
Some Results

Paper Name: Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Zoho 2nd and 3rd Round Coding Questions
70% (10)
Zoho 2nd and 3rd Round Coding Questions
49 pages
11 BSS OSS Requirements
100% (1)
11 BSS OSS Requirements
63 pages
NCA Network Control Annunciator 51482
No ratings yet
NCA Network Control Annunciator 51482
96 pages
W3Schools SQL Quiz Test
No ratings yet
W3Schools SQL Quiz Test
10 pages
Data Pre Processing
No ratings yet
Data Pre Processing
23 pages
15 dm2 Imbalanced Learning 2022 23
No ratings yet
15 dm2 Imbalanced Learning 2022 23
35 pages
VoCuongThinh 52000599 Report
No ratings yet
VoCuongThinh 52000599 Report
2 pages
DL Unit 3
No ratings yet
DL Unit 3
59 pages
Understanding Data Augmentation For Classification: When To Warp?
No ratings yet
Understanding Data Augmentation For Classification: When To Warp?
6 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Week 15
No ratings yet
Week 15
41 pages
Unit-2 L2 (3)
No ratings yet
Unit-2 L2 (3)
22 pages
AugMix: A Simple Data Processing Method To Improve Robustness and Uncertainty
No ratings yet
AugMix: A Simple Data Processing Method To Improve Robustness and Uncertainty
15 pages
A Complete Guide To Data Augmentation - DataCamp
No ratings yet
A Complete Guide To Data Augmentation - DataCamp
18 pages
AIML105
No ratings yet
AIML105
5 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Enhanced synthetic oversampling for multiclass imbalanced data
No ratings yet
Enhanced synthetic oversampling for multiclass imbalanced data
20 pages
6 Batchnorm
No ratings yet
6 Batchnorm
30 pages
Neural Networks: Sree Rama Vamsidhar S., Arun Kumar Sivapuram, Vaishnavi Ravi, Gowtham Senthil, Rama Krishna Gorthi
No ratings yet
Neural Networks: Sree Rama Vamsidhar S., Arun Kumar Sivapuram, Vaishnavi Ravi, Gowtham Senthil, Rama Krishna Gorthi
7 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Deep Learning notes
No ratings yet
Deep Learning notes
155 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
1 s2.0 S2666285X22000565 Main
No ratings yet
1 s2.0 S2666285X22000565 Main
9 pages
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
6 CNN
No ratings yet
6 CNN
50 pages
ML 3 & 4 Notes
No ratings yet
ML 3 & 4 Notes
18 pages
Data Mining
No ratings yet
Data Mining
49 pages
Chapter 2 Part1
No ratings yet
Chapter 2 Part1
33 pages
Lecture 2: Basics and Definitions: Networks As Data Models
No ratings yet
Lecture 2: Basics and Definitions: Networks As Data Models
28 pages
A Robust Real Time Face Detection
No ratings yet
A Robust Real Time Face Detection
55 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Dataset Augmentation
No ratings yet
Dataset Augmentation
9 pages
Jimaging 09 00046 v2
No ratings yet
Jimaging 09 00046 v2
26 pages
Regression and Regularization V4.1
No ratings yet
Regression and Regularization V4.1
48 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
116 pages
Lecture 05
No ratings yet
Lecture 05
34 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
L06 Features
No ratings yet
L06 Features
44 pages
Regularization
No ratings yet
Regularization
9 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Random Forest
No ratings yet
Random Forest
20 pages
COMP 4211 - Machine Learning
No ratings yet
COMP 4211 - Machine Learning
19 pages
DMTN
No ratings yet
DMTN
17 pages
2022_A review_ Data pre-processing and data augmentation techniques - ScienceDirect
No ratings yet
2022_A review_ Data pre-processing and data augmentation techniques - ScienceDirect
20 pages
Lecture 11
No ratings yet
Lecture 11
18 pages
Module 2 part2
No ratings yet
Module 2 part2
8 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
300 PDF
No ratings yet
300 PDF
8 pages
3- Feature extraction and images classification - Part 3
No ratings yet
3- Feature extraction and images classification - Part 3
29 pages
NB4-07 PT II Data Augmentation
No ratings yet
NB4-07 PT II Data Augmentation
6 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
NeuralNets DeepLearning
No ratings yet
NeuralNets DeepLearning
17 pages
Improved Regularization of Convolutional Neural Networks With Cutout
No ratings yet
Improved Regularization of Convolutional Neural Networks With Cutout
8 pages
DL UNIT 3
No ratings yet
DL UNIT 3
14 pages
Chap1 DataPreparation
No ratings yet
Chap1 DataPreparation
55 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Artificial Intelligence Interview Questions
From Everand
Artificial Intelligence Interview Questions
Tech Interviews
5/5 (2)
How To Use Grep Command in Linux-Unix
No ratings yet
How To Use Grep Command in Linux-Unix
12 pages
Dop Open Standard: 1. Motivation
No ratings yet
Dop Open Standard: 1. Motivation
4 pages
FIR Means The First Information Report
No ratings yet
FIR Means The First Information Report
1 page
New Text Document
No ratings yet
New Text Document
2 pages
Hitachi HQT-4180 30questions
No ratings yet
Hitachi HQT-4180 30questions
8 pages
AI in HRM
No ratings yet
AI in HRM
30 pages
User Guide For Windows 7 IO Atom E3800 MR4 - PUBLIC
No ratings yet
User Guide For Windows 7 IO Atom E3800 MR4 - PUBLIC
24 pages
Paper 11th English
No ratings yet
Paper 11th English
10 pages
Księga-29 10 2024
No ratings yet
Księga-29 10 2024
12 pages
PLC-Driver (V5) HITACHI H EH-150 Micro-EH - Programming Interface - Ethernet TCP-IP
No ratings yet
PLC-Driver (V5) HITACHI H EH-150 Micro-EH - Programming Interface - Ethernet TCP-IP
8 pages
Aquilion16 Mpdct0220ead
No ratings yet
Aquilion16 Mpdct0220ead
16 pages
School Management System
No ratings yet
School Management System
62 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
Qos Basics
No ratings yet
Qos Basics
3 pages
Thesis of Inventory Control With Inflation
100% (1)
Thesis of Inventory Control With Inflation
20 pages
Surf 2015 16 Sep PDF
100% (2)
Surf 2015 16 Sep PDF
48 pages
National Semiconductor: Preliminary
No ratings yet
National Semiconductor: Preliminary
20 pages
A300-600 - A310 SRIW Report
No ratings yet
A300-600 - A310 SRIW Report
8 pages
Simulation of FeinProbe Probe Card Model For A5g WLCSP Application
No ratings yet
Simulation of FeinProbe Probe Card Model For A5g WLCSP Application
24 pages
Chapter 3 Exam Answers
No ratings yet
Chapter 3 Exam Answers
18 pages
Private Sub BT
No ratings yet
Private Sub BT
5 pages
Visio-SWCS - Control Block Diagram Rev.02
No ratings yet
Visio-SWCS - Control Block Diagram Rev.02
2 pages
Multi Companymulti Book
100% (3)
Multi Companymulti Book
24 pages
Mcdaniel Tanilla Civilian Resume Complete v1
No ratings yet
Mcdaniel Tanilla Civilian Resume Complete v1
3 pages
Ahmed Abdel Magid Resume
No ratings yet
Ahmed Abdel Magid Resume
3 pages
Use Analog For Better Performance-To-Power Ratios
No ratings yet
Use Analog For Better Performance-To-Power Ratios
4 pages

Data Imbalance Problem

Uploaded by

Data Imbalance Problem

Uploaded by

Data Imbalance problem

Image: Analytics Vidya

wj=n_samples / (n_classes * n_samplesj)

Image: Analytics Vidya

The f1-score for the testing

Before Using Class weights

Image: Analytics Vidya

f1-score for the testing data: 0.1579371474617244

Image: Analytics Vidya

• Oversampling the minority classes

• It generates new observations by

• Throwing away data to make it

• Random under sampling

Let d(x_i, x_j) denotes the Euclidean distance

Select samples from the majority class for

NearMiss-3 is a 2-steps algorithm. First,

• Process of artificially increasing the amount of data by generating new data

• Synthetic data: When data is generated artificially without using real-world

• In real-world cases, we might have a dataset of photos captured under a specific

• Popular in computer vision

Image processing activities for data augmentation

Original Horizontally flipped Vertically flipped

Original Scaled out by 10% Scaled out by 20%

Original Gaussian noise Salt pepper noise

Original Translated to right Translated upwards

• Overfitting also happens when there is too much training.

• Stop training before we have a chance to overfit

• Stop training before we have a chance to overfit

• Stop training before we have a chance to overfit

➢ Each time before updating the parameters

➢ Each time before updating the parameters

Training of Dropout Testing of Dropout

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

Train a bunch of networks with different structures

Network Network Network Network

➢Using one mini-batch to train one network

You might also like