3b. Data Pre-Processing
3b. Data Pre-Processing
Machine learning
Acknowledgements:
Dr Vijay Kumar, IT Dept, NIT Jalandhar
Topics
• Need for data pre-processing
• What is data pre-processing
• Data Pre-processing tasks
Need for data Pre-processing
Mean
Median
Exception :
• Feature vector divided into number of subparts with
corresponding to one class.
• The sub feature vector used to determine central tendencies
values in order to replace missing values of that feature and
particular class.
• e.g. employee salary vector can be divided into male and
female or job designations etc.
Data Cleaning (Noisy data)
Noise is defined as a random variance in a measured variable. For
numeric values, boxplots and scatter plots can be used to identify
outliers.
Bin size =
= = 2.777
Bin 1: 7, 7, 7
Bin 2: 22, 22, 22
Bin 3: 27, 27, 27
Data Cleaning (Noisy data)
Using boundary values : replace the bin value by a closest
boundary value of the corresponding bin.
Bin 1: 4, 9, 9 “Boundary values remain unchanged in
Bin 2: 21, 21, 24 boundary method”
Bin 3: 26, 29, 29
To resolve inconsistencies
Manual correction using external references
Semi-automatic tools
• To detect violation of known functional dependencies and data
constraints
• To correct redundant data
Data Cleaning (Inconsistent data)
Cust.id Name Age DoB Cust.no Name Age DoB Cust.id Name Age DoB
This step is done before Data Reduction, but its details will
be discussed after Data Reduction (Data Reduction needs
50 consecutive min)
Data Reduction
Data Reduction: It is a process of constructing a
condensed representation of the data set which is smaller
in volume, while maintaining the integrity of original one.
The efficiency of results should not degrade with data
reduction
Some facts about data reduction in machine learning :
• There exist a optimal number of feature in a feature set for
corresponding Machine Learning task.
• Adding additional features than optimal ones (strictly necessary)
results in a performance degradation ( because of added noise).
Data Reduction
“ Challenging task”
Data Reduction
Benefits of data reduction
• Accuracy improvements.
• Overfitting (model fits for training data, but not for validation) risk
reduction.
• Speed up in training.
• Improved Data Visualization.
• Increase in explainability of our model.
• Increase storage efficiency.
• Reduced storage cost.
Data Reduction
Major techniques of data reduction are:
• Attribute subset selection
• Low variance filter
• High correlation filter
• Numerosity reduction
• Dimensionality reduction
Data Reduction
Attribute subset selection: The highly relevant attributes
should be used, rest all can be discarded.
No
Original dataset
Data Reduction (PCA)
Example:
X Y
Find determinant by
equating to zero & find 0.6165- 0.6154
values of 1 and 2
0.6154 0.7165-
How: (.6165- )*(.7165- )-(.6154*.6154)=0
Data Reduction (PCA)
Example:
-0.6675 0.6154
0.6165- 0.6154
0.6154 -0.5675
Compute
0.6154 0.7165-
eigenvectors
0.6165- 0.6154
0.5674 0.6154
0.6154 0.7165-
0.6154 0.6674
PCA: Eigenvector compute
-.6675*x1+.6154x2=0
-0.6675 0.6154 Div by (.6675*.6154) throughout
0.6154 -0.5675 x1/.6154=x2/.6675 = say t
V1=
Unit Eigen vector for by dividing by
= = sqrt(.8243) =
.908
V1= 1/ = 1/.908
=
Data Reduction (PCA)
Example:
0.67787 -0.73517
0.73517 0.67787
2 0
Min-max
47 normalization 0.512
90
1
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.034
specific range
is minimum and maximum
value of feature vector corresponding to .
Data Transformation (Normalization)
Mean normalization
X
2 -0.345
Mean
47 normalization 0.166
90
0.655
Where, 18
is mapped value -0.164
is data value to be mapped in 5
-0.311
specific range
is mean of feature vector corresponding
to .
is minimum and maximum
value of feature vector corresponding to .
Data Transformation (Normalization)
Z Score method:
X
2 -0.826
Z score
47 normalization 0.397
90
1.566
Where, 18
is mapped value -0.391
is data value to be mapped in 5
-0.745
specific range
and is mean and standard deviation of
feature vector corresponding to .
Data Transformation (Normalization)
Decimal scaling method:
X
2 Decimal 0.02
scaling
47 normalization 0.47
90
0.9
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.05
specific range
is maximum of the count of digits in
minimum and maximum value of feature
vector corresponding to
Data Transformation
Aggregation : take the aggregated values in order to put
the data in a better perspective.
Benefits of aggregation
• Reduce memory consumption to store large data records.
• Provides more stable view point than individual data objects
Data Transformation (Aggregation)
Feature 40000 35
500000 42