0% found this document useful (0 votes)
5 views

3b. Data Pre-Processing

Data pre processing

Uploaded by

harsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

3b. Data Pre-Processing

Data pre processing

Uploaded by

harsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Data Pre-processing in

Machine learning

Acknowledgements:
Dr Vijay Kumar, IT Dept, NIT Jalandhar
Topics
• Need for data pre-processing
• What is data pre-processing
• Data Pre-processing tasks
Need for data Pre-processing

“data data every where”


Need for data Pre-processing?
Need for data Pre-processing
 Data in the real world is “quite messy”
• incomplete: missing feature values, absence of certain
crucial feature, or containing only aggregate data.
 e.g. Height=“ ”
• noisy: containing errors or outliers
 e.g. Weight=“5000” or “-60”
• inconsistent: containing discrepancies in feature values.
 e.g. Age=“20” and dob=“12 july 1990”
 e.g. contradictions between duplicate records
Need for data Pre-processing

“No qualitative data, less accurate results!”

“Less accurate the model, higher the probability of wrong


decision”

“Right decision require, more qualitative data”


What is data Pre-processing
Data Pre-processing: it is that phase of any Machine
Learning process, which transforms, or encodes, the data to
bring it to such a state where it can be easily interpreted by
the learning algorithm.

“Data pre-processing is not a single standalone entity but a


collection of multiple interrelated tasks”

“Collectively data pre-processing constitutes majority of the effort in


machine learning process (approx. 90 % )”
Data pre-processing tasks

 Major data pre-processing tasks


• Data cleaning
• Data integration
• Data transformation
• Data reduction
Data Cleaning
Data cleaning: it is a procedure to "clean" the data by filling
in missing values, smoothening noisy data, identifying or
removing outliers, and resolving data inconsistencies.

 Data cleaning tasks


• Fill missing values
• Noise smoothening and outlier detection
• Resolving inconsistencies
Data Cleaning
Missing values: data values are not available.
e.g. many data entities have no data values corresponding to a
certain feature like BMI value missing for person in diabetes
dataset.

Probable reasons for missing values:


• faulty measuring equipment
• reluctance of person to share certain detail
• negligence on part of data entry operator
• feature unimportance at time of data collection
Data Cleaning (Missing values)
 Missing data handling techniques
• Removing the data entity
• Manually filling the values
• Replacing the missing value by central tendency (mean,
median, mode) for a feature vector
• Replacing the missing value by central tendency belonging
to same class for a feature vector.

“Technique selection is specific to user’s preference, dataset or feature


type or problem set”

“A feature is an individual measurable property or characteristic of a


phenomenon being observed”
Data Cleaning (Missing values)
Sample dataset related to forest fires
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
aug 92.3 NaN 22.2 NaN NaN
aug NaN 495.6 24.1 27 NaN
aug 91.5 608.2 8 86 2.2
sep 91 692.6 NaN 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Removing the data entity: Most easiest way directly but this is usually
discouraged as it leads to loss of data, as you are removing the data
entity or feature values that can add value to data set as well.
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
aug 92.3 NaN 22.2 NaN NaN
aug NaN 495.6 24.1 27 NaN
Month FFMC DC temp RH wind
aug 91.5 608.2 8 86 2.2
mar 86.2 94.3 8.2 51 6.7
sep 91 692.6 NaN 63 5.4
oct 90.6 669.1 18 33 0.9
sep 92.5 698.6 22.8 40 4
mar 89.3 102.2 11.4 99 1.8
aug 91.5 608.2 8 86 2.2
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Manually filing up of values : This approach is time consuming, and
not recommended for huge data sets.
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8 Month FFMC DC temp RH wind
aug 92.3 NaN 22.2 NaN NaN mar 86.2 94.3 8.2 51 6.7
aug NaN 495.6 24.1 27 NaN oct 90.6 669.1 18 33 0.9
aug 91.5 608.2 8 86 2.2 oct 90.6 686.9 17 33 0.8
sep 91 692.6 NaN 63 5.4 mar 91.6 77.5 8.3 97 4
sep 92.5 698.6 22.8 40 4 mar 89.3 102.2 11.4 99 1.8
aug 92.3 380 22.2 92 1.8
aug 90 495.6 24.1 27 2
aug 91.5 608.2 8 86 2.2
sep 91 692.6 22 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Central tendency technique : Here help of mean, median and mode
is taken to calculate the values to be replaced with missing values.

Mean

Median

Mode : Mode is the most frequent value corresponding to a certain


feature in a given data set
Data Cleaning (Missing values)
Replacing with mean value:
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN Month FFMC DC temp RH wind
mar NaN 77.5 8.3 97 4 mar 86.2 94.3 8.2 51 6.7
mar 89.3 102.2 11.4 99 1.8 oct 90.6 669.1 18 33 0.9
aug 92.3 NaN 22.2 NaN NaN oct 90.6 686.9 15.3 33 3.57
aug NaN 495.6 24.1 27 NaN mar 90.5 77.5 8.3 97 4
aug 91.5 608.2 8 86 2.2 mar 89.3 102.2 11.4 99 1.8
sep 91 692.6 NaN 63 5.4 aug 92.3 458.3 22.2 58.7 3.57
sep 92.5 698.6 22.8 40 4 aug 90.5 495.6 24.1 27 3.57
aug 91.5 608.2 8 86 2.2
sep 91 692.6 15.3 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)

Now the problem of missing values is solved!!

“ Replacing by mean value: Not a suitable method if data set


has many outliers”

For example: weighs of humans


67, 78, 900,-56,389,-1 etc. Outlier
Mean is 229.5
Data Cleaning (Missing values)
Replacing with median value:

Month FFMC DC temp RH wind


mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
Month FFMC DC temp RH wind
aug 92.3 NaN 22.2 NaN NaN
mar 86.2 94.3 8.2 51 6.7
aug NaN 495.6 24.1 27 NaN
oct 90.6 669.1 18 33 0.9
aug 91.5 608.2 8 86 2.2
oct 90.6 686.9 14.7 33 2.2
sep 91 692.6 NaN 63 5.4
mar 90.8 77.5 8.3 97 4
sep 92.5 698.6 22.8 40 4
mar 89.3 102.2 11.4 99 1.8
aug 92.3 495.6 22.2 40 2.2
aug 90.8 495.6 24.1 27 2.2
aug 91.5 608.2 8 86 2.2
sep 91 692.6 14.7 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Replacing with mode value:
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN Month FFMC DC temp RH wind
mar NaN 77.5 8.3 97 4 mar 86.2 94.3 8.2 51 6.7
mar 89.3 102.2 11.4 99 1.8 oct 90.6 669.1 18 33 0.9
aug 92.3 NaN 22.2 NaN NaN oct 90.6 686.9 NaN 33 4
aug NaN 495.6 24.1 27 NaN mar 90.6 77.5 8.3 97 4
aug 91.5 608.2 8 86 2.2 mar 89.3 102.2 11.4 99 1.8
sep 91 692.6 NaN 63 5.4 aug 92.3 NaN 22.2 33 4
sep 92.5 698.6 22.8 40 4 aug 90.6 495.6 24.1 27 4
aug 91.5 608.2 8 86 2.2
sep 91 692.6 NaN 63 5.4
sep 92.5 698.6 22.8 40 4

“Mode is good option for missing values in case of categorical


variables”
Data Cleaning (Missing values)
Replacing with central tendency values corresponding to class for a
certain feature.

The technique is similar to the one discussed above

Exception :
• Feature vector divided into number of subparts with
corresponding to one class.
• The sub feature vector used to determine central tendencies
values in order to replace missing values of that feature and
particular class.
• e.g. employee salary vector can be divided into male and
female or job designations etc.
Data Cleaning (Noisy data)
Noise is defined as a random variance in a measured variable. For
numeric values, boxplots and scatter plots can be used to identify
outliers.

Boxplot Scatter plot


Data Cleaning (Noisy data)
Popular reasons of random variations are:
• Malfunctioning of collection instruments.
• Data entry lags.
• Data transmission problems

To deal with these anomalous values, data smoothing techniques


are applied, some popular ones are
• Binning method
• Regression
• Outlier analysis
Data Cleaning (Noisy data)
Binning method : performs the task of data smoothening.
Steps of be followed under binning method are:
Step1: Sort the data into ascending order.
Step2: Calculate the bin size (i.e. number of bins)
Step3: Partition or distribute the data equally among the bins starting
with first element of sorted data.
Step4: perform data smoothening using bin means, bin boundaries, and
bin median.
Last bin can have one less or more element!!
Data Cleaning (Noisy data)
Example : 9, 21, 29, 28, 4, 21, 8, 24, 26

Step1: sorted the data 4, 8, 9, 21, 21, 24, 26, 28, 29


Step2 : Bin size calculation

Bin size =

= = 2.777

But we need to take ceiling value, so bin size is 3 here


Data Cleaning (Noisy data)
Step 3 : Bin partitioning (equi-size)
Bin 1: 4, 8, 9
Bin 2: 21, 21, 24
Bin 3: 26, 28, 29

Step 4 : data smoothening


 Using mean value : replace the bin values by bin average

Bin 1: 7, 7, 7
Bin 2: 22, 22, 22
Bin 3: 27, 27, 27
Data Cleaning (Noisy data)
 Using boundary values : replace the bin value by a closest
boundary value of the corresponding bin.
Bin 1: 4, 9, 9 “Boundary values remain unchanged in
Bin 2: 21, 21, 24 boundary method”
Bin 3: 26, 29, 29

 Using median values : replace the bin value by a bin median.


Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Data Cleaning (Noisy data)
Regression method : Linear regression and multiple linear
regression can be used to smooth the data, where the values are
conformed to a function.
Data Cleaning (Noisy data)
Outlier analysis: performs the task of data refinement by tracing
down the outliers with help of clustering and dealing with them.
Data Cleaning (Inconsistent data)
Inconsistent Data: discrepancies between different data
items.
e.g. the “Address” field contains the “Phone number”

To resolve inconsistencies
 Manual correction using external references
 Semi-automatic tools
• To detect violation of known functional dependencies and data
constraints
• To correct redundant data
Data Cleaning (Inconsistent data)

“To avoid inconsistencies, perform data assessment like


knowing what the data type of the features should be and whether
it is the same for all the data objects.”
Data Integration
Data Integration: It is the process of merging the data
from multiple sources into a coherent data store.
e.g. Collection of banking data from different banks at data stores of RBI

Issues in data integration

• Schema integration and feature


matching
• Redundant features
• Detection and resolution of data
value conflicts.
Data Integration
Schema integration and feature matching:

Cust.id Name Age DoB Cust.no Name Age DoB Cust.id Name Age DoB

Cust.id Name Age DoB

“Carefully analysis the metadata”


Data Integration
Redundant features: They are unwanted features.

 To deal with redundant features correlation


analysis is performed. Denoted by r.
Cust.id Name Age DoB

r is +ve r is -ve r is zero


Data Integration
Detection and resolution of data value conflicts:

Product year Price Product Year Price Product Year Price


($) (Rs) (pound)

Research data for price Product year Price


of essential products

“Carefully analysis the metadata”


Data Transformation

This step is done before Data Reduction, but its details will
be discussed after Data Reduction (Data Reduction needs
50 consecutive min)
Data Reduction
Data Reduction: It is a process of constructing a
condensed representation of the data set which is smaller
in volume, while maintaining the integrity of original one.
The efficiency of results should not degrade with data
reduction
Some facts about data reduction in machine learning :
• There exist a optimal number of feature in a feature set for
corresponding Machine Learning task.
• Adding additional features than optimal ones (strictly necessary)
results in a performance degradation ( because of added noise).
Data Reduction

“ Challenging task”
Data Reduction
Benefits of data reduction
• Accuracy improvements.
• Overfitting (model fits for training data, but not for validation) risk
reduction.
• Speed up in training.
• Improved Data Visualization.
• Increase in explainability of our model.
• Increase storage efficiency.
• Reduced storage cost.
Data Reduction
Major techniques of data reduction are:
• Attribute subset selection
• Low variance filter
• High correlation filter
• Numerosity reduction
• Dimensionality reduction
Data Reduction
Attribute subset selection: The highly relevant attributes
should be used, rest all can be discarded.

Techniques of attribute selection:


• Brute force approach: Try all possible feature subsets as input to
the machine learning algorithm and analyze the results.
• Statistical approach: The concept of statistical testing is applied for
selecting the most significant features from the original feature set.
Data Reduction (Attribute subset)
Brute force technique:
F1 F2 F3 F4 F5 F6 F7 F8 F9 Target

Step 1: Construct the power set corresponding to feature set.


Step 2: Select an element from power set.
Step 3: Measure the accuracy of Learning model corresponding to
selected feature.
Step 4: Repeat step 2 and Step 3 until desired accuracy is achieved

“To decrease the number of iterations, Expert knowledge of


features is used”
Data Reduction (Attribute subset)
Statistical technique: chi-square test
• A chi-square test is used to test the
independence of predictor and the
target.
• In feature reduction, aim is to select
highly dependent features as per
target.
• Independent features are removed.

“ More the value of chi-square, higher is the dependence level.


Data Reduction
Low variance filter: Normalized features that have variance
(distribution) less than a threshold are also removed, since
little changes in data means less information.
F1 F2 F3
0.113 0.272 0.318
Variable (F1) = 0.115298 0.803 0.383 0.027
0.197 0.630 0.319
Variable (F2) = 0.103589 0.210 0.987 0.310
0.305 0.984 0.390
Variable (F3) = 0.012525 0.464 0.031 0.314
0.954 0.008 0.399
Very low variance 0.943 0.324 0.278
0.997 0.418 0.133
0.270 0.525 0.373
Data Reduction
High correlation filter: Normalized attributes that have
correlation coefficient more than a threshold are also
removed, since similar trends means similar information is
carried. Correlation coefficient is usually calculates using
statistical methods such as Pearson’s chi-square value etc.
Data Reduction
Numerosity Reduction: This enable to store the model of
data instead of whole data.
e.g. Regression Models.
Data Reduction

Dimensionality Reduction: It is the technique to reduce


the number of dimensions not simply by selecting a feature
subset but much more than it.
 dimensions refers to the number of geometric planes the
dataset lies in, which could be high so much so that it cannot
be visualized with pen and paper.

“More the number planes, more is the complexity of the


dataset”
Data Reduction
Principal Component Analysis (PCA): It is a technique of
dimensionality reduction which performs the said task by the reducing
higher-dimensional feature-space to a lower-dimensional feature-
space. It also helps to make visualization of dataset simple.
Data Reduction (PCA)
Some of the major facts about PCA are:
 Principal components are new features that are constructed as a
linear combinations or mixtures of the initial feature set.
 These combinations is performed in such a manner that all the
newly constructed principal components are uncorrelated.
 Together with reduction task, PCA also preserving as much
information as possible of original data set.
Data Reduction (PCA)
Some of the major facts about PCA are:
 Principal components are usually denoted by PCi, where i can be
0, 1, 2, 3,. ….,n (depending on the number of feature in original
data set).
 The major proportion of information about original feature set can
be alone explained by first principal component i.e. PC1.
 The remaining information can be obtained from other principal
components in a decreasing proportion as per increase in value of
i.
Data Reduction (PCA)
Data Reduction (PCA)
 Geometrically , it can be said that principal components
are lines pointing the directions that captures maximum
amount of information about the data.

Simply, principal components are new axes to get better data


visibility with clear difference in observations.
Data Reduction (PCA)
Stepwise working of PCA
Step 1: Construction of covariance matrix i.e. A

Step 2: Computation of eigenvalues for covariance matrix.


Eignevector: the direction of that line, while the eigenvalue is a
number that tells us how the data set is spread out on the line which is
an Eigenvector. It show us the direction of our main axes (principal
components) of our data. The greater the eigenvalue, the greater the
variation along this axis

Step 3: Compute eigenvectors corresponding to every eigenvalue


obtained in step2:
Data Reduction (PCA)
Stepwise working of PCA

Step 4: Sort the eigenvectors in decreasing order of eigenvalues and


choose k eigenvectors with the largest eigenvalues.
Step 5: Transform the data along the principal component axis.

“Seems Difficult !!”

No

“Just need some mathematical skills”


Data Reduction (PCA)
Example:
X Y
2.5 2.4 Compute
0.5 0.7 covariance
2.2 2.9 Matrix A Cov(X,X) Cov(X,Y)
1.9 2.2
3.1 3
Cov(Y,X) Cov(Y,Y)
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

Original dataset
Data Reduction (PCA)
Example:
X Y

2.5 2.4 0.69 0.49 0.4761 0.2401 0.3381


0.5 0.7 -1.31 -1.21 1.7161 1.4641 1.5851
2.2 2.9 0.39 0.99 0.1521 0.9801 0.3861
1.9 2.2 0.09 0.29 0.0081 0.0841 0.0261
3.1 3 1.29 1.09 1.6641 1.1881 1.4061
2.3 2.7 0.49 0.79 0.2401 0.6241 0.3871
2 1.6 0.19 -0.31 0.0361 0.0961 -0.0589
1 1.1 -0.81 -0.81 0.6561 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961 0.0961
1.1 0.9 -0.71 -1.01 0.5041 1.0201 0.7171
=1.81 Cov(X,X) Cov(Y,Y) Cov(X,Y)= Cov(Y,X)
= 0.6165 = 0.7165 = 0.6154
Data Reduction (PCA)
Example:

0.6165 0.6154 Compute 0.6165 0.6154 1 0


eigenvalues
0.6154 0.7165 0.6154 0.7165 0 1

Find determinant by
equating to zero & find 0.6165- 0.6154
values of 1 and 2
0.6154 0.7165-
How: (.6165- )*(.7165- )-(.6154*.6154)=0
Data Reduction (PCA)
Example:
-0.6675 0.6154
0.6165- 0.6154
0.6154 -0.5675
Compute
0.6154 0.7165-
eigenvectors

0.6165- 0.6154
0.5674 0.6154
0.6154 0.7165-
0.6154 0.6674
PCA: Eigenvector compute
-.6675*x1+.6154x2=0
-0.6675 0.6154 Div by (.6675*.6154) throughout
0.6154 -0.5675  x1/.6154=x2/.6675 = say t

V1=
Unit Eigen vector for by dividing by
 = = sqrt(.8243) =
.908

V1= 1/ = 1/.908

=
Data Reduction (PCA)
Example:

0.67787 -0.73517

0.73517 0.67787

First Principal Component (PC1) Second Principal Component (PC2)

“vector corresponding to highest eigenvalue of considered as PC1


followed by other component as per their eigenvalue.”
 To calculate the percentage of information explained by PC1
and PC2, divide each component by sum (1.284+0.049=1.333)
of eigenvalues (recall 1=1.284, 2=.049)
PC1 =1.284/1.333=96% PC2 =0.049/1.333= 4%
Data Reduction (PCA)
Step 4 helps to reduce the dimension by discarding the components
with very less percentage of information in a multi-dimentional
space. The remaining ones form a matrix of vector known as feature
vector. Each column correspond to one principal component.
Step 5 data transformation along principal component using
PCA_dataset = EigenVectorT * Mean-Adjusted-DataT
Mean-Adjusted-Data => each point of the Original Data will be
adjusted by subtracting respective means
Extra Step (Recover Orig Data from PCA-Data)
Mean-Adjusted-DataT = PCA_dataset / EigenVectorT
= PCA_dataset *EigenVecor
Original Data => add means to respective dimensional-values of the
Mean-Adjusted-Data
Data Transformation (usually done before
Dimensionality Reduction)
Data Transformation: It is a process to transform or
consolidate the data in a form suitable for machine
learning algorithms.

Major techniques of data transformation are :-


• Normalization
• Aggregation
• Generalization
• Feature construction
Data Transformation
Normalization: It is the technique of mapping the
numerical feature values of any range into a specific
smaller range i.e. 0 to 1 or -1 to 1.

Popular methods of Normalization are:-


• Min-Max method
• Mean normalization
• Z score method
• Decimal scaling method
Data Transformation (Normalization)
Min-Max method:
X

2 0
Min-max
47 normalization 0.512
90
1
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.034
specific range
is minimum and maximum
value of feature vector corresponding to .
Data Transformation (Normalization)
Mean normalization
X

2 -0.345
Mean
47 normalization 0.166
90
0.655
Where, 18
is mapped value -0.164
is data value to be mapped in 5
-0.311
specific range
is mean of feature vector corresponding
to .
is minimum and maximum
value of feature vector corresponding to .
Data Transformation (Normalization)
Z Score method:
X

2 -0.826
Z score
47 normalization 0.397
90
1.566
Where, 18
is mapped value -0.391
is data value to be mapped in 5
-0.745
specific range
and is mean and standard deviation of
feature vector corresponding to .
Data Transformation (Normalization)
Decimal scaling method:
X

2 Decimal 0.02
scaling
47 normalization 0.47
90
0.9
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.05
specific range
is maximum of the count of digits in
minimum and maximum value of feature
vector corresponding to
Data Transformation
Aggregation : take the aggregated values in order to put
the data in a better perspective.

e.g. in case of transactional data, the day to day sales of product


at various stores location can be aggregated store wise over
months or years in order to be analyzed for some decision
making.

Benefits of aggregation
• Reduce memory consumption to store large data records.
• Provides more stable view point than individual data objects
Data Transformation (Aggregation)

Image Source blog from towards data science


Data Transformation
Generalization: The data is generalized from low-level
to higher order concepts using concept hierarchies.
e.g. categorical attributes like street can be generalized to higher
order concepts like city or country.
“The decision of generalization level depends on the problem
statement”

Feature construction : New attributes are constructed


from the given set of attributes.
e.g. feature like mobile number and landline number combined
together under new feature contact number
Feature Scaling
Feature Scaling : It is a method used to normalize the
range of independent features of data. Basically it is the
process of scaling down the feature magnitudes to bring
them in a common range platforms
Need for feature scaling Salary Age tax

Feature 40000 35

500000 42

Magnitude Units 25000 26

Tax = func (Salary, Age) 30000 30


Feature Scaling
 The large difference in magnitude of features makes the process
of model training difficult.
 Euclidian distance and Manhattan distance are required to be
calculated by many model for its working which comes out to be
large in case of huge difference in magnitude values.
 Moreover visualization is also difficult as data points are very far
off locations .
 KNN and K-means are some popularly known algorithms where
feature scaling is an important role player.
Feature Scaling
Feature Scaling
Feature Scaling Techniques
 Normalization (discussed under data transformation)
 Standardization (a.k.a Z-score normalization)
• It standardize the data to follow standard normal
distribution with mean 0 and standard deviation 1)

Standard Normal Distribution


Dataset Splitting
“After performing various pre-processing task on the dataset, it now
ready to be utilized by the exsiting machine learning algorithms”

“Wait !! Before modeling the dataset for analysis and decision


making by machine learning algorithms, it is advisable to split it into
training, validation, and testing set.”
Dataset Splitting

Training data: This is the part on which machine learning


algorithms are actually trained to build a model. The model
tries to learn the dataset and its various characteristics.

Validation data : This is the part of the dataset which is


used to validate our various model fits. In simpler words, it
is used to tune the hyperparameters of the algorithm for
better performance.
“Validation!! only tuning No learning”
Dataset Splitting
Test data : This part of the dataset is used to test our
model and quantify the accuracy measures to depict its
performance when deployed on real-world data.

Data splitting techniques:


• Simple random sampling (SRS)
• Systematic sampling
• Stratified sampling
Dataset Splitting
Simple random sampling :
In a simple random
sample each observation
in the data set has an
equal chance of being
selected.

“Result in biasness toward


one category if distribution of
categories is not proper.”
Dataset Splitting
Systematic sampling: It involves selecting items from an
ordered population using a skip or sampling interval. That
means that every "nth" data sample is chosen in a large
data set.

“Not a good choice when


data exhibits some
patterns”
Dataset Splitting
Stratified sampling: This is accomplished by selecting samples at
random within each class. This approach ensures that the frequency
distribution of the outcome is approximately equal within the training
and test sets. Mostly used in classification problems or where data
categories is available.
Dataset Splitting

Split Ratio : It is the ratio in which the dataset must be


splitted into training, validation, and testing set. It is highly
dependent on type of model to be trained and dataset itself.
 larger training ratio (most usual case): If our dataset and model
are such that a lot of training is required, then we use a larger
chunk of the data just for training purposes.
e.g training on textual data, image data, or video data where
thousands of features are involved.
 larger validation ratio: If the model has a lot of
hyperparameters that can be tuned, then keeping a higher
percentage of data for the validation set is advisable.
Dataset Splitting
Commonly used split ratios are:
70% train, 15% val, 15% test.
80% train, 10% val, 10% test.
60% train, 20% val, 20% test.
Many people split into 2 sets, instead of 3 sets:
(i) training (ii) Validation/Testing
Common Ratios: 70, 30 and 80,20

You might also like