0% found this document useful (0 votes)

5 views

3b. Data Pre-Processing

Data pre processing

Uploaded by

harsh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

3b. Data Pre-Processing

Data pre processing

Uploaded by

harsh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Data Pre-processing in

Machine learning

Acknowledgements:
Dr Vijay Kumar, IT Dept, NIT Jalandhar
Topics
• Need for data pre-processing
• What is data pre-processing
• Data Pre-processing tasks
Need for data Pre-processing

“data data every where”

Need for data Pre-processing?
Need for data Pre-processing
 Data in the real world is “quite messy”
• incomplete: missing feature values, absence of certain
crucial feature, or containing only aggregate data.
 e.g. Height=“ ”
• noisy: containing errors or outliers
 e.g. Weight=“5000” or “-60”
• inconsistent: containing discrepancies in feature values.
 e.g. Age=“20” and dob=“12 july 1990”
 e.g. contradictions between duplicate records
Need for data Pre-processing

“No qualitative data, less accurate results!”

“Less accurate the model, higher the probability of wrong

decision”

“Right decision require, more qualitative data”

What is data Pre-processing
Data Pre-processing: it is that phase of any Machine
Learning process, which transforms, or encodes, the data to
bring it to such a state where it can be easily interpreted by
the learning algorithm.

“Data pre-processing is not a single standalone entity but a

collection of multiple interrelated tasks”

“Collectively data pre-processing constitutes majority of the effort in

machine learning process (approx. 90 % )”
Data pre-processing tasks

 Major data pre-processing tasks

• Data cleaning
• Data integration
• Data transformation
• Data reduction
Data Cleaning
Data cleaning: it is a procedure to "clean" the data by filling
in missing values, smoothening noisy data, identifying or
removing outliers, and resolving data inconsistencies.

 Data cleaning tasks

• Fill missing values
• Noise smoothening and outlier detection
• Resolving inconsistencies
Data Cleaning
Missing values: data values are not available.
e.g. many data entities have no data values corresponding to a
certain feature like BMI value missing for person in diabetes
dataset.

Probable reasons for missing values:

• faulty measuring equipment
• reluctance of person to share certain detail
• negligence on part of data entry operator
• feature unimportance at time of data collection
Data Cleaning (Missing values)
 Missing data handling techniques
• Removing the data entity
• Manually filling the values
• Replacing the missing value by central tendency (mean,
median, mode) for a feature vector
• Replacing the missing value by central tendency belonging
to same class for a feature vector.

“Technique selection is specific to user’s preference, dataset or feature

type or problem set”

“A feature is an individual measurable property or characteristic of a

phenomenon being observed”
Data Cleaning (Missing values)
Sample dataset related to forest fires
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
aug 92.3 NaN 22.2 NaN NaN
aug NaN 495.6 24.1 27 NaN
aug 91.5 608.2 8 86 2.2
sep 91 692.6 NaN 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Removing the data entity: Most easiest way directly but this is usually
discouraged as it leads to loss of data, as you are removing the data
entity or feature values that can add value to data set as well.
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
aug 92.3 NaN 22.2 NaN NaN
aug NaN 495.6 24.1 27 NaN
Month FFMC DC temp RH wind
aug 91.5 608.2 8 86 2.2
mar 86.2 94.3 8.2 51 6.7
sep 91 692.6 NaN 63 5.4
oct 90.6 669.1 18 33 0.9
sep 92.5 698.6 22.8 40 4
mar 89.3 102.2 11.4 99 1.8
aug 91.5 608.2 8 86 2.2
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Manually filing up of values : This approach is time consuming, and
not recommended for huge data sets.
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8 Month FFMC DC temp RH wind
aug 92.3 NaN 22.2 NaN NaN mar 86.2 94.3 8.2 51 6.7
aug NaN 495.6 24.1 27 NaN oct 90.6 669.1 18 33 0.9
aug 91.5 608.2 8 86 2.2 oct 90.6 686.9 17 33 0.8
sep 91 692.6 NaN 63 5.4 mar 91.6 77.5 8.3 97 4
sep 92.5 698.6 22.8 40 4 mar 89.3 102.2 11.4 99 1.8
aug 92.3 380 22.2 92 1.8
aug 90 495.6 24.1 27 2
aug 91.5 608.2 8 86 2.2
sep 91 692.6 22 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Central tendency technique : Here help of mean, median and mode
is taken to calculate the values to be replaced with missing values.

Mean

Median

Mode : Mode is the most frequent value corresponding to a certain

feature in a given data set
Data Cleaning (Missing values)
Replacing with mean value:
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN Month FFMC DC temp RH wind
mar NaN 77.5 8.3 97 4 mar 86.2 94.3 8.2 51 6.7
mar 89.3 102.2 11.4 99 1.8 oct 90.6 669.1 18 33 0.9
aug 92.3 NaN 22.2 NaN NaN oct 90.6 686.9 15.3 33 3.57
aug NaN 495.6 24.1 27 NaN mar 90.5 77.5 8.3 97 4
aug 91.5 608.2 8 86 2.2 mar 89.3 102.2 11.4 99 1.8
sep 91 692.6 NaN 63 5.4 aug 92.3 458.3 22.2 58.7 3.57
sep 92.5 698.6 22.8 40 4 aug 90.5 495.6 24.1 27 3.57
aug 91.5 608.2 8 86 2.2
sep 91 692.6 15.3 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)

Now the problem of missing values is solved!!

“ Replacing by mean value: Not a suitable method if data set

has many outliers”

For example: weighs of humans

67, 78, 900,-56,389,-1 etc. Outlier
Mean is 229.5
Data Cleaning (Missing values)
Replacing with median value:

Month FFMC DC temp RH wind

mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN
mar NaN 77.5 8.3 97 4
mar 89.3 102.2 11.4 99 1.8
Month FFMC DC temp RH wind
aug 92.3 NaN 22.2 NaN NaN
mar 86.2 94.3 8.2 51 6.7
aug NaN 495.6 24.1 27 NaN
oct 90.6 669.1 18 33 0.9
aug 91.5 608.2 8 86 2.2
oct 90.6 686.9 14.7 33 2.2
sep 91 692.6 NaN 63 5.4
mar 90.8 77.5 8.3 97 4
sep 92.5 698.6 22.8 40 4
mar 89.3 102.2 11.4 99 1.8
aug 92.3 495.6 22.2 40 2.2
aug 90.8 495.6 24.1 27 2.2
aug 91.5 608.2 8 86 2.2
sep 91 692.6 14.7 63 5.4
sep 92.5 698.6 22.8 40 4
Data Cleaning (Missing values)
Replacing with mode value:
Month FFMC DC temp RH wind
mar 86.2 94.3 8.2 51 6.7
oct 90.6 669.1 18 33 0.9
oct 90.6 686.9 NaN 33 NaN Month FFMC DC temp RH wind
mar NaN 77.5 8.3 97 4 mar 86.2 94.3 8.2 51 6.7
mar 89.3 102.2 11.4 99 1.8 oct 90.6 669.1 18 33 0.9
aug 92.3 NaN 22.2 NaN NaN oct 90.6 686.9 NaN 33 4
aug NaN 495.6 24.1 27 NaN mar 90.6 77.5 8.3 97 4
aug 91.5 608.2 8 86 2.2 mar 89.3 102.2 11.4 99 1.8
sep 91 692.6 NaN 63 5.4 aug 92.3 NaN 22.2 33 4
sep 92.5 698.6 22.8 40 4 aug 90.6 495.6 24.1 27 4
aug 91.5 608.2 8 86 2.2
sep 91 692.6 NaN 63 5.4
sep 92.5 698.6 22.8 40 4

“Mode is good option for missing values in case of categorical

variables”
Data Cleaning (Missing values)
Replacing with central tendency values corresponding to class for a
certain feature.

The technique is similar to the one discussed above

Exception :
• Feature vector divided into number of subparts with
corresponding to one class.
• The sub feature vector used to determine central tendencies
values in order to replace missing values of that feature and
particular class.
• e.g. employee salary vector can be divided into male and
female or job designations etc.
Data Cleaning (Noisy data)
Noise is defined as a random variance in a measured variable. For
numeric values, boxplots and scatter plots can be used to identify
outliers.

Boxplot Scatter plot

Data Cleaning (Noisy data)
Popular reasons of random variations are:
• Malfunctioning of collection instruments.
• Data entry lags.
• Data transmission problems

To deal with these anomalous values, data smoothing techniques

are applied, some popular ones are
• Binning method
• Regression
• Outlier analysis
Data Cleaning (Noisy data)
Binning method : performs the task of data smoothening.
Steps of be followed under binning method are:
Step1: Sort the data into ascending order.
Step2: Calculate the bin size (i.e. number of bins)
Step3: Partition or distribute the data equally among the bins starting
with first element of sorted data.
Step4: perform data smoothening using bin means, bin boundaries, and
bin median.
Last bin can have one less or more element!!
Data Cleaning (Noisy data)
Example : 9, 21, 29, 28, 4, 21, 8, 24, 26

Step1: sorted the data 4, 8, 9, 21, 21, 24, 26, 28, 29

Step2 : Bin size calculation

Bin size =

= = 2.777

But we need to take ceiling value, so bin size is 3 here

Data Cleaning (Noisy data)
Step 3 : Bin partitioning (equi-size)
Bin 1: 4, 8, 9
Bin 2: 21, 21, 24
Bin 3: 26, 28, 29

Step 4 : data smoothening

 Using mean value : replace the bin values by bin average

Bin 1: 7, 7, 7
Bin 2: 22, 22, 22
Bin 3: 27, 27, 27
Data Cleaning (Noisy data)
 Using boundary values : replace the bin value by a closest
boundary value of the corresponding bin.
Bin 1: 4, 9, 9 “Boundary values remain unchanged in
Bin 2: 21, 21, 24 boundary method”
Bin 3: 26, 29, 29

 Using median values : replace the bin value by a bin median.

Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
Data Cleaning (Noisy data)
Regression method : Linear regression and multiple linear
regression can be used to smooth the data, where the values are
conformed to a function.
Data Cleaning (Noisy data)
Outlier analysis: performs the task of data refinement by tracing
down the outliers with help of clustering and dealing with them.
Data Cleaning (Inconsistent data)
Inconsistent Data: discrepancies between different data
items.
e.g. the “Address” field contains the “Phone number”

To resolve inconsistencies
 Manual correction using external references
 Semi-automatic tools
• To detect violation of known functional dependencies and data
constraints
• To correct redundant data
Data Cleaning (Inconsistent data)

“To avoid inconsistencies, perform data assessment like

knowing what the data type of the features should be and whether
it is the same for all the data objects.”
Data Integration
Data Integration: It is the process of merging the data
from multiple sources into a coherent data store.
e.g. Collection of banking data from different banks at data stores of RBI

Issues in data integration

• Schema integration and feature

matching
• Redundant features
• Detection and resolution of data
value conflicts.
Data Integration
Schema integration and feature matching:

Cust.id Name Age DoB Cust.no Name Age DoB Cust.id Name Age DoB

Cust.id Name Age DoB

“Carefully analysis the metadata”

Data Integration
Redundant features: They are unwanted features.

 To deal with redundant features correlation

analysis is performed. Denoted by r.
Cust.id Name Age DoB

r is +ve r is -ve r is zero

Data Integration
Detection and resolution of data value conflicts:

Product year Price Product Year Price Product Year Price

($) (Rs) (pound)

Research data for price Product year Price

of essential products

“Carefully analysis the metadata”

Data Transformation

This step is done before Data Reduction, but its details will
be discussed after Data Reduction (Data Reduction needs
50 consecutive min)
Data Reduction
Data Reduction: It is a process of constructing a
condensed representation of the data set which is smaller
in volume, while maintaining the integrity of original one.
The efficiency of results should not degrade with data
reduction
Some facts about data reduction in machine learning :
• There exist a optimal number of feature in a feature set for
corresponding Machine Learning task.
• Adding additional features than optimal ones (strictly necessary)
results in a performance degradation ( because of added noise).
Data Reduction

“ Challenging task”
Data Reduction
Benefits of data reduction
• Accuracy improvements.
• Overfitting (model fits for training data, but not for validation) risk
reduction.
• Speed up in training.
• Improved Data Visualization.
• Increase in explainability of our model.
• Increase storage efficiency.
• Reduced storage cost.
Data Reduction
Major techniques of data reduction are:
• Attribute subset selection
• Low variance filter
• High correlation filter
• Numerosity reduction
• Dimensionality reduction
Data Reduction
Attribute subset selection: The highly relevant attributes
should be used, rest all can be discarded.

Techniques of attribute selection:

• Brute force approach: Try all possible feature subsets as input to
the machine learning algorithm and analyze the results.
• Statistical approach: The concept of statistical testing is applied for
selecting the most significant features from the original feature set.
Data Reduction (Attribute subset)
Brute force technique:
F1 F2 F3 F4 F5 F6 F7 F8 F9 Target

Step 1: Construct the power set corresponding to feature set.

Step 2: Select an element from power set.
Step 3: Measure the accuracy of Learning model corresponding to
selected feature.
Step 4: Repeat step 2 and Step 3 until desired accuracy is achieved

“To decrease the number of iterations, Expert knowledge of

features is used”
Data Reduction (Attribute subset)
Statistical technique: chi-square test
• A chi-square test is used to test the
independence of predictor and the
target.
• In feature reduction, aim is to select
highly dependent features as per
target.
• Independent features are removed.

“ More the value of chi-square, higher is the dependence level.

Data Reduction
Low variance filter: Normalized features that have variance
(distribution) less than a threshold are also removed, since
little changes in data means less information.
F1 F2 F3
0.113 0.272 0.318
Variable (F1) = 0.115298 0.803 0.383 0.027
0.197 0.630 0.319
Variable (F2) = 0.103589 0.210 0.987 0.310
0.305 0.984 0.390
Variable (F3) = 0.012525 0.464 0.031 0.314
0.954 0.008 0.399
Very low variance 0.943 0.324 0.278
0.997 0.418 0.133
0.270 0.525 0.373
Data Reduction
High correlation filter: Normalized attributes that have
correlation coefficient more than a threshold are also
removed, since similar trends means similar information is
carried. Correlation coefficient is usually calculates using
statistical methods such as Pearson’s chi-square value etc.
Data Reduction
Numerosity Reduction: This enable to store the model of
data instead of whole data.
e.g. Regression Models.
Data Reduction

Dimensionality Reduction: It is the technique to reduce

the number of dimensions not simply by selecting a feature
subset but much more than it.
 dimensions refers to the number of geometric planes the
dataset lies in, which could be high so much so that it cannot
be visualized with pen and paper.

“More the number planes, more is the complexity of the

dataset”
Data Reduction
Principal Component Analysis (PCA): It is a technique of
dimensionality reduction which performs the said task by the reducing
higher-dimensional feature-space to a lower-dimensional feature-
space. It also helps to make visualization of dataset simple.
Data Reduction (PCA)
Some of the major facts about PCA are:
 Principal components are new features that are constructed as a
linear combinations or mixtures of the initial feature set.
 These combinations is performed in such a manner that all the
newly constructed principal components are uncorrelated.
 Together with reduction task, PCA also preserving as much
information as possible of original data set.
Data Reduction (PCA)
Some of the major facts about PCA are:
 Principal components are usually denoted by PCi, where i can be
0, 1, 2, 3,. ….,n (depending on the number of feature in original
data set).
 The major proportion of information about original feature set can
be alone explained by first principal component i.e. PC1.
 The remaining information can be obtained from other principal
components in a decreasing proportion as per increase in value of
i.
Data Reduction (PCA)
Data Reduction (PCA)
 Geometrically , it can be said that principal components
are lines pointing the directions that captures maximum
amount of information about the data.

Simply, principal components are new axes to get better data

visibility with clear difference in observations.
Data Reduction (PCA)
Stepwise working of PCA
Step 1: Construction of covariance matrix i.e. A

Step 2: Computation of eigenvalues for covariance matrix.

Eignevector: the direction of that line, while the eigenvalue is a
number that tells us how the data set is spread out on the line which is
an Eigenvector. It show us the direction of our main axes (principal
components) of our data. The greater the eigenvalue, the greater the
variation along this axis

Step 3: Compute eigenvectors corresponding to every eigenvalue

obtained in step2:
Data Reduction (PCA)
Stepwise working of PCA

Step 4: Sort the eigenvectors in decreasing order of eigenvalues and

choose k eigenvectors with the largest eigenvalues.
Step 5: Transform the data along the principal component axis.

“Seems Difficult !!”

“Just need some mathematical skills”

Data Reduction (PCA)
Example:
X Y
2.5 2.4 Compute
0.5 0.7 covariance
2.2 2.9 Matrix A Cov(X,X) Cov(X,Y)
1.9 2.2
3.1 3
Cov(Y,X) Cov(Y,Y)
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

Original dataset
Data Reduction (PCA)
Example:
X Y

2.5 2.4 0.69 0.49 0.4761 0.2401 0.3381

0.5 0.7 -1.31 -1.21 1.7161 1.4641 1.5851
2.2 2.9 0.39 0.99 0.1521 0.9801 0.3861
1.9 2.2 0.09 0.29 0.0081 0.0841 0.0261
3.1 3 1.29 1.09 1.6641 1.1881 1.4061
2.3 2.7 0.49 0.79 0.2401 0.6241 0.3871
2 1.6 0.19 -0.31 0.0361 0.0961 -0.0589
1 1.1 -0.81 -0.81 0.6561 0.6561 0.6561
1.5 1.6 -0.31 -0.31 0.0961 0.0961 0.0961
1.1 0.9 -0.71 -1.01 0.5041 1.0201 0.7171
=1.81 Cov(X,X) Cov(Y,Y) Cov(X,Y)= Cov(Y,X)
= 0.6165 = 0.7165 = 0.6154
Data Reduction (PCA)
Example:

0.6165 0.6154 Compute 0.6165 0.6154 1 0

eigenvalues
0.6154 0.7165 0.6154 0.7165 0 1

Find determinant by
equating to zero & find 0.6165- 0.6154
values of 1 and 2
0.6154 0.7165-
How: (.6165- )*(.7165- )-(.6154*.6154)=0
Data Reduction (PCA)
Example:
-0.6675 0.6154
0.6165- 0.6154
0.6154 -0.5675
Compute
0.6154 0.7165-
eigenvectors

0.6165- 0.6154
0.5674 0.6154
0.6154 0.7165-
0.6154 0.6674
PCA: Eigenvector compute
-.6675*x1+.6154x2=0
-0.6675 0.6154 Div by (.6675*.6154) throughout
0.6154 -0.5675  x1/.6154=x2/.6675 = say t

V1=
Unit Eigen vector for by dividing by
 = = sqrt(.8243) =
.908

V1= 1/ = 1/.908

=
Data Reduction (PCA)
Example:

0.67787 -0.73517

0.73517 0.67787

First Principal Component (PC1) Second Principal Component (PC2)

“vector corresponding to highest eigenvalue of considered as PC1

followed by other component as per their eigenvalue.”
 To calculate the percentage of information explained by PC1
and PC2, divide each component by sum (1.284+0.049=1.333)
of eigenvalues (recall 1=1.284, 2=.049)
PC1 =1.284/1.333=96% PC2 =0.049/1.333= 4%
Data Reduction (PCA)
Step 4 helps to reduce the dimension by discarding the components
with very less percentage of information in a multi-dimentional
space. The remaining ones form a matrix of vector known as feature
vector. Each column correspond to one principal component.
Step 5 data transformation along principal component using
PCA_dataset = EigenVectorT * Mean-Adjusted-DataT
Mean-Adjusted-Data => each point of the Original Data will be
adjusted by subtracting respective means
Extra Step (Recover Orig Data from PCA-Data)
Mean-Adjusted-DataT = PCA_dataset / EigenVectorT
= PCA_dataset *EigenVecor
Original Data => add means to respective dimensional-values of the
Mean-Adjusted-Data
Data Transformation (usually done before
Dimensionality Reduction)
Data Transformation: It is a process to transform or
consolidate the data in a form suitable for machine
learning algorithms.

Major techniques of data transformation are :-

• Normalization
• Aggregation
• Generalization
• Feature construction
Data Transformation
Normalization: It is the technique of mapping the
numerical feature values of any range into a specific
smaller range i.e. 0 to 1 or -1 to 1.

Popular methods of Normalization are:-

• Min-Max method
• Mean normalization
• Z score method
• Decimal scaling method
Data Transformation (Normalization)
Min-Max method:
X

2 0
Min-max
47 normalization 0.512
90
1
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.034
specific range
is minimum and maximum
value of feature vector corresponding to .
Data Transformation (Normalization)
Mean normalization
X

2 -0.345
Mean
47 normalization 0.166
90
0.655
Where, 18
is mapped value -0.164
is data value to be mapped in 5
-0.311
specific range
is mean of feature vector corresponding
to .
is minimum and maximum
value of feature vector corresponding to .
Data Transformation (Normalization)
Z Score method:
X

2 -0.826
Z score
47 normalization 0.397
90
1.566
Where, 18
is mapped value -0.391
is data value to be mapped in 5
-0.745
specific range
and is mean and standard deviation of
feature vector corresponding to .
Data Transformation (Normalization)
Decimal scaling method:
X

2 Decimal 0.02
scaling
47 normalization 0.47
90
0.9
Where, 18
is mapped value 0.18
is data value to be mapped in 5
0.05
specific range
is maximum of the count of digits in
minimum and maximum value of feature
vector corresponding to
Data Transformation
Aggregation : take the aggregated values in order to put
the data in a better perspective.

e.g. in case of transactional data, the day to day sales of product

at various stores location can be aggregated store wise over
months or years in order to be analyzed for some decision
making.

Benefits of aggregation
• Reduce memory consumption to store large data records.
• Provides more stable view point than individual data objects
Data Transformation (Aggregation)

Image Source blog from towards data science

Data Transformation
Generalization: The data is generalized from low-level
to higher order concepts using concept hierarchies.
e.g. categorical attributes like street can be generalized to higher
order concepts like city or country.
“The decision of generalization level depends on the problem
statement”

Feature construction : New attributes are constructed

from the given set of attributes.
e.g. feature like mobile number and landline number combined
together under new feature contact number
Feature Scaling
Feature Scaling : It is a method used to normalize the
range of independent features of data. Basically it is the
process of scaling down the feature magnitudes to bring
them in a common range platforms
Need for feature scaling Salary Age tax

Feature 40000 35

500000 42

Magnitude Units 25000 26

Tax = func (Salary, Age) 30000 30

Feature Scaling
 The large difference in magnitude of features makes the process
of model training difficult.
 Euclidian distance and Manhattan distance are required to be
calculated by many model for its working which comes out to be
large in case of huge difference in magnitude values.
 Moreover visualization is also difficult as data points are very far
off locations .
 KNN and K-means are some popularly known algorithms where
feature scaling is an important role player.
Feature Scaling
Feature Scaling
Feature Scaling Techniques
 Normalization (discussed under data transformation)
 Standardization (a.k.a Z-score normalization)
• It standardize the data to follow standard normal
distribution with mean 0 and standard deviation 1)

Standard Normal Distribution

Dataset Splitting
“After performing various pre-processing task on the dataset, it now
ready to be utilized by the exsiting machine learning algorithms”

“Wait !! Before modeling the dataset for analysis and decision

making by machine learning algorithms, it is advisable to split it into
training, validation, and testing set.”
Dataset Splitting

Training data: This is the part on which machine learning

algorithms are actually trained to build a model. The model
tries to learn the dataset and its various characteristics.

Validation data : This is the part of the dataset which is

used to validate our various model fits. In simpler words, it
is used to tune the hyperparameters of the algorithm for
better performance.
“Validation!! only tuning No learning”
Dataset Splitting
Test data : This part of the dataset is used to test our
model and quantify the accuracy measures to depict its
performance when deployed on real-world data.

Data splitting techniques:

• Simple random sampling (SRS)
• Systematic sampling
• Stratified sampling
Dataset Splitting
Simple random sampling :
In a simple random
sample each observation
in the data set has an
equal chance of being
selected.

“Result in biasness toward

one category if distribution of
categories is not proper.”
Dataset Splitting
Systematic sampling: It involves selecting items from an
ordered population using a skip or sampling interval. That
means that every "nth" data sample is chosen in a large
data set.

“Not a good choice when

data exhibits some
patterns”
Dataset Splitting
Stratified sampling: This is accomplished by selecting samples at
random within each class. This approach ensures that the frequency
distribution of the outcome is approximately equal within the training
and test sets. Mostly used in classification problems or where data
categories is available.
Dataset Splitting

Split Ratio : It is the ratio in which the dataset must be

splitted into training, validation, and testing set. It is highly
dependent on type of model to be trained and dataset itself.
 larger training ratio (most usual case): If our dataset and model
are such that a lot of training is required, then we use a larger
chunk of the data just for training purposes.
e.g training on textual data, image data, or video data where
thousands of features are involved.
 larger validation ratio: If the model has a lot of
hyperparameters that can be tuned, then keeping a higher
percentage of data for the validation set is advisable.
Dataset Splitting
Commonly used split ratios are:
70% train, 15% val, 15% test.
80% train, 10% val, 10% test.
60% train, 20% val, 20% test.
Many people split into 2 sets, instead of 3 sets:
(i) training (ii) Validation/Testing
Common Ratios: 70, 30 and 80,20

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
94% (68)
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
49 pages
Read People Like A Book by Patrick King-Edited
62% (71)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (29)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (70)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Million Dollar Habits - Tracy
100% (1)
Million Dollar Habits - Tracy
9 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
Ashrae - Ambient Conditions - Ahmedabad
No ratings yet
Ashrae - Ambient Conditions - Ahmedabad
2 pages
BAQ - ErnestoCortizzosIntl - Colombia (WMO 800280) 2017
No ratings yet
BAQ - ErnestoCortizzosIntl - Colombia (WMO 800280) 2017
2 pages
Hvac Assessment Report
No ratings yet
Hvac Assessment Report
95 pages
Ashrae Meteo (Islamabad Climate Data)
100% (1)
Ashrae Meteo (Islamabad Climate Data)
1 page
2017 Ashrae Climate Data - Des Moines Intl, Ia, Usa
No ratings yet
2017 Ashrae Climate Data - Des Moines Intl, Ia, Usa
2 pages
NASA Surface Meteorology and Solar Energy LA VIRGINIA
No ratings yet
NASA Surface Meteorology and Solar Energy LA VIRGINIA
9 pages
Design Weather Parameters & Mshgs
No ratings yet
Design Weather Parameters & Mshgs
1 page
Reliability Test Systems: Appendix
No ratings yet
Reliability Test Systems: Appendix
53 pages
2.modelo Lutz
No ratings yet
2.modelo Lutz
379 pages
Weather Input Data
No ratings yet
Weather Input Data
1 page
Solder Volume 1h-2 Size (0108-21082020)
No ratings yet
Solder Volume 1h-2 Size (0108-21082020)
193 pages
Session 06 - Monthly Average Daily Solar Radiation
No ratings yet
Session 06 - Monthly Average Daily Solar Radiation
17 pages
Wgen Par
No ratings yet
Wgen Par
4 pages
Weather Input
No ratings yet
Weather Input
7 pages
CAL - AlfonsoBonillaAragonIntl - Colombia (WMO 802590) 2017
No ratings yet
CAL - AlfonsoBonillaAragonIntl - Colombia (WMO 802590) 2017
2 pages
Intensity-Duration-Frequency (IDF)
No ratings yet
Intensity-Duration-Frequency (IDF)
15 pages
SSF Plastics Baseline Draft for Review
No ratings yet
SSF Plastics Baseline Draft for Review
19 pages
Design Weather Parameters & Mshgs
No ratings yet
Design Weather Parameters & Mshgs
7 pages
Lte H3I: Cs Fallback Time
No ratings yet
Lte H3I: Cs Fallback Time
5 pages
Data 4
No ratings yet
Data 4
4 pages
Geometrik Jalan Raya Setiana
No ratings yet
Geometrik Jalan Raya Setiana
88 pages
Design Weather Parameters & Mshgs
No ratings yet
Design Weather Parameters & Mshgs
1 page
Statewide Data On Crime in The US 2005
No ratings yet
Statewide Data On Crime in The US 2005
1 page
Appendix 1
No ratings yet
Appendix 1
436 pages
WORLD CARBON STEEL PRODUCT PRICE INDEX - WITH INDIVIDUAL PRODUCT FORECASTS
No ratings yet
WORLD CARBON STEEL PRODUCT PRICE INDEX - WITH INDIVIDUAL PRODUCT FORECASTS
3 pages
(BLANEY - GRIDDILE) F P (0.46t + 8) Pan Evaporation Et KP Ep (Pan Class - A) Mon F ET P KP (Case A) Ep ET
No ratings yet
(BLANEY - GRIDDILE) F P (0.46t + 8) Pan Evaporation Et KP Ep (Pan Class - A) Mon F ET P KP (Case A) Ep ET
6 pages
Kaiser-Meyer-Olkin (KMO)
100% (1)
Kaiser-Meyer-Olkin (KMO)
5 pages
Supplementary Documentation: Erivation of The Radient
No ratings yet
Supplementary Documentation: Erivation of The Radient
4 pages
Evapotranspirasi
No ratings yet
Evapotranspirasi
9 pages
Hvac Calculation
100% (1)
Hvac Calculation
335 pages
HVAC Calculation Report - 01
No ratings yet
HVAC Calculation Report - 01
1 page
Energía en Mi Casa
No ratings yet
Energía en Mi Casa
5 pages
Weather Reports - For Dining and Kitchen Area
No ratings yet
Weather Reports - For Dining and Kitchen Area
3 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
48 pages
Climate, Comfort, Building Design: Physiological Processes
No ratings yet
Climate, Comfort, Building Design: Physiological Processes
18 pages
2017 Ashrae Climate Data - Devner Stapleton, Co, Usa
No ratings yet
2017 Ashrae Climate Data - Devner Stapleton, Co, Usa
2 pages
2017 Ashrae Climate Data - Devner Stapleton, Co, Usa
No ratings yet
2017 Ashrae Climate Data - Devner Stapleton, Co, Usa
2 pages
Summary
No ratings yet
Summary
14 pages
March A. Observed Climatologic Data & Relevant Parameters: If Solar Radiation Rs Is Not Available
No ratings yet
March A. Observed Climatologic Data & Relevant Parameters: If Solar Radiation Rs Is Not Available
18 pages
jnc15713 Sup 0001 Supinfo
No ratings yet
jnc15713 Sup 0001 Supinfo
20 pages
Calculus
No ratings yet
Calculus
5 pages
Linear Regression Assignment
No ratings yet
Linear Regression Assignment
15 pages
Design Weather Parameters & Mshgs
No ratings yet
Design Weather Parameters & Mshgs
1 page
climate one building
No ratings yet
climate one building
26 pages
Design Weather Parameters & Mshgs
No ratings yet
Design Weather Parameters & Mshgs
99 pages
Heat Load Calculation: Project
No ratings yet
Heat Load Calculation: Project
16 pages
Chapter 6
No ratings yet
Chapter 6
20 pages
NASA Surface Meteorology and Solar Energy - Available Tables
No ratings yet
NASA Surface Meteorology and Solar Energy - Available Tables
8 pages
Density Help
No ratings yet
Density Help
8 pages
1.1 Reference Crop Evapotranspiration, Eto
No ratings yet
1.1 Reference Crop Evapotranspiration, Eto
20 pages
Lab Report: Course Name: Electrical Circuit (LAB) Course Code: CSE-133
No ratings yet
Lab Report: Course Name: Electrical Circuit (LAB) Course Code: CSE-133
4 pages
PROPOSED-DESIGN-OF-8600-KW-DIESEL-ELECTRIC-POWER
No ratings yet
PROPOSED-DESIGN-OF-8600-KW-DIESEL-ELECTRIC-POWER
43 pages
BWI-Kalipuro - (STA2) 2016
No ratings yet
BWI-Kalipuro - (STA2) 2016
4 pages
Xifengzhen, China: 2009 ASHRAE Handbook - Fundamentals (IP) © 2009 ASHRAE, Inc
No ratings yet
Xifengzhen, China: 2009 ASHRAE Handbook - Fundamentals (IP) © 2009 ASHRAE, Inc
1 page
Design Weather Parameters & Mshgs
No ratings yet
Design Weather Parameters & Mshgs
7 pages
M0395
No ratings yet
M0395
1 page
Statistical Analysis of Reservoir Data
No ratings yet
Statistical Analysis of Reservoir Data
11 pages
Water Resources Irrigation Development Division Office Gorkha, Nepal Shikharbesi Multipurpose Dam Project Volume II: Crop Water Requirement
No ratings yet
Water Resources Irrigation Development Division Office Gorkha, Nepal Shikharbesi Multipurpose Dam Project Volume II: Crop Water Requirement
18 pages
Principles and Applications of Thermal Analysis
From Everand
Principles and Applications of Thermal Analysis
Paul Gabbott
4/5 (1)
Adwea Approved Vendors List
No ratings yet
Adwea Approved Vendors List
2 pages
Informe 4 Numero de Reynolds
No ratings yet
Informe 4 Numero de Reynolds
16 pages
Safety Data Sheet This Safety Data Sheet Is Provided in Compliance With The EC Directive 1907/2006-453/2010 1. Identification of The Substance/mixture and of The Company/undertaking
No ratings yet
Safety Data Sheet This Safety Data Sheet Is Provided in Compliance With The EC Directive 1907/2006-453/2010 1. Identification of The Substance/mixture and of The Company/undertaking
7 pages
تمثلات المرض وأساليب العلاج في المجتمع الجزائري مقاربة سوسيولوجية
No ratings yet
تمثلات المرض وأساليب العلاج في المجتمع الجزائري مقاربة سوسيولوجية
34 pages
Syllabus Course Description: HVAC Systems
No ratings yet
Syllabus Course Description: HVAC Systems
3 pages
Solar Energy Term Paper
100% (2)
Solar Energy Term Paper
7 pages
Nationalexaminationscouncil (1).PDF
No ratings yet
Nationalexaminationscouncil (1).PDF
1 page
Forensic Science International: Teresa Lech
No ratings yet
Forensic Science International: Teresa Lech
5 pages
Sheet 4 - 4 Pup
No ratings yet
Sheet 4 - 4 Pup
1 page
Brian Washburn Thesis
100% (2)
Brian Washburn Thesis
5 pages
Lecture 5 - Relative Density - Ce 5133 Foundation Engineering
No ratings yet
Lecture 5 - Relative Density - Ce 5133 Foundation Engineering
29 pages
Phy05p 1
No ratings yet
Phy05p 1
34 pages
Download The Ashgate Research Companion to the Politics of Democratization in Europe Concepts and Histories 1st Edition Kari Palonen ebook All Chapters PDF
100% (2)
Download The Ashgate Research Companion to the Politics of Democratization in Europe Concepts and Histories 1st Edition Kari Palonen ebook All Chapters PDF
78 pages
Electric Vehicle Charging Stations
No ratings yet
Electric Vehicle Charging Stations
40 pages
Mansouri 2004
No ratings yet
Mansouri 2004
11 pages
BTech CBCS Course Structure With Syllabi - Minerals and Metallurgical Engineering
No ratings yet
BTech CBCS Course Structure With Syllabi - Minerals and Metallurgical Engineering
38 pages
Heidegger - Early Greek Thinking - Anaximander Fragment
No ratings yet
Heidegger - Early Greek Thinking - Anaximander Fragment
24 pages
Final K2 MT4
No ratings yet
Final K2 MT4
9 pages
Instant Access to Engineering Ethics: Concepts and Cases 6 ed. Edition Charles E. Harris ebook Full Chapters
100% (2)
Instant Access to Engineering Ethics: Concepts and Cases 6 ed. Edition Charles E. Harris ebook Full Chapters
65 pages
Map Catalogue Full Version
No ratings yet
Map Catalogue Full Version
86 pages
jsc11 01 Rms 20240118
100% (1)
jsc11 01 Rms 20240118
17 pages
GRADE 12 - 2nd SEMESTER ASSESSMENT 2024
No ratings yet
GRADE 12 - 2nd SEMESTER ASSESSMENT 2024
4 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
2 pages
TDS GW VS1 Vertishore Issue 1
No ratings yet
TDS GW VS1 Vertishore Issue 1
6 pages
English 9 - Diagnostic
No ratings yet
English 9 - Diagnostic
6 pages
Ens His 214 African Colonial History
No ratings yet
Ens His 214 African Colonial History
23 pages
Quantachrome
No ratings yet
Quantachrome
4 pages
Steps To Checking Vision Using Snellen Eye Chart: What Is Visual Acuity and What Does 20/20 Vision Mean?
No ratings yet
Steps To Checking Vision Using Snellen Eye Chart: What Is Visual Acuity and What Does 20/20 Vision Mean?
2 pages
PVA Fibre Reinforced High-Strength Cementitious Composite For 3D Printing Mechanical Properties and Durability
No ratings yet
PVA Fibre Reinforced High-Strength Cementitious Composite For 3D Printing Mechanical Properties and Durability
12 pages