Concepts (PPT) - Data Preprocessing
Concepts (PPT) - Data Preprocessing
Data Preprocessing
Data Mining & Methodology
2
Data Preprocessing
• The data preprocessing phase requires data understanding for
preparation tasks.
• It involves transforming raw data
into a clean and consistent format
suitable for analysis.
• It is a crucial phase ensuring data
quality that impacts the accuracy
and effectiveness of subsequent
data mining tasks.
3
Data Preprocessing
4
Why Do We Need To Pre-process the Data?
5
Before Data Preparation
6
Missing Data Treatment
Methods to handle the missing values:
• Deletion
• If an attribute contains a lot of missing values, consider to remove the attribute
• If only a few examples contain missing values, consider to remove those cases/rows
• Imputation
• In a categorical attribute with missing values we can introduce a new category, e.g.
“unknown”.
• Mean/ Mode/ Median Imputation
• Prediction model
• Sophisticated method for handling missing data. Here, we create a predictive model
to estimate values that will substitute the missing data.
10
Outliers Treatment
• Deleting observations:
• We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
We can also use trimming at both ends to remove outliers.
• Transforming variables can also eliminate outliers.
• Natural log of a value reduces the variation caused by extreme
values.
• Binning is also a form of variable transformation. Decision Tree
algorithm allows to deal with outliers well due to binning of
attribute’s values.
14
Data Transformation
• Data transformation, consists of several approaches, has already
demonstrated significant improvements in modelling performance.
• Common approaches:
• Data Generalisation
• Aggregation, Binning (Discretization/Binarization)
• Data Normalisation
• Range Transformation
• Z-Transformation
• Log Transformations
• Square Root
• Square
15
Data Transformation - Aggregation
18
Range & Z Transformation
𝒙𝒊 − 𝐦𝐢𝐧 𝒙 𝒙𝒊 − 𝐦𝐞𝐚𝐧 𝒙
𝒙′𝒊 = 𝒙′𝒊 =
𝒎𝒂𝒙 𝒙 − 𝐦𝐢𝐧 𝒙 𝒔𝒕𝒅𝒆𝒗 𝒙
Redundancy Irrelevancy
x2 x4
0.70
0.40
x
x1 x
x3
1 3
24
Attribute Creation
• A process to generate a new attributes based on existing attribute(s).
• For example, date (dd-mm-yy) as an input variable in a data set. We can
generate new variables like day, month, year, week, weekday that may
have better relationship with target variable. This step is used to
highlight the hidden relationship in a variable:
26
Attribute Creation Methods
• Creating derived attributes:
• This refers to creating new attributes from existing attribute(s) using set of
functions or different methods.
• Methods such as taking log of attribute values, binning attributes and other
transformation methods can also be used to create new attributes.
• Creating dummy attributes:
• Most common application of dummy attribute is to convert categorical variable
into numerical variables.
• Dummy attributes are also called Indicator Variables.
• It is useful to take categorical variable as a predictor in statistical models.
Categorical variable can take values 0 and 1.
27
Data Creation and Transformation
Existing Data Type New Data Type Methods Example
Nominal (Categorical) Numerical Dummy attribute creation In case of existing variable is a non-multi-value attribute, replacing the existing value with a number (NOTE: this
might create misleading meaning to the modelling).
In case of existing variable is a multi-value attribute, dummy variable creation is required.
E.g.{"Green", "Red", "Yellow"} to dummy variables:
v_green: if Green is true, then 0 else 1. v_red: if Red is true, then 0 else 1. v_yellow: if Yellow is true, then 0 else
1.
Ordinal (Categorical) Numerical Derived attribute creation {"Poor", "Average", "Good"} to derived variable values {1,2,3} based on their rank
Numerical Numerical Binning/Aggregation/ Performance marks {0-100} to CGPA points {0-4}; transform yearly salary using log
Normalization
Numerical Nominal (Ordinal) Binning/ Aggregation Age numbers grouped into derived variable value with age ranges e.g. "18-25", "26-30"
Performance score {1, 2, 3, 4, 5} discretized into three groups to {"Poor", "Average", "Good"}
Numerical Nominal Derived attribute attribute Acceptance choice {0, 1} to {"Yes", "No"}
(Categorical)
Nominal (Categorical) Ordinal NOTE: This transformation is rarely happened because it does not bring meaningful or useful derived values.
(Categorical)
Ordinal (Categorial) Ordinal / Nominal Binning Workload level {"L1", "L2", "L3", "L4", "L5"} discretized into three groups to {"Light", "Average", "Heavy"}
(Categorical)
Nominal Nominal Binning/Aggregation {"Light Blue", "Blue", "Dark Blue", "Light Red", "Red", "Dark Red" } to derived variable value {"Blue", "Red"}
(Categorical) (Categorical) 29
Summary of Data Preparation Methods
• Missing Values treatment (treatment to avoid data exclusion or bias)
1. Deletion
2. Imputation
3. Prediction Model
• Outliers (treatment to avoid scale problem)
1. Deletion
2. Transformation (Generalization/Normalization)
• Selection of attributes (another way to reduce dimensionality of data to minimize bias)
1. Delete irrelevant/duplicate data
2. Select useful attributes for modelling
• Attribute/Data Creation (new attributes that can capture the important information in a data set
much more efficiently than the original attributes)
1. Derived attributes
2. Dummy attributes
30