0% found this document useful (0 votes)

205 views

Concepts (PPT) - Data Preprocessing

The document discusses various techniques for data preprocessing including data cleaning, integration, transformation, reduction, and attribute creation. It covers handling missing data, outliers, and different types of data transformation such as binning, normalization, aggregation, and dummy attribute creation.

Uploaded by

mtemp7489

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views

Concepts (PPT) - Data Preprocessing

Uploaded by

mtemp7489

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

BIS2216/DBS1214

Data Mining & knowledge Discovery

Data Preprocessing
Data Mining & Methodology

• Data mining is a process that uses a variety of data analysis tools to

discover patterns and relationships in data that may be used to
make valid predictions.
• A generic data mining process methodology:

2
Data Preprocessing
• The data preprocessing phase requires data understanding for
preparation tasks.
• It involves transforming raw data
into a clean and consistent format
suitable for analysis.
• It is a crucial phase ensuring data
quality that impacts the accuracy
and effectiveness of subsequent
data mining tasks.

3
Data Preprocessing

• Data pre-processing requires data understanding for data preparation

tasks.
• Data preparation includes:
• Data cleaning (e.g., noisy data, inconsistent data formats)
• Data integration (e.g., combine multiple data sources)
• Data transformation (e.g., convert data into suitable formats)
• Data reduction (e.g., selecting relevant attributes)
• Derived/dummy data creation

4
Why Do We Need To Pre-process the Data?

• Much of the raw data contained in databases is unprocessed,

incomplete, and noisy.
• For example, the databases may contain:
• Attributes that are obsolete or redundant or no longer relevant or expired
• Missing values
• Outliers
• Data in a form not suitable for data mining models
• Data that values not consistent with policy or common sense.

5
Before Data Preparation

It is essential to understand the following before start data

preparation:
• Types of data
• Noisy data
• Data sampling
• Data statistics
• Modelling techniques to be used

6
Missing Data Treatment
Methods to handle the missing values:
• Deletion
• If an attribute contains a lot of missing values, consider to remove the attribute
• If only a few examples contain missing values, consider to remove those cases/rows
• Imputation
• In a categorical attribute with missing values we can introduce a new category, e.g.
“unknown”.
• Mean/ Mode/ Median Imputation
• Prediction model
• Sophisticated method for handling missing data. Here, we create a predictive model
to estimate values that will substitute the missing data.

10
Outliers Treatment

• Deleting observations:
• We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers.
We can also use trimming at both ends to remove outliers.
• Transforming variables can also eliminate outliers.
• Natural log of a value reduces the variation caused by extreme
values.
• Binning is also a form of variable transformation. Decision Tree
algorithm allows to deal with outliers well due to binning of
attribute’s values.

14
Data Transformation
• Data transformation, consists of several approaches, has already
demonstrated significant improvements in modelling performance.
• Common approaches:
• Data Generalisation
• Aggregation, Binning (Discretization/Binarization)
• Data Normalisation
• Range Transformation
• Z-Transformation
• Log Transformations
• Square Root
• Square
15
Data Transformation - Aggregation

• Generalization through attribute level

• Combining two or more attributes into a single attribute
• Purpose
• Data reduction
• Reduce the number of attributes or objects
• Change of scale
• Cities aggregated into regions, states, countries, etc.
• More “stable” data
• Aggregated data tends to have less variability (e.g. Age versus birthdate
with date-month-year)
Data Transformation - Binning (Grouping)

• Generalization through value level

• Some algorithms need data be in the form of categorical form or binary form,
so it is necessary to transform a continuous attribute into a categorical
attribute:
• Discretization
• Transform a continuous attribute into a categorical attribute
• Binarization
• Both continuous & discrete attributes to be transformed into binary
attributes
Data Transformation

Which point has the larger distance from point A?

18
Range & Z Transformation

𝒙𝒊 − 𝐦𝐢𝐧 𝒙 𝒙𝒊 − 𝐦𝐞𝐚𝐧 𝒙
𝒙′𝒊 = 𝒙′𝒊 =
𝒎𝒂𝒙 𝒙 − 𝐦𝐢𝐧 𝒙 𝒔𝒕𝒅𝒆𝒗 𝒙

Figure: Range Transformation and Z-Transformation

20
Input Reduction – Redundancy and Irrelevancy

Redundancy Irrelevancy

x2 x4
0.70

Input x2 has the

0.60
x2 x4 same information as
input x1. 0.50

0.40

x
x1 x
x3
1 3

Input x2 has the same Input x3 has the information

information as input x1. that is irrelevant to input x4.
Selection of Attributes (Variable Selection)
• Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant
• Although it may be possible for a domain expert to pick out some of the useful
attributes, this can be a difficult and time-consuming task
• Leaving out relevant attributes or keeping irrelevant attributes may be
detrimental, causing confusion/bias for the mining algorithm employed.
• Volume of irrelevant or redundant attributes can slow down the mining
process.
• Variable selection reduce the data set size by removing irrelevant or
redundant attributes (or dimensions).

24
Attribute Creation
• A process to generate a new attributes based on existing attribute(s).
• For example, date (dd-mm-yy) as an input variable in a data set. We can
generate new variables like day, month, year, week, weekday that may
have better relationship with target variable. This step is used to
highlight the hidden relationship in a variable:

26
Attribute Creation Methods
• Creating derived attributes:
• This refers to creating new attributes from existing attribute(s) using set of
functions or different methods.
• Methods such as taking log of attribute values, binning attributes and other
transformation methods can also be used to create new attributes.
• Creating dummy attributes:
• Most common application of dummy attribute is to convert categorical variable
into numerical variables.
• Dummy attributes are also called Indicator Variables.
• It is useful to take categorical variable as a predictor in statistical models.
Categorical variable can take values 0 and 1.

27
Data Creation and Transformation
Existing Data Type New Data Type Methods Example

Nominal (Categorical) Numerical Dummy attribute creation In case of existing variable is a non-multi-value attribute, replacing the existing value with a number (NOTE: this
might create misleading meaning to the modelling).
In case of existing variable is a multi-value attribute, dummy variable creation is required.
E.g.{"Green", "Red", "Yellow"} to dummy variables:
v_green: if Green is true, then 0 else 1. v_red: if Red is true, then 0 else 1. v_yellow: if Yellow is true, then 0 else
1.
Ordinal (Categorical) Numerical Derived attribute creation {"Poor", "Average", "Good"} to derived variable values {1,2,3} based on their rank

Numerical Numerical Binning/Aggregation/ Performance marks {0-100} to CGPA points {0-4}; transform yearly salary using log
Normalization

Numerical Nominal (Ordinal) Binning/ Aggregation Age numbers grouped into derived variable value with age ranges e.g. "18-25", "26-30"
Performance score {1, 2, 3, 4, 5} discretized into three groups to {"Poor", "Average", "Good"}

Numerical Nominal Derived attribute attribute Acceptance choice {0, 1} to {"Yes", "No"}
(Categorical)

Nominal (Categorical) Ordinal NOTE: This transformation is rarely happened because it does not bring meaningful or useful derived values.
(Categorical)

Ordinal (Categorial) Ordinal / Nominal Binning Workload level {"L1", "L2", "L3", "L4", "L5"} discretized into three groups to {"Light", "Average", "Heavy"}
(Categorical)

Nominal Nominal Binning/Aggregation {"Light Blue", "Blue", "Dark Blue", "Light Red", "Red", "Dark Red" } to derived variable value {"Blue", "Red"}
(Categorical) (Categorical) 29
Summary of Data Preparation Methods
• Missing Values treatment (treatment to avoid data exclusion or bias)
1. Deletion
2. Imputation
3. Prediction Model
• Outliers (treatment to avoid scale problem)
1. Deletion
2. Transformation (Generalization/Normalization)
• Selection of attributes (another way to reduce dimensionality of data to minimize bias)
1. Delete irrelevant/duplicate data
2. Select useful attributes for modelling
• Attribute/Data Creation (new attributes that can capture the important information in a data set
much more efficiently than the original attributes)
1. Derived attributes
2. Dummy attributes
30

Software Engineering - PPT - Unit 1 - Class 1
No ratings yet
Software Engineering - PPT - Unit 1 - Class 1
11 pages
Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
AASHTOWare Pavement ME Design Build 1 3 28 Release Notes
No ratings yet
AASHTOWare Pavement ME Design Build 1 3 28 Release Notes
26 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Unit 4 Part A
No ratings yet
Unit 4 Part A
51 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
94 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Evaluation Metrics For Regression: Dr. Jasmeet Singh Assistant Professor, Csed Tiet, Patiala
No ratings yet
Evaluation Metrics For Regression: Dr. Jasmeet Singh Assistant Professor, Csed Tiet, Patiala
13 pages
Kmbn It01_ Unit 4
No ratings yet
Kmbn It01_ Unit 4
19 pages
Lab 1: Preprocessing Using Python
No ratings yet
Lab 1: Preprocessing Using Python
5 pages
Packages in Python
No ratings yet
Packages in Python
54 pages
DIP Notes Unit-2
No ratings yet
DIP Notes Unit-2
159 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
6 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Evaluation Metrics in Machine Learning
No ratings yet
Evaluation Metrics in Machine Learning
14 pages
CS463 Digital Image Processing - Image
No ratings yet
CS463 Digital Image Processing - Image
3 pages
Multimedia DB
No ratings yet
Multimedia DB
30 pages
Unit I
No ratings yet
Unit I
85 pages
Statistical Inference
No ratings yet
Statistical Inference
69 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
41 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
Question Bank_CSE-DS
No ratings yet
Question Bank_CSE-DS
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
37 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
No ratings yet
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
34 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
Data Visualisation and Analytics
No ratings yet
Data Visualisation and Analytics
3 pages
ML 2
No ratings yet
ML 2
6 pages
Practical 5: Introduction To Weka For Classfication
100% (1)
Practical 5: Introduction To Weka For Classfication
4 pages
OOSE Lab Report
No ratings yet
OOSE Lab Report
30 pages
Cp7029 Information Storage Management
100% (1)
Cp7029 Information Storage Management
1 page
BI Chapter 4 - SP2020 PDF
No ratings yet
BI Chapter 4 - SP2020 PDF
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Lecture 4 Data Structure Linked List
No ratings yet
Lecture 4 Data Structure Linked List
30 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Unit 2 Fod
No ratings yet
Unit 2 Fod
27 pages
Important Questions of SE: Chapter 1:-Introduction To Software and Software Engineering
100% (1)
Important Questions of SE: Chapter 1:-Introduction To Software and Software Engineering
4 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Clipping CG
No ratings yet
Clipping CG
9 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
SOC Lab Manual
No ratings yet
SOC Lab Manual
11 pages
Hci U II
No ratings yet
Hci U II
23 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Key Data Mining Tasks: 1. Descriptive Analytics
No ratings yet
Key Data Mining Tasks: 1. Descriptive Analytics
10 pages
Chapter 5 Data Resource Management
No ratings yet
Chapter 5 Data Resource Management
24 pages
Unit 1 (DMW)
No ratings yet
Unit 1 (DMW)
53 pages
PPT1
No ratings yet
PPT1
93 pages
Lecture 3 - Variables, Datatypes and Operatiors in Python PDF
No ratings yet
Lecture 3 - Variables, Datatypes and Operatiors in Python PDF
45 pages
UNIT-V-MCA-305-ADVANCED DBMS
No ratings yet
UNIT-V-MCA-305-ADVANCED DBMS
25 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
Café Coffee Day: Corporate Profile
No ratings yet
Café Coffee Day: Corporate Profile
10 pages
Hajj Presentation
No ratings yet
Hajj Presentation
87 pages
Resume - Salsman
No ratings yet
Resume - Salsman
3 pages
Vector
No ratings yet
Vector
4 pages
Packaging Machinery
No ratings yet
Packaging Machinery
16 pages
Craad the Shadow Prince
No ratings yet
Craad the Shadow Prince
6 pages
Daily Guideposts 2016 Sample
100% (1)
Daily Guideposts 2016 Sample
13 pages
LG 37lc2r 42lc2r 32lb1r 37lb1r 42lb1r Hurricane3 High Chassis Guide Training (ET)
No ratings yet
LG 37lc2r 42lc2r 32lb1r 37lb1r 42lb1r Hurricane3 High Chassis Guide Training (ET)
53 pages
BD61 Interview Questions
No ratings yet
BD61 Interview Questions
13 pages
JASMIN.D19-dlp With Attached Worksheets - Definition of Solution
No ratings yet
JASMIN.D19-dlp With Attached Worksheets - Definition of Solution
6 pages
Biodiesel Extraction From Cotton Seed Oil
No ratings yet
Biodiesel Extraction From Cotton Seed Oil
12 pages
NIJ Standard of Armour Materials
No ratings yet
NIJ Standard of Armour Materials
27 pages
Question Pack 1 (2)
No ratings yet
Question Pack 1 (2)
23 pages
NEWMay LIST - OF - LICENSED - CFAS
No ratings yet
NEWMay LIST - OF - LICENSED - CFAS
103 pages
Dissertation - Tea Export of Assam
No ratings yet
Dissertation - Tea Export of Assam
3 pages
Aula 3 - Introdução À Chocolateria
No ratings yet
Aula 3 - Introdução À Chocolateria
12 pages
Huawei ONT EG8145V5 Datasheet
No ratings yet
Huawei ONT EG8145V5 Datasheet
4 pages
Beyond The Personality: The Beginner's Guide To Enlightenment.
100% (2)
Beyond The Personality: The Beginner's Guide To Enlightenment.
72 pages
Delay Pedal Dictionary
100% (1)
Delay Pedal Dictionary
15 pages
Mahamarathon For Gpat 2024 Schedule
100% (1)
Mahamarathon For Gpat 2024 Schedule
11 pages
ElementsFifthEdition 8-22-23
No ratings yet
ElementsFifthEdition 8-22-23
240 pages
D Internet Myiemorgmy Iemms Assets Doc Alldoc Document 9194 - METD 270116 T PDF
No ratings yet
D Internet Myiemorgmy Iemms Assets Doc Alldoc Document 9194 - METD 270116 T PDF
1 page
Arwen Amigurumi
No ratings yet
Arwen Amigurumi
19 pages
alAMIN REPORT
No ratings yet
alAMIN REPORT
42 pages
Ga and Oyl
No ratings yet
Ga and Oyl
3 pages
09 Samss 099 PDF
0% (1)
09 Samss 099 PDF
7 pages
class 4th paper mid
No ratings yet
class 4th paper mid
2 pages
Chemistry of Fats & Oils
No ratings yet
Chemistry of Fats & Oils
38 pages
BF 00571142
No ratings yet
BF 00571142
1 page

Concepts (PPT) - Data Preprocessing

Uploaded by

Concepts (PPT) - Data Preprocessing

Uploaded by

BIS2216/DBS1214

Data Mining & knowledge Discovery

• Data mining is a process that uses a variety of data analysis tools to

• Data pre-processing requires data understanding for data preparation

• Much of the raw data contained in databases is unprocessed,

It is essential to understand the following before start data

• Generalization through attribute level

• Generalization through value level

Which point has the larger distance from point A?

Figure: Range Transformation and Z-Transformation

Input x2 has the

Input x2 has the same Input x3 has the information

You might also like