0% found this document useful (0 votes)

310 views8 pages

Data Pre-Processing Techniques Explained

This document discusses data pre-processing techniques. It covers topics like data cleaning, transformation, reduction, and discretization. Data cleaning involves tasks like handling missing data by filling it in or smoothing noisy data. Data transformation includes normalization and aggregation. Data reduction aims to reduce the volume of data while maintaining similar analytical results, using methods like cube aggregation, attribute selection, and principal component analysis. The document emphasizes that high quality data preparation is important for data warehousing and mining.

Uploaded by

towsif.imran.dhk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

310 views8 pages

Data Pre-Processing Techniques Explained

Uploaded by

towsif.imran.dhk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 2- Data Pre-Processing

Topic Learning Outcomes

At the end of this topic, you should be able to:

1. Explain the different forms of data pre-processing

2. Apply the different types of data pre-processing appropriately

Contents & Structure

• Data preprocessing

• Data cleaning

• Data transformation

• Data reduction

Data Preprocessing
• Data cleaning

– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies

• Data integration

– Integration of multiple databases, data cubes, or files

• Data transformation

– Normalization and aggregation

• Data reduction

– Obtains reduced representation in volume but produces the same or similar analytical
results

• Data discretization

– Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing

Why Data Preprocessing?

• Data in the real world is dirty

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data

• e.g., occupation=“ ”

– noisy: containing errors or outliers

• e.g., Salary=“-10”

– inconsistent: containing discrepancies in codes or names

• e.g., Age=“42” Birthday=“03/07/1997”

• e.g., rating “1,2,3”, now rating “A, B, C”

• e.g., discrepancy between duplicate records

Data Cleaning
• Importance

– “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball

– “Data cleaning is the number one problem in data warehousing”—DCI survey

• Data cleaning tasks

– Fill in missing values

– Identify outliers and smooth out noisy data

– Correct inconsistent data

– Resolve redundancy caused by data integration

Missing Data

• Data is not always available

– E.g., many tuples have no recorded value for several attributes, such as customer income in
sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
• Missing data may need to be inferred.

How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing

Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class
– the most probable value: inference-based such as Bayesian formula or decision tree

Noisy Data

• Noise: random error or variance in a measured variable

• Incorrect attribute values may due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data

How to Handle Noisy Data?

• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible outliers)
Binning Methods for Data Smoothing

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Cluster Analysis

Regression
Data Transformation
Data Transformation: Normalization

Data Reduction
Data Reduction Strategies

• Data reduction
– Obtain a reduced representation of the data set that is much smaller in volume but yet
produce the same (or almost the same) analytical results.
• Why data reduction?
– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the complete data set
• Data reduction strategies
– Data cube aggregation:
– Attribute subset selection
– Principal Component Analysis.
Data Cube Aggregation

• The lowest level of a data cube (base cuboid)

– The aggregated data for an individual entity of interest

– E.g., a customer in a phone calling data warehouse

• Multiple levels of aggregation in data cubes

– Further reduce the size of data to deal with

• Reference appropriate levels

– Use the smallest representation which is enough to solve the task

• Queries regarding aggregated information should be answered using data cube, when possible
A Sample Data Cube

Attribute Subset Selection

• Feature selection (i.e., attribute subset selection):

– Select a minimum set of features such that the probability distribution of different classes
given the values for those features is as close as possible to the original distribution given
the values of all features

– reduce # of patterns in the patterns, easier to understand

• Heuristic methods (due to exponential # of choices):

– Step-wise forward selection

– Step-wise backward elimination

– Combining forward selection and backward elimination

– Decision-tree induction
Example of Decision Tree Induction

Dimensionality Reduction: Principal Component Analysis (PCA)

• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that
can be best used to represent data
• Steps
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component vectors
– The principal components are sorted in order of decreasing “significance” or strength
– Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance. (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data
• Works for numeric data only
• Used when the number of dimensions is large

Principal Component Analysis

Review Questions

• Why to pre-process data?

• How to handle missing data?
• How to handle noisy data?
• When should you employ data reduction?

Summary / Recap of Main Points

• Data preparation or preprocessing is a big issue for both data warehousing and data mining
• Descriptive data summarization is need for quality data preprocessing
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but data preprocessing still an active area of research

Common questions

Data reduction strategies include data cube aggregation, attribute subset selection, and Principal Component Analysis (PCA). Data cube aggregation reduces the dataset size through multi-level aggregation steps, making it easier to manage . Attribute subset selection involves choosing a minimal set of relevant features, simplifying the dataset while maintaining its analytical power . PCA reduces dimensionality by transforming data into fewer orthogonal components that retain most of the original variance . Together, these approaches decrease data volume but preserve the integrity of analysis, making it more feasible to handle large datasets efficiently.

Data cleaning addresses missing data by ignoring tuples when the class label is missing or filling in values using various strategies such as global constants, means, or inference-based methods like Bayesian formulas . Noisy data is tackled through binning, regression, clustering, and combined computer-human inspection . It is a major challenge because real-world data is often incomplete and noisy, as noted by Ralph Kimball and a DCI survey that cite data cleaning as one of the biggest issues in data warehousing due to its complexity and the extent to which it affects data analysis quality .

The main forms of data preprocessing include data cleaning, data integration, data transformation, and data reduction. Data cleaning involves filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies, which enhance data quality and reliability . Data integration combines data from multiple sources, ensuring consistency and comprehensive datasets . Data transformation involves normalization and aggregation to standardize and summarize data, respectively, facilitating comparison and analysis . Lastly, data reduction decreases data volume while maintaining analytical results, making the analysis more efficient and manageable . Together, these processes prepare raw data for more accurate and meaningful analysis.

PCA facilitates dimensionality reduction by transforming data into a set of orthogonal components ordered by their variance. By selecting only the components with the highest variance, PCA reduces the dimensionality of the dataset while retaining as much variability as possible . It works well for numeric data and is beneficial when dealing with high-dimensional spaces. However, its limitation is that it cannot be used for non-numeric data and might lose interpretability since the derived components might not have a clear meaning . Additionally, PCA assumes linear relationships, which may not be suitable for all types of data analysis.

Clustering helps in handling noisy data by grouping data points into clusters and detecting outliers as those which don’t fit well into any cluster, which can then be removed or re-evaluated . In contrast, binning sorts data and smooths it using techniques like bin means, mediated modifications to detected groups; regression fits data into a model, smoothing the noise concerning a deterministic relationship . Unlike binning and regression, clustering focuses more on natural data structure and relationships for noise reduction, which may offer more flexibility in identifying and managing anomalies.

Missing data can be handled automatically using methods such as filling with a global constant, the attribute mean, the attribute mean for specific classes, or the most probable value inferred via Bayesian formulas or decision trees . While these methods can efficiently address missing values, they have potential drawbacks such as introducing bias (e.g., mean imputation can distort statistical relationships) and masking underlying issues (e.g., consistently missing values might indicate a systematic problem). Moreover, inferred values may not reflect the true distribution, potentially leading to inaccurate analyses.

Human inspection in data cleaning benefits from leveraging human judgment and intuition to validate suspicious data values, particularly in noisy areas that automated systems might either mislabel or overlook . It allows for the detection of subtle complexities and nuances that automated techniques might miss. However, it poses significant challenges such as being time-consuming, financially costly, and prone to human error or bias . Additionally, it might not be scalable for large datasets, thus requiring a combination with automated methods to be effective on substantial data collections.

Data reduction is particularly necessary when dealing with extremely large datasets that make complete analysis either infeasible or excessively time-consuming, such as databases storing terabytes of data . Using a data cube offers benefits such as the capability to answer queries more efficiently by utilizing aggregated information already structured at multiple levels. This allows for quicker data retrieval and smarter storage management, ultimately speeding up the analysis process while maintaining the essential analytical results .

Data discretization is a crucial part of data reduction that focuses specifically on numerical data. It involves segmenting continuous data into intervals, making it easier to analyze by reducing data complexity . This simplification allows the use of more straightforward analytical methods and improves performance in tasks like classification and regression by transforming continuous attributes into categorical form. It is important for numerical data as it can reveal trends and patterns not easily apparent with raw continuous data .

Normalization in data transformation standardizes data ranges, which makes different datasets more easily comparable and improves the efficiency of data analysis by reducing skewness and making computation less intensive . Aggregation involves summarizing attribute data which reduces the dataset's complexity and size without compromising on analytical quality, thereby speeding up processing time and improving the efficiency of data analysis tasks . These methods streamline datasets which otherwise could be cumbersome due to size and variability.

Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
22 pages
Machine Learning Overview and Applications
No ratings yet
Machine Learning Overview and Applications
8 pages
Understanding Type I and II Errors
No ratings yet
Understanding Type I and II Errors
11 pages
CCS360 Recommender System Exam Guide
No ratings yet
CCS360 Recommender System Exam Guide
2 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
CS3352 Foundations of Data Science Syllabus
No ratings yet
CS3352 Foundations of Data Science Syllabus
2 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
Data Warehouse and Mining Lab Manual
No ratings yet
Data Warehouse and Mining Lab Manual
60 pages
Space-Time Tradeoffs in Algorithms
100% (1)
Space-Time Tradeoffs in Algorithms
24 pages
Data Handling & Visualization Lab Manual
No ratings yet
Data Handling & Visualization Lab Manual
32 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
8 pages
Data Science Fundamentals Question Bank
No ratings yet
Data Science Fundamentals Question Bank
5 pages
Data Classification Techniques Overview
No ratings yet
Data Classification Techniques Overview
14 pages
One Sample T-Test Explained
No ratings yet
One Sample T-Test Explained
30 pages
Distance-Based Outlier Detection Methods
No ratings yet
Distance-Based Outlier Detection Methods
6 pages
Minimax Algorithm in Game Theory
100% (1)
Minimax Algorithm in Game Theory
9 pages
Understanding Frequency Distributions
No ratings yet
Understanding Frequency Distributions
27 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
14 pages
Big Data and Analytics Course Overview
No ratings yet
Big Data and Analytics Course Overview
34 pages
Binary, Multi-Class & Multi-Label Classification
No ratings yet
Binary, Multi-Class & Multi-Label Classification
6 pages
Decision Tree Induction in DWDM
No ratings yet
Decision Tree Induction in DWDM
11 pages
CURE vs K-Means in Clustering Analysis
No ratings yet
CURE vs K-Means in Clustering Analysis
48 pages
Machine Learning with MLlib & Scikit-learn
100% (1)
Machine Learning with MLlib & Scikit-learn
28 pages
Data Wrangling with Pandas in Python
No ratings yet
Data Wrangling with Pandas in Python
40 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
48 pages
Understanding Decision Tree Classification
No ratings yet
Understanding Decision Tree Classification
43 pages
Big Data Analytics Overview and Notes
No ratings yet
Big Data Analytics Overview and Notes
9 pages
Machine Learning Experiments Manual
No ratings yet
Machine Learning Experiments Manual
35 pages
Feature Engineering and Selection Methods
No ratings yet
Feature Engineering and Selection Methods
68 pages
Creating ARFF Files for Weka Analysis
No ratings yet
Creating ARFF Files for Weka Analysis
21 pages
Introduction to Pandas for Data Analysis
100% (1)
Introduction to Pandas for Data Analysis
2 pages
AI Production Systems Overview
No ratings yet
AI Production Systems Overview
104 pages
Sorted Array Requirement for Binary Search
No ratings yet
Sorted Array Requirement for Binary Search
466 pages
Genetic Algorithms in Soft Computing
No ratings yet
Genetic Algorithms in Soft Computing
37 pages
Classification Techniques and Decision Trees
No ratings yet
Classification Techniques and Decision Trees
17 pages
Data Analytics Lab Manual for B.Tech
No ratings yet
Data Analytics Lab Manual for B.Tech
35 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
41 pages
Pattern Recognition Course Overview
0% (1)
Pattern Recognition Course Overview
2 pages
Understanding Singular Value Decomposition
No ratings yet
Understanding Singular Value Decomposition
88 pages
Simple Linear Regression Implementation
No ratings yet
Simple Linear Regression Implementation
3 pages
Data Cleaning and Transformation Techniques
No ratings yet
Data Cleaning and Transformation Techniques
22 pages
Types of Data in Cluster Analysis
No ratings yet
Types of Data in Cluster Analysis
51 pages
Pattern Recognition & Anomaly Detection
No ratings yet
Pattern Recognition & Anomaly Detection
2 pages
AI in Personalized Nutrition Insights
No ratings yet
AI in Personalized Nutrition Insights
4 pages
Mining Complex Data Types Overview
No ratings yet
Mining Complex Data Types Overview
69 pages
OLAP in Data Warehousing Overview
No ratings yet
OLAP in Data Warehousing Overview
26 pages
OPTICS: Density-Based Clustering Method
100% (1)
OPTICS: Density-Based Clustering Method
10 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
20 pages
Association Rule Mining Techniques
100% (1)
Association Rule Mining Techniques
24 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
62 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
52 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
52 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
52 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
55 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
34 pages
Data Integration in Preprocessing
No ratings yet
Data Integration in Preprocessing
29 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
55 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
35 pages
Data Preprocessing Techniques in Data Mining
No ratings yet
Data Preprocessing Techniques in Data Mining
49 pages
Data Preprocessing Techniques in Data Mining
0% (1)
Data Preprocessing Techniques in Data Mining
46 pages
Kinematics Overview for Class 9 Physics
No ratings yet
Kinematics Overview for Class 9 Physics
68 pages
Java Interface Implementation Guide
No ratings yet
Java Interface Implementation Guide
7 pages
Automotive Power Seat Reference Design: TI Designs: TIDA-020008
No ratings yet
Automotive Power Seat Reference Design: TI Designs: TIDA-020008
30 pages
Dual PWM CCFL Controller BI3101A
100% (1)
Dual PWM CCFL Controller BI3101A
9 pages
C Program for Process System Calls
No ratings yet
C Program for Process System Calls
15 pages
71950CAT - 1706 Manual
No ratings yet
71950CAT - 1706 Manual
116 pages
ABCD Parameters of Transmission Lines
No ratings yet
ABCD Parameters of Transmission Lines
2 pages
38AE
No ratings yet
38AE
6 pages
GBPI Product Catalogue Brochure
No ratings yet
GBPI Product Catalogue Brochure
12 pages
Health Trends Analysis in Java
No ratings yet
Health Trends Analysis in Java
84 pages
Warehouse Structure Design Analysis
No ratings yet
Warehouse Structure Design Analysis
24 pages
Ferrites: Key Definitions and Concepts
No ratings yet
Ferrites: Key Definitions and Concepts
29 pages
Computer Applications Training Manual
No ratings yet
Computer Applications Training Manual
76 pages
Poompuhar Tsunami Sediment Analysis
No ratings yet
Poompuhar Tsunami Sediment Analysis
21 pages
ICT Unicorn Model Trading Indicator
No ratings yet
ICT Unicorn Model Trading Indicator
10 pages
Understanding Earth's Rotation Dynamics
No ratings yet
Understanding Earth's Rotation Dynamics
3 pages
PL/SQL Exercises by Sameer Dehadrai
No ratings yet
PL/SQL Exercises by Sameer Dehadrai
2 pages
Introduction to Quantum Machine Learning
No ratings yet
Introduction to Quantum Machine Learning
4 pages
Laminar Flow in Rotating Cylinders
No ratings yet
Laminar Flow in Rotating Cylinders
6 pages
D.C. Machines & Transformer Lab Manual
No ratings yet
D.C. Machines & Transformer Lab Manual
186 pages
Control Plan for Can Making Process
No ratings yet
Control Plan for Can Making Process
1 page
S - 7720 OTS Standard PDF
100% (1)
S - 7720 OTS Standard PDF
46 pages
PCC 1301 Engine Commissioning Report
100% (2)
PCC 1301 Engine Commissioning Report
5 pages
Chopin's Prelude No. 15 Analysis
100% (1)
Chopin's Prelude No. 15 Analysis
2 pages
BEEE Question Paper March 2023
No ratings yet
BEEE Question Paper March 2023
3 pages
Understanding Gears and Their Functions
No ratings yet
Understanding Gears and Their Functions
14 pages
Working Capital Management and Its Impact On Profitability: A Case Study of Bharti Airtel Telecom Company
No ratings yet
Working Capital Management and Its Impact On Profitability: A Case Study of Bharti Airtel Telecom Company
7 pages
Six Sigma: Quality Improvement Strategy
No ratings yet
Six Sigma: Quality Improvement Strategy
11 pages
Economics Exam Paper for Class XI
No ratings yet
Economics Exam Paper for Class XI
5 pages
Simotics Low Voltage Motors en
No ratings yet
Simotics Low Voltage Motors en
28 pages

Data Pre-Processing Techniques Explained

Uploaded by

Data Pre-Processing Techniques Explained

Uploaded by

Chapter 2- Data Pre-Processing

Topic Learning Outcomes

At the end of this topic, you should be able to:

1. Explain the different forms of data pre-processing

2. Apply the different types of data pre-processing appropriately

Contents & Structure

– Integration of multiple databases, data cubes, or files

– Normalization and aggregation

Why Data Preprocessing?

– noisy: containing errors or outliers

– inconsistent: containing discrepancies in codes or names

• e.g., Age=“42” Birthday=“03/07/1997”

• e.g., rating “1,2,3”, now rating “A, B, C”

• e.g., discrepancy between duplicate records

– “Data cleaning is the number one problem in data warehousing”—DCI survey

• Data cleaning tasks

– Fill in missing values

– Identify outliers and smooth out noisy data

– Correct inconsistent data

– Resolve redundancy caused by data integration

• Data is not always available

How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing

• Noise: random error or variance in a measured variable

How to Handle Noisy Data?

* Partition into equal-frequency (equi-depth) bins:

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

• The lowest level of a data cube (base cuboid)

– The aggregated data for an individual entity of interest

– E.g., a customer in a phone calling data warehouse

• Multiple levels of aggregation in data cubes

– Further reduce the size of data to deal with

• Reference appropriate levels

– Use the smallest representation which is enough to solve the task

Attribute Subset Selection

• Feature selection (i.e., attribute subset selection):

– reduce # of patterns in the patterns, easier to understand

• Heuristic methods (due to exponential # of choices):

– Step-wise forward selection

– Step-wise backward elimination

– Combining forward selection and backward elimination

Dimensionality Reduction: Principal Component Analysis (PCA)

Principal Component Analysis

• Why to pre-process data?

Summary / Recap of Main Points

Common questions

What strategies are involved in data reduction, and how do they help in managing large data volumes?

What strategies are involved in data reduction, and how do they help in managing large data volumes?

How does data cleaning address the challenges of missing and noisy data, and why is it considered a major challenge in data warehousing?

How does data cleaning address the challenges of missing and noisy data, and why is it considered a major challenge in data warehousing?

What are the main forms of data preprocessing and how do they contribute to the effectiveness of data analysis?

What are the main forms of data preprocessing and how do they contribute to the effectiveness of data analysis?

How does Principal Component Analysis (PCA) facilitate dimensionality reduction, and what are its limitations?

How does Principal Component Analysis (PCA) facilitate dimensionality reduction, and what are its limitations?

What role does clustering play in handling noisy data, and how does it differ from binning and regression methods?

What role does clustering play in handling noisy data, and how does it differ from binning and regression methods?

Discuss the methods of handling missing data automatically and their potential drawbacks.

Discuss the methods of handling missing data automatically and their potential drawbacks.

What are the benefits and challenges associated with human inspection in data cleaning, particularly for noisy data?

What are the benefits and challenges associated with human inspection in data cleaning, particularly for noisy data?

In what situations is data reduction particularly necessary, and what are the benefits of using a data cube for this purpose?

In what situations is data reduction particularly necessary, and what are the benefits of using a data cube for this purpose?

How does data discretization fit into the process of data reduction, and what is its importance for numerical data?

How does data discretization fit into the process of data reduction, and what is its importance for numerical data?

Explain how the data transformation methods of normalization and aggregation can improve data analysis efficiency.

Explain how the data transformation methods of normalization and aggregation can improve data analysis efficiency.

You might also like