03 Preprocessing
03 Preprocessing
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
4
Chapter 3: Data Preprocessing
technology limitation
incomplete data
inconsistent data
9
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
Clustering
detect and remove outliers
10
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
11
Data Preprocessing
Wavelet transforms
Data compression
16
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
17
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
18