VIPDMTheoryChapter3
VIPDMTheoryChapter3
Concepts and
Techniques
(3rd ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
4
Form of Data Preprocessing
5
Chapter 3: Data Preprocessing
7
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
8
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred
9
How to Handle Missing
Data?
Methods:
Ignore the tuple
Fill in manually
Use a global constant
Use attribute mean/median
Use the most probable value
10
Ignore the Tuple
Scenario:Done when the class label is missing or when
missing values are minimal.
Not effective if the percentage of missing values is high.
11
Fill Manually
Scenario:
Feasible for small datasets or critical missing
values.
Requires domain expertise.
12
Fill with a Global Constant
Scenario:
Replace missing values with a constant (e.g., "Unknown").
Solution:
Replace missing values with "Unknown."
13
Fill with a Global Constant
Pros: Simple and uniform.
Cons: May reduce data variability and accuracy.
14
Fill with Mean/Median
Replace missing values with the column mean or median.
16
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
17
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-
frequency) bins
then one can smooth by bin means, smooth by
18
How to Handle Noisy Data?
Regression
smooth by fitting the data into regression
functions
Regression is used to fit a function to the data and reduce
noise by predicting and smoothing the values. Linear
regression, polynomial regression, or other regression
techniques can model the relationship between variables
and help reduce variability.
Example:
19
Solution with Linear Regression:
•
20
How to Handle Noisy Data?
Clustering
Clustering groups similar data points together, and data points
that don’t belong to any cluster or lie far from cluster centers
are treated as outliers. Techniques like K-means or DBSCAN are
commonly used.
Outlier: [120].
21
How to Handle Noisy Data?
Combined computer and human inspection
detect suspicious values and check by human
22
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
23
Chapter 3: Data Preprocessing
27
Chi-Square Calculation: An
Example
Step 1: Hypotheses
Null Hypothesis (H₀): Gender and preference for online shopping are
independent (no association).
Alternative Hypothesis (H₁): Gender and preference for online
shopping are not independent (there is an association).
28
Step 2: The expected frequency for each
cell is calculated using the formula:
30
Step 3: Chi-Square Formula
(Observed Expected ) 2
2
Expected
31
Step 4: Sum the Chi-Square Values
32
Step 7: Interpret the Result
33
Chi-Square Calculation: An
Example
Scatter plots
showing the
similarity from
–1 to 1.
36
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, and are the respective mean
or expected values of A and B, A σ andBσ are the respective
A B
42
4
3
Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT) for linear signal
processing, multi-resolution analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
4
7
Wavelet Decomposition
Wavelets: A math tool for space-efficient
hierarchical decomposition of functions
S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to S^
= [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
Compression: many small detail coefficients can
be replaced by 0’s, and only the significant
coefficients are retained
Principal Component Analysis (PCA)
The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.
x2
x1
48
Principal Component Analysis
(Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only(Sparse data and skewed data)
49
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information
contained in one or more other attributes
E.g., purchase price of a product and the
amount of sales tax paid
Irrelevant attributes
Contain no information that is useful for the
data mining task at hand
E.g., students' ID is often irrelevant to the task
of predicting students' GPA
50
Heuristic Search in Attribute
Selection
There are 2d possible attribute combinations of d
attributes
Typical heuristic attribute selection methods:
Best single attribute under the attribute independence
The best single-attribute is picked first
Then next best attribute condition to the first, ...
Step-wise backward elimination:
Repeatedly eliminate the worst attribute
Combination of forward selection and backward
elimination
Optimal branch and bound(Decision tree induction):
Use attribute elimination and backtracking
51
Attribute Subset Selection
52
Attribute Creation (Feature
Generation)
Create new attributes (features) that can capture
the important information in a data set more
effectively than the original ones
Three general methodologies
Attribute extraction
Domain-specific
Mapping data to new space (see: data
reduction)
E.g., Fourier transformation, wavelet
transformation, manifold approaches (not
covered)
Attribute construction
E.g., Height and width gives Area
53
Data Reduction 2: Numerosity
Reduction
Reduce data volume by choosing alternative,
smaller forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate
sampling, … 54
Parametric vs. Non-Parametric
Methods
line
Multiple regression
Allows a response variable Y to be modeled as
probability distributions
56
y
Regression Analysis
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a Y1’
y=x+1
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory X1 x
variables or predictors)
Used for prediction (including
The parameters are estimated so as to give a forecasting of time-series
"best fit" of the data data), inference, hypothesis
Most commonly the best fit is evaluated by testing, and modeling of causal
relationships
using the least squares method, but other
criteria have also been used
57
Linear Regression (Parametric)
• Instead of storing all data points, we only store the coefficients b0and b1. Then, we can
estimate Y values from X, reducing data storage.
58
Linear Regression (Parametric)
Example:
Price=50000+200×(Size)+10000×(Bedrooms)
+5000×(Distance)
Example:
Consider a dataset of customers classified by income level (low, medium, high)
and purchase behavior (buys, doesn't buy). Instead of storing the entire dataset, a log-
linear model estimates frequencies like:
63
Histogram Analysis
Types of Histograms:
1. Equal-Width Histogram:
1. The data range is divided into bins of equal width.
points.
2. Example: If we have 1000 data points and 10 bins,
69
7
0
Compression
String compression
There are extensive theories and well-tuned algorithms
without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Data Compression
os sy
l
Original Data
Approximated
7
3
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized
73,600 12,000 to [0.0, 1.0].
(1.0 0) 0 0.716
Then $73,000 is mapped to 98 , 000 12, 000
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
75
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into
intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
76
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or bottom-
up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
77
Simple Discretization: Binning
87