Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
technology limitation
incomplete data
inconsistent data
9
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
P value is 0.043
In other words, because 0.043 < 0.05 we think that Gender is linked to
Pet Preference (Men and Women have different preferences for Cats
and Dogs).
Correlation Analysis (Nominal Data)
To calculate this p-value, we use the Chi-Square Test!
Calculate "Expected Value" for each entry: Multiply each row total by
each column total and divide by the overall total:
Cat Dog
Men 489×438/962 489×524/962 489
Women 473×438/962 473×524/962 473
438 524 962
Correlation Analysis (Nominal Data)
Which gives us:
Cat Dog
Men 222.64 266.36 489
Women 215.36 257.64 473
438 524 962
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
Correlation (viewed as linear relationship)
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, A and B, and then take their dot product
Correlation coefficient:
Cov(X,Y) =
= (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-
11) /(4-1)
= (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3
= 3 + 0.6 + .5 + 2.7 / 3= 6.8/3= 2.267
The result is positive, meaning that the variables are positively
related.
Co-Variance: An Example
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
Wavelet transforms
Data compression
Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
What Is Wavelet Transform?
Decomposes a signal into
different frequency subbands
Applicable to n-
dimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
Why Wavelet Transform?
Use hat-shape filters
Emphasize region where points cluster
Multi-resolution
Detect arbitrary shaped clusters at different scales
Efficient
Complexity O(N)
x2
x1
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
Works for numeric data only
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in
one or more other attributes
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant attributes
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Heuristic Search in Attribute Selection
There are 2d possible attribute combinations of d attributes
Typical heuristic attribute selection methods:
Best single attribute under the attribute independence
Domain-specific
patterns in Chapter 7)
Data discretization
45
Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
Multiple regression
Allows a response variable Y to be modeled as a
distributions
y
Regression Analysis
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors) Used for prediction
The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but
relationships
other criteria have also been used
Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
Useful for dimensionality reduction and data smoothing
Histogram Analysis
Divide data into buckets and 40
store average (sum) for each 35
bucket 30
Partitioning rules: 25
Equal-width: equal bucket 20
range
15
Equal-frequency (or equal- 10
depth)
5
0
30000
10000
20000
40000
50000
60000
70000
80000
90000
100000
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 10
Sampling
Convenience sampling
Judgmental or purposive sampling
Snowball sampling
Quota sampling
Sampling
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling:
Partition the data set, and draw samples from each
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a ce
re
SRSW
R
Raw Data
Sampling: Cluster or Stratified Sampling
os sy
l
Original Data
Approximated
Chapter 3: Data Preprocessing
73,600 54,000
1.225
16,000
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
Data Discretization Methods
Typical methods: All the methods can be applied
recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
70
Simple Discretization: Binning
country}
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression