Chapter 3: Data Preprocessing
Chapter 3: Data Preprocessing
2
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
3
Chapter 3: Data Preprocessing
technology limitation
incomplete data
8
How to Handle Noisy Data?
Binning
first sort data and partition(divided ) into several buckets/equal
frequency (also called bins)
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
y = b + mx
Clustering
Groups(also called Clusters) having similar values are formed
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
9
Data Cleaning as a Process
Various steps involved in datamining process
Data discrepancy detection
there may exist noise in data due to the following
Manual error, system error, field overloading, poorly designed structure etc
The discrepancy can be detected by metadata (data about data)helps in
finding the domain & datatype, range, dependency, distribution
Data should be analyzed using the following rules.
Unique rule (attribute value should be different)
Consecutive rule (should be no missing values b/w the lowest & highest value for
the attribute, all existing values must be unique)
Null rule (use special characters/strings to indicate null condition)
Use commercial tools
Data scrubbing tools: use simple domain knowledge (e.g., postal code,
10
Data Cleaning as a Process Conti….
11
Chapter 3: Data Preprocessing
14
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
15
Chi-Square Calculation: An Example
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
18
Correlation (viewed as linear relationship)
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, A and B, and then take their dot product
19
Covariance (Numeric Data)
Covariance is similar to correlation
Correlation coefficient:
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
Chapter 3: Data Preprocessing
Wavelet transforms
Data Compression
23
Data Cube Aggregation
Where aggregation operations are applied to the data in the construction of a
data cube. It is a process in which information is gathered and expressed in a
summary form, for purpose such as statistical analysis. Example
Year 2016
Half-yearly Sales
H1 Rs.5000
H2 Rs.3000
Year 2018
Half-yearly Sales
H1 Rs.8000
H2 Rs.5000
24
Attribute Subset Selection
Irrelevant, weakly relevant, or redundant attributes or dimensions may be
detected and removed.
Redundant attributes
Duplicate much or all of the information contained in one or more other
attributes
E.g., purchase price of a product and the amount of sales tax paid
Irrelevant attributes
Contain no information that is useful for the data mining task at hand
E.g., students' ID is often irrelevant to the task of predicting students'
GPA
25
Heuristic Search in Attribute Selection
For n attributes, there exists 2n possible subsets.
Typical heuristic attribute selection methods:
Stepwise forward selection
Stepwise backward elimination
Combination of forward selection and backward elimination
Decision tree induction
26
Stepwise forward selection
Forward Selection
Initial Attribute Set: { A1,A2,A3,A4,A5,A6,A7,A8}
Initial Reduced Set : { }
=> {A2}
=> {A2, A5}
=> {A2,A5,A6}
=> Reduced Attribute Set:
=>{A2,A5,A6,A8}
27
Stepwise backward elimination
Initially, the procedure starts with the full set of attributes in the reduced set.
Then, with each successive iteration the worst attribute is removed from
the attribute set.
Backward Elimination
Initial Attribute Set: { A1,A2,A3,A4,A5,A6,A7,A8}
=> {A1,A2,A3,A4,A5,A6,A8}
=> {A1,A2,A3,A5,A6,A8}
=> {A1,A2,A5,A6,A8}
=>{A2,A5,A6,A8}
28
Combination of forward selection and backward elimination
This technique with each iteration identifies the best attribute from the original
attributes and at the same time removes the worst attributes from the
remaining attributes.
29
Decision Tree Induction
This technique constructs a tree-like structure on the basis of the available data
The tree consists of an internal (nonleaf) node, which denotes a test on an
attribute, a branch, which represents the result of the test, and an external
(leaf) node which denotes predicated class.
At each node, the algorithm chooses the best attribute such that the data are
divided into individual classes.
Initial Attribute Set: {A1,A2,A3,A4,A5,A6,A7,A8}
A5?
Y N
A2? A8?
Y N Y N
A6? class1 class1 class2
Y N
class1 class2
30
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis (PCA)
31
Dimensionality Reduction
Dimensionality reduction represents the original data in the compressed or reduced form
by applying data encoding or transformations on it.
If the original data can be reconstructed from the compressed data without losing any
information, the data reduction is said to be lossless.
If one can reconstruct only an approximation of the original data, the data reduction is
said to be lossy.
The two most effective and popular methods of lossy dimensionality reduction are
Wavelet transforms
Principal Component Analysis(PCA)
32
What Is Wavelet Transform?
Decomposes a signal into
different frequency subbands
Applicable to n-
dimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
33
Wavelet Transformation
Haar2 Daubechie4
This lossy dimension reduction method works by using its variant
called Discrete Wavelet Transform (DWT)
DWT is a linear signal processing technique, multi-resolution analysis
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to Discrete Fourier Transform (DFT) , it is a signal
processing technique, but better lossy compression, localized in space
Lossy compression by wavelets gives better result than JPEG
compression.
Some of the popular wavelet transforms are Haar-2, Daubechines-4
and Daubechines-6 transforms.
34
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
35
Principal Component Analysis (Steps)
This method searches for k,
n-dimensional orthogonal vectors that can be best used to represent the data
Where k <= n, here n refers to total attributes or dimensions of the data which need to
be reduced
find k ≤ n orthogonal vectors (principal components) that can be best used to represent
data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal component vectors
The principal components are sorted in order of decreasing “significance” or strength
Since the components are sorted, the size of the data can be reduced by eliminating
the weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
Works for numeric data only
36
Numerosity Reduction
It reduces data volume by choosing alternative smaller forms of data
representation. Such representation can be achieved by two methods
1. Parametric methods
Here, only parameters of data and outliers are stored instead of the
actual data. (Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data (except
possible outliers))
Ex.: Regression and Log-linear models
2. Non-parametric methods
Used to store data in reduced forms such as Histograms, Clustering,
Sampling, …
Do not assume models
37
Regression Models
Linear regression (Y = m X + b)
Data modeled to fit a straight line, it uses the formula of a straight line (Y =
m X + b) and determines the appropriate values for m and b (called
regression coefficients) to predict the value of y (also called response
variable) based on a given value of x (also called predictor variable)
Often uses the least-square method to fit the line
Two regression coefficients, m and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
Multiple regression (Y = b0 + b1 X1 + b2 X2)
Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector
Many nonlinear functions can be transformed into the above
38
y
Regression Analysis
Y1
Regression analysis: A collective name for
techniques for the modeling and analysis Y1’
y=x+1
of numerical data consisting of values of a
dependent variable (also called
response variable or measurement) and X1 x
of one or more independent variables (aka.
explanatory variables or predictors) Used for prediction
The parameters are estimated so as to (including forecasting of
give a "best fit" of the data time-series data), inference,
hypothesis testing, and
Most commonly the best fit is evaluated by
modeling of causal
using the least squares method, but relationships
other criteria have also been used
39
Log-Linear Models
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
Useful for dimensionality reduction and data smoothing
40
Histogram
20000
40000
50000
60000
70000
80000
90000
100000
41
Histogram
Partitioning rules:
Equal-width: equal bucket range
In this, the width of each bucket range is uniform
Equal-frequency (or equal-depth)
buckets are created that the frequency of each bucket is roughly
constant. Ie each bucket is roughly holding the same number of
contiguous data samples.
42
Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
The quality of clusters is determined by two factors, namely cluster
diameter and centroid distance
Cluster diameter is defined as the maximum distance b/w any two
objects.
Centroid distance is the average distance of each cluster object from the
cluster centroid.
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional index
tree structures
There are many choices of clustering definitions and clustering algorithms
Cluster analysis will be studied in depth in Chapter 10
43
Sampling
It is data reduction technique which helps in representing a large data set by
a much smaller random sample (or subset) of the data.
Sampling: obtaining a small sample s to represent the whole data set N
Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor performance in the
presence of skew
Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)
44
Types of Sampling
45
Sampling: With or without Replacement
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
46
Sampling: Cluster or Stratified Sampling
47
Data Reduction 3: Data Compression
String compression
There are extensive theories and well-tuned algorithms
ss y
lo
Original Data
Approximated
49
If the original data can be reconstructed from the
compressed data without losing any information, the
data reduction is said to be lossless.
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
53
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
54
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
55
Simple Discretization: Binning
59
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
60
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit
data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
61
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
63