Chapter3 DataPreprocessing
Chapter3 DataPreprocessing
1
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction – data compression techniques like PCA, attribute subset
selection, attribute construction
– Numerosity reduction – smaller representations using parametric models (regression) or
nonparametric models (histograms, clusters, sampling or data aggregation)
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
2
Forms of data preprocessing
3
Data Cleaning
4
Data Cleaning
• Data in the Real World Is Dirty: Why? Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=― ‖ (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=―−10‖ (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=―42‖, Birthday=―03/07/2010‖
• was rating ―1, 2, 3‖, now rating ―A, B, C‖
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
5
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding (left Blank)
– certain data may not be considered important at the time of
entry (left Blank)
– not register history or changes of the data
7
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification) - not effective, unless the tuple contains several
attributes with missing values
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., ―unknown‖
– the attribute mean (Central Tendency: Mean, Median, Mode)
– the attribute mean for all samples belonging to the same class
– the most probable value, inference-based such as Bayesian
formula or decision tree
8
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
9
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
10
How Binning is done?
• Equal‐width(distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
11
Equal-width (distance) partitioning
• Sorted data for price (in dollars):
– 4, 8, 15, 21, 21, 24, 25, 28, 34
• W = (B –A)/N = (34 – 4) / 3 = 10
– Bin 1: 4-14, Bin2: 15-24, Bin 3: 25-34
• Equal-width (distance) partitioning:
– Bin 1: 4, 8
– Bin 2: 15, 21, 21, 24
– Bin 3: 25, 28, 34
12
Regression
13
Clustering
Figure: A 2‐D plot of customer data with respect to customer locations in a city, showing
three data clusters. Each cluster centroid is marked with a ―+‖, representing the average
point on space that cluster. Outliers may be detected as values that fall outside of the sets
of clusters.
14
Data Integration
15
Data Integration
• Data integration: Combines data from multiple sources into a coherent store
• [1] Schema integration: e.g., A.cust-id B.cust-#
– Solution: To resolve errors, use metadata for integration of data from
different sources.
• Entity identification problem:
– Identify real world entities from multiple data sources
• [2] Detecting and resolving data value conflicts (Solution: Sec. 3.2.3 Book)
– For same real world entity, attribute values from different sources are
different.
– Possible reasons: different representations, different scales, encoding.
– Eg1: A weight attribute may be stored in metric units in one system and
British imperial units in another.
– Eg2: For a hotel chain, the price of rooms in different cities may involve not
only different currencies but also different services (e.g., free breakfast) and
taxes.
16
Handling Redundancy in Data Integration
• [3] Redundancy and correlation analysis: Redundant data
occurs often when integration of multiple databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a ―derived‖ attribute in
another table
• Redundant attributes may be detected by correlation analysis and
covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
17
Correlation Analysis (Categorical Data)
18
Correlation Analysis (Categorical Data)
• Χ2 (chi-square) test (Observed Expected ) 2
2
Expected
• The χ2 tests the hypothesis that A and B are independent, that is, there is no
correlation between them.
• The test is based on a significance level, with (r – 1) x (c – 1) degrees of freedom.
• If the hypothesis can be rejected, then we say that A and B are statistically correlated.
• Note:
– The larger the χ2 value, the more likely the variables are related.
– The cells that contribute the most to the χ2 value are those whose actual count is very
different from the expected count.
19
Chi-Square Calculation: An Example
male female Sum (row)
fiction 250(90) 200(360) 450
Non-fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
• Note: Are gender and preferred reading correlated?
• χ2 calculation (numbers in parenthesis are expected counts)
20
Correlation Analysis (Numerical Data)
(n 1) A B (n 1) A B
where n is the number of tuples,A andB are the respective means of A and
B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is
the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation. Thus A (or B) redundant.
• If rA,B = 0, A and B are independent, no correlation;
• If rAB < 0, A and B are negatively correlated.
21
Visually Evaluating Correlation
22
Correlation Vs Causality
23
Covariance (Numeric Data)
Correlation coefficient:
where n is the number of tuples, and are the respective mean or expected
A B
values of A and B, σA and σB are the respective standard deviation of A and
B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B
is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true.
24
Covariance: An Example
• Eg: Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
25
Data Reduction
26
Data Reduction
• Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results.
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
– Dimensionality reduction
• remove attributes that are the same or similar to other attributes
– Numerosity reduction
• represent or aggregate the data, sometimes with precision loss
– Data compression
• generalized techniques to decrease the number of bytes needed to
store data
– Data cube aggregation
27
Data Reduction 1: Dimensionality Reduction
• Dimensionality reduction techniques
– Wavelet transforms
– Principal Component Analysis
– Attribute subset selection (e.g., feature selection)
28
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation in data.
• The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix, and
these eigenvectors define the new space.
x2
x1
29
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k ortho-normal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing ―significance‖
or strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only
30
Attribute Subset Selection
• Redundant attributes
– Duplicate much or all of the information contained in one or more
other attributes
– E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data mining task at
hand
– E.g., students' ID is often irrelevant to the task of predicting
students' GPA
31
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes.
– Best single attribute under the attribute independence assumption: choose by
significance tests.
• Typical heuristic attribute selection methods:
1. Stepwise forward selection (best step-wise feature selection):
• The best single-attribute is picked first
• Then next best attribute is added, ...
2. Stepwise backward elimination (step-wise attribute elimination):
• Repeatedly eliminate the worst attribute
3. Best combined attribute selection and elimination
4. Decision tree induction:
• Tree is constructed from given data. At each node, the algorithm
chooses the ―best‖ attribute to partition the data into individual
classes
32
Greedy (heuristic) methods for
attribute subset selection
33
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods
– Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except
possible outliers)
– Ex: Regression, Log-linear models
• Non-parametric methods
– histograms, clustering, sampling, and data cube aggregation
34
Parametric Data Reduction:
Regression and Log-Linear
• Linear regression
Models
y
– Data modeled to fit a straight line
Y1
– Often uses the least-square method
to fit the line
– Y = w X + b where w and b are Y1’ y=x+1
regression coefficients
• Multiple regression
x
– Allows a response variable Y to be X1
modeled as a linear function of
multidimensional feature vector
Note: see Section 3.4.5 for
– Y = b0 + b1 X1 + b2 X2 more details.
35
Histogram Analysis
• Divide data into bins (buckets)
and store average (sum) for each
bin.
• Partitioning rules:
– Equal-width histogram
– Equal-frequency histogram
Note: Histograms are highly effective at
approximating both sparse and dense data,
as well as highly skewed and uniform data.
37
Clustering
• Partition data set into clusters
based on similarity, and
store cluster representation
(e.g., centroid and diameter)
only.
39
Types of Sampling
• Simple random sampling (SRS)
– There is an equal probability of selecting any particular item
• SRS without replacement (SRSWOR)
– Once an object is selected, it is removed from the population
• SRS with replacement (SRSWR)
– A selected object is not removed from the population
• Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
– Used in conjunction with skewed data
40
Types of Sampling
41
Sampling: Cluster or Stratified Sampling
42
Data Reduction 3: Data Cube Aggregation
Fig: Sales data for a given branch for the years 2008 through 2010. On the left,
the sales are shown per quarter. On the right, the data are aggregated to
provide the annual sales.
43
Data Cube Aggregation
• Data cubes store
multidimensional aggregated
information.
• Figure shows a data cube for
multidimensional analysis of sales
data with respect to annual sales
per item type for each branch.
• Each cell holds an aggregate data
value.
• Adv: provides fast access to pre-
computed and summarized data.
• The cube created at the lowest abstraction level is referred to as base cuboid.
• A cube at the highest level of abstraction is the apex cuboid.
44
Data Reduction 4: Data Compression
• String compression
– There are extensive theories and well-tuned algorithms.
– Typically lossless, but only limited manipulation is possible
without expansion.
• Audio/video compression
– Typically lossy compression, with progressive refinement.
– Sometimes small fragments of signal can be reconstructed
without reconstructing the whole.
• Dimensionality and numerosity reduction may also be
considered as forms of data compression.
45
Data Compression
Original Data
Approximated
46
Data Transformation
47
Data Transformation
48
Data Transformation Handling Methods
– Smoothing: Remove noise from data
– Attribute/feature construction: New attributes constructed from the
given ones
– Aggregation: Summarization and data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization, z-score normalization,
normalization by decimal scaling
– Discretization: The raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels
(e.g., youth, adult, senior).
– Concept hierarchy generation for nominal data: Attributes such as
street can be generalized to higher-level concepts, like city or country
49
Data Transformation by Normalization
50
Normalization methods
• Let A be a numeric attribute with n observed values, v1, v2, …, vn.
• [1] Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex: Let income range 12,000 to 98,000 normalized to [0.0, 1.0].
73,600 12,000
Then 73,000 is mapped to 98,000 12,000
(1.0 0) 0 0.716
51
Normalization methods
• [3] Normalization by decimal scaling: normalizes by moving the decimal
point of values of attribute A.
– The number of decimal points moved depends on the maximum absolute
value of A.
– A value v of A is normalized to v’ by computing
v
v' j where j is the smallest integer such that Max(|ν’|) < 1
10
– Ex: Suppose that the recorded values of A range from -986 to 917.
The maximum absolute value of A is 986.
To normalize by decimal scaling, divide each value by 1000 (i.e., j = 3) so that
-986 normalizes to -0.986 and 917 normalizes to 0.917.
52