02 Data_preprocessing -4,5,6
02 Data_preprocessing -4,5,6
What is Data?
entity, or instance
Data Preprocessing
p g
z A well-accepted
p multidimensional view:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– V l added
Value dd d
– Interpretability
– Accessibility
Major Tasks in Data Preprocessing
z Data cleaning
– Fill iin missing
i i values,
l smooth
th noisy
i d data,
t ididentify
tif or remove outliers,
tli
and resolve inconsistencies
z Data integration
– Integration of multiple databases or files
z Data transformation
– Normalization and aggregation
z Data reduction
– Obtains reduced representation in volume but produces the same or
similar
i il analytical
l ti l results
lt
z Data discretization
– Part of data reduction but with p
particular importance,
p , especially
p y for
numerical data
Forms of Data Preprocessing
Data Preprocessing
z Importance
p
– garbage in garbage out principle (GIGO)
– the attribute mean for all data points belonging to the same class:
smarter
z Binning
– first
fi t sortt d
data
t and
d partition
titi into
i t (equal-frequency)
( lf ) bins
bi
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
z Regression
– smooth by fitting the data into regression functions
z Cl t i
Clustering
– detect and remove outliers
z Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
Simple Discretization Methods: Binning
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
A)/N
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing
g by
y bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin
Bi 33: 29
29, 29
29, 29
29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Cluster Analysis as Binning
Data Cleaning as a Process
z Data integration:
– Combines data from multiple sources into a coherent store
z Schema integration: e.g., A.cust-id ≡ B.cust-#
– Integrate metadata from different sources
z E tit id
Entity identification
tifi ti problem:
bl
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
z Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units
Handling Redundancy in Data Integration
rA , B =
∑ ( A − A )( B − B ) ∑ ( AB ) − n A B
=
( n − 1)σ A σ B ( n − 1)σ A σ B
where n is the number of tuples, A and B are the respective means of A
and B,
B σA and σB are the respective standard deviation of A and B,
B and
Σ(AB) is the sum of the AB cross-product.
z Χ2 (chi-square) test
n
(Ob
Observed
d − E
Expected
d ) 2
χ n2−1 = ∑ i i
i =1 Expectedi
z n is
i th
the number
b off possible
ibl values
l
z The larger the Χ2 value, the more likely the variables are related
z The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
z Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Sum (col.)
(col ) 300 1200 1500
If science fiction and chess playing are independent attributes, then the
probability to like SciFi AND play chess is
That means, we expect 0.06 · 1500 = 90 such cases (if they are independent)
Chi-Square Calculation: An Example
Sum (col.)
(col ) 300 1200 1500
Sta da d Deviation
Standard e at o o of Average
e age Sta da d Deviation
Standard e at o o of Average
e age
Monthly Precipitation Yearly Precipitation
Attribute Transformation
z String compression
– There
Th are extensive
t i theories
th i and
d well-tuned
ll t d algorithms
l ith
– Typically lossless
– But only limited manipulation is possible without expansion
z Audio/video compression
– Typically lossy compression, with progressive refinement
– Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
z Time sequence
q is not audio
– Typically short and vary slowly with time
Data Compression
Original Data
Approximated
Data Compression (via PCA)
Dimensions
Dimensions==206
120
160
10
40
80
Data Reduction Method: Sampling
p g
z Sampling: obtaining a small sample s to represent
the whole data set N
z Allow a mining algorithm to run in complexity that
is potentially sub-linear
sub linear to the size of the data
z Choose a representative subset of the data
– Simple
p random sampling
p g may y have very
yppoor
performance in the presence of skew
z Develop adaptive sampling methods
– St
Stratified
tifi d sampling:
li
Approximate the percentage of each class (or
p p
subpopulation of interest)) in the overall database
Used in conjunction with skewed data
Types of Sampling
z Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
Sampling: with or without Replacement
Raw Data
Sample Size
z Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
z Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
Feature Subset Selection
z Techniques:
– Brute
Brute-force
force approach:
Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
Features are selected before data mining algorithm is run
– Wrapper approaches:
Use the data mining algorithm as a black box to find best subset
of attributes
Feature Creation
z Methodologies:
– Mapping Data to New Space
Feature construction by combining features
Data Preprocessing
p g
z Discretization:
– Divide the range of a continuous attribute into intervals
D t
Data E
Equal
l interval
i t l width
idth
z Discretization
– Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
– Interval labels can then be used to replace actual data values
– Supervised vs. unsupervised (use class or don’t use class variable)
– Split (top-down) vs. merge (bottom-up)
– Clustering
Cl t i analysis
l i ((covered
d earlier
li and
d iin more d
detail
t il llater)
t )
– Entropy-based
Entropy based discretization: supervised,
supervised top
top-down
down split
country 15 di
distinct
ti t values
l
pprovince_or_
_ _ state 365 distinct values