Pre Processing
Pre Processing
⯈Data cleaning
⯈Data reduction
⯈ Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
⯈ Use the attribute mean for all samples belonging to the same class to fill in the missing
value: smarter
y=x+1
Y1’
X1 x
Data Cleaning as a Process
⮚ Data discrepancy detection
⯈ Use metadata (e.g., domain, range, dependency, distribution)
⯈ Check field overloading
⯈ Check uniqueness rule, consecutive rule and null rule
⯈ Use commercial tools (Talend Data Quality Tool, Sept. 2008)
⯈ Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors
and make corrections
⯈ Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
⮚ Data migration and integration
⯈ Data migration tools: allow transformations to be specified
⯈ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a
graphical user interface
⮚ Integration of the two processes
Handle Noisy Data: Cluster
Analysis
Data integration and
transformation
⮚ Data integration
Combines data from multiple sources into a coherent store
⯈
⮚ Schema integration: e.g., A.cust-id ≡ B.cust-#
⯈ Integrate metadata from different sources
⮚ Entity identification problem
⯈ Identify and use real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
⮚ Detecting and resolving data value conflicts
⯈ For the same real world entity, attribute values from different sources are different
⯈ Possible reasons: different representations, different scales, e.g., metric vs. British
units
Handling Redundancy in Data
Integration
⮚ Redundant data occur often when integration of multiple databases
⯈ Object identification: The same attribute or object may have
different names in different databases
⯈ Derivable data: One attribute may be a “derived”
attribute in another table, e.g.,
annual revenue
⮚ Redundant attributes may be able to be detected by correlation analysis
⮚ Careful integration of the data from multiple sources
may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Correlation Analysis (Numerical
⯈ Data)
Correlation coefficient (also called Pearson’s product moment coefficient)
⯈ 90
It shows that like_science_fiction 210 play_chess
and 360 840
are correlated in the group (as 507 >
significance level ~10)
Data Transformation
⮚ Smoothing: remove noise from data
⮚ Aggregation: summarization, data cube construction
⮚ Generalization: concept hierarchy climbing
⮚ Normalization: scaled to fall within a small, specified
range
⯈ min-max normalization
⯈ z-score normalization
⯈ normalization by decimal scaling
⮚ Attribute/feature construction
⯈ New attributes constructed from the given ones
Data Transformation:
⮚ Min-maxNormalization
normalization: to [new_min , new_max ]A A
v−
v' (new _ maxA − new _ minA) + new _
minA
= maxA − minA
minA
⯈ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600 −12,000
98,000 −12,000 (1.0 − 0) + 0 =
0.716
⮚ Z-score normalization (μ: mean, σ: standard
deviation):
v' = v − μA
σ A
73,600 − 54,000 =
⯈ Ex. Let μ = 54,000, σ = 16,000. Then
1.225
⮚ Normalization by decimal scaling 16,000
v
v' Where j is the smallest integer such that Max(|ν’|) <
= 10 1
j
Data Reduction
⮚ Why Data Reduction?
⯈ A database/data warehouse may store terabytes of data
⯈ Complex data analysis/mining may take a very long time to run on the complete data set
⮚ Data reduction
⯈ Obtain a reduced representation of the data set that is much smaller in volume but yet produce the
same (or almost the same) analytical results
A1 ? A6 ?
⮚ Audio/video compression
⯈ Typically lossy compression, with progressive refinement
⯈ Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
Data Compression
Original Data
Approximated
Regression
⮚ Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
⮚ Greatly studied in statistics, neural network fields.
⮚ Examples:
⯈Predicting sales amounts of new product based on
advertising expenditure.
⯈Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
⯈ Time series prediction of stock market indices.
Data Reduction Method (1):
Regression
⮚ Linear regression: Data are modeled to fit a straight line
⯈ Often uses the least-square method to fit the line
Y=wX+b
⯈ Two regression coefficients, w and b, specify the line and are to be estimated by using the
data at hand
⯈ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
⮚ Multiple regression: Allows a response variable Y to be modeled as a linear function of a
multidimensional feature vector
Y = b0 + b1 X1 + b2 X2.
⮚ Many nonlinear functions can be transformed into the above
Data Reduction Method (2):
Histograms
⮚ Divide data into buckets and 4
0
store average (sum) for each
bucket 3
5
⮚ Partitioning rules: 3
⯈ Equal-width: equal bucket range 0
10000
20000
30000
40000
50000
60000
70000
80000
90000
5
100000
pair for pairs have the β–1 largest
differences 1
Histograms
Data Reduction Method (4):
Sampling
⮚ Sampling: Obtaining a small sample s to represent the whole data set N
⮚ Allow a mining algorithm to run in complexity that is potentially sub-linear to the size
of
the data
⮚ Choose a representative subset of the data
⯈ Simple random sampling may have very poor performance in the presence of skew
⮚ Develop adaptive sampling methods
⯈ Stratified sampling:
⯈ Approximate the percentage of each class (or subpopulation of interest) in the overall
database
⯈ Used in conjunction with skewed data
⮚ Note: Sampling may not reduce database I/Os (page at a time)
Sampling: with or without Replacement
Raw Data
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample
Discretization
⮚ Three types of attributes:
⮚ Discretization:
Entropy(S ) = − pi=1log ( p )
1
∑ i 2 i
⮚ The boundary that minimizes the entropy function over all possible boundaries is selected as
a binary discretization
⮚ The process is recursively applied to partitions obtained until some stopping criterion is met
⮚ Such a boundary may reduce data size and improve classification accuracy
Interval Merge by χ Analysis 2
⯈ Adjacent intervals with the lowest χ2 values are merged together, since low χ2 values
for a pair indicate similar class distributions
⯈ This merge process proceeds recursively until a predefined stopping criterion is met
(such as significance level, max-interval, max inconsistency, etc.)
Segmentation by Natural Partitioning
⯈ A simple 3-4-5 rule can be used to segment numeric data into relatively uniform,
“natural” intervals.
⯈ If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range
into 4 intervals
⯈ If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range
into 5 intervals
Concept Hierarchy Generation for Categorical
Data
⮚ Specification of a partial/total ordering of attributes explicitly at the schema level
by users or experts
⯈ street < city < state < country
⮚ Specification of a hierarchy for a set of values by explicit data grouping
⯈ {Urbana, Champaign, Chicago} < Illinois
⮚ Specification of only a partial set of attributes
⯈ E.g., only street < city, not others
⮚ Automatic generation of hierarchies (or attribute levels) by the analysis of the number
of distinct values
⯈ E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
⮚ Some hierarchies can be automatically generated based on the analysis of the number
of distinct values per attribute in the data set
⯈ The attribute with the most distinct values is placed at the lowest level of the
hierarchy
⯈ Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
⯈ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, 1999
⯈ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
⯈ T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build a Data Quality Browser.
SIGMOD’02.
⯈ H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4),
December 1997
⯈ E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data
Engineering.
Vol.23, No.4
⯈ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001
⯈ Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996
⯈ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering,