DM Lect3
DM Lect3
Data Preprocessing
Lec 3
Outlines
• Why preprocess the data?
• Data cleaning
• Data reduction
• Data cleaning
• Data reduction
• the attribute mean for all samples belonging to the same class:
smarter
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g.,
deal with possible outliers)
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning
• ifA andB are the lowest and highest values of the attribute, the
• Data cleaning
• Data reduction
• Challenges:
• Entity identification problem:
• How to match schemas and objects from different sources ?
• Redundancy and correlation analysis : are any attributes
correlated?
Entity identification problem
• Identify real world entities from multiple data sources are
tricky
rA,B
( A A )( B B )
( AB ) n A B
(n 1)
A B (n 1)
A B
• Data cleaning
• Data integration
• Data transformation
• Data reduction
Data Transformation
• Data are transformed or consolidated into forms
appropriate for mining.
• The resulting mining process may be more efficient, and
the patterns found may be easier to understand
• Attribute Transformation: A function that maps
the entire set of values of a given attribute to a
new set of replacement values.
Data Transformation
• Data transformation strategies :
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• e.g.,daily sales data may be aggregated to monthly or annual sales
min
v
v' ( new _ max A new _ min A ) new
A
_ min A
max min
A A
A
73 , 600 54 , 000
1 . 225
• Ex. Let μ = 54,000, σ = 16,000. Then 16 , 000
• Data cleaning
• Data integration
• Data transformation
• Data reduction
Data Reduction Strategies
• Why data reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on
the complete data set
• Data reduction
• Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
• Data reduction strategies
• Data cube aggregation:
• Dimensionality reduction — e.g., remove unimportant attributes
• Sampling
• Data Compression
Data Reduction: Dimensionality
reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly
sparse
Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
The possible combinations of subspaces will grow
exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Data Reduction: Dimensionality
reduction
• Dimensionality reduction is the process of
reducing the number of attributes under
consideration including :
• Data compression techniques
• Attribute subset selection which removes irrelevant,
weakly relevant, or redundant attribute or dimentions.
• Why ?
• To improve quality and efficient of the mining process.
Data Reduction: sampling
• Sampling obtaining a small sample s to represent the
whole data set N.
• Choose a representative subset of the data.
• Stratified sampling
• Split the data into several groups; then draw random samples from
each group.
• Ensures that both groups are represented.
Sample Size
ssy
lo
Original Data
Approximated
ANY QUESTIONS