Data_Preprocessing-1-19
Data_Preprocessing-1-19
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
1
Data Quality: Why Preprocess the Data?
2
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
3
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
4
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
5
Incomplete (Missing) Data
◼ technology limitation
◼ incomplete data
◼ inconsistent data
8
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ Clustering
◼ detect and remove outliers
9
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)
10
Chapter 3: Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Summary
11
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales, e.g.,
metric vs. British units
12
Handling Redundancy in Data Integration
14
2/12/2025 Data Mining: Concepts and Techniques 15
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
16
Correlation (viewed as linear relationship)
◼ Correlation measures the linear relationship
between objects
◼ To compute correlation, we standardize data
objects, A and B, and then take their dot product
17
Covariance (Numeric Data)
◼ Covariance is similar to correlation
Correlation coefficient:
◼ Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?