Data Mining
Data Mining
Integration
Lecture 4
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2023 - 2024
Outline
2
Data Reduction
Strategies
3
Data Reduction
Attribute Subset Selection
find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
An exhaustive search can be prohibitively expensive
Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
Attribute construction e.g. area attribute based on height and width attributes
4
Data Reduction
Attribute Subset
Selection
5
Data Reduction- Numerosity reduction
Regression
6
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
7
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
8
Data Reduction
Histograms
Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).
9
Data Reduction
Histograms
10
Data Reduction
Sampling
12
Transformation and Discretization
Transformation Strategies
Attribute construction
Aggregation
labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)
13
Transformation and Discretization
Transformation by Normalization
14
Transformation and Discretization
Transformation by Normalization
15
Transformation and Discretization
Transformation by Normalization
16
Transformation and Discretization
Concept Hierarchy
18
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
19
Quiz
• You have this data for the attribute age: 13,
15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70.
• Use smoothing by bin means to smooth these data, using a bin depth
of 3
• Use min-max normalization to transform the value 35 for age onto the
range [0.0, 1.0].
• Use z-score normalization to transform the value 35 for age, where
the standard deviation of age is 12.94 years.
20