0% found this document useful (0 votes)
31 views

Data Mining

This document discusses various techniques for data reduction and transformation in data mining. It describes strategies like dimensionality reduction using PCA and attribute subset selection to reduce the number of attributes. For numerosity reduction, it covers parametric methods like regression and non-parametric methods such as histograms, clustering, and sampling to replace the original data with a smaller representation. Common transformation techniques discussed include normalization, binning, concept hierarchies, and aggregation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Data Mining

This document discusses various techniques for data reduction and transformation in data mining. It describes strategies like dimensionality reduction using PCA and attribute subset selection to reduce the number of attributes. For numerosity reduction, it covers parametric methods like regression and non-parametric methods such as histograms, clustering, and sampling to replace the original data with a smaller representation. Common transformation techniques discussed include normalization, binning, concept hierarchies, and aggregation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining and Business Intelligence

Integration

Data Pre-processing Reduction


Part2
Transformation
By
Dr. Nora Shoaip

Lecture 4

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2023 - 2024
Outline

Data Reduction Data Transformation


 Wavelet transforms  Normalization
 PCA  Binning
 Attribute subset selection  Histogram analysis
 Regression  Cluster/Decision
 Histograms trees/Correlation
 Clustering analyses
 Sampling  Concept hierarchy

2
Data Reduction
Strategies

 Dimensionality reduction  reduce number of attributes


◦Wavelet transforms, PCA, Attribute subset selection
 Numerosity reduction  replace original data volume by smaller data representation
◦Parametric  a model is used to estimate the data - only the data parameters are
stored
Regression
◦Nonparametric  store reduced representations of the data
Histograms, clustering, sampling
 Compression  transformations applied to obtain a “compressed” representation of
original data
◦Lossless, Lossy

3
Data Reduction
Attribute Subset Selection

 find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
 An exhaustive search can be prohibitively expensive
 Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
 Attribute construction  e.g. area attribute based on height and width attributes
4
Data Reduction
Attribute Subset
Selection

5
Data Reduction- Numerosity reduction
Regression

 Data is modeled to fit a straight line


 A random variable y (response variable), can be modeled
as a linear function of another random variable x
(predictor variable)
Regression line equation  y = wx + b
 w and b are regression coefficients  they specify the
slope of the line and y-intercept
 Solved for by the method of least squares minimize error
between actual line separating data and estimate of the
line (best-fitting line)

6
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

7
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

8
Data Reduction
Histograms

 A histogram for an attribute, A, partitions the data distribution of A into disjoint


subsets, referred to as buckets or bins.

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant


(i.e., each bucket contains roughly the same number of contiguous data samples).

9
Data Reduction
Histograms

The following data are a list of AllElectronics


prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18,
18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

10
Data Reduction
Sampling

 A large data set represented by a smaller random data sample


 Simple random sample without replacement (SRSWOR) of size s  draw s of the N
tuples (s < N)
◦all tuples are equally likely to be sampled
 Simple random sample with replacement (SRSWR) of size s  similar to SRSWOR,
but each time a tuple is drawn, it’s recorded then placed back so it may be drawn again
 Cluster sample  If tuples are grouped into M “clusters,” an SRS of s clusters can be
obtained
 Stratified sample  If tuples are divided into strata, a stratified sample is generated by
obtaining an SRS at each stratum
◦e.g. stratum is created for each customer age group
11
Data Reduction
Sampling

12
Transformation and Discretization
Transformation Strategies

 Smoothing  binning, regression

 Attribute construction

 Aggregation

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

13
Transformation and Discretization
Transformation by Normalization

To help avoid dependence on the choice of measurement units


Give all attributes equal weight
Methods:
min-max normalization
z-score normalization

14
Transformation and Discretization
Transformation by Normalization

15
Transformation and Discretization
Transformation by Normalization

16
Transformation and Discretization
Concept Hierarchy

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically


 Concept hierarchies facilitate drilling and rolling to view data in multiple
granularity
 Concept hierarchy formation: Recursively reduce data by collecting and
replacing low level concepts (e.g. age values) by higher level concepts (e.g.
age groups: youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
 Concept hierarchy can be automatically formed for both numeric and nominal
data  discretization
17
Transformation and Discretization
Concept Hierarchy

For nominal data:


Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
street, city, province or state, country  street < city < province or state < country
Specification of a set of attributes, but not of their partial ordering  order
automatically generated by system
e.g. Location  country contains a smaller #distinct values than street
automatically generate concept hierarchy based on # distinct values per attribute in the
given attribute set
Not for all concepts! Time  year (20), month (12), day of week (7)

18
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
19
Quiz
• You have this data for the attribute age: 13,
15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70.
• Use smoothing by bin means to smooth these data, using a bin depth
of 3
• Use min-max normalization to transform the value 35 for age onto the
range [0.0, 1.0].
• Use z-score normalization to transform the value 35 for age, where
the standard deviation of age is 12.94 years.

20

You might also like