0% found this document useful (0 votes)

31 views

Data Mining

This document discusses various techniques for data reduction and transformation in data mining. It describes strategies like dimensionality reduction using PCA and attribute subset selection to reduce the number of attributes. For numerosity reduction, it covers parametric methods like regression and non-parametric methods such as histograms, clustering, and sampling to replace the original data with a smaller representation. Common transformation techniques discussed include normalization, binning, concept hierarchies, and aggregation.

Uploaded by

mohamedelgohary679

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Data Mining

Uploaded by

mohamedelgohary679

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Mining and Business Intelligence

Integration

Data Pre-processing Reduction

Part2
Transformation
By
Dr. Nora Shoaip

Lecture 4

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2023 - 2024
Outline

Data Reduction Data Transformation

 Wavelet transforms  Normalization
 PCA  Binning
 Attribute subset selection  Histogram analysis
 Regression  Cluster/Decision
 Histograms trees/Correlation
 Clustering analyses
 Sampling  Concept hierarchy

2
Data Reduction
Strategies

 Dimensionality reduction  reduce number of attributes

◦Wavelet transforms, PCA, Attribute subset selection
 Numerosity reduction  replace original data volume by smaller data representation
◦Parametric  a model is used to estimate the data - only the data parameters are
stored
Regression
◦Nonparametric  store reduced representations of the data
Histograms, clustering, sampling
 Compression  transformations applied to obtain a “compressed” representation of
original data
◦Lossless, Lossy

3
Data Reduction
Attribute Subset Selection

 find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
 An exhaustive search can be prohibitively expensive
 Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
 Attribute construction  e.g. area attribute based on height and width attributes
4
Data Reduction
Attribute Subset
Selection

5
Data Reduction- Numerosity reduction
Regression

 Data is modeled to fit a straight line

 A random variable y (response variable), can be modeled
as a linear function of another random variable x
(predictor variable)
Regression line equation  y = wx + b
 w and b are regression coefficients  they specify the
slope of the line and y-intercept
 Solved for by the method of least squares minimize error
between actual line separating data and estimate of the
line (best-fitting line)

6
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

7
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

8
Data Reduction
Histograms

 A histogram for an attribute, A, partitions the data distribution of A into disjoint

subsets, referred to as buckets or bins.

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant

(i.e., each bucket contains roughly the same number of contiguous data samples).

9
Data Reduction
Histograms

The following data are a list of AllElectronics

prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18,
18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

10
Data Reduction
Sampling

 A large data set represented by a smaller random data sample

 Simple random sample without replacement (SRSWOR) of size s  draw s of the N
tuples (s < N)
◦all tuples are equally likely to be sampled
 Simple random sample with replacement (SRSWR) of size s  similar to SRSWOR,
but each time a tuple is drawn, it’s recorded then placed back so it may be drawn again
 Cluster sample  If tuples are grouped into M “clusters,” an SRS of s clusters can be
obtained
 Stratified sample  If tuples are divided into strata, a stratified sample is generated by
obtaining an SRS at each stratum
◦e.g. stratum is created for each customer age group
11
Data Reduction
Sampling

12
Transformation and Discretization
Transformation Strategies

 Smoothing  binning, regression

 Attribute construction

 Aggregation

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

13
Transformation and Discretization
Transformation by Normalization

To help avoid dependence on the choice of measurement units

Give all attributes equal weight
Methods:
min-max normalization
z-score normalization

14
Transformation and Discretization
Transformation by Normalization

15
Transformation and Discretization
Transformation by Normalization

16
Transformation and Discretization
Concept Hierarchy

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically

 Concept hierarchies facilitate drilling and rolling to view data in multiple
granularity
 Concept hierarchy formation: Recursively reduce data by collecting and
replacing low level concepts (e.g. age values) by higher level concepts (e.g.
age groups: youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
 Concept hierarchy can be automatically formed for both numeric and nominal
data  discretization
17
Transformation and Discretization
Concept Hierarchy

For nominal data:

Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
street, city, province or state, country  street < city < province or state < country
Specification of a set of attributes, but not of their partial ordering  order
automatically generated by system
e.g. Location  country contains a smaller #distinct values than street
automatically generate concept hierarchy based on # distinct values per attribute in the
given attribute set
Not for all concepts! Time  year (20), month (12), day of week (7)

18
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
19
Quiz
• You have this data for the attribute age: 13,
15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70.
• Use smoothing by bin means to smooth these data, using a bin depth
of 3
• Use min-max normalization to transform the value 35 for age onto the
range [0.0, 1.0].
• Use z-score normalization to transform the value 35 for age, where
the standard deviation of age is 12.94 years.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (83)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
77% (13)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Week 2
No ratings yet
Week 2
96 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Data_Preprocessing-2
No ratings yet
Data_Preprocessing-2
30 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
3-Data Fundamentals for BI- Part2
No ratings yet
3-Data Fundamentals for BI- Part2
44 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
No ratings yet
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
24 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
No ratings yet
DMiningKuliah 2B DPreparation Lanjutan New2 - 3
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Pca (Data Reduction)
No ratings yet
Pca (Data Reduction)
24 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
w2-Data_Preparation
No ratings yet
w2-Data_Preparation
46 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing: Data Cleaning Data Integration and Transformation
No ratings yet
Data Preprocessing: Data Cleaning Data Integration and Transformation
41 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Normalization
No ratings yet
Normalization
35 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Discret Ization
No ratings yet
Discret Ization
12 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preprocessing - 2: Course Leader
No ratings yet
Data Preprocessing - 2: Course Leader
31 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Mining

Uploaded by

Data Mining

Uploaded by

Data Mining and Business Intelligence

Data Pre-processing Reduction

Data Reduction Data Transformation

 Dimensionality reduction  reduce number of attributes

 Data is modeled to fit a straight line

 A histogram for an attribute, A, partitions the data distribution of A into disjoint

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant

The following data are a list of AllElectronics

 A large data set represented by a smaller random data sample

 Smoothing  binning, regression

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

To help avoid dependence on the choice of measurement units

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically

For nominal data:

You might also like