0% found this document useful (0 votes)

10 views

Topic 4

Uploaded by

applehead0203

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Topic 4

Uploaded by

applehead0203

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

ASET

Amity School of Engineering & Technology

Module2:Topic4: Inconsistent Data, Data Integration and Transformation
B.Tech CSE, 7th Semester
Data Mining & Business Intelligence

1
Learning Objective and Outcomes

Objective:

To learn techniques to handle inconsistent data

Outcomes:
Apply data integration and transformation technique
Outline

• Inconsistent Data
• Data Integration
• Data Transformation
Inconsistent Data ASET
Data Integration
ASET

• Data integration:
– Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id  B.cust-#

– Integrate metadata from different sources

• Entity identification problem:

– Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

• Detecting and resolving data value conflicts

– For the same real world entity, attribute values from different sources are different
– Possible reasons: different representations, different scales, e.g., metric vs. British units

5
5
Handling Redundancy in Data Integration
ASET

• Redundant data occur often when integration of multiple databases

– Object identification: The same attribute or object may have different names in different
databases
– Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue

• Redundant attributes may be able to be detected by correlation analysis and covariance analysis

• Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

6
6
Correlation Analysis (Nominal Data) ASET

• Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
• The larger the Χ2 value, the more likely the variables are related

• The cells that contribute the most to the Χ2 value are those whose actual count is very different
from the expected count

• Correlation does not imply causality

– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

7
Chi-Square Calculation: An Example
ASET

Play Not play Sum

chess chess (row)
Like science fiction 250(90) 200(360) 450

Not like science 50(210) 1000(840) 1050

fiction
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2

 
2
   507.93
• I 90 210 360 840

• t shows that like_science_fiction and play_chess are correlated in the group

8
Correlation Analysis (Numeric Data)
ASET

• Correlation coefficient (also called Pearson’s product moment

coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B
where n is the number of tuples, and are the respective means
of A and B, σA and σB are the respective standard
B deviation of A and A
B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

9
Visually Evaluating Correlation
ASET

Scatter plots
showing the
similarity from
–1 to 1.

10
Correlation (viewed as linear relationship)
ASET

• Correlation measures the linear relationship between objects

• To compute correlation, we standardize data objects, A and B, and then take
their dot product

a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

11
Covariance (Numeric Data)
ASET

• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and

B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow12
multivariate normal distributions) does a covariance of 0 imply independence
Co-Variance: An Example
ASET

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Data Transformation
ASET

• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones

– Aggregation: Summarization, data cube construction

– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization

• z-score normalization

• normalization by decimal scaling

– Discretization: Concept hierarchy climbing

14
Normalization ASET

• Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
(1.0  0)  0 0.716
1.0]. Then $73,000 is mapped to 98,000  12,000

• Z-score normalization (μ: mean, σ: standard deviation):

v  A
v' 
 A

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
• Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
15
Summary ASET

Data integration from multiple sources:

Entity identification problem
Remove redundancies
Detect inconsistencies
Data transformation
Normalization

16
References ASET

D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM,
42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language,
model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee
on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective.
Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge
and Data Engineering, 7:623-640, 1995

03preprocessing Part2
No ratings yet
03preprocessing Part2
15 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Ue22cs342aa2 20240827192243
No ratings yet
Ue22cs342aa2 20240827192243
28 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
03Preprocessing
No ratings yet
03Preprocessing
38 pages
Unit 3
No ratings yet
Unit 3
164 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
Module 2_DM_AI
No ratings yet
Module 2_DM_AI
61 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
55 pages
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
Pca
No ratings yet
Pca
39 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Preprocessing-Featue Engineering
No ratings yet
Preprocessing-Featue Engineering
16 pages
Data Integration and Discretization
No ratings yet
Data Integration and Discretization
39 pages
PPT1
No ratings yet
PPT1
93 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Relational Model-2023 PDF
No ratings yet
Relational Model-2023 PDF
12 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
DM_merged
No ratings yet
DM_merged
169 pages
Linear Regression Apply On House Price Prediction On Boston House Dataset
No ratings yet
Linear Regression Apply On House Price Prediction On Boston House Dataset
12 pages
CH 3
No ratings yet
CH 3
68 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Introduc) On To Probabilis) C Latent Seman) C Analysis: NYP Predic) Ve Analy) Cs Meetup June 10, 2010
No ratings yet
Introduc) On To Probabilis) C Latent Seman) C Analysis: NYP Predic) Ve Analy) Cs Meetup June 10, 2010
26 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
29 pages
The Mathematics Behind Principal Component Analysis
No ratings yet
The Mathematics Behind Principal Component Analysis
9 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Domain Theory Notes
No ratings yet
Domain Theory Notes
61 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
TCMG - MEEG 573 - SP - 20 - Lecture - 7
No ratings yet
TCMG - MEEG 573 - SP - 20 - Lecture - 7
69 pages
172 - Slide 5 - Relational Model
No ratings yet
172 - Slide 5 - Relational Model
55 pages
Unit 4 Correlation and Linear Regression
No ratings yet
Unit 4 Correlation and Linear Regression
26 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
SQL
No ratings yet
SQL
51 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Module 2
No ratings yet
Module 2
62 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Model 780-001 Indoor Explosion-Proof Single Party Handset Station
No ratings yet
Model 780-001 Indoor Explosion-Proof Single Party Handset Station
2 pages
A Gis Modelling Approach For Flood Hazard Assessment in Part of Surakarta City, Indonesia
No ratings yet
A Gis Modelling Approach For Flood Hazard Assessment in Part of Surakarta City, Indonesia
18 pages
Or Simplex
No ratings yet
Or Simplex
13 pages
Calculus 1: AREA AND VOLUME
No ratings yet
Calculus 1: AREA AND VOLUME
37 pages
Ek Khwaab Ne Aankhein Kholi Hain Kya Mod Aaya Hai Kahaani Mein Wo Bheeg Rahi Hai Baarish Mein Aur Aag Lagi Hai Paani Mein
No ratings yet
Ek Khwaab Ne Aankhein Kholi Hain Kya Mod Aaya Hai Kahaani Mein Wo Bheeg Rahi Hai Baarish Mein Aur Aag Lagi Hai Paani Mein
3 pages
Application: Name:Plate-Shaped RF Power Ceramic Capacitor Item#.: CCG81 Series
No ratings yet
Application: Name:Plate-Shaped RF Power Ceramic Capacitor Item#.: CCG81 Series
3 pages
Pages From Neta 2007 PDF
No ratings yet
Pages From Neta 2007 PDF
3 pages
Catia Multicax Installation Guide
No ratings yet
Catia Multicax Installation Guide
42 pages
Co 2 Insufflation
No ratings yet
Co 2 Insufflation
12 pages
SPE 71582 Theory and Analysis of Injectivity Tests On Horizontal Wells
No ratings yet
SPE 71582 Theory and Analysis of Injectivity Tests On Horizontal Wells
16 pages
CD Assignments
No ratings yet
CD Assignments
7 pages
CPP - Controls Manual - 01439M1C11 - Preliminary Lipstronic
No ratings yet
CPP - Controls Manual - 01439M1C11 - Preliminary Lipstronic
248 pages
Intro to FEA Notes Zurich
No ratings yet
Intro to FEA Notes Zurich
208 pages
What Is Middleware?
No ratings yet
What Is Middleware?
33 pages
Document 5
No ratings yet
Document 5
16 pages
Jso 2021 Paper III
No ratings yet
Jso 2021 Paper III
13 pages
Dureza Total Con Titulador Digital
No ratings yet
Dureza Total Con Titulador Digital
8 pages
SPD3 Subplate
No ratings yet
SPD3 Subplate
5 pages
Week 12 - Java Threads
No ratings yet
Week 12 - Java Threads
3 pages
Lecture 4 - Differential Equation
No ratings yet
Lecture 4 - Differential Equation
10 pages
Algorithm Complexity
No ratings yet
Algorithm Complexity
35 pages
Page Rank, Structure of Web and Analyzing A Web Graph
No ratings yet
Page Rank, Structure of Web and Analyzing A Web Graph
17 pages
DOANE - Stats Answer Key Chap 008
100% (1)
DOANE - Stats Answer Key Chap 008
73 pages
Whats New in Asme A 2010
No ratings yet
Whats New in Asme A 2010
19 pages
8 JTAIKpb ZX3 K GAOQ
No ratings yet
8 JTAIKpb ZX3 K GAOQ
11 pages
Gromacs-Manual-3 3
100% (9)
Gromacs-Manual-3 3
300 pages
Pardo, Orense and Sarmah (2018) - Cyclic Strength of Sand Mixed With Biochar Some Preliminary Results
No ratings yet
Pardo, Orense and Sarmah (2018) - Cyclic Strength of Sand Mixed With Biochar Some Preliminary Results
7 pages
Java Lab
No ratings yet
Java Lab
44 pages
Feb 6
No ratings yet
Feb 6
48 pages
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
100% (1)
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
15 pages