0% found this document useful (0 votes)
10 views

Topic 4

Uploaded by

applehead0203
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Topic 4

Uploaded by

applehead0203
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

ASET

Amity School of Engineering & Technology


Module2:Topic4: Inconsistent Data, Data Integration and Transformation
B.Tech CSE, 7th Semester
Data Mining & Business Intelligence

1
Learning Objective and Outcomes

Objective:

To learn techniques to handle inconsistent data


Outcomes:
Apply data integration and transformation technique
Outline

• Inconsistent Data
• Data Integration
• Data Transformation
Inconsistent Data ASET
Data Integration
ASET

• Data integration:
– Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id  B.cust-#


– Integrate metadata from different sources

• Entity identification problem:


– Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

• Detecting and resolving data value conflicts


– For the same real world entity, attribute values from different sources are different
– Possible reasons: different representations, different scales, e.g., metric vs. British units

5
5
Handling Redundancy in Data Integration
ASET

• Redundant data occur often when integration of multiple databases


– Object identification: The same attribute or object may have different names in different
databases
– Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual
revenue

• Redundant attributes may be able to be detected by correlation analysis and covariance analysis

• Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality

6
6
Correlation Analysis (Nominal Data) ASET

• Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
• The larger the Χ2 value, the more likely the variables are related

• The cells that contribute the most to the Χ2 value are those whose actual count is very different
from the expected count

• Correlation does not imply causality


– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

7
Chi-Square Calculation: An Example
ASET

Play Not play Sum


chess chess (row)
Like science fiction 250(90) 200(360) 450

Not like science 50(210) 1000(840) 1050


fiction
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2


 
2
   507.93
• I 90 210 360 840

• t shows that like_science_fiction and play_chess are correlated in the group

8
Correlation Analysis (Numeric Data)
ASET

• Correlation coefficient (also called Pearson’s product moment


coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B
where n is the number of tuples, and are the respective means
of A and B, σA and σB are the respective standard
B deviation of A and A
B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

9
Visually Evaluating Correlation
ASET

Scatter plots
showing the
similarity from
–1 to 1.

10
Correlation (viewed as linear relationship)
ASET

• Correlation measures the linear relationship between objects


• To compute correlation, we standardize data objects, A and B, and then take
their dot product

a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '


11
Covariance (Numeric Data)
ASET

• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and


B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow12
multivariate normal distributions) does a covariance of 0 imply independence
Co-Variance: An Example
ASET

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Data Transformation
ASET

• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones

– Aggregation: Summarization, data cube construction


– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization

• z-score normalization

• normalization by decimal scaling

– Discretization: Concept hierarchy climbing

14
Normalization ASET

• Min-max normalization: to [new_minA, new_maxA]


v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
(1.0  0)  0 0.716
1.0]. Then $73,000 is mapped to 98,000  12,000

• Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
• Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
15
Summary ASET

Data integration from multiple sources:


Entity identification problem
Remove redundancies
Detect inconsistencies
Data transformation
Normalization

16
References ASET

D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM,
42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language,
model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee
on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective.
Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge
and Data Engineering, 7:623-640, 1995

17

You might also like