Topic 4
Topic 4
1
Learning Objective and Outcomes
Objective:
• Inconsistent Data
• Data Integration
• Data Transformation
Inconsistent Data ASET
Data Integration
ASET
• Data integration:
– Combines data from multiple sources into a coherent store
5
5
Handling Redundancy in Data Integration
ASET
• Redundant attributes may be able to be detected by correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
6
6
Correlation Analysis (Nominal Data) ASET
• Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count is very different
from the expected count
7
Chi-Square Calculation: An Example
ASET
8
Correlation Analysis (Numeric Data)
ASET
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective means
of A and B, σA and σB are the respective standard
B deviation of A and A
B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
9
Visually Evaluating Correlation
ASET
Scatter plots
showing the
similarity from
–1 to 1.
10
Correlation (viewed as linear relationship)
ASET
Correlation coefficient:
• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Data Transformation
ASET
• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
• z-score normalization
14
Normalization ASET
73,600 54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000
• Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
15
Summary ASET
16
References ASET
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM,
42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language,
model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee
on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective.
Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge
and Data Engineering, 7:623-640, 1995
17