0% found this document useful (0 votes)
7 views

03preprocessing Part2

This chapter discusses data preprocessing techniques for data mining. It covers data cleaning, integration, reduction, transformation, and discretization. Specific techniques covered include schema integration to combine data from multiple sources, resolving conflicts when integrating data, and using correlation and covariance analysis to detect redundant attributes and evaluate relationships between numeric and nominal attributes. These preprocessing steps aim to improve data quality and prepare the data for mining models.

Uploaded by

baigsalman251
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

03preprocessing Part2

This chapter discusses data preprocessing techniques for data mining. It covers data cleaning, integration, reduction, transformation, and discretization. Specific techniques covered include schema integration to combine data from multiple sources, resolving conflicts when integrating data, and using correlation and covariance analysis to detect redundant attributes and evaluate relationships between numeric and nominal attributes. These preprocessing steps aim to improve data quality and prepare the data for mining models.

Uploaded by

baigsalman251
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Mining

Dr. Shahid Mahmood Awan

http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
[email protected]
University of Management and Technology

Fall 2017
Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 3 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
3
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
4
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple


databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
5
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

6
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
7
Chi-square Table
Suppose that the ratio of male to female students in the Science Faculty
is exactly 1:1, but in the Pharmacology Honours class over the past ten
years there have been 80 females and 40 males. Is this a significant
departure from expectation?

Female Male Total


Observed
80 40 120
numbers (O)
Expected
60 60 120
numbers (E)
O-E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 13.34 = X2
Degree of Freedom

11/13/2023 Data Mining: Concepts and Techniques 9


Chi-square Table
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product


moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

11
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

12
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot product

a 'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

13
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and


B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence14
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

You might also like