0% found this document useful (0 votes)

7 views

03preprocessing Part2

This chapter discusses data preprocessing techniques for data mining. It covers data cleaning, integration, reduction, transformation, and discretization. Specific techniques covered include schema integration to combine data from multiple sources, resolving conflicts when integrating data, and using correlation and covariance analysis to detect redundant attributes and evaluate relationships between numeric and nominal attributes. These preprocessing steps aim to improve data quality and prepare the data for mining models.

Uploaded by

baigsalman251

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

03preprocessing Part2

Uploaded by

baigsalman251

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Data Mining

Dr. Shahid Mahmood Awan

http://turing.cs.pub.ro/mas_11
curs.cs.pub.ro
[email protected]
University of Management and Technology

Fall 2017
Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 3 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
3
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
4
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
5
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population

6
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
7
Chi-square Table
Suppose that the ratio of male to female students in the Science Faculty
is exactly 1:1, but in the Pharmacology Honours class over the past ten
years there have been 80 females and 40 males. Is this a significant
departure from expectation?

Female Male Total

Observed
80 40 120
numbers (O)
Expected
60 60 120
numbers (E)
O-E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 13.34 = X2
Degree of Freedom

11/13/2023 Data Mining: Concepts and Techniques 9

Chi-square Table
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product

moment coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1
(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective

means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

11
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

12
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot product

a 'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

13
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and

B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence14
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

Practice Exam Questions Statistics 301 Professor Wardrop Chapters 1, 12, 2, and 3
100% (1)
Practice Exam Questions Statistics 301 Professor Wardrop Chapters 1, 12, 2, and 3
22 pages
1.1 - Principles of Management - Meaning, Definition, Significance of Management
No ratings yet
1.1 - Principles of Management - Meaning, Definition, Significance of Management
42 pages
4th Quarter Test in Mapeh
100% (7)
4th Quarter Test in Mapeh
5 pages
Rigid Body Dynamics
No ratings yet
Rigid Body Dynamics
123 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
03Preprocessing
No ratings yet
03Preprocessing
38 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Topic 4
No ratings yet
Topic 4
17 pages
Ch 3-Final
No ratings yet
Ch 3-Final
39 pages
Unit 3
No ratings yet
Unit 3
164 pages
Lecture 4-Data Preprocessing - Integration
No ratings yet
Lecture 4-Data Preprocessing - Integration
12 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Ue22cs342aa2 20240827192243
No ratings yet
Ue22cs342aa2 20240827192243
28 pages
Module 2_DM_AI
No ratings yet
Module 2_DM_AI
61 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Integration and Discretization
No ratings yet
Data Integration and Discretization
39 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
55 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Preprocessing-Featue Engineering
No ratings yet
Preprocessing-Featue Engineering
16 pages
DM_merged
No ratings yet
DM_merged
169 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
on-the-resemblance-and-containment-of-documents
No ratings yet
on-the-resemblance-and-containment-of-documents
9 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Linear Regression Apply On House Price Prediction On Boston House Dataset
No ratings yet
Linear Regression Apply On House Price Prediction On Boston House Dataset
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Cheatsheetforstatistics
No ratings yet
Cheatsheetforstatistics
4 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Module 2
No ratings yet
Module 2
62 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
CH 3
No ratings yet
CH 3
68 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
03Preprocessing_20160222
No ratings yet
03Preprocessing_20160222
65 pages
Unit 4 Correlation and Linear Regression
No ratings yet
Unit 4 Correlation and Linear Regression
26 pages
PPT1
No ratings yet
PPT1
93 pages
Lab 4 Regression BBIO180 Manual Au24
No ratings yet
Lab 4 Regression BBIO180 Manual Au24
5 pages
Xi - Economics - Model Paper
No ratings yet
Xi - Economics - Model Paper
6 pages
Data Analysis
No ratings yet
Data Analysis
7 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lec7
No ratings yet
Lec7
45 pages
Calculus III Essentials
From Everand
Calculus III Essentials
Editors of REA
1/5 (2)
Ch03-Intensity Transformations and Spatial Filtering
No ratings yet
Ch03-Intensity Transformations and Spatial Filtering
65 pages
Ch06-Color Image Processing
No ratings yet
Ch06-Color Image Processing
40 pages
Ch10-Image Segmentation
No ratings yet
Ch10-Image Segmentation
22 pages
02data Part4
No ratings yet
02data Part4
28 pages
Ch05-Image Restoration
No ratings yet
Ch05-Image Restoration
49 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
02data Part2
No ratings yet
02data Part2
34 pages
02data Part1
No ratings yet
02data Part1
19 pages
COADE Friction Stiffness
No ratings yet
COADE Friction Stiffness
1 page
1.1. Chapter1. IntroductionToPrincipleofAccounting
No ratings yet
1.1. Chapter1. IntroductionToPrincipleofAccounting
54 pages
Literature Poetry First Editions
No ratings yet
Literature Poetry First Editions
68 pages
Via Connect Pro Release Notes 15627
No ratings yet
Via Connect Pro Release Notes 15627
50 pages
Enable or Disable Concurrent Prog Parameters Dynamically
No ratings yet
Enable or Disable Concurrent Prog Parameters Dynamically
14 pages
Electro Cleanse Report
100% (2)
Electro Cleanse Report
14 pages
Yamaha FZS1000 Fazer 2001 Owners Manual 5LV-28199-E0
No ratings yet
Yamaha FZS1000 Fazer 2001 Owners Manual 5LV-28199-E0
110 pages
The Solution For Big Data Hadoop
No ratings yet
The Solution For Big Data Hadoop
27 pages
Huang Et Al 2024 Fe Doped Ni2p Nise2 Composite Catalysts for Urea Oxidation Reaction (Uor) for Energy Saving Hydrogen (1)
No ratings yet
Huang Et Al 2024 Fe Doped Ni2p Nise2 Composite Catalysts for Urea Oxidation Reaction (Uor) for Energy Saving Hydrogen (1)
13 pages
Shinn Your Word Is Your Wand PDF
100% (1)
Shinn Your Word Is Your Wand PDF
46 pages
The Art of Conversation Questions
100% (1)
The Art of Conversation Questions
2 pages
Sisecamflatglass Acoustic Laminated
No ratings yet
Sisecamflatglass Acoustic Laminated
2 pages
Darsey
No ratings yet
Darsey
17 pages
Parcial 2 Iga Tema 1
No ratings yet
Parcial 2 Iga Tema 1
3 pages
DAFTAR PUSTAKA TUGA NO 1 ERLYTA VIVI PERMATASARI - 02619210011 - MM2021 - UTS Marketing Management - 03032021
No ratings yet
DAFTAR PUSTAKA TUGA NO 1 ERLYTA VIVI PERMATASARI - 02619210011 - MM2021 - UTS Marketing Management - 03032021
2 pages
Assassination of Julius Caesar - Wikipedia
No ratings yet
Assassination of Julius Caesar - Wikipedia
1 page
(SSH Client, X-Server and Networking Tools) : Dede 172.18.151.194
No ratings yet
(SSH Client, X-Server and Networking Tools) : Dede 172.18.151.194
3 pages
DI Pipes Internatinal
No ratings yet
DI Pipes Internatinal
48 pages
REVALIDA-STUDY-GUIDE (1)
No ratings yet
REVALIDA-STUDY-GUIDE (1)
4 pages
4 Periodic Table of Elements
No ratings yet
4 Periodic Table of Elements
17 pages
Division Memorandum No. 532, s.2022
No ratings yet
Division Memorandum No. 532, s.2022
12 pages
Suresh A/L Mohana Krishnan NO 13, JLN SP 6/6 Seri Pristana 47000, Sungai Buloh, Sel
No ratings yet
Suresh A/L Mohana Krishnan NO 13, JLN SP 6/6 Seri Pristana 47000, Sungai Buloh, Sel
2 pages
Open Source Automated Testing: An Insight Into Current Trends and Scope For Further Research
No ratings yet
Open Source Automated Testing: An Insight Into Current Trends and Scope For Further Research
12 pages
James G. Marks JR., Jeffrey J. Miller-Lookingbill and Marks' Principles of Dermatology-Saunders (2013)
No ratings yet
James G. Marks JR., Jeffrey J. Miller-Lookingbill and Marks' Principles of Dermatology-Saunders (2013)
310 pages
Job Resume (Devesh Bhardwaj) .
No ratings yet
Job Resume (Devesh Bhardwaj) .
2 pages
Diversity Added Adaptive-Fuzzy Logic Controlled Receivers For Direct Sequence Code Division Multiple Access System
No ratings yet
Diversity Added Adaptive-Fuzzy Logic Controlled Receivers For Direct Sequence Code Division Multiple Access System
10 pages
Toy Wars
No ratings yet
Toy Wars
2 pages

03preprocessing Part2

Uploaded by

03preprocessing Part2

Uploaded by

Data Mining

Dr. Shahid Mahmood Awan

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Preprocessing: An Overview

 Redundant data occur often when integration of multiple

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are

Female Male Total

11/13/2023 Data Mining: Concepts and Techniques 9

 Correlation coefficient (also called Pearson’s product

where n is the number of tuples, A and B are the respective

a 'k  (ak  mean( A)) / std ( A)

b'k  (bk  mean( B )) / std ( B )

correlatio n( A, B )  A' B '

where n is the number of tuples, A and

 It can be simplified in computation as

You might also like