Chap2 Data PDF
Chap2 Data PDF
What is Data?
2
Attribute Values
Measurement of Length
The way you measure an attribute is somewhat may not match
the attributes properties.
5 A 1
B
7 2
8 3
10 4
15 5
4
Types of Attributes
6
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
8
Discrete and Continuous Attributes
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
10
Important Characteristics of Structured Data
– Dimensionality
Curse of Dimensionality
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
11
Record Data
12
Data Matrix
13
Document Data
season
coach
game
score
team
ball
lost
pla
wi
n
y
14
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
15
Graph Data
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
16
Chemical Data
17
Ordered Data
Sequences of transactions
Items/Events
An element of
the sequence
18
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
19
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
20
Data Quality
21
Noise
22
Outliers
23
Missing Values
24
Duplicate Data
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
25
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
26
Aggregation
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability
27
Aggregation
28
Sampling
Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data
and the final data analysis.
29
Sampling …
30
Types of Sampling
Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
31
Sample Size
32
Sample Size
What sample size is necessary to get at least one
object from each of 10 groups.
33
Curse of Dimensionality
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
34
Dimensionality Reduction
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
35
x2
x1
36
Dimensionality Reduction: PCA
x2
x1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
37
38
Dimensionality Reduction: PCA
Dimensions
Dimensions==206
120
160
10
40
80
39
Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
40
Feature Subset Selection
Techniques:
– Brute-force approch:
Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
Features are selected before data mining algorithm is run
– Wrapper approaches:
Use the data mining algorithm as a black box to find best subset
of attributes
41
Feature Creation
42
Mapping Data to a New Space
Fourier transform
Wavelet transform
43
44
Discretization Without Using Class Labels
45
Attribute Transformation
46
Similarity and Dissimilarity
Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
47
48
Euclidean Distance
Euclidean Distance
n 2
dist ( pk qk )
k 1
49
Euclidean Distance
p1
point x y
2
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
50
Minkowski Distance
51
r = 2. Euclidean distance
52
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
53
Mahalanobis Distance
mahalanobi s ( p , q ) ( p q ) 1 ( p q )T
1 n
j ,k ( X ij X j )( X ik X k )
n 1 i 1
54
Mahalanobis Distance
Covariance Matrix:
0.3 0.2
0.2 0.3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
55
56
Common Properties of a Similarity
57
58
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
59
Cosine Similarity
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
60
Extended Jaccard Coefficient (Tanimoto)
61
Correlation
correlation( p, q) p q
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
62
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
63
64
Using Weights to Combine Similarities
65
Density
Examples:
– Euclidean density
Euclidean density = number of points per unit volume
– Probability density
– Graph-based density
66
Euclidean Density – Cell-based
67
68