Data Mining CH2
Data Mining CH2
D at a Mi ning 1
Outline
• Types of Data
• Data Quality
• Data Pre-processing
D at a Mi ning 2
What is Data?
Attributes
• Collection of data objects and
their attributes
Tid Refund Marital Taxable
• An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
– Examples: eye color of a person,
2 No Married 100K No
temperature, etc.
3 No Single 70K No
Objects
– Attribute is also known as
4 Yes Married 120K No
variable, field, characteristic,
dimension, or feature 5 No Divorced 95K Yes
D at a Mi ning 3
A More Complete View of Data
D at a Mi ning 4
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute for
a particular object
D at a Mi ning 5
Measurement of Length
• The way you measure an attribute may not match the attributes properties.
5 A 1
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and
property of additvity
length. 10 4 properties
of length.
E
15 5
D at a Mi ning 6
Types of Attributes
D at a Mi ning 9
Attribute Description Examples Operations
Type
Nominal Nominal attribute zip codes, employee mode, entropy,
values only ID numbers, eye contingency
distinguish. (=, ) color, sex: {male, correlation, 2
Categorical
Qualitative
female} test
D at a Mi ning 12
Asymmetric Attributes
D at a Mi ning 14
Critiques
• Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
D at a Mi ning 15
Critiques …
D at a Mi ning 16
More Complicated Examples
• ID numbers
– Nominal, ordinal, or interval?
• Biased Scale
– Interval or Ratio
D at a Mi ning 17
Key Messages for Attribute Types
• The types of operations you choose should be “meaningful” for the type of
data you have
– Distinctness, order, meaningful intervals, and meaningful ratios are only
four properties of data
– The data type you see – often numbers or strings – may not capture all the
properties or may suggest properties that are not there
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
D at a Mi ning 19
Important Characteristics of Data
–Sparsity
• Only presence counts
–Resolution
• Patterns depend on the scale
–Size
• Type of analysis may depend on size of data
D at a Mi ning 20
Record Data
D at a Mi ning 21
Data Matrix
D at a Mi ning 22
Document Data
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
D at a Mi ning 23
Transaction Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
D at a Mi ning 24
Graph Data
2
5 1
2
5
• Sequences of transactions
Items/Events
An element of
the sequence
D at a Mi ning 26
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
D at a Mi ning 27
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
D at a Mi ning 28
Data Quality
D at a Mi ning 29
Data Quality …
D at a Mi ning 30
Noise
• Causes?
D at a Mi ning 32
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
D at a Mi ning 33
Missing Values …
• Missing completely at random (MCAR)
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
• Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
• Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
• Not possible to know the situation from the data
D at a Mi ning 34
Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
– Major issue when merging data from heterogeneous sources
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
D at a Mi ning 35
Similarity and Dissimilarity Measures
• Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
D at a Mi ning 36
Similarity/Dissimilarity for Simple Attributes
D at a Mi ning 37
Euclidean Distance
• Euclidean Distance
D at a Mi ning 38
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
D at a Mi ning 39
Minkowski Distance
D at a Mi ning 40
Minkowski Distance: Examples
• r = 2. Euclidean distance
D at a Mi ning 41
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
D at a Mi ning 42
Mahalanobis Distance
𝑇 −1
𝐦𝐚𝐡𝐚𝐥𝐚𝐧𝐨𝐛𝐢𝐬 ( 𝐱 , 𝐲 )=(
𝐱 − 𝐲 ) Ʃ (𝐱 − 𝐲)
B A: (0.5, 0.5)
0.3 0.2
0.2 0.3
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
D at a Mi ning 44
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known
properties.
1. d(x, y) 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
• A distance that satisfies these properties is a metric
D at a Mi ning 45
Common Properties of a Similarity
• Similarities, also have some well known properties.
D at a Mi ning 46
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes
D at a Mi ning 47
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001
D at a Mi ning 48
Cosine Similarity
D at a Mi ning 49
Extended Jaccard Coefficient (Tanimoto)
D at a Mi ning 50
Correlation measures the linear relationship
between objects
D at a Mi ning 51
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
D at a Mi ning 52
Drawback of Correlation
y i = x i2
• mean(x) = 0, mean(y) = 4
• std(x) = 2.16, std(y) = 3.74
D at a Mi ning 53
Comparison of Proximity Measures
• Domain of application
– Similarity measures tend to be specific to the type of attribute and
data
– Record data, images, graphs, sequences, 3D-protein structure, etc.
tend to have different measures
• However, one can talk about various properties that you would like a
proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
• The measure must be applicable to the data and produce results that
agree with domain knowledge
D at a Mi ning 54
Information Based Measures
D at a Mi ning 55
Information and Probability
D at a Mi ning 56
Entropy
• For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
D at a Mi ning 57
Entropy Examples
D at a Mi ning 58
Entropy for Sample Data: Example
D at a Mi ning 59
Entropy for Sample Data
• Suppose we have
– a number of observations (m) of some attribute, X, e.g., the hair
color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
D at a Mi ning 60
Mutual Information
Formally, , where
Where pij is the probability that the ith value of X and the jth value of Y
occur together
D at a Mi ning 63
General Approach for Combining Similarities
D at a Mi ning 64
Using Weights to Combine Similarities
D at a Mi ning 65
Density
D at a Mi ning 66
Euclidean Density: Grid-based Approach
D at a Mi ning 69