Chap2 Data
Chap2 Data
Types of Data
Data Quality
Data Preprocessing
Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample, 9 No Married 75K No
entity, or instance
10 No Single 90K Yes
10
Attribute Values
B
7 2
C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.
15 5
Types of Attributes
female} test
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 11
Tan, Steinbach, Karpatne, Kumar
Asymmetric Attributes
Only presence (a non-zero attribute value) is regarded as
important
Words present in documents
Items present in customer transactions
Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
present
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
2
5 1
2
5
Sequences of transactions
Items/Events
An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Two sine waves Observed signal (sum of the two sine waves) Observed signal with noise
1 3 3
0.8
2 2
0.6
0.4
1 1
0.2
magnitude
magnitude
magnitude
0 0 0
-0.2
-1 -1
-0.4
-0.6
-2 -2
-0.8
-1 -3 -3
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
time (seconds) time (seconds) time (seconds)
Causes?
01/27/2021 Introduction to Data Mining, 2nd Edition 28
Tan, Steinbach, Karpatne, Kumar
Missing Values
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
r = 2. Euclidean distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 37
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance
-0.5
Covariance
Matrix:
0.3 0.2
C
0 . 2 0 . 3
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
x= 1000000000
y= 0000001001
##if we add new members that are far from the mean, the
SD of this new set will be larger
Scatter plots
showing the
similarity from
–1 to 1.
Y
4
3
yi = xi2 2
1
0
-4 -2 0 2 4
X
mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74
Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
The measure must be applicable to the data and
produce results that agree with domain knowledge
01/27/2021 Introduction to Data Mining, 2nd Edition 55
Tan, Steinbach, Karpatne, Kumar
Information Based Measures
For
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
Suppose we have
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
Formally, , where
Where pij is the probability that the ith value of X and the jth value of Y
occur together
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
Aggregation
Sampling
Discretization and Binarization
Attribute Transformation
Dimensionality Reduction
Feature subset selection
Feature creation
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x1
01/27/2021 Introduction to Data Mining, 2nd Edition 88
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA
Frequency