Chap2 Data
Chap2 Data
● Types of Data
● Data Quality
● Data Preprocessing
Objects
variable, field, characteristic,
dimension, or feature
● A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Attribute Values
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 11
Tan, Steinbach, Karpatne, Kumar
Asymmetric Attributes
● Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are
not present
– Sparsity
◆ Only presence counts
– Resolution
◆ Patterns depend on the scale
– Size
◆ Type of analysis may depend on size of data
Benzene Molecule:
C6H6
01/27/2021 Introduction to Data Mining, 2nd Edition 21
Tan, Steinbach, Karpatne, Kumar
Ordered Data
● Sequences of transactions
Items/Events
An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 22
Tan, Steinbach, Karpatne, Kumar
Ordered Data
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
● Causes?
01/27/2021 Introduction to Data Mining, 2nd Edition 28
Tan, Steinbach, Karpatne, Kumar
Missing Values
● Examples:
– Same person with multiple email addresses
● Data cleaning
– Process of dealing with duplicate data issues
● Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
● Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
● Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes
● Euclidean Distance
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 34
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance
● r = 2. Euclidean distance
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 37
Tan, Steinbach, Karpatne, Kumar
Mahalanobis Distance
Covariance
Matrix:
B A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
x= 1000000000
y= 0000001001
Scatter plots
showing the
similarity from
–1 to 1.
yi = xi2
● mean(x) = 0, mean(y) = 4
● std(x) = 2.16, std(y) = 3.74
● Domain of application
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
● However, one can talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
● The measure must be applicable to the data and
produce results that agree with domain knowledge
01/27/2021 Introduction to Data Mining, 2nd Edition 50
Tan, Steinbach, Karpatne, Kumar
Information Based Measures
Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624
● Aggregation
● Sampling
● Discretization and Binarization
● Attribute Transformation
● Dimensionality Reduction
● Feature subset selection
● Feature creation
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
● When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
● Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
x1
01/27/2021 Introduction to Data Mining, 2nd Edition 83
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA
Frequency