Outlier Analysis
Outlier Analysis
Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
Unsupervised learning: no predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step
Data Mining: for
Concepts
Techniques
other
and algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
Pattern Recognition
Spatial Data Analysis
Create thematic maps in GIS by clustering
feature spaces
Detect spatial clusters or for other spatial mining
tasks
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns Data Mining: Concepts and
Techniques
Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Data Mining: Concepts and
Techniques
Quality: What Is Good
Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns
Data Mining: Concepts and
Techniques
Measure the Quality of
Clustering
Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, typically
metric: d(i, j)
There is a separate “quality” function that
measures the “goodness” of a cluster.
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define “similar enough” or “good
Data Mining: Concepts and
Techniques
Requirements of Clustering in Data
Mining
Scalability
Ability to deal with different types of attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Data Mining: Concepts and
Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Data Structures
Data matrix
x11 ... x1f ... x1p
(two modes)
... ... ... ... ...
x ... x if ... x ip
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix
0
(one mode) d(2,1)
0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Data Mining: Concepts and
Techniques
Binary Variables
Object j
1 0 sum
A contingency table for 1 a b a b
Object i
binary data 0 c d c d
sum a c b d p
Distance measure for b c
d (i, j)
symmetric binary a b c d
variables:
d (i, j) b c
Distance measure for a b c
asymmetric binary
variables: simJaccard (i, j) a
a b c
Jaccard coefficient
Data Mining: Concepts and
(similarity measure for Techniques
Dissimilarity between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set
to 0 0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
Data Mining: Concepts and
Techniques
Nominal Variables
M 1
compute ranks rif and f
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose 7 7
K object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5
each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
initial to
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
Data
0 1 2
Mining:
3 4 5
Concepts
6 7 8 9
and
10 0 1 2 3 4 5 6 7 8 9 10
Techniques
PAM (Partitioning Around Medoids)
(1987)
10 10
9 9
t j
8
7
t 8
5
j 6
4
i h 4
h
3
2
3
2
i
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10
10
9
9
8
h 8
j
7
7
6
6
5
5 i
i h j
t
4
4
3
3
2
2
1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9
10
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9
(3,4)
(2,6)
8
4
(4,5)
3
2 (4,7)
(3,8)
1
0
0 1 2 3 4 5 6 7 8 9 10
Clustering feature:
summary of the statistics for a given subcluster: the 0-th,
1st and 2nd moments of the subcluster from the statistical
point of view.
registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: specify the maximum number of children.
threshold: max diameter of sub-clusters stored at the leaf
Data Mining: Concepts and
nodes Techniques
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Major ideas
Use links to measure similarity/proximity
Not distance-based
Computational complexity:
O(n 2 nmmma n 2 log n)
Algorithm: sampling-based clustering
Draw random sample
Experiments
Congressional voting, mushroom data
Data Set
Merge Partition
Final Clusters
p belongs to NEps(q) p MinPts = 5
core point condition: q Eps = 1 cm
|NEps (q)| >= MinPts
Data Mining: Concepts and
Techniques
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. Eps,
p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
Density-connected p q
A point p is density-connected
to a point q w.r.t. Eps, MinPts o
if there is a point o such that
both, p and q are density-
Data Mining: Concepts and
reachable from o w.r.t.Techniques
Eps
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
(SIGMOD’99)
Produces a special order of the database wrt its
undefined
‘
Cluster-order
Data Mining: Concepts and of the objects
Techniques
Density-Based Clustering: OPTICS & Its
Applications
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
D N
2 2
Major features f Gaussian
Solid mathematical foundation
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significant faster than existing algorithm (e.g., DBSCAN)
But needs a large number of parameters
Data Mining: Concepts and
Techniques
Denclue: Technical Essence
Uses grid cells but only keeps information about
grid cells that do actually contain data points and
manages these cells in a tree-based access
structure
Influence function: describes the impact of a data
point within its neighborhood
Overall density of the data space can be
calculated as the sum of the influence function of
all data points
Clusters can be determined mathematically by
identifying density attractors
Density attractors are local maximal of the overall
Data Mining: Concepts and
Techniques
Density Attractor
incremental update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries
Techniquesare either horizontal
Data Mining: Concepts and
WaveCluster: Clustering by Wavelet Analysis
(1998)
Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach which applies wavelet
transform to the feature space
How to apply wavelet transform to find clusters
Summarizes the data by imposing a multidimensional
grid structure onto data space
These multidimensional spatial data objects are
represented in a n-dimensional feature space
Apply wavelet transform on feature space to find the
dense regions in the feature space
Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
Data Mining: Concepts and
Techniques
Wavelet Transform
Wavelet transform: A signal processing technique
that decomposes a signal into different frequency
sub-band (can be applied to n-dimensional signals)
Data are transformed to preserve relative distance
between objects at different levels of resolution
Allows natural clusters to become more
distinguishable
transform
Why is wavelet transformation useful for clustering?
Use hat-shape filters to emphasize region where points
effective
Major features:
Complexity O(N)
EM (Expectation maximization), AutoClass
Machine learning approach
COBWEB, CLASSIT
Neural network approach
SOM (Self-Organizing Feature Map)
Data Mining: Concepts and
Techniques
EM — Expectation Maximization
EM — A popular iterative refinement algorithm
An extension to k-means
Assign each object to a cluster according to a weight (prob.
distribution)
New means are computed based on weighted measures
General idea
Starts with an initial estimate of the parameter vector
Iteratively rescores the patterns against the mixture density
produced by the parameter vector
The rescored patterns are used to update the parameter
updates
Patterns belonging to the same cluster, if they are placed
by their scores in a particular component
Algorithm converges fast but may not be in global optima
Data Mining: Concepts and
Techniques
The EM (Expectation Maximization)
Algorithm
Maximization step:
Estimation of model parameters
unlabeled objects
Finds characteristic description for each concept
(class)
COBWEB (Fisher’87)
A popular a simple method of incremental
conceptual learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a
Competitive learning
Involves a hierarchical architecture of several units
(neurons)
Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
Data Mining: Concepts and
Techniques
CLIQUE (Clustering In QUEst)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
y
l ar 30 50
Sa age
Strength
automatically finds subspaces of the highest
Where
A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
Problems with bi-cluster
No downward closure property,
Data Mining: Concepts and
Techniques
p-Clustering:
Clustering by
Pattern Similarity
Given object x, y in O and features a, b in T, pCluster is a 2 by
2 matrix d xa d xb
pScore( ) | (d xa d xb ) (d ya d yb ) |
d ya d yb
A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O,
T), pScore(X) ≤ δ for some δ > 0
Properties of δ-pCluster
Downward closure
Clusters are more homogeneous than bi-cluster (thus the
name: pair-wise Cluster)
Pattern-growth algorithm has been developed for efficient
mining d xa / d ya
d xb / d yb
For scaling patterns, one can observe, taking logarithmic on
will lead to the pScore form
Data Mining: Concepts and
Techniques
Outlier Detection
What Is Outlier Discovery?
What are outliers?
The set of objects are considerably dissimilar
Gretzky, ...
Problem: Define and find outliers in large data sets
Applications:
Credit card fraud detection
Customer segmentation
Medical analysis
Drawbacks
most tests are for single attribute
Nested-loop algorithm O P
Cell-based algorithm
(LOF)
Ex. C1 contains 400 loosely Assume outlier is not
distributed points, C2 has crisp
100 tightly condensed Each point has a LOF