0% found this document useful (0 votes)
6 views

Outlier Analysis

Uploaded by

btscomparison
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Outlier Analysis

Uploaded by

btscomparison
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step
Data Mining: for
Concepts
Techniques
other
and algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering
feature spaces
 Detect spatial clusters or for other spatial mining
tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar
access patterns Data Mining: Concepts and
Techniques
Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Data Mining: Concepts and
Techniques
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
 The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns
Data Mining: Concepts and
Techniques
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, typically
metric: d(i, j)
 There is a separate “quality” function that
measures the “goodness” of a cluster.
 The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables.
 Weights should be associated with different
variables based on applications and data
semantics.
 It is hard to define “similar enough” or “good
Data Mining: Concepts and
Techniques
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Data Mining: Concepts and
Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 (two modes)  
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 0 
 (one mode)  d(2,1) 
 0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

Data Mining: Concepts and


Techniques
Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

Data Mining: Concepts and


Techniques
Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )


.

 Calculate the standardized measurement (z-


xif  m f
score) zif  s
f

 Using mean absolute deviation is more robust


than using standardData
deviation
Mining: Concepts and
Techniques
Similarity and Dissimilarity
Between Objects

 Distances are normally used to measure the


similarity or dissimilarity between two data
objects
 Some popular
d (i, j) q (| ones
x  x |include:
q
 | x  x Minkowski
|q ... | x  x |distance:
q
)
i1 j1 i2 j2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)


are two p-dimensional data objects, and q is a
positive integer
 , j) | x  x |  |distance
If q = 1, d isd (iManhattan x  x | ... | x  x |
i1 j1 i2 j2 ip jp
Data Mining: Concepts and
Techniques
Similarity and Dissimilarity
Between Objects (Cont.)

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Data Mining: Concepts and
Techniques
Binary Variables
Object j
1 0 sum
 A contingency table for 1 a b a b
Object i
binary data 0 c d c d
sum a  c b  d p
 Distance measure for b c
d (i, j) 
symmetric binary a b c  d
variables:
d (i, j)  b c
 Distance measure for a b  c
asymmetric binary
variables: simJaccard (i, j)  a
a b  c
 Jaccard coefficient
Data Mining: Concepts and
(similarity measure for Techniques
Dissimilarity between Binary
Variables

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set
to 0 0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
Data Mining: Concepts and
Techniques
Nominal Variables

 A generalization of the binary variable in that it can


take more than 2 states, e.g., red, yellow, blue,
green
 Method 1: Simple matching
 m: # of matches,
d (i, j)p:
total
p  m# of variables
p

 Method 2: use a large number of binary variables


 creating a new binary variable for each of the M
nominal states
Data Mining: Concepts and
Techniques
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif {1,...,M f }
if

 map the range of each variable onto [0, 1] by


replacing i-th object in the f-th variable by
rif  1
zif 
Mf 1
 compute the dissimilarity using methods for
interval-scaled variables
Data Mining: Concepts and
Techniques
Ratio-Scaled Variables

 Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a
good choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their
rank as interval-scaled
Data Mining: Concepts and
Techniques
Variables of Mixed Types

 A database may contain all the six types of


variables
 symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio


 One may use a weighted  pf formula
 (f)
d to
( f )combine
d
their effects (i, j )  1 ij ij
 pf 1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized
distance
 f is ordinal or ratio-scaled zif  r  1
if

M 1

compute ranks rif and f

Data Mining: Concepts and


 Techniques
Vector Objects

 Vector objects: keywords in documents,


gene features in micro-arrays, etc.
 Broad applications: information retrieval,
biologic taxonomy, etc.
 Cosine measure

 A variant: Tanimoto coefficient

Data Mining: Concepts and


Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Major Clustering Approaches
(I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

Data Mining: Concepts and


Techniques
Major Clustering Approaches
(II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
Data Mining: Concepts and
 Typical methods: COD (obstacles),
Techniquesconstrained clustering
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip,
tjq)
 Complete link: largest distance between an element in
one cluster and an element in the other, i.e., dis(Ki, Kj) =
max(tip, tjq)
 Average: avg distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters,
i.e., dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between theConcepts
Data Mining: medoids
and of two clusters, i.e.,
Techniques
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point of


the cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance
between all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

Data Mining: Concepts and


Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters, s.t., min sum of squared
distance k 2
 m1tmiKm (Cm  tmi )

 Given a k, find a partition of k clusters that optimizes the


chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
Data Mining: Concepts and
Techniques
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented


in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the
nearest seed point
 Go back to Step 2, stop when no more new
assignment
Data Mining: Concepts and
Techniques
The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose 7 7

K object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


Techniques
Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is


# clusters, and t is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Data Mining: Concepts and
Techniques
Variations of the K-Means Method

 A few variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical
objects
 Using a frequency-based method to update modes of
clusters
Data Mining: Concepts and
 Techniques
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may
substantially distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


Techniques
The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters


 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering
 PAM works effectively for small data sets, but does not
scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)
Data Mining: Concepts and
Techniques
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5
each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

initial to
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping total cost
Until no
7 7

O and 6
of 6

change Oramdom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0

Data
0 1 2
Mining:
3 4 5
Concepts
6 7 8 9
and
10 0 1 2 3 4 5 6 7 8 9 10

Techniques
PAM (Partitioning Around Medoids)
(1987)

 PAM (Kaufman and Rousseeuw, 1987), built in


Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and
selected object i, calculate the total swapping
cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the
Data Mining: Concepts and
most similar representative
Techniques object
PAM Clustering: Total swapping cost
TCih=jCjih

10 10

9 9

t j
8

7
t 8

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

j
7
7
6
6

5
5 i
i h j
t
4
4

3
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) C = d(j, h) - d(j, t)


Data Mining: Concepts and
Techniques jih
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence


of noise and outliers because a medoid is less
influenced by outliers or other extreme values
than a mean
 Pam works efficiently for small data sets but does
not scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
 Sampling based method,
CLARA(Clustering LARge Applications)
Data Mining: Concepts and
Techniques
CLARA (Clustering Large Applications)
(1990)

 CLARA (Kaufmann and Rousseeuw in 1990)


 Built in statistical analysis packages, such as
S+
 It draws multiple samples of the data set, applies
PAM on each sample, and gives the best
clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
Data Mining: Concepts and
necessarily representTechniques
a good clustering of the
CLARANS (“Randomized” CLARA)
(1994)

 CLARANS (A Clustering Algorithm based on


Randomized Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as
searching a graph where every node is a potential
solution, that is, a set of k medoids
 If the local optimum is found, CLARANS starts with
new randomly selected node in search for a new
local optimum
 It is more efficient and scalable than both PAM and
CLARA
 Focusing techniquesData
and spatial access structures
Mining: Concepts and
Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Hierarchical Clustering

 Use distance matrix as clustering criteria. This


method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Data Mining: Concepts and
(DIANA)
Techniques
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g.,
Splus
 Use the Single-Link method and the dissimilarity
matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion

9
Eventually all nodes belong to the same cluster
10 10

9
10

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


Techniques
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

Data Mining: Concepts and


Techniques
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


Techniques
Recent Hierarchical Clustering
Methods
 Major weakness of agglomerative clustering
methods
 do not scale well: time complexity of at least
O(n2), where n is the number of total objects
 can never undo what was done previously
 Integration of hierarchical with distance-based
clustering
 BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters
 ROCK (1999): clustering categorical data by
neighbor and link analysis
Data Mining: Concepts and
 CHAMELEON (1999): hierarchical clustering using
Techniques
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering
using Hierarchies (Zhang, Ramakrishnan & Livny,
SIGMOD’96)
 Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for multiphase
clustering
 Phase 1: scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering structure
of the data)
 Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single
scan and improves the quality
Data Mining: Conceptswith
and a few additional
Techniques
Clustering Feature Vector in
BIRCH

Clustering Feature: CF = (N, LS, SS)


N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10

9
(3,4)
(2,6)
8

4
(4,5)
3

2 (4,7)
(3,8)
1

0
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and


Techniques
CF-Tree in BIRCH

 Clustering feature:
 summary of the statistics for a given subcluster: the 0-th,
1st and 2nd moments of the subcluster from the statistical
point of view.
 registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: specify the maximum number of children.
 threshold: max diameter of sub-clusters stored at the leaf
Data Mining: Concepts and
nodes Techniques
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6


L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

Data Mining: Concepts and


Techniques
Clustering Categorical Data: The ROCK
Algorithm
 ROCK: RObust Clustering using linKs
 S. Guha, R. Rastogi & K. Shim, ICDE’99

 Major ideas
 Use links to measure similarity/proximity

 Not distance-based

 Computational complexity:
O(n 2  nmmma  n 2 log n)
 Algorithm: sampling-based clustering
 Draw random sample

 Cluster with links

 Label data in disk

 Experiments
 Congressional voting, mushroom data

Data Mining: Concepts and


Techniques
Similarity Measure in ROCK
 Traditional measures for categorical data may not work well,
e.g., Jaccard coefficient
 Example: Two groups (clusters) of transactions
 C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
 C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Jaccard co-efficient may lead to wrong clustering result
 C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
 C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, Tf}) T2
Sim( T , T )  1
 Jaccard co-efficient-based similarity function: 1 2 T1  T2

 Ex. Let T1 = {a, b, c}, T2 =


{c{c,
} d, e} 1
Sim (T 1, T 2)   0.2
{a, b, c, d , e} 5
Data Mining: Concepts and
Techniques
Link Measure in ROCK
 Links: # of common neighbors
 C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
 C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
 link(T1, T2) = 4, since they have 4 common
neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
 link(T1, T3) = 3, since they have 3 common
neighbors
Data Mining: Concepts and
 Techniques
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
 CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high
relative to the internal interconnectivity of the clusters and
closeness of items within the clusters
 Cure ignores information about interconnectivity of the
objects, Rock ignores information about the closeness of
two clusters
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find
Data Mining: Concepts and
the genuine clusters by repeatedly
Techniques combining these sub-
Overall Framework of
CHAMELEON
Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

Data Mining: Concepts and


Techniques
CHAMELEON (Clustering Complex
Objects)

Data Mining: Concepts and


Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination
condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based) Data Mining: Concepts and


Techniques
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if


p belongs to NEps(q) p MinPts = 5
 core point condition: q Eps = 1 cm
|NEps (q)| >= MinPts
Data Mining: Concepts and
Techniques
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps,
p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
 Density-connected p q
 A point p is density-connected
to a point q w.r.t. Eps, MinPts o
if there is a point o such that
both, p and q are density-
Data Mining: Concepts and
reachable from o w.r.t.Techniques
Eps
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

Data Mining: Concepts and


Techniques
DBSCAN: The Algorithm

 Arbitrary select a point p


 Retrieve all points density-reachable from p w.r.t.
Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
 Continue the process until all of the points have
been processed.
Data Mining: Concepts and
Techniques
DBSCAN: Sensitive to
Parameters

Data Mining: Concepts and


Techniques
CHAMELEON (Clustering Complex
Objects)

Data Mining: Concepts and


Techniques
OPTICS: A Cluster-Ordering Method
(1999)

 OPTICS: Ordering Points To Identify the Clustering


Structure
 Ankerst, Breunig, Kriegel, and Sander

(SIGMOD’99)
 Produces a special order of the database wrt its

density-based clustering structure


 This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a


broad range of parameter settings
 Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering


structure
 Can be represented graphically or using
Data Mining: Concepts and
visualization techniques Techniques
OPTICS: Some Extension from
DBSCAN

Index-based:

k = number of dimensions

N = 20

p = 75%

M = N(1-p) = 5 D
 Complexity: O(kN2)
 Core Distance
p1
 Reachability Distance o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) =Data
4cm
e = 3 cm
Mining: Concepts and
Techniques
Reachability
-distance

undefined


 ‘

Cluster-order
Data Mining: Concepts and of the objects
Techniques
Density-Based Clustering: OPTICS & Its
Applications

Data Mining: Concepts and


Techniques
DENCLUE: Using Statistical Density
Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
d ( x , y )2

 Using statistical density functions:
f Gaussian ( x , y ) e 2 2

d ( x , xi ) 2

( x )  i 1 e
D N
2 2
f Gaussian

d ( x , xi ) 2

( x, xi )  i 1 ( xi  x) e
D N
2 2
 Major features f Gaussian
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
Data Mining: Concepts and
Techniques
Denclue: Technical Essence
 Uses grid cells but only keeps information about
grid cells that do actually contain data points and
manages these cells in a tree-based access
structure
 Influence function: describes the impact of a data
point within its neighborhood
 Overall density of the data space can be
calculated as the sum of the influence function of
all data points
 Clusters can be determined mathematically by
identifying density attractors
 Density attractors are local maximal of the overall
Data Mining: Concepts and
Techniques
Density Attractor

Data Mining: Concepts and


Techniques
Center-Defined and Arbitrary

Data Mining: Concepts and


Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Grid-Based Clustering Method

 Using multi-resolution grid data structure


 Several interesting methods
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)

A multi-resolution clustering approach using wavelet
method
 CLIQUE: Agrawal, et al. (SIGMOD’98)

On high-dimensional data (thus put in the section of
clustering high-dimensional data

Data Mining: Concepts and


Techniques
STING: A Statistical Information Grid
Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular
cells
 There are several levels of cells corresponding to
different levels of resolution

Data Mining: Concepts and


Techniques
The STING Clustering Method

 Each cell at a high level is partitioned into a number


of smaller cells in the next lower level
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
 Parameters of higher level cells can be easily
calculated from parameters of lower level cell

count, mean, s, min, max

type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data
queries

Start from a pre-selected layer—typically with a
small number of cells
 For each cell in the current level compute the
confidence interval Data Mining: Concepts and
Techniques
Comments on STING
 Remove the irrelevant cells from further
consideration
 When finish examining the current layer, proceed
to the next lower level
 Repeat this process until the bottom layer is
reached
 Advantages:
 Query-independent, easy to parallelize,

incremental update
 O(K), where K is the number of grid cells at the

lowest level
 Disadvantages:
 All the cluster boundaries
Techniquesare either horizontal
Data Mining: Concepts and
WaveCluster: Clustering by Wavelet Analysis
(1998)
 Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach which applies wavelet
transform to the feature space
 How to apply wavelet transform to find clusters
 Summarizes the data by imposing a multidimensional
grid structure onto data space
 These multidimensional spatial data objects are
represented in a n-dimensional feature space
 Apply wavelet transform on feature space to find the
dense regions in the feature space
 Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
Data Mining: Concepts and
Techniques
Wavelet Transform
 Wavelet transform: A signal processing technique
that decomposes a signal into different frequency
sub-band (can be applied to n-dimensional signals)
 Data are transformed to preserve relative distance
between objects at different levels of resolution
 Allows natural clusters to become more
distinguishable

Data Mining: Concepts and


Techniques
The WaveCluster Algorithm
 Input parameters
 # of grid cells for each dimension

 the wavelet, and the # of applications of wavelet

transform
 Why is wavelet transformation useful for clustering?
 Use hat-shape filters to emphasize region where points

cluster, but simultaneously suppress weaker information


in their boundary
 Effective removal of outliers, multi-resolution, cost

effective
 Major features:
 Complexity O(N)

 Detect arbitrary shaped clusters at different scales

 Not sensitive to noise, not sensitive to input order

 Only applicable to low dimensional data


Data Mining: Concepts and
 Both grid-based and density-based
Techniques
Quantization
& Transformation
 First, quantize data into m-D grid
structure, then wavelet transform
 a) scale 1: high resolution

 b) scale 2: medium resolution

 c) scale 3: low resolution

Data Mining: Concepts and


Techniques
Chapter 7. Cluster Analysis

1. What is Cluster Analysis?


2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Model-Based Clustering

 What is model-based clustering?


 Attempt to optimize the fit between the given

data and some mathematical model


 Based on the assumption: Data are generated by

a mixture of underlying probability distribution


 Typical methods
 Statistical approach


EM (Expectation maximization), AutoClass
 Machine learning approach

COBWEB, CLASSIT
 Neural network approach

SOM (Self-Organizing Feature Map)
Data Mining: Concepts and
Techniques
EM — Expectation Maximization
 EM — A popular iterative refinement algorithm
 An extension to k-means
 Assign each object to a cluster according to a weight (prob.
distribution)
 New means are computed based on weighted measures
 General idea
 Starts with an initial estimate of the parameter vector
 Iteratively rescores the patterns against the mixture density
produced by the parameter vector
 The rescored patterns are used to update the parameter
updates
 Patterns belonging to the same cluster, if they are placed
by their scores in a particular component
 Algorithm converges fast but may not be in global optima
Data Mining: Concepts and
Techniques
The EM (Expectation Maximization)
Algorithm

 Initially, randomly assign k cluster centers


 Iteratively refine the clusters based on two steps
 Expectation step: assign each data point X to
i
cluster Ci with the following probability

 Maximization step:

Estimation of model parameters

Data Mining: Concepts and


Techniques
Conceptual Clustering
 Conceptual clustering
 A form of clustering in machine learning

 Produces a classification scheme for a set of

unlabeled objects
 Finds characteristic description for each concept

(class)
 COBWEB (Fisher’87)
 A popular a simple method of incremental

conceptual learning
 Creates a hierarchical clustering in the form of a

classification tree
 Each node refers to a concept and contains a

probabilistic description of that


Data Mining: Concepts and concept
Techniques
COBWEB Clustering
Method
A classification tree

Data Mining: Concepts and


Techniques
More on Conceptual Clustering
 Limitations of COBWEB
 The assumption that the attributes are independent of each
other is often too strong because correlation may exist
 Not suitable for clustering large database data – skewed
tree and expensive probability distributions
 CLASSIT
 an extension of COBWEB for incremental clustering of
continuous data
 suffers similar problems as COBWEB
 AutoClass (Cheeseman and Stutz, 1996)
 Uses Bayesian statistical analysis to estimate the number
of clusters
 Popular in industry Data Mining: Concepts and
Techniques
Neural Network Approach
 Neural network approaches
 Represent each cluster as an exemplar, acting

as a “prototype” of the cluster


 New objects are distributed to the cluster whose

exemplar is the most similar according to some


distance measure
 Typical methods
 SOM (Soft-Organizing feature Map)

 Competitive learning


Involves a hierarchical architecture of several units
(neurons)

Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
Data Mining: Concepts and
Techniques
CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


 Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-
based
 It partitions each dimension into the same number of equal
length interval
 It partitions an m-dimensional data space into non-
overlapping rectangular units
 A unit is dense if the fraction of total data points contained
in the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace Data Mining: Concepts and
Techniques
CLIQUE: The Major Steps
 Partition the data space and find the number of
points that lie inside each cell of the partition.
 Identify the subspaces that contain clusters using
the Apriori principle
 Identify clusters
 Determine dense units in all subspaces of
interests
 Determine connected dense units in all
subspaces of interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster

of connected dense units for each cluster


 Determination of minimal cover for each cluster
Data Mining: Concepts and
Techniques
Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

y
l ar 30 50
Sa age

Data Mining: Concepts and


Techniques
Strength and Weakness of
CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters


exist in those subspaces
 insensitive to the order of records in input and

does not presume some canonical data


distribution
 scales linearly with the size of input and has

good scalability as the number of dimensions in


the data increases
 Weakness
 The accuracy of the clustering result may be

degraded at the expense of simplicity of the


Data Mining: Concepts and
method Techniques
Frequent Pattern-Based Approach
 Clustering high-dimensional space (e.g., clustering text
documents, microarray data)
 Projected subspace-clustering: which dimensions to be
projected on?

CLIQUE, ProClus
 Feature extraction: costly and may not be effective?
 Using frequent patterns as “features”

“Frequent” are inherent features

Mining freq. patterns may not be so expensive
 Typical methods
 Frequent-term-based document clustering
 Clustering by pattern similarity in micro-array data
(pClustering)
Data Mining: Concepts and
Techniques
Clustering by Pattern Similarity (p-
Clustering)

 Right: The micro-array “raw”


data shows 3 genes and their
values in a multi-dimensional
space
 Difficult to find their patterns
 Bottom: Some subsets of
dimensions form nice shift and
scaling patterns

Data Mining: Concepts and


Techniques
Why p-Clustering?
 Microarray data analysis may need to
 Clustering on thousands of dimensions (attributes)
 Discovery of both shift and scaling patterns
 Clustering with Euclidean distance measure? — cannot find shift
patterns
 Clustering on derived attribute Aij = ai – aj? — introduces N(N-1)
dimensions
 Bi-cluster using transformed mean-squared residue score matrix (I, J)
1 1 1
d  d d   d
ij | J |  ij
d   d
Ij | I | i  I ij IJ | I || J | i  I , j  J ij
jJ

 Where
 A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
 Problems with bi-cluster
 No downward closure property,
Data Mining: Concepts and
 Techniques
p-Clustering:
Clustering by
Pattern Similarity
 Given object x, y in O and features a, b in T, pCluster is a 2 by
2 matrix  d xa d xb 
pScore(   ) | (d xa  d xb )  (d ya  d yb ) |
 d ya d yb 
 A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O,
T), pScore(X) ≤ δ for some δ > 0
 Properties of δ-pCluster
 Downward closure
 Clusters are more homogeneous than bi-cluster (thus the
name: pair-wise Cluster)
 Pattern-growth algorithm has been developed for efficient
mining d xa / d ya

d xb / d yb
 For scaling patterns, one can observe, taking logarithmic on
will lead to the pScore form
Data Mining: Concepts and
Techniques
Outlier Detection
What Is Outlier Discovery?
 What are outliers?
 The set of objects are considerably dissimilar

from the remainder of the data


 Example: Sports: Michael Jordon, Wayne

Gretzky, ...
 Problem: Define and find outliers in large data sets
 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

Data Mining: Concepts and


Techniques
Outlier Discovery:
Statistical
Approaches

 Assume a model underlying distribution that


generates data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 Drawbacks
 most tests are for single attribute

 In many cases, data distribution may not be

known Data Mining: Concepts and


Techniques
Outlier Discovery: Distance-Based
Approach

 Introduced to counter the main limitations


imposed by statistical methods
 We need multi-dimensional analysis without

knowing data distribution


 Distance-based outlier: A DB(p, D)-outlier is an
object O in a dataset T such that at least a fraction
p of the objects in T lies at a distance greater than
D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm O P
 Cell-based algorithm

Data Mining: Concepts and


Techniques
Index-based algorithm: The index-based algorithm facilitates
multidimensional indexing structures, including R-trees or k-d trees,
to search for neighbors of each object o inside radius d around that
object. Once K (K = N(1-p)) neighbors of object o are
discovered, it is accessible that o is not an outlier. This algorithm has
the lowest case complexity of O (k * n2), where k is the
dimensionality, and n is the number of objects in the data set.

•Nested-loop algorithm: The nested loop algorithm has the same


evaluation complexity as the index-based algorithm but avoids
building index structures and minimizes the amount of I/O. It splits
the memory buffer in half and puts the data into several logical
blocks.
•Cell-based algorithm: It avoids the O(n2) computational
complexity and develops a cell-based algorithm for memory-resident
datasets. Its complexity is O(c*k + n), where c is a constant based on
the number of cells and k is the dimension.
Density-Based Local
Outlier Detection
 Distance-based outlier
detection is based on global
distance distribution
 It encounters difficulties to
identify outliers if data is not
uniformly distributed  Local outlier factor

(LOF)
 Ex. C1 contains 400 loosely  Assume outlier is not
distributed points, C2 has crisp
100 tightly condensed  Each point has a LOF

points, 2 outlier points o1, o2


 Distance-based method
cannot identify o2 as an
outlier
 Need the concept of local
outlier Data Mining: Concepts and
Techniques
Outlier Discovery: Deviation-Based
Approach

 Identifies outliers by examining the main


characteristics of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can
distinguish unusual objects from among a
series of supposedly like objects
 OLAP data cube technique
 uses data cubes to identify regions of
anomalies in large multidimensional data
Data Mining: Concepts and
Techniques

You might also like