0% found this document useful (0 votes)

6 views

Outlier Analysis

Uploaded by

btscomparison

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Outlier Analysis

Uploaded by

btscomparison

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 104

Chapter 7.

Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step
Data Mining: for
Concepts
Techniques
other
and algorithms
Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering
feature spaces
 Detect spatial clusters or for other spatial mining
tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar
access patterns Data Mining: Concepts and
Techniques
Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Data Mining: Concepts and
Techniques
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation
 The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns
Data Mining: Concepts and
Techniques
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, typically
metric: d(i, j)
 There is a separate “quality” function that
measures the “goodness” of a cluster.
 The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables.
 Weights should be associated with different
variables based on applications and data
semantics.
 It is hard to define “similar enough” or “good
Data Mining: Concepts and
Techniques
Requirements of Clustering in Data
Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Data Mining: Concepts and
Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 (two modes)  
 ... ... ... ... ... 
x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 0 
 (one mode)  d(2,1) 
 0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

Data Mining: Concepts and

Techniques
Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

Data Mining: Concepts and

Techniques
Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )

 Calculate the standardized measurement (z-

xif  m f
score) zif  s
f

 Using mean absolute deviation is more robust

than using standardData
deviation
Mining: Concepts and
Techniques
Similarity and Dissimilarity
Between Objects

 Distances are normally used to measure the

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

are two p-dimensional data objects, and q is a
positive integer
 , j) | x  x |  |distance
If q = 1, d isd (iManhattan x  x | ... | x  x |
i1 j1 i2 j2 ip jp
Data Mining: Concepts and
Techniques
Similarity and Dissimilarity
Between Objects (Cont.)

 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties

d(i,j)  0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Data Mining: Concepts and
Techniques
Binary Variables
Object j
1 0 sum
 A contingency table for 1 a b a b
Object i
binary data 0 c d c d
sum a  c b  d p
 Distance measure for b c
d (i, j) 
symmetric binary a b c  d
variables:
d (i, j)  b c
 Distance measure for a b  c
asymmetric binary
variables: simJaccard (i, j)  a
a b  c
 Jaccard coefficient
Data Mining: Concepts and
(similarity measure for Techniques
Dissimilarity between Binary
Variables

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 gender is a symmetric attribute
 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set
to 0 0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
Data Mining: Concepts and
Techniques
Nominal Variables

 A generalization of the binary variable in that it can

take more than 2 states, e.g., red, yellow, blue,
green
 Method 1: Simple matching
 m: # of matches,
d (i, j)p:
total
p  m# of variables
p

 Method 2: use a large number of binary variables

 creating a new binary variable for each of the M
nominal states
Data Mining: Concepts and
Techniques
Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank rif {1,...,M f }
if

 map the range of each variable onto [0, 1] by

replacing i-th object in the f-th variable by
rif  1
zif 
Mf 1
 compute the dissimilarity using methods for
interval-scaled variables
Data Mining: Concepts and
Techniques
Ratio-Scaled Variables

 Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables—not a
good choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their
rank as interval-scaled
Data Mining: Concepts and
Techniques
Variables of Mixed Types

 A database may contain all the six types of

variables
 symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio

 One may use a weighted  pf formula
 (f)
d to
( f )combine
d
their effects (i, j )  1 ij ij
 pf 1 ij( f )
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is interval-based: use the normalized
distance
 f is ordinal or ratio-scaled zif  r  1
if

M 1

compute ranks rif and f

Data Mining: Concepts and

 Techniques
Vector Objects

 Vector objects: keywords in documents,

gene features in micro-arrays, etc.
 Broad applications: information retrieval,
biologic taxonomy, etc.
 Cosine measure

 A variant: Tanimoto coefficient

Data Mining: Concepts and

Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Major Clustering Approaches
(I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

Data Mining: Concepts and

Techniques
Major Clustering Approaches
(II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific
constraints
Data Mining: Concepts and
 Typical methods: COD (obstacles),
Techniquesconstrained clustering
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one
cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip,
tjq)
 Complete link: largest distance between an element in
one cluster and an element in the other, i.e., dis(Ki, Kj) =
max(tip, tjq)
 Average: avg distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters,
i.e., dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between theConcepts
Data Mining: medoids
and of two clusters, i.e.,
Techniques
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point of

the cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance
between all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

Data Mining: Concepts and

 Given a k, find a partition of k clusters that optimizes the

chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
Data Mining: Concepts and
Techniques
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented

in four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the
nearest seed point
 Go back to Step 2, stop when no more new
assignment
Data Mining: Concepts and
Techniques
The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose 7 7

K object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and

Techniques
Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as:
deterministic annealing and genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
Data Mining: Concepts and
Techniques
Variations of the K-Means Method

 A few variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical
objects
 Using a frequency-based method to update modes of
clusters
Data Mining: Concepts and
 Techniques
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may
substantially distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and

Techniques
The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters

 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering
 PAM works effectively for small data sets, but does not
scale well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)
Data Mining: Concepts and
Techniques
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5
each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

initial to
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping total cost
Until no
7 7

O and 6
of 6

change Oramdom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0

Data
0 1 2
Mining:
3 4 5
Concepts
6 7 8 9
and
10 0 1 2 3 4 5 6 7 8 9 10

Techniques
PAM (Partitioning Around Medoids)
(1987)

 PAM (Kaufman and Rousseeuw, 1987), built in

Splus
 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and
selected object i, calculate the total swapping
cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h

Then assign each non-selected object to the
Data Mining: Concepts and
most similar representative
Techniques object
PAM Clustering: Total swapping cost
TCih=jCjih

10 10

9 9

t j
8

7
t 8

5
j 6

4
i h 4
h
3

2
3

2
i
1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, h) - d(j, i) Cjih = 0

10
10

9
9

8
h 8

j
7
7
6
6

5
5 i
i h j
t
4
4

3
3

2
2

1
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Cjih = d(j, t) - d(j, i) C = d(j, h) - d(j, t)

Data Mining: Concepts and
Techniques jih
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence

of noise and outliers because a medoid is less
influenced by outliers or other extreme values
than a mean
 Pam works efficiently for small data sets but does
not scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
 Sampling based method,
CLARA(Clustering LARge Applications)
Data Mining: Concepts and
Techniques
CLARA (Clustering Large Applications)
(1990)

 CLARA (Kaufmann and Rousseeuw in 1990)

 Built in statistical analysis packages, such as
S+
 It draws multiple samples of the data set, applies
PAM on each sample, and gives the best
clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
Data Mining: Concepts and
necessarily representTechniques
a good clustering of the
CLARANS (“Randomized” CLARA)
(1994)

 CLARANS (A Clustering Algorithm based on

Randomized Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as
searching a graph where every node is a potential
solution, that is, a set of k medoids
 If the local optimum is found, CLARANS starts with
new randomly selected node in search for a new
local optimum
 It is more efficient and scalable than both PAM and
CLARA
 Focusing techniquesData
and spatial access structures
Mining: Concepts and
Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Hierarchical Clustering

 Use distance matrix as clustering criteria. This

method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Data Mining: Concepts and
(DIANA)
Techniques
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g.,
Splus
 Use the Single-Link method and the dissimilarity
matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion

9
Eventually all nodes belong to the same cluster
10 10

9
10

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and

Techniques
Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.

Data Mining: Concepts and

Techniques
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g.,
Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and

Techniques
Recent Hierarchical Clustering
Methods
 Major weakness of agglomerative clustering
methods
 do not scale well: time complexity of at least
O(n2), where n is the number of total objects
 can never undo what was done previously
 Integration of hierarchical with distance-based
clustering
 BIRCH (1996): uses CF-tree and incrementally
adjusts the quality of sub-clusters
 ROCK (1999): clustering categorical data by
neighbor and link analysis
Data Mining: Concepts and
 CHAMELEON (1999): hierarchical clustering using
Techniques
BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering
using Hierarchies (Zhang, Ramakrishnan & Livny,
SIGMOD’96)
 Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for multiphase
clustering
 Phase 1: scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering structure
of the data)
 Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
 Scales linearly: finds a good clustering with a single
scan and improves the quality
Data Mining: Conceptswith
and a few additional
Techniques
Clustering Feature Vector in
BIRCH

Clustering Feature: CF = (N, LS, SS)

N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi2
CF = (5, (16,30),(54,190))
10

9
(3,4)
(2,6)
8

4
(4,5)
3

2 (4,7)
(3,8)
1

0
0 1 2 3 4 5 6 7 8 9 10

Data Mining: Concepts and

Techniques
CF-Tree in BIRCH

 Clustering feature:
 summary of the statistics for a given subcluster: the 0-th,
1st and 2nd moments of the subcluster from the statistical
point of view.
 registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: specify the maximum number of children.
 threshold: max diameter of sub-clusters stored at the leaf
Data Mining: Concepts and
nodes Techniques
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

Data Mining: Concepts and

Techniques
Clustering Categorical Data: The ROCK
Algorithm
 ROCK: RObust Clustering using linKs
 S. Guha, R. Rastogi & K. Shim, ICDE’99

 Major ideas
 Use links to measure similarity/proximity

 Not distance-based

 Computational complexity:
O(n 2  nmmma  n 2 log n)
 Algorithm: sampling-based clustering
 Draw random sample

 Cluster with links

 Label data in disk

 Experiments
 Congressional voting, mushroom data

Data Mining: Concepts and

Techniques
Similarity Measure in ROCK
 Traditional measures for categorical data may not work well,
e.g., Jaccard coefficient
 Example: Two groups (clusters) of transactions
 C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
 C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Jaccard co-efficient may lead to wrong clustering result
 C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
 C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, Tf}) T2
Sim( T , T )  1
 Jaccard co-efficient-based similarity function: 1 2 T1  T2

 Ex. Let T1 = {a, b, c}, T2 =

{c{c,
} d, e} 1
Sim (T 1, T 2)   0.2
{a, b, c, d , e} 5
Data Mining: Concepts and
Techniques
Link Measure in ROCK
 Links: # of common neighbors
 C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
 C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
 link(T1, T2) = 4, since they have 4 common
neighbors

{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
 link(T1, T3) = 3, since they have 3 common
neighbors
Data Mining: Concepts and
 Techniques
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
 CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high
relative to the internal interconnectivity of the clusters and
closeness of items within the clusters
 Cure ignores information about interconnectivity of the
objects, Rock ignores information about the closeness of
two clusters
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find
Data Mining: Concepts and
the genuine clusters by repeatedly
Techniques combining these sub-
Overall Framework of
CHAMELEON
Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

Data Mining: Concepts and

Techniques
CHAMELEON (Clustering Complex
Objects)

Data Mining: Concepts and

Techniques
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion),
such as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination
condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based) Data Mining: Concepts and

Techniques
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if


p belongs to NEps(q) p MinPts = 5
 core point condition: q Eps = 1 cm
|NEps (q)| >= MinPts
Data Mining: Concepts and
Techniques
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable p
from a point q w.r.t. Eps,
p1
MinPts if there is a chain of q
points p1, …, pn, p1 = q, pn = p
such that pi+1 is directly
density-reachable from pi
 Density-connected p q
 A point p is density-connected
to a point q w.r.t. Eps, MinPts o
if there is a point o such that
both, p and q are density-
Data Mining: Concepts and
reachable from o w.r.t.Techniques
Eps
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
 Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

Data Mining: Concepts and

Techniques
DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t.
Eps and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-
reachable from p and DBSCAN visits the next
point of the database.
 Continue the process until all of the points have
been processed.
Data Mining: Concepts and
Techniques
DBSCAN: Sensitive to
Parameters

Data Mining: Concepts and

Techniques
CHAMELEON (Clustering Complex
Objects)

Data Mining: Concepts and

Techniques
OPTICS: A Cluster-Ordering Method
(1999)

 OPTICS: Ordering Points To Identify the Clustering

Structure
 Ankerst, Breunig, Kriegel, and Sander

(SIGMOD’99)
 Produces a special order of the database wrt its

density-based clustering structure

 This cluster-ordering contains info equiv to the

density-based clusterings corresponding to a

broad range of parameter settings
 Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering

structure
 Can be represented graphically or using
Data Mining: Concepts and
visualization techniques Techniques
OPTICS: Some Extension from
DBSCAN

Index-based:

k = number of dimensions

N = 20

p = 75%

M = N(1-p) = 5 D
 Complexity: O(kN2)
 Core Distance
p1
 Reachability Distance o
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) =Data
4cm
e = 3 cm
Mining: Concepts and
Techniques
Reachability
-distance

undefined


 ‘

Cluster-order
Data Mining: Concepts and of the objects
Techniques
Density-Based Clustering: OPTICS & Its
Applications

Data Mining: Concepts and

Techniques
DENCLUE: Using Statistical Density
Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
d ( x , y )2

 Using statistical density functions:
f Gaussian ( x , y ) e 2 2

d ( x , xi ) 2

( x )  i 1 e
D N
2 2
f Gaussian

d ( x , xi ) 2

( x, xi )  i 1 ( xi  x) e
D N
2 2
 Major features f Gaussian
 Solid mathematical foundation
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
Data Mining: Concepts and
Techniques
Denclue: Technical Essence
 Uses grid cells but only keeps information about
grid cells that do actually contain data points and
manages these cells in a tree-based access
structure
 Influence function: describes the impact of a data
point within its neighborhood
 Overall density of the data space can be
calculated as the sum of the influence function of
all data points
 Clusters can be determined mathematically by
identifying density attractors
 Density attractors are local maximal of the overall
Data Mining: Concepts and
Techniques
Density Attractor

Data Mining: Concepts and

Techniques
Center-Defined and Arbitrary

Data Mining: Concepts and

 Using multi-resolution grid data structure

 Several interesting methods
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)

A multi-resolution clustering approach using wavelet
method
 CLIQUE: Agrawal, et al. (SIGMOD’98)

On high-dimensional data (thus put in the section of
clustering high-dimensional data

Data Mining: Concepts and

Techniques
STING: A Statistical Information Grid
Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular
cells
 There are several levels of cells corresponding to
different levels of resolution

Data Mining: Concepts and

Techniques
The STING Clustering Method

 Each cell at a high level is partitioned into a number

of smaller cells in the next lower level
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
 Parameters of higher level cells can be easily
calculated from parameters of lower level cell

count, mean, s, min, max

type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data
queries

Start from a pre-selected layer—typically with a
small number of cells
 For each cell in the current level compute the
confidence interval Data Mining: Concepts and
Techniques
Comments on STING
 Remove the irrelevant cells from further
consideration
 When finish examining the current layer, proceed
to the next lower level
 Repeat this process until the bottom layer is
reached
 Advantages:
 Query-independent, easy to parallelize,

incremental update
 O(K), where K is the number of grid cells at the

lowest level
 Disadvantages:
 All the cluster boundaries
Techniquesare either horizontal
Data Mining: Concepts and
WaveCluster: Clustering by Wavelet Analysis
(1998)
 Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach which applies wavelet
transform to the feature space
 How to apply wavelet transform to find clusters
 Summarizes the data by imposing a multidimensional
grid structure onto data space
 These multidimensional spatial data objects are
represented in a n-dimensional feature space
 Apply wavelet transform on feature space to find the
dense regions in the feature space
 Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
Data Mining: Concepts and
Techniques
Wavelet Transform
 Wavelet transform: A signal processing technique
that decomposes a signal into different frequency
sub-band (can be applied to n-dimensional signals)
 Data are transformed to preserve relative distance
between objects at different levels of resolution
 Allows natural clusters to become more
distinguishable

Data Mining: Concepts and

Techniques
The WaveCluster Algorithm
 Input parameters
 # of grid cells for each dimension

 the wavelet, and the # of applications of wavelet

transform
 Why is wavelet transformation useful for clustering?
 Use hat-shape filters to emphasize region where points

cluster, but simultaneously suppress weaker information

in their boundary
 Effective removal of outliers, multi-resolution, cost

effective
 Major features:
 Complexity O(N)

 Detect arbitrary shaped clusters at different scales

 Not sensitive to noise, not sensitive to input order

 Only applicable to low dimensional data

Data Mining: Concepts and
 Both grid-based and density-based
Techniques
Quantization
& Transformation
 First, quantize data into m-D grid
structure, then wavelet transform
 a) scale 1: high resolution

 b) scale 2: medium resolution

 c) scale 3: low resolution

Data Mining: Concepts and

Techniques
Chapter 7. Cluster Analysis

1. What is Cluster Analysis?

2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
Data Mining: Concepts and
Techniques
Model-Based Clustering

 What is model-based clustering?

 Attempt to optimize the fit between the given

data and some mathematical model

 Based on the assumption: Data are generated by

a mixture of underlying probability distribution

 Typical methods
 Statistical approach


EM (Expectation maximization), AutoClass
 Machine learning approach

COBWEB, CLASSIT
 Neural network approach

SOM (Self-Organizing Feature Map)
Data Mining: Concepts and
Techniques
EM — Expectation Maximization
 EM — A popular iterative refinement algorithm
 An extension to k-means
 Assign each object to a cluster according to a weight (prob.
distribution)
 New means are computed based on weighted measures
 General idea
 Starts with an initial estimate of the parameter vector
 Iteratively rescores the patterns against the mixture density
produced by the parameter vector
 The rescored patterns are used to update the parameter
updates
 Patterns belonging to the same cluster, if they are placed
by their scores in a particular component
 Algorithm converges fast but may not be in global optima
Data Mining: Concepts and
Techniques
The EM (Expectation Maximization)
Algorithm

 Initially, randomly assign k cluster centers

 Iteratively refine the clusters based on two steps
 Expectation step: assign each data point X to
i
cluster Ci with the following probability

 Maximization step:

Estimation of model parameters

Data Mining: Concepts and

Techniques
Conceptual Clustering
 Conceptual clustering
 A form of clustering in machine learning

 Produces a classification scheme for a set of

unlabeled objects
 Finds characteristic description for each concept

(class)
 COBWEB (Fisher’87)
 A popular a simple method of incremental

conceptual learning
 Creates a hierarchical clustering in the form of a

classification tree
 Each node refers to a concept and contains a

probabilistic description of that

Data Mining: Concepts and concept
Techniques
COBWEB Clustering
Method
A classification tree

Data Mining: Concepts and

Techniques
More on Conceptual Clustering
 Limitations of COBWEB
 The assumption that the attributes are independent of each
other is often too strong because correlation may exist
 Not suitable for clustering large database data – skewed
tree and expensive probability distributions
 CLASSIT
 an extension of COBWEB for incremental clustering of
continuous data
 suffers similar problems as COBWEB
 AutoClass (Cheeseman and Stutz, 1996)
 Uses Bayesian statistical analysis to estimate the number
of clusters
 Popular in industry Data Mining: Concepts and
Techniques
Neural Network Approach
 Neural network approaches
 Represent each cluster as an exemplar, acting

as a “prototype” of the cluster

 New objects are distributed to the cluster whose

exemplar is the most similar according to some

distance measure
 Typical methods
 SOM (Soft-Organizing feature Map)

 Competitive learning


Involves a hierarchical architecture of several units
(neurons)

Neurons compete in a “winner-takes-all” fashion for
the object currently being presented
Data Mining: Concepts and
Techniques
CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

 Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-
based
 It partitions each dimension into the same number of equal
length interval
 It partitions an m-dimensional data space into non-
overlapping rectangular units
 A unit is dense if the fraction of total data points contained
in the unit exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace Data Mining: Concepts and
Techniques
CLIQUE: The Major Steps
 Partition the data space and find the number of
points that lie inside each cell of the partition.
 Identify the subspaces that contain clusters using
the Apriori principle
 Identify clusters
 Determine dense units in all subspaces of
interests
 Determine connected dense units in all
subspaces of interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster

of connected dense units for each cluster

 Determination of minimal cover for each cluster
Data Mining: Concepts and
Techniques
Vacation
(10,000)

(week)
Salary

0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

age age
20 30 40 50 60 20 30 40 50 60

=3
Vacation

y
l ar 30 50
Sa age

Data Mining: Concepts and

Techniques
Strength and Weakness of
CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters

exist in those subspaces
 insensitive to the order of records in input and

does not presume some canonical data

distribution
 scales linearly with the size of input and has

good scalability as the number of dimensions in

the data increases
 Weakness
 The accuracy of the clustering result may be

degraded at the expense of simplicity of the

Data Mining: Concepts and
method Techniques
Frequent Pattern-Based Approach
 Clustering high-dimensional space (e.g., clustering text
documents, microarray data)
 Projected subspace-clustering: which dimensions to be
projected on?

CLIQUE, ProClus
 Feature extraction: costly and may not be effective?
 Using frequent patterns as “features”

“Frequent” are inherent features

Mining freq. patterns may not be so expensive
 Typical methods
 Frequent-term-based document clustering
 Clustering by pattern similarity in micro-array data
(pClustering)
Data Mining: Concepts and
Techniques
Clustering by Pattern Similarity (p-
Clustering)

 Right: The micro-array “raw”

data shows 3 genes and their
values in a multi-dimensional
space
 Difficult to find their patterns
 Bottom: Some subsets of
dimensions form nice shift and
scaling patterns

Data Mining: Concepts and

Techniques
Why p-Clustering?
 Microarray data analysis may need to
 Clustering on thousands of dimensions (attributes)
 Discovery of both shift and scaling patterns
 Clustering with Euclidean distance measure? — cannot find shift
patterns
 Clustering on derived attribute Aij = ai – aj? — introduces N(N-1)
dimensions
 Bi-cluster using transformed mean-squared residue score matrix (I, J)
1 1 1
d  d d   d
ij | J |  ij
d   d
Ij | I | i  I ij IJ | I || J | i  I , j  J ij
jJ

 Where
 A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
 Problems with bi-cluster
 No downward closure property,
Data Mining: Concepts and
 Techniques
p-Clustering:
Clustering by
Pattern Similarity
 Given object x, y in O and features a, b in T, pCluster is a 2 by
2 matrix  d xa d xb 
pScore(   ) | (d xa  d xb )  (d ya  d yb ) |
 d ya d yb 
 A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O,
T), pScore(X) ≤ δ for some δ > 0
 Properties of δ-pCluster
 Downward closure
 Clusters are more homogeneous than bi-cluster (thus the
name: pair-wise Cluster)
 Pattern-growth algorithm has been developed for efficient
mining d xa / d ya

d xb / d yb
 For scaling patterns, one can observe, taking logarithmic on
will lead to the pScore form
Data Mining: Concepts and
Techniques
Outlier Detection
What Is Outlier Discovery?
 What are outliers?
 The set of objects are considerably dissimilar

from the remainder of the data

 Example: Sports: Michael Jordon, Wayne

Gretzky, ...
 Problem: Define and find outliers in large data sets
 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

Data Mining: Concepts and

Techniques
Outlier Discovery:
Statistical
Approaches

 Assume a model underlying distribution that

generates data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution

 distribution parameter (e.g., mean, variance)

 number of expected outliers

 Drawbacks
 most tests are for single attribute

 In many cases, data distribution may not be

known Data Mining: Concepts and

Techniques
Outlier Discovery: Distance-Based
Approach

 Introduced to counter the main limitations

imposed by statistical methods
 We need multi-dimensional analysis without

knowing data distribution

 Distance-based outlier: A DB(p, D)-outlier is an
object O in a dataset T such that at least a fraction
p of the objects in T lies at a distance greater than
D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm O P
 Cell-based algorithm

Data Mining: Concepts and

Techniques
Index-based algorithm: The index-based algorithm facilitates
multidimensional indexing structures, including R-trees or k-d trees,
to search for neighbors of each object o inside radius d around that
object. Once K (K = N(1-p)) neighbors of object o are
discovered, it is accessible that o is not an outlier. This algorithm has
the lowest case complexity of O (k * n2), where k is the
dimensionality, and n is the number of objects in the data set.

•Nested-loop algorithm: The nested loop algorithm has the same

evaluation complexity as the index-based algorithm but avoids
building index structures and minimizes the amount of I/O. It splits
the memory buffer in half and puts the data into several logical
blocks.
•Cell-based algorithm: It avoids the O(n2) computational
complexity and develops a cell-based algorithm for memory-resident
datasets. Its complexity is O(c*k + n), where c is a constant based on
the number of cells and k is the dimension.
Density-Based Local
Outlier Detection
 Distance-based outlier
detection is based on global
distance distribution
 It encounters difficulties to
identify outliers if data is not
uniformly distributed  Local outlier factor

(LOF)
 Ex. C1 contains 400 loosely  Assume outlier is not
distributed points, C2 has crisp
100 tightly condensed  Each point has a LOF

points, 2 outlier points o1, o2

 Distance-based method
cannot identify o2 as an
outlier
 Need the concept of local
outlier Data Mining: Concepts and
Techniques
Outlier Discovery: Deviation-Based
Approach

 Identifies outliers by examining the main

characteristics of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 simulates the way in which humans can
distinguish unusual objects from among a
series of supposedly like objects
 OLAP data cube technique
 uses data cubes to identify regions of
anomalies in large multidimensional data
Data Mining: Concepts and
Techniques

User Guide PRONTO SAT EN PDF
No ratings yet
User Guide PRONTO SAT EN PDF
7 pages
Nissan D22 Transmission Repair Manual
100% (1)
Nissan D22 Transmission Repair Manual
2 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Algorithms
No ratings yet
Algorithms
107 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
8clst
No ratings yet
8clst
100 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
8clst
No ratings yet
8clst
98 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Unit 4
No ratings yet
Unit 4
65 pages
Clustering
No ratings yet
Clustering
123 pages
8 CLST
No ratings yet
8 CLST
98 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
Clustering
No ratings yet
Clustering
84 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Data Mining
No ratings yet
Data Mining
98 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
Clustering
No ratings yet
Clustering
47 pages
Analysis of cluteruing
No ratings yet
Analysis of cluteruing
16 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
51 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
DATA_MINING_UNIT-4
No ratings yet
DATA_MINING_UNIT-4
15 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
27 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Unit 5
No ratings yet
Unit 5
27 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1 - Introduction
30 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Assignment Submission Plan - November Exam Board
No ratings yet
Assignment Submission Plan - November Exam Board
4 pages
Multimedia Systems Chapter 9
No ratings yet
Multimedia Systems Chapter 9
6 pages
Med Trans
No ratings yet
Med Trans
8 pages
TD824 1F161F28K
100% (1)
TD824 1F161F28K
310 pages
Grey's Anatomy - para Mobile e Google Slides
No ratings yet
Grey's Anatomy - para Mobile e Google Slides
21 pages
Mathematics in The Modern World: Handout
No ratings yet
Mathematics in The Modern World: Handout
10 pages
ويبنار تصميم محطات تحلية المياه
No ratings yet
ويبنار تصميم محطات تحلية المياه
72 pages
Hanuman Full Onl 30
No ratings yet
Hanuman Full Onl 30
5 pages
Selecting An LNG Process: Not An Easy Task
No ratings yet
Selecting An LNG Process: Not An Easy Task
24 pages
Down Lights and Accents
No ratings yet
Down Lights and Accents
18 pages
Ip Clinic 2025 Phase i - Final List
No ratings yet
Ip Clinic 2025 Phase i - Final List
28 pages
Lesson No
No ratings yet
Lesson No
19 pages
Garn W. Data Analytics For Business. AI-ML-PBI-SQL-R 2024
No ratings yet
Garn W. Data Analytics For Business. AI-ML-PBI-SQL-R 2024
283 pages
Fluke Biomedical Ansur: Test and Inspection Procedure
No ratings yet
Fluke Biomedical Ansur: Test and Inspection Procedure
3 pages
530 Shuttle Brochure
No ratings yet
530 Shuttle Brochure
2 pages
Unit 22: Onboard Passenger Operations
No ratings yet
Unit 22: Onboard Passenger Operations
17 pages
01-01 Technical Specifications
No ratings yet
01-01 Technical Specifications
4 pages
Syvecs Manual V1.1
No ratings yet
Syvecs Manual V1.1
108 pages
Gashaw Research Proposal 1
100% (1)
Gashaw Research Proposal 1
19 pages
5 Curved Toughened Glass
No ratings yet
5 Curved Toughened Glass
5 pages
5 Plumbing BSVS Tender Drawing
No ratings yet
5 Plumbing BSVS Tender Drawing
6 pages
11-Social Engineering Tool Kit-16!08!2024
No ratings yet
11-Social Engineering Tool Kit-16!08!2024
44 pages
Dhanada-Kavacham Telugu PDF File12892
No ratings yet
Dhanada-Kavacham Telugu PDF File12892
3 pages
11 Electrical Module Admin Manual
100% (1)
11 Electrical Module Admin Manual
72 pages
Materials: Advances in Orthotic and Prosthetic Manufacturing: A Technology Review
No ratings yet
Materials: Advances in Orthotic and Prosthetic Manufacturing: A Technology Review
15 pages
LB1823
No ratings yet
LB1823
12 pages
Daily Diary - 947 RPT - 12-12-2022 - NS
No ratings yet
Daily Diary - 947 RPT - 12-12-2022 - NS
4 pages
Computer-Aided School and University Timetabling: The New Wave
No ratings yet
Computer-Aided School and University Timetabling: The New Wave
24 pages