Unit:Ii and Iii Compiled By: Rashmi Bohra
Unit:Ii and Iii Compiled By: Rashmi Bohra
dimensions
• Intermediate aggregate values are re- A B A C B C
used for computing ancestor cuboids
• Cannot do Apriori pruning: No iceberg A B C
optimization
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3Compiled by :Rashmi Bohra 6
A
Multi-way Array Aggregation
for Cube Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
1 a ll
– If a partition does not satisfy
min_sup, its descendants can be 2 A 10 B 14 C 16 D
pruned
– If minsup = 1 ⇒ compute full CUBE! 3 A B 7 A C 9 A D 11 BC 13 BD 15 C D
• No simultaneous aggregation
4 A BC 6 A BD 8 A C D 12 BC D
• Bottom-up computation
A B C D
Hhd …
Header Bus … edu hhd bus
table … …
Jan …
Feb … Jan Mar Jan Feb
… …
Tor …
Van … Tor Van Tor Mon
Month CityMon
Cust_grp Prod… Cost Price
… …
Jan Tor Edu Printer 500 485
Quant-Info Q.I. Q.I. Q.I.
Jan Tor Hhd TV 800 1200
Jan Tor Edu Camera 1160 1280 Sum: 1765
Cnt: 2
Feb Mon Bus Laptop 1500 2500 by :Rashmi Bohra
Compiled 13
Mar Van Edu HD 540 520 bins
H-Cubing: Computing Cells Involving
Dimension City
Attr. Val. Q.I. Side-link
From (*, *, Tor) to (*, Jan, Tor)
Header Edu … root
Table Hhd …
Bus …
HTor … … Edu. Hhd. Bus.
Jan …
Feb …
… …
Attr. Val. Quant-Info Side-link Jan. Mar. Jan. Feb.
Edu Sum:2285 …
Hhd …
Bus …
… … Tor. Van. Tor. Mon.
Jan …
Feb … Quant-Info Q.I. Q.I. Q.I.
… …
Tor …
Van … Sum: 1765
Mon … Cnt:Bohra
2
Compiled by :Rashmi 14
… …
Computing Cells Involving Month But
No City
1. Roll up quant-info root
2. Compute cells
Edu. Hhd. Bus.
involving month but
noVal.city
Attr. Quant-Info Side-link
Edu. Sum:2285 … Jan. Mar. Jan. Feb.
Hhd. …
Bus. …
Q.I. Q.I. Q.I. Q.I.
… …
Jan. …
Feb. …
Mar. …
… …
Tor. … Tor. Van. Tor. Mont.
Van. … Top-k OK mark: if Q.I. in a child
Mont. … passes top-k avg threshold, so does
… …
Compiledits parents.
by :Rashmi Bohra No binning is needed!
15
Computing Cells Involving Only
Cust_grp
root
A B C / AA BB CD / A AB C D / A B C D
A /A B /B C /C D /D
b*: 33 b1: 26
ro o t: 5
c*: 14 c3: 211 c* : 27
A B /A B A C /A C A D /A B C /B C B D /B C D
a1: 3 a2: 2
d*: 15 d4 : 2 12 d*: 28
A B C /A B C A B D /A B A C D /A BC D
b*: 1 b1: 2 b*: 2
Aggregation
A B C D
• The counts in the base tree are carried over to the new trees
• Lossless reduction
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
2. Aggregate * function
3. Inquire ? function
3??*1
:c
o
un
t
For example, returns a 2-D
data cube.
Compiled by :Rashmi Bohra 33
Online Query Computation (2)
A B C D E F GH I J K L MN …
Instantiated Online
Base Table Cube
Compiled by :Rashmi Bohra 35
Experiment: Size vs. Dimensionality
(50 and 100 cardinality)
• Hypothesis-driven
– exploration by user, huge search space
• Discovery-driven (Sarawagi, et al.’98)
– Effective navigation of large OLAP data cubes
– pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation
– Exception: significantly different from the value anticipated,
based on a statistical model
– Visual cues such as background color are used to reflect the
degree of exception of each cell
Compiled by :Rashmi Bohra 38
What is Concept Description?
• Descriptive vs. predictive data mining
– Descriptive mining: describes concepts or task-relevant
data sets in concise, summarative, informative,
discriminative forms
– Predictive mining: Based on data and analysis, constructs
models for the database, and predicts the trend and
properties of unknown data
• Concept description:
– Characterization: provides a concise and succinct
summarization of the given collection of data
– Comparison: provides descriptions comparing two or
more collections of data
Compiled by :Rashmi Bohra 39
Data Generalization and
Summarization-based
Characterization
• Data generalization
– A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to
higher ones.
1
2
3
4
Conceptual levels
5
– Approaches:
• Data cube approach(OLAP approach)
• Attribute-oriented induction approach
Compiled by :Rashmi Bohra 40
Concept Description vs.
OLAP
• Similarity:
– Data generalization
– Presentation of data summarization at multiple levels of abstraction.
– Interactive drilling, pivoting, slicing and dicing.
• Differences:
– Can handle complex data types of the attributes and their aggregations
– Automated desired level allocation.
– Dimension relevance analysis and ranking when there are many
relevant dimensions.
– Sophisticated typing on dimensions and measures.
– Analytical characterization: data dispersion analysis
Compiled by :Rashmi Bohra 41
Attribute-Oriented
Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures
• How it is done?
– Collect the task-relevant data (initial relation) using a relational
database query
– Perform generalization by attribute removal or attribute
generalization
– Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts
– Interactive presentation with users
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Compiled by :Rashmi Bohra 46
Presentation of Generalized
Results
• Generalized relation:
– Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
• Cross tabulation:
– Mapping results into cross tabulation form (similar to contingency
tables).
– Visualization techniques:
– Pie charts, bar charts, curves, cubes, and other visual forms.
• Quantitative characteristic rules:
– Mapping generalized result into characteristic rules with quantitative
information
grad ( x) ∧ male(associated
x) ⇒ with it, e.g.,
birth _ region ( x) ="CanadaCompiled
"[t :53%] ∨ birth Bohra
by :Rashmi _ region ( x) =" foreign"[t : 47%]47
.
Mining Class Comparisons
• Comparison: Comparing two or more classes
• Method:
– Partition the set of relevant data into the target class and the
contrasting class(es)
– Generalize both classes to the same high level concepts
– Compare tuples with the same high level descriptions
– Present for every tuple its description and two measures
• support - distribution within single class
• comparison - distribution between classes
– Highlight the tuples with strong discriminant features
• Relevance Analysis:
– Find attributes (features) which best distinguish different classes
Compiled by :Rashmi Bohra 48
Class Description
Quantitative characteristic rule
•
–
∀ X, target_class(X) ⇒ condition(X) [t : t_weight]
necessary
Quantitative discriminant rule
•
sufficient
–
Quantitative description rule
•
∀ X, target_cla ss(X) ⇐ condition( X) [d : d_weight]
∀ X, target_cla ss(X) ⇔
condition 1(X) [t : w1, d : w ′1] ∨ ... ∨ condition n(X) [t : wn, d : w ′n]
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
N AM E R AN K YEARS TENUR ED
Tom Assistant Prof 2 no Tenured?
M erlisa Associate Prof 7 no
G eorge Professor 5 yes
Joseph Assistant Prof Compiled
7 by :Rashmi
yesBohra 54
Supervised vs. Unsupervised
Learning
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
• Data cleaning
– Preprocess data in order to reduce noise and
handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
age?
<=30 overcast
31..40 >40
no yes no yes
| Dj | v
Information needed (after using = ∑split D
InfoA ( DA) to × I into
(D j ) v
partitions) to classify D: j =1 | D |
– GainRatio(A) = Gain(A)/SplitInfo(A)
4 4 6 6 4 4
SplitInfo A ( D) = − ×log 2 ( ) − ×log 2 ( ) − ×log 2 ( ) = 0.926
• Ex. 14 14 14 14 14 14
– gain_ratio(income) = 0.029/0.926 = 0.031
• The attribute with the maximum gain ratio is selected as the
splitting attribute
Compiled by :Rashmi Bohra 66
Gini index (CART, IBM
IntelligentMiner)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini ( D) =1− ∑ p 2j
j =1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index gini(D)
is defined as
|D1| |D |
gini A ( D) = gini ( D1) + 2 gini ( D 2)
|D| |D|
• Reduction in Impurity:
∆gini ( A) = gini ( D) − gini A ( D)
• The attribute provides the smallest ginisplit (D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute)
Compiled by :Rashmi Bohra 67
Gini index (CART, IBM
IntelligentMiner)
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) = 1 − − = 0.459
14 14
• Suppose the attribute income partitions D into 10 in D1: {low,
10 4
medium} and 4 in D2 giniincome∈{low , medium} ( D ) = Gini ( D1 ) + Gini ( D1 )
14 14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
P(X | C ) P(C )
P(C | X) = i i
i P(X)
• Since P(X) is constant for all classes, only
P(C | X) = P(X | C ) P(C )
needs to be maximized i i i
P ( X | C i ) = g ( xk , µ C i , σ C i )
Compiled by :Rashmi Bohra 87
Naïve Bayesian Classifier: Training
Dataset age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Class:
>40 low yes fair yes
C1:buys_computer = >40 low yes excellent no
‘yes’ 31…40 low yes excellent yes
C2:buys_computer = ‘no’<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
Data sample <=30 medium yes excellent yes
X = (age <=30, 31…40 medium no excellent yes
31…40 high yes fair yes
Income = medium,
>40 medium no excellent no
Student = yes
Credit_rating = Fair)
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
• Each attribute-value pair along a path forms a conjunction: the leaf no yes excellent fair
holds the class prediction no yes
no yes
• Rules are mutually exclusive and exhaustive
pos − neg
FOIL _ Prune ( R ) =
pos + neg
Pos/neg are # of positive/negative tuples covered by R.
If FOIL_Prune is higher for the pruned version of R, prune R
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
Output vector
Err j = O j (1 − O j )∑ Errk w jk
Output layer k
θ j = θ j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden layer Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I j
1+ e
Input layer
I j = ∑ wij Oi + θ j
i
Input vector: X Compiled by :Rashmi Bohra 108
How A Multi-Layer Neural Network
Works?
• The inputs to the network correspond to the attributes measured for each training tuple
• Inputs are fed simultaneously into the units making up the input layer
• They are then weighted and fed simultaneously to a hidden layer
• The number of hidden layers is arbitrary, although usually only one
• The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network's prediction
• The network is feed-forward in that none of the weights cycles back to an input unit or to an
output unit of a previous layer
• From a statistical point of view, networks perform nonlinear regression: Given enough hidden
units and enough training samples, they can closely approximate any function
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
∑( x − x )( yi − y )
w = w = y −w x
i
i =1
1 | D|
0 1
∑( x i =1
i − x )2
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of
classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
Compiled by :Rashmi Bohra 145
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., boostrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most
votes to X
• Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
• Accuracy
– Often significant better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
Compiled by :Rashmi Bohra 146
Boosting
• Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the subsequent classifier,
Mi+1 , to pay more attention to the training tuples that were misclassified by Mi
– The final M* combines the votes of each individual classifier, where the weight of each
classifier's vote is a function of its accuracy
• The boosting algorithm can be extended for the prediction of continuous
values
• Comparing with bagging: boosting tends to achieve greater accuracy, but it
also risks overfitting the model to misclassified data
• Issues regarding classification and • Lazy learners (or learning from your
neighbors)
prediction
• Other classification methods
• Classification by decision tree
• Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification • Model selection
• Classification by back propagation • Summary
0
d(2,1)
• Dissimilarity matrix 0
d(3,1 ) d ( 3,2) 0
– (one mode)
: : :
d ( n,1) d ( n,2) ... ... 0
Compiled by :Rashmi Bohra 161
Type of data in clustering
analysis
• Interval-scaled variables
• Binary variables
• Nominal, ordinal, and ratio variables
• Variables of mixed types
• Standardize data
– Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where
m f = 1n (x1 f + x2 f + ... + xnf )
.
xif − m f
zif =
• Using mean absolute deviation s fis more robust than using
standard deviation
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | + ...+ | x − x |
i1 j1 i2 j 2 ip jp
Compiled by :Rashmi Bohra 164
Similarity and Dissimilarity
Between Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x | 2 + | x − x |2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
• Also, one can use weighted distance,
parametric Pearson product moment correlation,
or other disimilarity
Compiledmeasures
by :Rashmi Bohra 165
Binary Variables Object j
1 0 sum
• A contingency table for binary data
1 a b a +b
Object i
0 c d c+d
• Distance measure for symmetric binary sum a + c b + d p
variables:
d (i, j) = p −
p
m
• Complete link: largest distance between an element in one cluster and an element
in the other, i.e., dis(Ki, Kj) = max(tip , tjq )
• Average: avg distance between an element in one cluster and an element in the
other, i.e., dis(Ki, Kj) = avg(tip , tjq )
• Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Σ N Σ N (t −t ) 2
• Diameter: square root of average
i =1 i mean
=1 ip squared
iq
Dm =
N ( N −1)
distance between all pairs of points in the cluster
6 7
7
5 6
6
4 5
5 3
4
4
Assign
2
Update 3
3 1
2 each 0 the 2
cluster
0 1 2 3 4 5 6 7 8 9 10
1
1
objects 0
0
0 1 2 3 4 5 6 7 8 9 10 to means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center 10 10
K=2
9 9
8 8
7 7
Arbitrarily choose 6 6
K object as initial
5 5
4 4
cluster center 3
2
Update 3
1 the 1
cluster
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
means
Compiled by :Rashmi Bohra 181
Comments on the K-Means
Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global
optimum may be found using techniques such as: deterministic
annealing and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
– Dissimilarity calculations
7
Arbitrar Assign
6 6
6 5 5
5
y 4 each 4
4 choose 3
remaini 3
2 2
3
k object 1
ng 1
2
as 0
0 1 2 3 4 5 6 7 8 9 10
object 0
0 1 2 3 4 5 6 7 8 9 10
initial to
1
0
0 1 2 3 4 5 6 7 8 9 10
medoid nearest
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10
Do loop 9
8 Compute
9
8
Swapping 7 total cost 7
Until no O and 6
of 6
Oramdom
change
5 5
4
swapping 4
If quality is 3
2
3
improved. 1 1
9 9
j
t
8 8
7
t 7
6 6
4
j 5
h
3 3
2 i h 2
1 1 i
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9
9
8
8
7 h 7
5
j 6
4
4
i
i 3
h j
t
3
2
2
t
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
7 7 7
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
8
(3,4)
7
5
(2,6)
(4,5)
4
(4,7)
1
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
Compiled by :Rashmi Bohra 199
CF-Tree in BIRCH
• Clustering feature:
– summary of the statistics for a given subcluster: the 0-th, 1st and 2nd
moments of the subcluster from the statistical point of view.
– registers crucial measurements for computing cluster and utilizes
storage efficiently
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
{c} 1
– Ex. LetST
im1 =({a,
T 1,b,
T c}, T2 = {c, d, e}
2) = = =0.2
{a, b, c, d , e} 5
Compiled by :Rashmi Bohra 203
Link Measure in ROCK
• Links: # of common neighbors
– C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
– C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Data Set
Merge Partition
Final Clusters
• Density-connected
– A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are o
density-reachable from o w.r.t. Eps
and MinPts Compiled by :Rashmi Bohra 211
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
• Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
• Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
o
• Reachability Distance
p2 o
Max (core-distance (o), d (o, p))
MinPts = 5
r(p1, o) = 2.8cm. r(p2,o)Compiled
= 4cmby :Rashmi Bohra 217
ε = 3 cm
Reachability
-distance
undefined
ε
ε
ε ‘
Cluster-order
Compiled by :Rashmi Bohra 218
of the objects
Density-Based Clustering: OPTICS & Its
Applications
d ( x , xi ) 2
−
( x ) =∑
N
D 2σ2
f Gaussian i =1
e
d ( x , xi ) 2
−
( x, xi ) = ∑i =1 ( xi − x) ⋅ e
N
∇f D
Gaussian
2σ 2
• Major features
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily shaped clusters in high-
dimensional data sets
– Significant faster than existing algorithm (e.g., DBSCAN)
– But needs a large number of parameters
– Maximization step:
• Estimation of model parameters
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
τ =3
Vacation
ry 30 50
a la age
S
• Typical methods
– Frequent-term-based document clustering
– Clustering by pattern similarity in micro-array data (pClustering)
Compiled by :Rashmi Bohra 251
Clustering by Pattern Similarity (p-
Clustering)
• Right: The micro-array “raw” data shows 3 genes
and their values in a multi-dimensional space
– Difficult to find their patterns
1
d = ∑ d 1 1
– Where ij | J | j ∈ J ij d =
∑ d d = ∑ d
Ij | I | ij IJ | I || J | i ∈ I , j ∈ J ij
i∈I
– A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0
• Problems with bi-cluster
– No downward closure property,
– Due to averaging, it may contain outliers but still within δ-threshold
d xa d xb
pScore ( ) =| (d xa − d xb ) − (d ya − d yb ) |
for
• A pair (O, T) is in δ-pCluster
d ya d if
yb
any 2 by 2 matrix X in (O, T), pScore(X) ≤
δ for some δ > 0
• Properties of δ-pCluster
– Downward closure
– Clusters are more homogeneous than bi-cluster (thus the name: pair-wise Cluster)
d xa / d ya
<δ
d xb / d yb
Compiled by :Rashmi Bohra 254
Chapter 6. Cluster
Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10.Constraint-Based Clustering
11.Outlier Analysis
12.Summary Compiled by :Rashmi Bohra 255
Why Constraint-Based Cluster
Analysis?
• Need user feedback: Users know their applications the best
• Less parameters but more user-desired constraints, e.g., an ATM allocation
problem: obstacle & desired clusters