0% found this document useful (0 votes)
4 views

Lecture 10

fdhdtytyr

Uploaded by

sarahgohar0308
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 10

fdhdtytyr

Uploaded by

sarahgohar0308
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

BIG DATA ANALYTICS

Lecture 10 --- Week 11


Content

 Overview of Clustering

 Some Applications of Clustering

 Uses of Clustering

 Similarity and Distance Measures

 Jaccard’s coefficient and distance, Simple matching coefficient and


distance, and Hamming distance
Overview of Clustering

 In general a grouping of objects such that the objects in a group


(cluster) are similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
 Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no
predefined classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of Clustering

 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access
patterns
Uses of Clustering
Discovered Clusters Industry Group
 Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

 Group related documents 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

for browsing, genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,


Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
functionality, stocks with Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

similar price fluctuations, Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

users with same behavior 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

 Summarization
 Reduce the size of large
data sets

 Applications
 Recommendation systems
Clustering
 Search Personalization precipitation in
Australia
Outliers
 Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality

cluster

outliers

 In some applications we are interested in


discovering outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 (two modes)  x11 ... x1f ... x1p 
 

tuples/objects
 ... ... ... ... ... 
x ... x if ... x ip 
the “classic” data input  i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 dissimilarity or distance  n1 
matrix objects
 (one mode)  0 
 d(2,1) 0 
 
objects
 d(3,1) d ( 3,2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i)  d ( n,1) d ( n,2) ... ... 0
Similarity and Distance

 For many different problems we need to quantify how close two objects
are.
 Examples:
 For an item bought by a customer, find other similar items
 Group together the customers of a site so that similar customers are shown
the same ad.
 Group together web documents so that you can separate the ones that talk
about politics and the ones that talk about sports.
 Find all the near-duplicate mirrored web documents.
 Find credit card transactions that are very different from previous transactions.
 To solve these problems we need a definition of similarity, or distance.
 The definition depends on the type of data that we have
Similarity

 Numerical measure of how alike two data objects are.


 A function that maps pairs of objects to real values
 Higher when objects are more alike.
 Often falls in the range [0,1], sometimes in [-1,1]

 Desirable properties for similarity


1. s(p, q) = 1 (or maximum similarity) only if p = q. (Identity)
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets

 Consider the following documents

apple apple new


releases releases apple
new new pie

ipod ipad recipe
Which ones are more similar?

 How would you quantify their similarity?


Similarity: Intersection

 Number of words in common


apple apple new
releases releases apple
new new pie
ipod ipad recipe
 Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2
 What about this document?

Vefa rereases new book


with apple pie recipes
 Sim(D,D) = Sim(D,D) = 3
Measuring Similarity in Clustering
 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects i and j is


expressed in terms of a distance function, which is typically a
metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The definitions of distance functions are usually


different for interval-scaled, boolean, categorical,
ordinal and ratio-scaled variables.

 Weights may be associated with different variables


based on applications and data semantics.
Type of data in cluster analysis
 Interval-scaled variables
 e.g., salary, height

 Binary variables
 e.g., gender (M/F), has_cancer(T/F)

 Nominal (categorical) variables


 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

 Ratio-scaled variables
 population growth (1,10,100,1000,...)

 Variables of mixed types


 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure
the similarity or dissimilarity between two data
objects
 The most popular conform to Minkowski distance:
 p p p 1/ p
L p (i, j)  | x  x |  | x  x | ... | x  x |

 
 i1 j1 i2 j 2 in jn 

where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-
dimensional data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block)


L (i, j) | x  x |  | x  x | ... | x  x |
distance: 1 i1 j1 i2 j 2 in jn
Similarity and Dissimilarity Between
Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )

i1 j1 i2 j 2 in jn
Properties

d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:

d (i, j)  (w | x  x |2 w | x  x |2 ... wn | x  x |2 )
1 i1 j1 2 i2 j 2 in jn
Jaccard Similarity
 The Jaccard similarity (Jaccard coefficient) of two sets S1, S2
is the size of their intersection divided by the size of their
union.
 JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

 Extreme behavior:
 Jsim(X,Y) = 1, iff X = Y
 Jsim(X,Y) = 0 iff X,Y have no elements in common
 JSim is symmetric
Jaccard Similarity between sets

 The distance for the documents

apple apple new Vefa


releases releases apple rereases
new new pie new book
ipod ipad recipe with apple
pie recipes
 JSim(D,D) = 3/5
 JSim(D,D) = JSim(D,D) = 2/6
 JSim(D,D) = JSim(D,D) = 3/9
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
J=(1001100110) 1 a b a b
object i 0 c d c d
sum a  c b  d p

 Simple matching coefficient distance (invariant, if the binary


d (i, j)  b c
variable is symmetric): a b  c  d

 Jaccard coefficient distance (i, j)  b ifcthe binary variable


d (noninvariant
a b  c
is asymmetric):
Binary Variables
 Another approach is to define the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefficient similarity:
s(i, j)  a d
a b  c  d
 Jaccard coefficient similarity:
s(i, j)  a
a b  c

Note that: s(i,j) = 1 – d(i,j)


Dissimilarity between Binary
Variables
 Example (Jaccard coefficient)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

 all attributes are asymmetric binary


 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
A simpler definition
 Each variable is mapped to a bitmap (binary vector)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0
Mary 1 0 1 0 1 0
Jim 1 1 0 0 0 0

 Jack: 101000
 Mary: 101010
 Jim: 110000
 Simple match distance:
number of non - common bit positions
d (i, j ) 
total number of bits

 Jaccard coefficient: number of 1' s in i  j


d (i, j ) 1 
number of 1' s in i  j
Distance

 Numerical measure of how different two data objects are


 A function that maps pairs of objects to real values
 Lower when objects are more alike
 Higher when two objects are different
 Minimum distance is 0, when comparing an object with itself.
 Upper limit varies
Distance Metric

 A distance function d is a distance metric if it is a function from pairs


of objects to real numbers such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Hamming Distance

 Hamming distance is the number of positions in which bit-vectors


differ.
 Example: p1 = 10101
p2 = 10011.
 d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th positions.
 The L1 norm for the binary vectors

 Hamming distance between two vectors of categorical attributes


is the number of positions in which they differ.
 Example: x = (married, low income, cheat),
y = (single, low income, not cheat)
d(x,y) = 2

You might also like