Intro Data Science: Cluster Analysis
Intro Data Science: Cluster Analysis
SCIENCE
AGENDA 2
I. CLUSTER ANALYSIS
II. THE K-MEANS ALGORITHM
III. CHOOSING K
IV. EXAMPLE
INTRO TO DATA SCIENCE
I. CLUSTER
CLUSTER ANALYSIS 4
continuous categorical
supervised ??? ???
unsupervised ??? ???
LOGISTIC REGRESSION 5
continuous categorical
supervised regression classification
unsupervised dimension reduction clustering
CLUSTER ANALYSIS 6
Q: What is a cluster?
CLUSTER ANALYSIS 7
Q: What is a cluster?
Q: What is a cluster?
Q: What is a cluster?
http://i.huffpost.com/gen/1563531/thumbs/o-GROCERY-STORE-facebook.jpg
CLUSTER ANALYSIS 1
8
There are many kinds of classification procedures.
For our class, we will be focusing on K-means
clustering, which is one of the most popular
clustering algorithms.
II. K-MEANS
K-MEANS CLUSTERING 2
0
Q: How does the algorithm work?
THE BASIC K-MEANS ALGORITHM 2
1
1) choose k initial centroids (note that k is an input)
III. CLUSTER
CLUSTER VALIDATION 4
4
In general, k-means will converge to a solution and
return a partition of k clusters, even if no natural
clusters exist in the data.
CLUSTER VALIDATION 4
5
In general, k-means will converge to a solution and
return a partition of k clusters, even if no natural
clusters exist in the data.
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
SILHOUETTE COEFFICIENT 4
9
One useful measure than combines the ideas of
cohesion and separation is the silhouette coefficient.
For point xi, this is given by:
such that:
ai = average in-cluster distance to xi
bij = average between-cluster distance to xi
SILHOUETTE COEFFICIENT 5
0
The silhouette coefficient can take values between -1
and 1.
Weaknesses:
However, K-means is highly scale dependent, and is
not suitable for data with widely varying shapes and
densities.
INTRO TO DATA SCIENCE
EX: K-MEANS