0% found this document useful (0 votes)
24 views

Intro Data Science: Cluster Analysis

This document discusses cluster analysis and the k-means clustering algorithm. It begins with an introduction to cluster analysis, defining a cluster as a group of similar data points. It then explains the basic k-means algorithm, which takes k initial centroids as input, assigns each point to the nearest centroid, recalculates the centroid positions, and repeats these steps until convergence is reached. Key aspects of k-means covered include choosing initial centroids, assessing similarity via Euclidean distance, and recomputing centroids. The goal of k-means is to partition observations into k clusters to enhance understanding of the dataset.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Intro Data Science: Cluster Analysis

This document discusses cluster analysis and the k-means clustering algorithm. It begins with an introduction to cluster analysis, defining a cluster as a group of similar data points. It then explains the basic k-means algorithm, which takes k initial centroids as input, assigns each point to the nearest centroid, recalculates the centroid positions, and repeats these steps until convergence is reached. Key aspects of k-means covered include choosing initial centroids, assessing similarity via Euclidean distance, and recomputing centroids. The goal of k-means is to partition observations into k clusters to enhance understanding of the dataset.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

INTRO TO DATA

SCIENCE
AGENDA 2

I. CLUSTER ANALYSIS
II. THE K-MEANS ALGORITHM
III. CHOOSING K
IV. EXAMPLE
INTRO TO DATA SCIENCE

I. CLUSTER
CLUSTER ANALYSIS 4

continuous categorical
supervised ??? ???
unsupervised ??? ???
LOGISTIC REGRESSION 5

continuous categorical
supervised regression classification
unsupervised dimension reduction clustering
CLUSTER ANALYSIS 6

Q: What is a cluster?
CLUSTER ANALYSIS 7

Q: What is a cluster?

A: A group of similar data points.


CLUSTER ANALYSIS 8

Q: What is a cluster?

A: A group of similar data points.

The concept of similarity is central to the definition


of a cluster, and therefore to cluster analysis.
CLUSTER ANALYSIS 9

Q: What is a cluster?

A: A group of similar data points.

The concept of similarity is central to the definition


of a cluster, and therefore to cluster analysis.

In general, greater similarity between points leads to


better clustering.
CLUSTER ANALYSIS 1
0
Q: What is the purpose of cluster analysis?
CLUSTER ANALYSIS 1
1
Q: What is the purpose of cluster analysis?

A: To enhance our understanding of a dataset by


dividing the data into groups.
CLUSTER ANALYSIS 1
2
Q: What is the purpose of cluster analysis?

A: To enhance our understanding of a dataset by


dividing the data into groups.

Clustering provides a layer of abstraction from


individual data points.
CLUSTER ANALYSIS 1
3
Q: What is the purpose of cluster analysis?

A: To enhance our understanding of a dataset by


dividing the data into groups.

Clustering provides a layer of abstraction from


individual data points.

The goal is to extract and enhance the natural


CLUSTER ANALYSIS 1
4
Clustering can be useful in a wide variety of
domains, including genetics, consumer internet and
business.
CLUSTER ANALYSIS 1
5
Clustering can be useful in a wide variety of
domains, including genetics, consumer internet and
business.
CLUSTER ANALYSIS 1
6
Clustering can be useful in a wide variety of
domains, including genetics, consumer internet and
business.
CLUSTER ANALYSIS 1
7
Clustering can be useful in a wide variety of
domains, including genetics, consumer internet and
business.

http://i.huffpost.com/gen/1563531/thumbs/o-GROCERY-STORE-facebook.jpg
CLUSTER ANALYSIS 1
8
There are many kinds of classification procedures.
For our class, we will be focusing on K-means
clustering, which is one of the most popular
clustering algorithms.

K-means is an iterative method that partitions a data


set into k clusters.
INTRO TO DATA SCIENCE

II. K-MEANS
K-MEANS CLUSTERING 2
0
Q: How does the algorithm work?
THE BASIC K-MEANS ALGORITHM 2
1
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid
- assign point to nearest centroid

3) recalculate centroid positions


4) repeat steps 2-3 until stopping criteria met
STEP 1 – CHOOSING INITIAL CENTROIDS 2
2
Q: How do you choose the initial centroid positions?
STEP 1 – CHOOSING INITIAL CENTROIDS 2
3
Q: How do you choose the initial centroid positions?

A: There are several options:


STEP 1 – CHOOSING INITIAL CENTROIDS 2
4
Q: How do you choose the initial centroid positions?

A: There are several options:


- randomly (but may yield divergent behavior)
STEP 1 – CHOOSING INITIAL CENTROIDS 2
5
Q: How do you choose the initial centroid positions?

A: There are several options:


- randomly (but may yield divergent behavior)
- perform alternative clustering task, use resulting
centroids as
initial k-means centroids
STEP 1 – CHOOSING INITIAL CENTROIDS 2
6
Q: How do you choose the initial centroid positions?

A: There are several options:


- randomly (but may yield divergent behavior)
- perform alternative clustering task, use resulting
centroids as
initial k-means centroids
- start with global centroid, choose point at max
distance, repeat (but might select outlier)
STEP 2 – ASSESS SIMILARITY 2
7
Q: How do you determine which centroid a given
point is most similar to?
STEP 2 – ASSESS SIMILARITY 2
8
Q: How do you determine which centroid a given
point is most similar to?
The similarity criterion is determined by the measure
we choose.
STEP 2 – ASSESS SIMILARITY 2
9
Q: How do you determine which centroid a given
point is most similar to?
The similarity criterion is determined by the measure
we choose.
In the case of k-means clustering, the similarity
metric is the Euclidian distance:
STEP 2 – ASSESS SIMILARITY 3
0
Q: How do you determine which centroid a given
point is most similar to?
The similarity criterion is determined by the measure
we choose.
In the case of k-means clustering, the similarity
metric is the Euclidian distance:

N
d ( x1 , x2 )  i 1
( x1i  x2i )
2
STEP 3 – RECOMPUTING THE CENTER 3
1
Q: How do we recompute the positions of the
centers at each iteration of the algorithm?
A: By calculating the centroid (i.e., the geometric
center)
STEP 4 – CONVERGENCE 3
2
We iterate until some stopping criteria are met; in
general, suitable convergence is achieved in a small
number of steps.
STEP 4 – CONVERGENCE 3
3
We iterate until some stopping criteria are met; in
general, suitable convergence is achieved in a small
number of steps.

Stopping criteria can be based on the centroids (eg, if


positions change by no more than e) or on the points
(eg, if no more than x% change clusters between
iterations).
THE BASIC K-MEANS ALGORITHM 3
4
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 3
5
1) choose k initial centroids (note that k is an
input)

2) for each point: x


1

- find distance to each centroid


- assign point to nearest centroid
x2

3) recalculate centroid positions


4) repeat steps 2-3 until stopping criteria met
THE BASIC K-MEANS ALGORITHM 3
6
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroidx 1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 3
7
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 3
8
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 3
9
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 4
0
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 4
1
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


THE BASIC K-MEANS ALGORITHM 4
2
1) choose k initial centroids (note that k is an input)

2) for each point:


- find distance to each centroid x1

- assign point to nearest centroid

3) recalculate centroid positions x 2

4) repeat steps 2-3 until stopping criteria met


INTRO TO DATA SCIENCE

III. CLUSTER
CLUSTER VALIDATION 4
4
In general, k-means will converge to a solution and
return a partition of k clusters, even if no natural
clusters exist in the data.
CLUSTER VALIDATION 4
5
In general, k-means will converge to a solution and
return a partition of k clusters, even if no natural
clusters exist in the data.

We will look at two validation metrics useful for


partitional clustering, cohesion and separation.
CLUSTER VALIDATION 4
6
Cohesion measures clustering effectiveness within a
cluster.
CLUSTER VALIDATION 4
7
Cohesion measures clustering effectiveness within a
cluster.

Separation measures clustering effectiveness between


clusters.
CLUSTER VALIDATION 4
8

source: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
SILHOUETTE COEFFICIENT 4
9
One useful measure than combines the ideas of
cohesion and separation is the silhouette coefficient.
For point xi, this is given by:

such that:
ai = average in-cluster distance to xi
bij = average between-cluster distance to xi
SILHOUETTE COEFFICIENT 5
0
The silhouette coefficient can take values between -1
and 1.

In general, we want separation to be high and


cohesion to be low. This corresponds to a value of SC
close to +1.

A negative silhouette coefficient means the cluster


radius is larger than the space between clusters, and
SILHOUETTE COEFFICIENT 5
1
The silhouette coefficient for the cluster Ci is given
by the average silhouette coefficient across all points
in Ci:
SILHOUETTE COEFFICIENT 5
2
The silhouette coefficient for the cluster Ci is given
by the average silhouette coefficient across all points
in Ci:

The overall silhouette coefficient is given by the


average silhouette coefficient across all clusters:
SILHOUETTE COEFFICIENT 5
3
The silhouette coefficient for the cluster Ci is given
by the average silhouette coefficient across all points
in Ci:

The overall silhouette coefficient is given NOTE


by the
average silhouette coefficient across all points:
This gives a
summary measure
of the overall
clustering quality.
CLUSTER VALIDATION 5
4
One useful application of cluster validation is to
determine the best number of clusters for your
dataset.
CLUSTER VALIDATION 5
5
One useful application of cluster validation is to
determine the best number of clusters for your
dataset.

Q: How would you do this?


CLUSTER VALIDATION 5
6
One useful application of cluster validation is to
determine the best number of clusters for your
dataset.

Q: How would you do this?


A: By computing the SSE or SC for different values
of k.
CLUSTER VALIDATION 5
7
Ultimately, cluster validation and clustering in
general are suggestive techniques that rely on human
interpretation to be meaningful.
STRENGTHS AND WEAKNESSES 5
8
Strengths:
K-means is a popular algorithm because of its
computational efficiency and simple and intuitive
nature.
STRENGTHS AND WEAKNESSES 5
9
Strengths:
K-means is a popular algorithm because of its
computational efficiency and simple and intuitive
nature.

Weaknesses:
However, K-means is highly scale dependent, and is
not suitable for data with widely varying shapes and
densities.
INTRO TO DATA SCIENCE

EX: K-MEANS

You might also like