Clustering
Clustering
Agglomerative: It is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left.
4. Explain the Agglomerative Hierarchical Clustering algorithm with the help of an example.
Step- 1: In the first step, we compute the proximity of individual observations and consider all
Step- 2: In this step, similar clusters are merged together and result in a single cluster.
For our example, we consider B, C, and D, E are similar clusters that are merged in this step.
Step- 3: We again compute the proximity of new clusters and merge the similar clusters to form
Explain the different linkage methods used in the Hierarchical Clustering Algorithm.
Single Linkage: For two clusters R and S, the single linkage returns the minimum distance
between two points i and j
Complete Linkage: For two clusters R and S, the complete linkage returns the maximum
distance between two points i and j
Average Linkage: For two clusters R and S, first for the distance between any data-point i in R
and any data-point j in S and then the arithmetic mean of these distances are calculated.
Centroid-linkage: In this method, we find the centroid of cluster 1 and the centroid of cluster 2
and then calculate the distance between the two before merging.
Pros of Single-linkage:
This approach can differentiate between non-elliptical shapes as long as the gap
between the two clusters is not small.
Cons of Single-linkage:
This approach cannot separate clusters properly if there is noise between clusters.
Pros of Complete-linkage:
This approach gives well-separating clusters if there is some kind of noise present
between clusters.
Cons of Complete-Linkage:
Ward’s method (a.k.a. Minimum variance method or Ward’s Minimum Variance Clustering
Method) is an alternative to single-link clustering. Popular in fields like linguistics, it’s liked
because it usually creates compact, even-sized clusters (Szmrecsanyi, 2012).
Like most other clustering methods, Ward’s method is computationally intensive. However,
Ward’s has significantly fewer computations than other methods. The drawback is this usually
results in less than optimal clusters.
Like other clustering methods, Ward’s method starts with n clusters, each containing a single
object. These n clusters are combined to make one cluster containing all objects. At each step,
the process makes a new cluster that minimizes variance, measured by an index called E
Space complexity: Hierarchical Clustering Technique requires very high space when the number
of observations in our dataset is more since we need to store the similarity matrix in the RAM.
update the proximity matrix and also restore that matrix, therefore the time complexity is also
very high. So, the time complexity is the order of the cube of n.
List down some of the possible conditions for producing two different dendrograms using an
agglomerative Clustering algorithm with the same dataset.
To find the optimal number of clusters, Silhouette Score is considered to be one of the popular
approaches.
In K-means clustering, elbow method and silhouette analysis or score techniques are used to
find the number of clusters in a dataset. The elbow method is used to find the “elbow” point,
where adding additional data samples does not change cluster membership much. Silhouette
score determines whether there are large gaps between each sample and all other samples
How can we measure the goodness of the clusters in the Hierarchical Clustering Algorithm?
There are many measures to find the goodness of clusters but the most popular one is
intra-cluster diameter and the diameter of a cluster is calculated as the distance between its
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging
the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between
finite sample sets, and is defined as the size of the intersection divided by the size of the union
of the sample sets:
Gaussian mixtures seem to be more robust. However, GMs usually tend to be slower than K-
Means because it takes more iterations of the EM algorithm to reach the convergence.
Gaussian mixture modelling clustering data points have become simpler as they can handle
even oblong clusters. It works in the same principle as K-means but has some of the
advantages over it. It tells us about which data belongs to which cluster along with the
probabilities. In other words, it performs hard classification while K-Means perform soft
classification.
when using various clustering algorithms why does Euclidean distance technique is not
clustering using Euclidean distance with 100 features. Up to how many features
Decision trees can also be used to for clusters in the data but clustering often generates natural
At least a single variable is required to perform clustering analysis. Clustering analysis with a
For two runs of K-Mean clustering is it expected to get same clustering results?
K-Means clustering algorithm instead converses on local minima which might also correspond
to the global minima in some cases but not always. Therefore, it’s advised to run the K-Means
While calculating Euclidean distance large scale variable variable will suppressed by small scale
Eg.age and salary change in age by 5 is considerable change but change is 5k in salary is
negligible change.
All four conditions can be used as possible termination condition in K-Means clustering:
1. This condition limits the runtime of the clustering algorithm, but in some cases the
quality of the clustering will be poor because of an insufficient number of iterations.
2. Except for cases with a bad local minimum, this produces a good clustering, but
runtimes may be unacceptably long.
3. This also ensures that the algorithm has converged at the minima.
4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is
of a desired quality after termination. Practically, it’s a good practice to combine it with
a bound on the number of iterations to guarantee termination.
Q9. Which of the following clustering algorithms suffers from the problem of convergence at
local optima?
Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
E. 1,2 and 4
F. All of the above
Solution: (D)
Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has
Solution: (A)
Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the
Q11. After performing K-Means Clustering analysis on a dataset, you observed the following
dendrogram. Which of the following conclusion can be drawn from the dendrogram?
A. There were 28 data points in clustering analysis
D. The above dendrogram interpretation is not possible for K-Means clustering analysis
Solution: (D)
A dendrogram is not possible for K-Means clustering analysis. However, one can create a cluster
have goodness of fit approach which can be dunn’s index or silhouette analysis or score
Which measures points in the clusters must be closer and distance between the cluster must
be higher.
Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy of Linear
Options:
A. 1 only
B. 1 and 2
C. 1 and 4
D. 3 only
E. 2 and 4
Solution: (F)
Creating an input feature for cluster ids as ordinal variable or creating an input feature for
cluster centroids as a continuous variable might not convey any relevant information to the
regression model for multidimensional data. But for clustering in a single dimension, all of the
given methods are expected to convey meaningful information to the regression model. For
example, to cluster people in two groups based on their hair length, storing clustering ID as
ordinal variable and cluster centroids as continuous variables will convey meaningful
information.
Q13. What could be the possible reason(s) for producing two different dendrograms using
C. of variables used
D. B and c only
Solution: (E)
Change in either of Proximity function, no. of data points or no. of variables will lead to
A. 1
B. 2
C. 3
D. 4
Solution: (B)
Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram
Q15. What is the most appropriate no. of clusters for the data points represented by the
following dendrogram:
A. 2
B. 4
C. 6
D. 8
Solution: (B)
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in
the dendrogram cut by a horizontal line that can transverse the maximum distance vertically
Q16. In which of the following cases will K-Means clustering fail to give good results?
Options:
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4
E. 1, 2, 3 and 4
Solution: (D)
K-Means clustering algorithm fails to give good results when the data contains outliers, the
density spread of data points across the data space is different and the data points follow non-
convex shapes.
Q17. Which of the following metrics, do we have for finding dissimilarity between two
1. Single-link
2. Complete-link
3. Average-link
Options:
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
Solution: (D)
All of the three methods i.e. single link, complete link and average link can be used for finding
Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of them
Solution: (A)
Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively
variable will carry extra weight on the distance calculation than desired.
Which of the following clustering representations and dendrogram depicts the use of MIN or
B.
C.
D.
Solution: (A)
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is
defined to be the minimum of the distance between any two points in the different clusters.
For instance, from the table, we see that the distance between points 3 and 6 is 0.11, and that
is the height at which they are joined into one cluster in the dendrogram. As another example,
the distance between clusters {3, 6} and {2, 5} is given by dist({3, 6}, {2, 5}) = min(dist(3, 2),
dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.
A.
B.
C.
D.
Solution: (B)
For the single link or MAX version of hierarchical clustering, the proximity of two clusters is
defined to be the maximum of the distance between any two points in the different clusters.
Similarly, here points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of {2,
5}. This is because the dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) =
0.2216, which is smaller than dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) =
max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) =
A.
B.
C.
D.
Solution: (C)
For the group average version of hierarchical clustering, the proximity of two clusters is defined
to be the average of the pairwise proximities between all pairs of points in the different
clusters. This is an intermediate approach between MIN and MAX. This is expressed by the
following equation:
Here, the distance between some clusters. dist({3, 6, 4}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 ∗
1) = 0.2751. dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 ∗ 1) = 0.2889. dist({3, 6, 4}, {2, 5}) = (0.1483 +
0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(6∗1) = 0.2637. Because dist({3, 6, 4}, {2, 5}) is
smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at the fourth
stage
A.
B.
C.
D.
Solution: (D)
Ward method is a centroid method. Centroid method calculates the proximity between two
clusters by calculating the distance between the centroids of clusters. For Ward’s method, the
proximity between two clusters is defined as the increase in the squared error that results
when two clusters are merged. The results of applying Ward’s method to the sample data set of
six points. The resulting clustering is somewhat different from those produced by MIN, MAX,
Q23. What should be the best choice of no. of clusters based on the following results:
A. 1
B. 2
C. 3
D. 4
Solution: (C)
The silhouette coefficient is a measure of how similar an object is to its own cluster compared
to other clusters. Number of clusters for which silhouette coefficient is highest represents the
Q24. Which of the following is/are valid iterative strategy for treating missing values before
clustering analysis?
Solution: (C)
All of the mentioned techniques are valid for treating missing values before clustering analysis
Q25. K-Mean algorithm has some limitations. One of the limitation it has is, it makes hard
assignments(A point either completely belongs to a cluster or not belongs at all) of points to
clusters.
Note: Soft assignment can be consider as the probability of being assigned to each cluster: say
Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of these
Solution: (C)
Both, Gaussian mixture models and Fuzzy K-means allows soft assignments.
Q26. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration clusters, C1, C2, C3 has following observations:
D. None of these
Solution: (A)
Q27. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering
algorithm. After first iteration clusters, C1, C2, C3 has following observations:
second iteration.
A. 10
B. 5*sqrt(2)
C. 13*sqrt(2)
D. None of these
Solution: (A)
Manhattan distance between centroid C1 i.e. (4, 4) and (9, 9) = (9-4) + (9-4) = 10
Q28. If two variables V1 and V2, are used for clustering. Which of the following are true for K
Options:
A. 1 only
B. 2 only
C. 1 and 2
If the correlation between the variables V1 and V2 is 1, then all the data points will be in a
straight line. Hence, all the three cluster centroids will form a straight line as well.
Q29. Feature scaling is an important step before applying K-Mean algorithm. What is reason
behind this?
A. In distance calculation it will give the same weights for all features
B. You always get the same clusters. If you use or don’t use feature scaling
D. None of these
Solution; (A)
Feature scaling ensures that all the features get same weight in the clustering analysis. Consider
a scenario of clustering people based on their weights (in KG) with range 55-110 and height (in
inches) with range 5.6 to 6.4. In this case, the clusters produced without scaling can be very
misleading as the range of weight is much higher than that of height. Therefore, its necessary to
bring them to same scale so that they have equal weightage on the clustering result.
Q30. Which of the following method is used for finding optimal of cluster in K-Mean
algorithm?
A. Elbow method
B. Manhattan method
C. Ecludian mehthod
E. None of these
Solution: (A)
Out of the given options, only elbow method is used for finding the optimal number of clusters.
The elbow method looks at the percentage of variance explained as a function of the number of
clusters: One should choose a number of clusters so that adding another cluster doesn’t give
Options:
A. 1 and 3
B. 1 and 2
C. 2 and 3
D. 1, 2 and 3
Solution: (D)
All three of the given statements are true. K-means is extremely sensitive to cluster center
initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall
clustering.
Q32. Which of the following can be applied to get good results for K-means algorithm
Options:
A. 2 and 3
B. 1 and 3
C. 1 and 2
D. All of above
Solution: (D)
All of these are standard practices that are used in order to obtain good clustering results.
Q33. What should be the best choice for number of clusters based on the following results:
A. 5
B. 6
C. 14
D. Greater than 14
Solution: (B)
Based on the above results, the best choice of number of clusters using elbow method is 6.
Q34. What should be the best choice for number of clusters based on the following results:
A. 2
B. 4
C. 6
D. 8
Solution: (C)
Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot,
the optimal clustering number of grid cells in the study area should be 2, at which the value of
the average silhouette coefficient is highest. However, the SSE of this clustering solution (k = 2)
is too large. At k = 6, the SSE is much lower. In addition, the value of the average silhouette
coefficient at k = 6 is also very high, which is just lower than k = 2. Thus, the best choice is k = 6.
Q35. Which of the following sequences is correct for a K-Means algorithm using Forgy
method of initialization?
Options:
A. 1, 2, 3, 5, 4
B. 1, 3, 2, 4, 5
C. 2, 1, 3, 4, 5
D. None of these
Solution: (A)
The methods used for initialization in K means are Forgy and Random Partition. The Forgy
method randomly chooses k observations from the data set and uses these as the initial means.
The Random Partition method first randomly assigns a cluster to each observation and then
proceeds to the update step, thus computing the initial mean to be the centroid of the cluster’s
algorithm for clustering a set of data points into two clusters, which of the assumptions are
important:
Solution: (C)
In EM algorithm for clustering its essential to choose the same no. of clusters to classify the
data points into as the no. of different distributions they are expected to be generated from
Q37. Which of the following is/are not true about Centroid based K-Means clustering
A. 1 only
B. 5 only
C. 1 and 3
D. 6 and 7
E. 4, 6 and 7
Solution: (B)
All of the above statements are true except the 5 th as instead K-Means is a special case of EM
algorithm in which only the centroids of the cluster distributions are calculated at each
iteration.
Q38. Which of the following is/are not true about DBSCAN clustering algorithm:
1. For data points to be in a cluster, they must be in a distance threshold to a core point
2. It has strong assumptions for the distribution of data points in dataspace
3. It has substantially high time complexity of order O(n 3)
4. It does not require prior knowledge of the no. of desired clusters
5. It is robust to outliers
Options:
A. 1 only
B. 2 only
C. 4 only
D. 2 and 3
E. 1 and 5
F. 1, 3 and 5
Solution: (D)
DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions
for the distribution of data points in the dataspace.
DBSCAN has a low time complexity of order O(n log n) only.
Q39. Which of the following are the high and low bounds for the existence of F-Score?
A. [0,1]
B. (0,1)
C. [-1,1]
Solution: (A)
The lowest and highest possible values of F score are 0 and 1 with 1 representing that every
data point is assigned to the correct cluster and 0 representing that the precession and/ or
recall of the clustering analysis are both 0. In clustering analysis, high value of F score is desired.
Q40. Following are the results observed for clustering 6000 data points into 3 clusters: A, B
and C:
A. 3
B. 4
C. 5
D. 6
Solution: (D)
Here,
Therefore,
Hence,
########################################################
The inability of machine learning method to capture the true relationship is called bias