Analysis of test data using K-Means Clustering in Python
Last Updated :
09 Apr, 2025
In data science K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is primarily used for grouping similar data points together based on their features which helps in discovering inherent patterns in the dataset. In this article we will demonstrates how to apply K-Means clustering to test data in Python using OpenCV library.
What is K-Means Clustering?
K-Means clustering is an iterative algorithm that divides data into a predefined number of clusters (K) by partitioning data into K clusters based on feature similarities. It works by minimizing the variance within each cluster ensuring that data points within the same cluster are as similar as possible. The algorithm iteratively assigns data points to the nearest centroid, recalculates the centroids and continues this process until convergence.
Steps involved in K-Means clustering:
- Choose the number of clusters (K).
- Initialize K cluster centroids.
- Assign each data point to the nearest centroid.
- Recalculate the centroids based on the points assigned to each cluster.
- Repeat the assignment and centroid recalculation until convergence.
K-Means clustering helps in test data analysis by grouping similar tests based on features like test scores, difficulty levels, or time taken to solve. By clustering tests, one can gain insights into:
- Identifying test patterns.
- Grouping similar test items.
- Finding anomalies or outliers.
Analysis of test data using K-Means Clustering
OpenCV provides an efficient implementation of the K-Means algorithm through its cv2.kmeans()
function. This function allows us to cluster data points into predefined groups based on their features making it an ideal choice for analyzing test data. By this we can do fast and optimized clustering. Here is the step by step implementation.
1. Importing Libraries
We will be using numpy, pandas and OpenCV for this.
Python
import numpy as np
import cv2
from matplotlib import pyplot as plt
2. Generating and Visualizing Test Data with Multiple Features
Let’s start by generating and visualizing random test data using matplotlib. In this case we create two sets of data points X
and Y
and visualize them as a histogram.
- np.random.randint: Generates random integers in the specified range. In this case, it creates two arrays of random integers between 10 and 35 for
X
and 55 and 70 for Y
, both with dimensions (25, 2). - np.vstack: Stacks arrays vertically (row-wise). It combines the
X
and Y
arrays into a single array Z
. - Z.reshape: Changes the shape of the array. It reshapes the
Z
array into a 50×2 array adjusting the dimensions accordingly. - np.float32: Converts the array
Z
to 32-bit floating-point type for better compatibility with some functions especially in libraries like OpenCV.
Python
X = np.random.randint(10,35,(25,2))
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
Z = Z.reshape((50,2))
Z = np.float32(Z)
plt.xlabel('Test Data')
plt.ylabel('Z samples')
plt.hist(Z, 256, [0, 256])
plt.show()
Output:

Visualized test data with Multiple features
It shows two distinct clusters of data, with peaks indicating higher frequencies of test data points in specific ranges. The color-coded bars represent different data sets or clusters, and this distribution helps identify patterns in the data, which K-Means clustering can further analyze by grouping similar data points together.
3. Applying K-Means Clustering on Test Data
Now let’s apply the K-Means clustering algorithm to the test data and observe its behavior.
- cv2.TERM_CRITERIA_EPS: Specifies the stopping condition for the K-Means algorithm based on the accuracy of the centroids position.
- cv2.TERM_CRITERIA_MAX_ITER: Specifies the maximum number of iterations the K-Means algorithm will run.
- cv2.kmeans: Performs K-Means clustering on the data. It takes
Z
(dataset), the number of clusters (2 in this case) and various parameters like the criteria, maximum iterations and the initialization method (KMEANS_RANDOM_CENTERS). - label.ravel(): Flattens the label array to a 1D array and assigns each data point to its corresponding cluster.
- Z[label.ravel() == 0]: Selects the data points assigned to cluster 0 and stores them in array
A
. - Z[label.ravel() == 1]: Selects the data points assigned to cluster 1 and stores them in array
B
.
Python
X = np.random.randint(10,45,(25,2))
Y = np.random.randint(55,70,(25,2))
Z = np.vstack((X,Y))
Z = np.float32(Z)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
ret, label, center = cv2.kmeans(Z, 2, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
A = Z[label.ravel() == 0]
B = Z[label.ravel() == 1]
plt.scatter(A[:, 0], A[:, 1])
plt.scatter(B[:, 0], B[:, 1], c='r')
plt.scatter(center[:, 0], center[:, 1], s=80, c='y', marker='s')
plt.xlabel('Test Data')
plt.ylabel('Z samples')
plt.show()
Output:

K Means Clustering
The plot clearly shows that the K-Means algorithm has successfully grouped the data points into two distinct clusters, with the centroids positioned around the center of each group.
K-Means clustering is a useful unsupervised machine learning technique especially in applications such as test data analysis. By grouping similar test data points together you can easily identify patterns and trends that provide valuable insights. Although the algorithm is simple and effective it has limitations such as sensitivity to the choice of initial centroids and the requirement for predefining the number of clusters (K).
Similar Reads
K-Means Clustering using PySpark Python
In this tutorial series, we are going to cover K-Means Clustering using Pyspark. K-means is a clustering algorithm that groups data points into K distinct clusters based on their similarity. It is an unsupervised learning technique that is widely used in data mining, machine learning, and pattern re
4 min read
KMeans Clustering and PCA on Wine Dataset
K-Means Clustering: K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we specify the nu
6 min read
K-Means clustering on the handwritten digits data using Scikit Learn in Python
K - means clustering is an unsupervised algorithm that is used in customer segmentation applications. In this algorithm, we try to form clusters within our datasets that are closely related to each other in a high-dimensional space. In this article, we will see how to use the k means algorithm to id
5 min read
Clustering Text Documents using K-Means in Scikit Learn
Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-M
3 min read
Different Phases of Projected Clustering in Data Analytics
We know Projected clustering is a typical dimension reduction subspace clustering method which instead of initiating from single dimensional spaces, proceeds by identifying an initial approximation of the clusters in high dimensional attribute space. But to do this projected clustering algorithm goe
3 min read
Olympics Data Analysis Using Python
In this article, we are going to see the Olympics analysis using Python. The modern Olympic Games or Olympics are leading international sports events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Oly
4 min read
Implementation of KNN classifier using Scikit - learn - Python
K-Nearest Neighbors is a most simple but fundamental classifier algorithm in Machine Learning. It is under the supervised learning category and used with great intensity for pattern recognition, data mining and analysis of intrusion. It is widely disposable in real-life scenarios since it is non-par
3 min read
ML | Determine the optimal value of K in K-Means Clustering
Determining optimal value of K in k means clustering is a hectic task as a optimal value can help us to find better data pattern and model prediction. Choosing is value manually is very difficult so we use various techniques to find its value. In this article we will discuss about these techniques.
5 min read
Image Segmentation using K Means Clustering
Image segmentation is a technique in computer vision that divides an image into different segments. This can help identify specific objects, boundaries or patterns in the image. Image is basically a set of given pixels and in image segmentation pixels with similar intensity are grouped together. Im
2 min read
Python - Basics of Pandas using Iris Dataset
Python language is one of the most trending programming languages as it is dynamic than others. Python is a simple high-level and an open-source language used for general-purpose programming. It has many open-source libraries and Pandas is one of them. Pandas is a powerful, fast, flexible open-sourc
8 min read