Open In App

Analysis of test data using K-Means Clustering in Python

Last Updated : 09 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In data science K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is primarily used for grouping similar data points together based on their features which helps in discovering inherent patterns in the dataset. In this article we will demonstrates how to apply K-Means clustering to test data in Python using OpenCV library.

What is K-Means Clustering?

K-Means clustering is an iterative algorithm that divides data into a predefined number of clusters (K) by partitioning data into K clusters based on feature similarities. It works by minimizing the variance within each cluster ensuring that data points within the same cluster are as similar as possible. The algorithm iteratively assigns data points to the nearest centroid, recalculates the centroids and continues this process until convergence.

Steps involved in K-Means clustering:

  1. Choose the number of clusters (K).
  2. Initialize K cluster centroids.
  3. Assign each data point to the nearest centroid.
  4. Recalculate the centroids based on the points assigned to each cluster.
  5. Repeat the assignment and centroid recalculation until convergence.

K-Means clustering helps in test data analysis by grouping similar tests based on features like test scores, difficulty levels, or time taken to solve. By clustering tests, one can gain insights into:

  • Identifying test patterns.
  • Grouping similar test items.
  • Finding anomalies or outliers.

Analysis of test data using K-Means Clustering

OpenCV provides an efficient implementation of the K-Means algorithm through its cv2.kmeans() function. This function allows us to cluster data points into predefined groups based on their features making it an ideal choice for analyzing test data. By this we can do fast and optimized clustering. Here is the step by step implementation.

1. Importing Libraries

We will be using numpy, pandas and OpenCV for this.

Python
import numpy as np 
import cv2
from matplotlib import pyplot as plt 

2. Generating and Visualizing Test Data with Multiple Features

Let’s start by generating and visualizing random test data using matplotlib. In this case we create two sets of data points X and Y and visualize them as a histogram.

  • np.random.randint: Generates random integers in the specified range. In this case, it creates two arrays of random integers between 10 and 35 for X and 55 and 70 for Y, both with dimensions (25, 2).
  • np.vstack: Stacks arrays vertically (row-wise). It combines the X and Y arrays into a single array Z.
  • Z.reshape: Changes the shape of the array. It reshapes the Z array into a 50×2 array adjusting the dimensions accordingly.
  • np.float32: Converts the array Z to 32-bit floating-point type for better compatibility with some functions especially in libraries like OpenCV.
Python
X = np.random.randint(10,35,(25,2)) 
Y = np.random.randint(55,70,(25,2)) 
Z = np.vstack((X,Y)) 
Z = Z.reshape((50,2)) 

Z = np.float32(Z)

plt.xlabel('Test Data') 
plt.ylabel('Z samples')
plt.hist(Z, 256, [0, 256])
plt.show()

Output:

download

Visualized test data with Multiple features

It shows two distinct clusters of data, with peaks indicating higher frequencies of test data points in specific ranges. The color-coded bars represent different data sets or clusters, and this distribution helps identify patterns in the data, which K-Means clustering can further analyze by grouping similar data points together.

3. Applying K-Means Clustering on Test Data

Now let’s apply the K-Means clustering algorithm to the test data and observe its behavior.

  • cv2.TERM_CRITERIA_EPS: Specifies the stopping condition for the K-Means algorithm based on the accuracy of the centroids position.
  • cv2.TERM_CRITERIA_MAX_ITER: Specifies the maximum number of iterations the K-Means algorithm will run.
  • cv2.kmeans: Performs K-Means clustering on the data. It takes Z (dataset), the number of clusters (2 in this case) and various parameters like the criteria, maximum iterations and the initialization method (KMEANS_RANDOM_CENTERS).
  • label.ravel(): Flattens the label array to a 1D array and assigns each data point to its corresponding cluster.
  • Z[label.ravel() == 0]: Selects the data points assigned to cluster 0 and stores them in array A.
  • Z[label.ravel() == 1]: Selects the data points assigned to cluster 1 and stores them in array B.
Python
X = np.random.randint(10,45,(25,2)) 
Y = np.random.randint(55,70,(25,2)) 
Z = np.vstack((X,Y))

Z = np.float32(Z)

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
ret, label, center = cv2.kmeans(Z, 2, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)

A = Z[label.ravel() == 0]
B = Z[label.ravel() == 1]

plt.scatter(A[:, 0], A[:, 1])
plt.scatter(B[:, 0], B[:, 1], c='r')
plt.scatter(center[:, 0], center[:, 1], s=80, c='y', marker='s')
plt.xlabel('Test Data')
plt.ylabel('Z samples')
plt.show()

Output:

download

K Means Clustering

The plot clearly shows that the K-Means algorithm has successfully grouped the data points into two distinct clusters, with the centroids positioned around the center of each group.

K-Means clustering is a useful unsupervised machine learning technique especially in applications such as test data analysis. By grouping similar test data points together you can easily identify patterns and trends that provide valuable insights. Although the algorithm is simple and effective it has limitations such as sensitivity to the choice of initial centroids and the requirement for predefining the number of clusters (K).



Next Article

Similar Reads