0% found this document useful (0 votes)
20 views3 pages

SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-R

The document discusses the K-Means clustering algorithm. K-Means is an unsupervised machine learning algorithm that groups data points into K number of clusters. Each cluster is associated with a centroid, which is the mean of the data points in that cluster. Data points are assigned to the cluster with the closest centroid. The centroids and cluster assignments are recalculated in each iteration until centroids do not change or few points change clusters. One drawback is needing to specify K, or the number of clusters, beforehand. The document provides an example of using K-Means clustering on customer data to segment customers into groups based on their calling patterns.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views3 pages

SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-R

The document discusses the K-Means clustering algorithm. K-Means is an unsupervised machine learning algorithm that groups data points into K number of clusters. Each cluster is associated with a centroid, which is the mean of the data points in that cluster. Data points are assigned to the cluster with the closest centroid. The centroids and cluster assignments are recalculated in each iteration until centroids do not change or few points change clusters. One drawback is needing to specify K, or the number of clusters, beforehand. The document provides an example of using K-Means clustering on customer data to segment customers into groups based on their calling patterns.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

3/14/24, 7:33 AM SAP HANA PAL – K-Means Algorithm or How to do Cust...

- SAP Community

Each cluster is associated with a centroid and each point is assigned to the cluster with the closest centroid. The centroid is
the mean of the points in the cluster. The closeness can be measured using:

Manhattan Distance
Euclidean Distance (most commonly used)
Minkowski Distance

Every time a point is assigned to a cluster the centroid is recalculated. This is repeated in multiple iterations until centroids
don’t change anymore (meaning all points have been assigned to a corresponding cluster) or until relatively few points
change clusters. Usually most of the centroid movement happens in the first iterations.

One of the main drawbacks of the K-Means Algorithm is that you need to specify the number of Ks (or clusters) upfront as
an input parameter. Knowing this value is usually very hard, that is why it is important to run quality measurement
functions to check the quality of your clustering. Later in this post we will talk about this.

I came across a very interesting paper that talks about segmentation in the telecommunication industry, so I thought it
would be a very nice use case to demo the K-Means algorithm in HANA (if you are interested in this topic, I very much
recommend reading this paper). These are the steps I followed:

https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 3/39


3/14/24, 7:33 AM SAP HANA PAL – K-Means Algorithm or How to do Cust... - SAP Community

So each row in this table will represent a unique customer. Now I need to fill it, but I do not have access to real data, so I
had to build my own dataset. I created 30 different customers (30 rows) that can be grouped in 3 segments:

Segment 1: From Customer ID 1 thru 10. In this segment customers usually have short calls. They originate or receive
a low number of calls. These customers call more in the evening, more often during the weekend and to mobile lines.
They send and receive a fair amount of SMSs. This segment could represent personal mobile users.
Segment 2: From Customer ID 10001 thru 10010. In this segment customers have an average call duration. They
originate or receive an average number of calls. They usually call during business hours and during week days. They
send or receive a small amount of SMSs. This segment could represent small business users.
Segment 3: From Customer ID 20001 thru 20010. In this segment customers usually have long duration calls. They
usually call during business hours and during week days. They usually call to mobile lines and they heavily use SMSs.
This segment could represent enterprise business users.

The resulting table looks like this:

https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 5/39


3/14/24, 7:33 AM SAP HANA PAL – K-Means Algorithm or How to do Cust... - SAP Community

primary key("ID")

);

/* Table Type that will be used as the output parameter

that will contain the centers for each cluster */

DROP TYPE PAL_KMEANS_CENTERS_TELCO;

CREATE TYPE PAL_KMEANS_CENTERS_TELCO AS TABLE(

"CENTER_ID" INT,

"V000" DOUBLE,

"V001" DOUBLE,

"V002" DOUBLE,

"V003" DOUBLE,

"V004" DOUBLE,

"V005" DOUBLE,

https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 8/39

You might also like