SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-R
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-R
- SAP Community
Each cluster is associated with a centroid and each point is assigned to the cluster with the closest centroid. The centroid is
the mean of the points in the cluster. The closeness can be measured using:
Manhattan Distance
Euclidean Distance (most commonly used)
Minkowski Distance
Every time a point is assigned to a cluster the centroid is recalculated. This is repeated in multiple iterations until centroids
don’t change anymore (meaning all points have been assigned to a corresponding cluster) or until relatively few points
change clusters. Usually most of the centroid movement happens in the first iterations.
One of the main drawbacks of the K-Means Algorithm is that you need to specify the number of Ks (or clusters) upfront as
an input parameter. Knowing this value is usually very hard, that is why it is important to run quality measurement
functions to check the quality of your clustering. Later in this post we will talk about this.
I came across a very interesting paper that talks about segmentation in the telecommunication industry, so I thought it
would be a very nice use case to demo the K-Means algorithm in HANA (if you are interested in this topic, I very much
recommend reading this paper). These are the steps I followed:
So each row in this table will represent a unique customer. Now I need to fill it, but I do not have access to real data, so I
had to build my own dataset. I created 30 different customers (30 rows) that can be grouped in 3 segments:
Segment 1: From Customer ID 1 thru 10. In this segment customers usually have short calls. They originate or receive
a low number of calls. These customers call more in the evening, more often during the weekend and to mobile lines.
They send and receive a fair amount of SMSs. This segment could represent personal mobile users.
Segment 2: From Customer ID 10001 thru 10010. In this segment customers have an average call duration. They
originate or receive an average number of calls. They usually call during business hours and during week days. They
send or receive a small amount of SMSs. This segment could represent small business users.
Segment 3: From Customer ID 20001 thru 20010. In this segment customers usually have long duration calls. They
usually call during business hours and during week days. They usually call to mobile lines and they heavily use SMSs.
This segment could represent enterprise business users.
primary key("ID")
);
"CENTER_ID" INT,
"V000" DOUBLE,
"V001" DOUBLE,
"V002" DOUBLE,
"V003" DOUBLE,
"V004" DOUBLE,
"V005" DOUBLE,