K-Means for Recommendation System
K-Means for Recommendation System
MEANS CLUSTERING IN
RECOMMENDATION SYSTEM
Presented by:
Prajwal S M
Sagar Talagatti
Sanjay Lote
OUTLINE
• A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine learning,
that uses Data Science & Big Data to suggest or recommend additional products to consumers.
• Uses data to help predict, narrow down, and find what people are looking for among an exponentially growing
number of options.
• These can be based on various criteria, including past purchases, search history, demographic information, and
other factors. Recommender systems are highly useful as they help users discover products and services they
might otherwise have not found on their own.
• Because of their capability to predict consumer interests and desires on a highly personalized level,
recommender systems are a favorite with content and product providers. They can drive consumers to just
about any product or service that interests them, from books to videos to health classes to clothing.
2. Types of Recommendation Systems
• While there are a vast number of recommender algorithms and techniques, most fall into these broad
categories:
1. Collaborative filtering
2. Content filtering
3. Context filtering
1. Collaborative Filtering
• Collaborative filtering algorithms recommend items (this is the
filtering part) based on preference information from many users
(this is the collaborative part).
• This approach uses similarity of user preference behavior, given
previous interactions between users and items, recommender
algorithms learn to predict future interaction.
• The idea is that if some people have made similar decisions and
purchases in the past, like a movie choice, then there is a high
probability they will agree on additional future selections.
2. Content Filtering
• Uses the attributes or features of an item (this is the content
part) to recommend other items similar to the user’s preferences.
• This approach is based on similarity of item and user features,
given information about a user and items they have interacted
with (e.g. a user’s age, the category of a restaurant’s cuisine, the
average review for a movie), model the likelihood of a new
interaction.
• For example, if a content filtering recommender sees you liked
the movies You’ve Got Mail and Sleepless in Seattle, it might
recommend another movie to you with the same genres and/or
cast such as Joe Versus the Volcano. Fig. 3: Content Filtering
2. Types of Recommendation Systems (contd…)
3. Context Filtering
• Includes users’ contextual information in the recommendation
process.
• Netflix spoke at NVIDIA GTC about making better
recommendations by framing a recommendation as a contextual
sequence prediction. This approach uses a sequence of
contextual user actions, plus the current context, to predict the
probability of the next action.
• In the Netflix example, given one sequence for each user—the
country, device, date, and time when they watched a movie—
they trained a model to predict what to watch next.
Fig. 4: Context Filtering
3. Introduction to K-Means Clustering
• Say you are given a data set where each observed example has a set of features, but has no labels. Labels
are an essential ingredient to a supervised algorithm. We cannot run a supervised algorithm.
• One of the most straightforward tasks we can perform on a data set without labels is to find groups of data
in our dataset which are similar to one another -- what we call clusters.
• K-Means is one of the most popular "clustering" algorithms. K-means stores “k” centroids that it uses to
define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid
than any other centroid.
• It partitions the data points into clusters in such a way that:
• Data points in same cluster have high degree of similarity.
• Data points in different clusters have high degree of dis-similarity.
1. Choose the number of clusters K. The value of K can be chosen randomly or based on some observations
or using a method like Within Cluster Sum of Squares(WCSS).
2. Randomly select any K data points as cluster centers. Select cluster centers in such a way that they are
as farther as possible from each other.
3. Calculate the distance between each data point and each cluster center. The distance may be calculated
either by using given distance function or by using Euclidean distance formula.
4. Assign each data point to some cluster. A data point is assigned to that cluster whose center is nearest to
that data point.
5. Re-compute the center of newly formed clusters. The center of a cluster is computed by taking mean of
all the data points contained in that cluster.
6. Keep repeating the procedure from Step-3 to Step-5 until any of the following stopping criteria is met:
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
5. Methods to determine the value of K
Now let us see how we can apply K-Means to train a Movie Recommendation System:
1. Collect dataset to train a model. It can be taken from Kaggle. If a company like Netflix is developing a system,
then they already have the data of their users.
2. This data has several attributes like username, age, subscriptions, watch history of users, ratings and reviews
etc.. Not all data is of use for training the model, hence we extract only those attributes that are required.
3. In this case, lets say we are implementing recommender using Content Based Filtering. So we make use of
user’s age, ratings, and the genres of the movies that they watched. As genres is a nominal variable, assign
some numeric values to it instead and maintain this mapping for further use.
4. Next, we run K-Means algorithm on the dataset obtained in Step 3 for say values of K in the range 1-20. For
each iteration, we calculate WCSS value and plot a WCSS curve to obtain optimal value of K.