0% found this document useful (0 votes)
59 views8 pages

Collaborative Filtering Techniques Explained

Collaborative filtering is a technique used in recommendation systems to predict a user's preferences based on the preferences of similar users. Collaborative filtering algorithms can predict a user-item rating or determine the top-k items or users. Neighborhood-based collaborative filtering forms neighborhoods of similar users or items to make recommendations. User-based collaborative filtering uses ratings from similar users while item-based collaborative filtering uses ratings for similar items.

Uploaded by

Angel Leo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views8 pages

Collaborative Filtering Techniques Explained

Collaborative filtering is a technique used in recommendation systems to predict a user's preferences based on the preferences of similar users. Collaborative filtering algorithms can predict a user-item rating or determine the top-k items or users. Neighborhood-based collaborative filtering forms neighborhoods of similar users or items to make recommendations. User-based collaborative filtering uses ratings from similar users while item-based collaborative filtering uses ratings for similar items.

Uploaded by

Angel Leo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT – 3

Collaborative filtering is a technique used in recommendation system to


make predictions about an individual’s preferences based on the preferences of
similar users. The idea about this method is that people who have similar
preferences in the past are likely to have similar preferences in the future.

Collaborative filtering algorithms, neighborhood-based collaborative filtering


algorithms can be formulated in one of two ways:
1. Predicting the rating value of a user-item combination: This is the simplest
and most primitive formulation of a recommender system. In this case, the
missing rating ruj of the user u for item j is predicted.
2. Determining the top-k items or top-k users: In most practical settings, the
merchant is not necessarily looking for specific ratings values of user-item
combinations. Rather, it is more interesting to learn the top-k most relevant
items for a particular user, or the top-k most relevant users for a particular item.
The problem of determining the top-k items is more common than that of
finding the top-k users. This is because the former formulation is used to
present lists of recommended items to users in Webcentric scenarios. In
traditional recommender algorithms, the “top-k problem” almost always refers
to the process of finding the top-k items, rather than the top-k users.

However, the latter formulation is also useful to the merchant because it can be
used to determine the best users to target with marketing efforts.
Collaborative filtering algorithm used data on the interactions of a large number of
users with a particular item such as their rating. This collaborative filtering used
two main approaches.

1. User-based collaborative filtering

2. Item-based collaborative filtering

Neighborhood-based Collaborative Filtering

Collaborative Filtering (CF) methods collect preferences in the form of ratings or


signals from many users (hence the name) and then recommend items to a user
based on item interactions that people have with similar tastes as this user had in
the past. In other words, these methods assume that if person X likes a subset of
items that person Y likes, then X is more likely to have the same opinion as Y for a
given item compared to a random person that may or may not have the same
preferences.
The main idea with neighborhood-based methods is to leverage either user-user
similarity or item-item similarity to make recommendations. These methods
assume that similar users tend to have similar behaviors when rating items. We can
also expand this assumption to items as well: similar items tend to receive similar
ratings from the same user.
In these methods, the interactions between users and items are generally
represented by a user-item matrix, where each row represents a user and each
column represents an item, while the cells represent the interaction between the
two, which, in most cases, are the item ratings made by users. In this context, we
can define two types of neighborhood-based methods:
 User-based Collaborative Filtering: Ratings given by users like a user U
are used to make recommendations. More specifically, to predict U's rating
for a given item I, we calculate the weighted average of the rating r of k
similar users (neighbors) to U, where the weights are determined by the
similarity between U and each of the similar users.

 Item-based Collaborative Filtering: Ratings of a group of similar items


are used to make recommendations for item I. Similarly, to predict I's rating
given by a user U, we calculate the weighted average of the rating r of k
similar items (neighbors) to I, where the weights are determined by the
similarity between I and each of the similar items.
Comparison between User-based and Item-based Methods
The difference is subtle, but user-based collaborative
filtering predicts a user’s rating by using the ratings of neighboring users, while
item-based collaborative filtering leverages the user's ratings on neighboring items,
which allows for more consistent predictions because it follows the rating
behaviors of that user. In the former case, the similarity is calculated between the
rows of the user-item matrix, while the latter looks at similarities between the
columns of the matrix.
These approaches also differ in the way they solve problems. It is common to use
the item neighborhood to recommend a list of top k items to a user. On the other
hand, it is interesting to retrieve the top k users from a segment to target them for
marketing campaigns.
To understand the reasoning behind a recommendation, item-based methods
provide better explanations than user-based methods. This is because item-based
recommendations can use the item neighborhood to explain the results in the form
of “you bought this, so these are the recommended items”. The item neighborhood
can also be useful for suggesting product bundles to maximize sales. On the other
hand, user-based methods’ recommendations usually cannot be explained directly
because neighbor users are anonymized for privacy reasons.
Additionally, item-based methods may only recommend items very similar to what
the user already liked, whereas user-based methods often recommend a more
diverse set of items. This can encourage users to try new items and potentially keep
their engagement and interest.
Another significant difference between these approaches is related to ratings.
Calculating the similarity between users to predict ratings may be misleading
because users may rate items in a different manner. When you present a range of
values to the user, he/she might interpret them differently. For instance, in a 5-star
rating system, a user may rate an item as 3 because it does what it is expected to do
and nothing more, while others might use 3 to rate an item that barely works. Some
users rate items highly and others rate items less favorably. To address this issue,
the ratings should be mean centered by the user, meaning the user’s mean rating is
subtracted from their raw rating, and the target user’s mean rating is added to the
calculation, as in the example below:
Neighborhood Models in Practice
Let’s say we have the following small sample of a user-item matrix, where items
are from a digital commerce store. Notice there are missing ratings, which means
users typically do not rate all products.

To show how the algorithm works in practice, let’s assume we have built an item-
based model. Note that the steps of the algorithm would be analogous to the user-
based model, except for the perspective changes and focus on similarities between
rows (users).
Remember that neighborhood CF algorithms rely on the ratings and similarity
between items/users, so the first step is to define which similarity metric to use.
One of the most common choices is the Pearson similarity, which measures how
correlated a pair of vectors are. The range of values scales from -1 to 1, where
those values indicate negative and positive correlations, respectively, and 0
indicates no correlation between vectors. This is the Pearson similarity equation for
item-based models:

During this first phase, it’s usual to precompute the similarity matrix beforehand to
obtain a good performance during inference time. In the case of item-based
models, an item-item similarity matrix is built by applying the similarity metric
between all pairs of items. Since the matrix is sparse, we only consider the set of
mutually rated pairs of items during the similarity computation. For instance, the
similarity between items from columns 1 and 4 of the image above will be
computed as the similarity between vectors [4,3,5] and [5,3,4]. It’s possible that a
pair of items may show no co-ratings by users due to the sparsity of the matrix,
resulting in an empty set. In that case, a value of 0 similarity is assigned for that
pair. To improve computational efficiency, it is common to consider only the k
nearest neighbors of an item during inference time.
Let’s say we want to predict how Madison rated the Animal Farm book and we
defined k=2 as the number of nearest neighbors to consider during calculation. To
simplify the example, we will only manually calculate the similarities between the
target item and items from columns 2 and 4 because they are the nearest neighbors
for this item. When calculating the mean rating during similarity computation, we
will consider only the set of ratings that are mutually exclusive between items.
The image below shows how the neighborhood is formed. The circle in red is the
value we’re trying to predict. The squares in green are ratings from Madison that
are going to be used to infer the rating for the target item. The other two ratings
marked with an X are not considered because k=2. The rectangles in orange show a
set of mutually exclusive ratings between the target item and the item from column
2, while the rectangles in blue show the same, but for the common ratings between
the target item and item from column 4.
These are the common set of ratings between the target item (item 3) and the first
neighbor (item 2): [4,3,3] and [4,4,3]. The first step is to calculate the mean in each
set:
4+ 3+3
Target item mean = 3
= 3.33

4+ 4+ 3
Item 2 mean = 3
= 3.67

The Pearson similarity formula centers the ratings by their mean, so we can
transform this vector and then plug the results into the equation:
Target item mean centered vector= [(4-3.33),(3-3.33),(3-3.33)]=[0.67-0.33,-0.33]
Item 2 mean centered vector = [(4-3.33),(3-3.33),(3-3.33)]=[0.33,0.33,-0.67]
To simplify the calculations, we separate the numerator and denominator:
Numerator = (0.67+0.33) + (-033*0.33) + (-0.33*-0.67) = 0.33
Denominator = √ 0.672 +¿ ¿ * √ ¿ ¿ = 0.67
0.33
Then finally compute the similarity between items 3 and 2 = 0.67 =0.5

4+ 3
Target item mean = 2 = 3.5

5+4
Item 4 mean = 2 = 4.5

Target item mean centered vector


= [(4-3.5), (3-3.5)] = [0.5, -0.5]
Item 4 mean centered vector = [(5-4.5), (4-4.5)] = [0.5, -0.5]
Numerator = (0.5 *0.5) + (-0.5*-0.5) = 0.5
Denominator = √ 0.52 +¿ ¿ * √ ¿ ¿ = 0.5
0.5
The same calculation is done for the similarity between items 3 and 4:= 0.5 = 1

Next, we calculate the mean for each item, considering all the item’s ratings:
4+ 4+ 1+ 2+3
Item 2 means = 5
= 2.8

5+3+3
Item 3 means = 3
=4

Then, we can plug in the values we found together with Madison’s ratings for
Items 2 and 4 (1 and 3, respectively) in the equation below:

So,
3.33+0.5∗( 1−2.8 ) +1∗(3.4 )
Rating (Madison, Item3) = 0.5+1 =2.07
Since ratings are discrete numbers, we round this value to 2. It’s important to note
that in a real-world setting, it’s often recommended to use neighborhood methods
only when k is above a certain threshold because, when the number of neighbors is
small, the predictions are usually not precise. An alternative would be to use
Content-based filtering when we do not have enough data about the user-item
relationship.

You might also like