0% found this document useful (0 votes)
31 views

DM - Lecture 5

The Pearson correlation coefficient is a measure of the linear correlation between two variables. In the context of collaborative filtering recommender systems, it is used to measure the similarity between two users based on their ratings of items. Some key properties of the Pearson correlation in this context: - It ranges from -1 to 1, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. - It considers both the rating values and how close they are to the user's average rating. So two users can have high positive correlation even if they rate on different scales, as long as their ratings relative to their own averages are similar. - Only items that both users have rated are considered. This

Uploaded by

Maa See
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

DM - Lecture 5

The Pearson correlation coefficient is a measure of the linear correlation between two variables. In the context of collaborative filtering recommender systems, it is used to measure the similarity between two users based on their ratings of items. Some key properties of the Pearson correlation in this context: - It ranges from -1 to 1, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. - It considers both the rating values and how close they are to the user's average rating. So two users can have high positive correlation even if they rate on different scales, as long as their ratings relative to their own averages are similar. - Only items that both users have rated are considered. This

Uploaded by

Maa See
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

1

RECOMMENDER SYSTEMS

Jesse Davis
Economics:
Traditional Retail vs. The Web
2

 Physical retail: Space is limited and expensive


 People are not willing to travel far for products
 Products stocked must make money

 Implication: Must focus on popular products

 Web: Storage space is cheap


 Sites cater to everyone
 Implication: Low cost and easy access mean
possible to offer more choice
Problem: Too many products, what interests me?

Solution: Systems that can recommend products


Sales vs. Products: The Long Tail
3

Stores

Mixed
(e.g., Amazon)
Sales

Online only
(e.g., iTunes)

Products (e.g., songs, books, etc.)


Making Recommendations
4

“Into Thin Air”

Books are similar

“Touching the Void”

The book “Into Thin Air” helped


make “Touching the Void” a bestseller
Wired article:
http://www.wired.com/wired/archive/12.10/tail.html
Recommendation Types
5

 Editorial
 List
of favorites
 Essential items

 Aggregates
 Top10 lists
 Most emailed articles
 Most recent posts

 Personalized user recommendations


 Amazon
 Movie sites
6 Key Challenges
Key Challenges
7

 How do we get user feedback?

 How do we evaluate predictions?

 How do predict an unknown rating?


 Main interest: Highly rated products
Options for Obtaining Ratings
8

 Explicit rating of products


 Bought items
 Items on “wish lists”
 Recently clicked product pages/links
 Length of time spent on product page
 Printed links
 Etc.
Evaluation
9

Ratings
1 3 4
3 5 5
4 5 5
Users

3
3
2 ? ?
?
Most
2 1 ?
recent
3 ?
ratings
1
Metrics: Compare with Known Rating
10

 Precision at top 10: % of those in top 10


𝑁
1
 Root-mean square error = 𝑁
෍(𝑃𝑖 − 𝑅𝑖 )2
𝑖=1

 Spearman’s rank correlation between


model’s and user’s complete rankings
6 σ𝑛𝑖=1 𝑈𝑖 − 𝑀𝑖 2
𝑅𝑠 =
𝑛(𝑛2 − 1)
 For 0/1 data
 Coverage: Number of items/users for which the
system can make a prediction
 Precision, ROC, etc.
Weakness with Metrics
11

 Focusing on accuracy misses important points


 Prediction diversity
 Prediction context

 Order of prediction

 Only high ratings matter


 RMSE might penalize a method that does well for
high ratings and badly for others
Recommending Products
12

 Idea 1: This is a machine learning problem


 Collect(a lot of) rating for the product per user
 Define a set of features (i.e., profile for item)

 Learn a model

 Idea 2: Content-based filtering


 Define a profile for an item
 Define a profile for a user

 Compare similarity between users and items


Key Challenge: Building Item Profile
13

 Item profile: A way to describe each item


 These are usually hand-crafted
 For a movie:
 Actors

 Genre

 Director

 Etc.

 News articles:
 Words, title, author, etc.
User Profiles and
Content-Based Prediction
14

 Main idea: Recommend items to customer x that


are similar to previous items rated highly by x

 User profile possibilities:


 Weighted average of rated item profiles
 Variation: weight by difference from average
rating for item

 Prediction heuristic: Given user profile x and


item profile i, estimate
𝒙·𝒊
𝑢(𝒙, 𝒊) = cos(𝒙, 𝒊) =
| 𝒙 |⋅| 𝒊 |
Pros and Cons of
Content-Based Approaches
15

 Advantages
 Onlyneed data about one user
 More personalized approach

 More easily recommend new/unpopular items

 Can provide context for recommendation

 Disadvantages
 Must manually construct meaningful features
 Never recommends items outside of a
user’s content profile
 Hard to build a profile for a new user
16 Netflix and Recent Advances
Present Recommendation in the
Context of Netflix Challenge
17

 While it is now older, it highlights a number of


important issues in data mining that are
applicable to many other problems

 Driven much of the advances in recommender


systems: We are still building off these ideas
Netflix: The $1 Million Question
18

Given: A training set of 100 million rating


Do: Build a recommendation system that
improves the root-mean squared error by
10% over Netflix’s system
Netflix Data and Evaluation
19

Each rating: (user, movie, rating, time stamp) tuple


Train set Test set
3M known to Netflix
100M ratings
• 99% sparse 1.5M for public
• 480,000 users leaderboard
• 17,700 movies
1.5M for
final winner

Score: Test set RMSE


The Competition
20

 Register on competition site and download


anonymized data
 Submit predictions for test set at most one time
per day
 Winning algorithm: Software and non-exclusive
license to Netflix, must publish algorithm
 50k improvement prizes every year
 Once 10% threshold met: 30 day final
competition period
 Start October of 2006: More popular than
anticipated (eventually around 40,000 teams!)
Pros and Cons of a Challenge?
21

 Advantages
 Expensive to develop internal system
 Publicity is good

 Awarding prize will pay for itself:


$1 million ≪ value of software to Netflix

 Disadvantages
 Privacy concerns (user backlash to data release)
 Prize won too quickly or no one wins the prize

 Time and effort to run the competition

 Results not useful in practice (e.g., too slow)


What Worked Well:
Or General Data Analysis Advice
22

 Try the obvious thing first

 Predict the right thing

 Think outside the box

 Know your data

 The more (models), the merrier


23 Try Obvious Approaches First
Classic Recommendation Approach:
Collaborative Filtering
24

 Intuition: Find users with similar tastes and


recommend products they liked

 Insight: Can do this just by looking at ratings!

 Big idea:
 Find other users whose ratings are similar to the
current user
 Propagate the (dis)likes to the current user
Pictorial Overview
25

R1 R2 R3 R4 R5 R6
Alice 2 - 3 2 - 1
Bob 2 5 4 - - 2
Chris 4 3 - - - 5
Diana 3 - 2 4 - 5

Find Correlation R3 = 4

R1 R2 R3 R4 R5 R6
Eve 2 5 ? ? ? ?

Active user
General Algorithm
26

 Step 1: Measure the similarity between


the user of interest and all other users

 Step 2 (Optional): Select a smaller subset


consisting of the most similar users

 Step 3: Predict ratings as a weighted


combination of “nearest neighbors”

 Step 4: Return the highest-rated items


Step 1: Similarity Weighting
27

 Question: How can we compute the similarity


between two users?

𝑆𝑖 ∩ 𝑆𝑗
 Idea 1: Jaccard similarity: sim 𝑆𝑖 , 𝑆𝑗 =
𝑆𝑖 ∪ 𝑆𝑗

 Problem: Ignores the actual rating

𝑆𝑖 ⋅ 𝑆𝑗
 Idea 2: Cosine similarity: sim(Si, Sj) =
||𝑆𝑖 || ⋅ ||𝑆𝑗 ||
 Problem: Treats missing ratings as zero
Step 1: Similarity Weighting
28

Pearson Correlation
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 𝑅𝑗𝑘 − 𝑅ഥ𝑗
𝑊𝑖𝑗 =
2 2
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 σ𝑘 𝑅𝑗𝑘 − 𝑅ഥ𝑗

σ𝑚 𝑅𝑖𝑒
𝑅ഥ𝑖 =
𝑒=1
(Average rating given by user i)
𝑚
Rik = User i’s rating on item k

Consider only items k rated by both users!!!


What Does Pearson Correlation
Mean?
29

σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 𝑅𝑗𝑘 − 𝑅ഥ𝑗


𝑊𝑖𝑗 =
2 2
σ𝑘 𝑅𝑖𝑘 − 𝑅ഥ𝑖 σ𝑘 𝑅𝑗𝑘 − 𝑅ഥ𝑗
Positive if both predictions on same side of average

1 = perfect linear relationship with Rik increasing with Rjk


-1 = perfect linear relationship with Rik decreasing
when Rjk increases
Step 2: Selecting Neighborhood
30

 Could use the whole database

 Could ignore “far away” users


 Pick “k” nearest users
 Include all users above a predetermined weight
threshold
 Note: Could use any k-NN strategy here
Step 3: Predicting Ratings
31

 Predict a rating
 Account for different rating levels by looking at
difference from a user’s average rating
 Weight each user’s contribution by similarity

𝑃𝑖𝑘 = 𝑅ഥ𝑖 + 𝛼 ෍ 𝑊𝑖𝑗 𝑅𝑗𝑘 − 𝑅ഥ𝑗


𝑗
1
𝛼=
σ |𝑊𝑖𝑗 |
Wij = Pearson correlation
Step 4: Return Items
32

 Usually only interested in highly rated items

 Do not want to overload the user

 There usually return the top 2, 5 or 10 items

 Can give user the option to view more items


Practical Issues
33

 Potentially very large datasets


 Number of users: Millions
 Number of items: 100,000s

 Key issue: How to efficiently find similar users?


 Pairwise similarity for 1M users: ~5 days
 Pairwise similarity for 10M users: ~1 year!

 Big trick: Locality Sensitive Hashing


(See programming for big data)
Pros and Cons of
Collaborative Filtering
34

 Advantages
 Simple and intuitive approach for any item type
 No feature construction and selection
 Exploits information about other users

 Disadvantages
 Data is sparse: Hard to find similar users
 Cold start: Need to enough users in database
 First rater: Can’t recommend unrated items,
e.g., new or unique items
 Popularity bias: Favors items that lots of people like
(i.e., bad if you have unique taste)
Performance of Various Methods
Global average: 1.1296

User average: 1.0651


Movie average: 1.0533

Netflix: 0.9514
Basic Collaborative filtering: 0.94

Grand Prize: 0.8563

35
36 Predict the Right Thing
Predict the “Right Thing”
37

 Our task
 Predict(user_id, movie_id, ?)
 Minimize RMSE of predictions

Question: What will produce low RMSE?

 Two points:
 Obvious: Better results if model optimized
towards the given objective
 Subtle: To get good results, often have to derive
a new target variable to predict
What Affects a User’s Rating?
38

 Hypothesis: Multiple factors affect rating


 Goal: Isolate the portion of the rating that
captures the user-movie effect
 Two big effects:

User – User’s rating scale


Bias – Values of other ratings user gave recently
(user’s mood, anchoring, multi-user accounts)

Movie – (Recent) popularity of movie i


Bias – Selection bias (“frequency”)
Capturing Global Effects:
Better Baseline
39

overall mean rating Rating deviation of movie i

𝒓𝒙𝒊 = 𝒃𝒙𝒊 = 𝝁 + 𝒃𝒙 + 𝒃𝒊

rating deviation of user x

 𝝁: Average rating of all movies in data


 𝒃𝒙 : Average rating of user x − 𝝁
 𝒃𝒊 : Average rating of movie i − 𝝁
Example of Baseline
40

 Mean movie rating: 3.7 stars

 The Sixth Sense is 0.5 stars above average

 Joe rates 0.2 stars below average

 Baseline estimation:
 3.7+ 0.5 + (-0.2) = 4
 Joe will rate The Sixth Sense 4 stars
Three Problems with the
Collaborative Filtering Model
41

 Similarities are arbitrary: Many choices and


not tied to our ultimate objective
 Do not account for interactions between
neighbors: May “double count’’
 Weighted average: Ignores differences among
high & low confidence neighbors 3-NN, with sims:
0.1, 0.05, and 0.15
0.1(2) + 0.05 (3) + 0.15(2)
2+ = 3.27
(0.1 + 0.05 + 0.15)
3-NN, with sims:
0.6(2) + 0.3 (3) + 0.9(2) 0.6, 0.3, and 0.75
2+ = 3.27
(0.6 + 0.3 + 0.9)
Idea: Weighted Sum
42

 Weighted sum rather than weighted avg.:


𝑟ෞ
𝑥𝑖 = 𝑏𝑥𝑖 + ෍ 𝑤𝑖𝑗 𝑟𝑥𝑗 − 𝑏𝑥𝑗
𝑗∈𝑁(𝑖;𝑥)
 A few notes:
 𝒃𝒙𝒊 is the new, more complex baseline with biases
 𝑵 𝒊; 𝒙 : movies rated by user x that are similar to
movie i
 𝒘𝒊𝒋 is the interpolation weight (some real number)

 𝒘𝒊𝒋 models the interaction between pairs of movies


(it does not depend on user x)
Picking Weights: Optimization
43

 Competition objective: Sum of squared errors


𝟐
෍ 𝒓ො 𝒙𝒊 − 𝒓𝒙𝒊
(𝒊,𝒙)∈𝑹

 Idea: Pick weights to minimize this objective!


2
𝐽 𝑤 =෍ 𝑏𝑥𝑖 + ෍ 𝑤𝑖𝑗 𝑟𝑥𝑗 − 𝑏𝑥𝑗 − 𝑟𝑥𝑖
𝑥,𝑖 𝑗∈𝑁 𝑖;𝑥
True
Posed as Predicted rating
rating
optimization
problem!
Performance of Various Methods
Global average: 1.1296

User average: 1.0651


Movie average: 1.0533

Netflix: 0.9514
Basic Collaborative filtering: 0.94

CF+Biases+learned weights: 0.91

Grand Prize: 0.8563

44
45 Think Outside the Box
Let’s Think about the Task
46

Movies  Goal: Fill in “missing entries”


1 3 ? 5
5 4 ? ?
Users

 Likely some shared hidden


2 4 1 2 structure for users and movies
2 4 5
4 3 4 2  Implication: Could represent the
1 3 3 ? data using “fewer” dimension
Latent Factor Models
Drama
Sense and Amadeus
Sensibility

Braveheart
Serious Factor 1Light
Dumb and hearted
Dumber
Lethal Weapon
Factor 2

47

Syrianna
Ocean’s 11
Action
Latent Factor Model
48

Movies
= R ≈ Q x PT
1 3 ? 5 Movies “Topics’’ Movies
5 4 ? ?

“Topics’’
Users

Users

Users
2 4
2 4
1 2
5 ͌ X
4 3 4 2
R=uxm Q=uxd P=mxd
1 3 3 ?
matrix matrix matrix
Key idea: d ≪ m, u
“Topics”: shared hidden structure
(e.g., how much each users likes each genre)
Latent Factor Models
49

Movies factors factors


1 3 5 5 4 .1 -.4 .2 1.1 -.8 2.1
5 4 ? 4 2 1 -.2 .7 -.4
Users

-.5 .6 .5
2 4 1 2 3 4 3 5 .3 .5 .6

Users

Items
-.2 .3 .5
2 4 5 4 2 1.1 2.1 .3 .5 1.4 1.7
4 3 4 2 2 -.7 2.1 -2 -2 .3 2.4
1 3 3 2 4 -1 .7 .3 -.5 -1 .9
.8 1.4 -.3
? = -.5 * -2 +.6 * .3 + .5 * 2.4 = 2.38 -.4 2.9 .4
.3 -.7 .8
Note: Item-factor should be 1.4 1.2 .7
transposed, but not due to space 2.4 -.1 -.6
Optimization problem
50

Goal: Find Matrices P and Q according to:


2
min ෍ 𝑅𝑖𝑗 − 𝑄𝑃T 𝑖𝑗
P,Q
𝑖𝑗∈𝑅

Three things to think about


 R has many missing entries

 Objective is non-convex

 R is sparse, so worried about overfitting


Intuition of a Solution
51

1 3 5 ? ? ? ? ? ? ? ?
5 4 ? ? X ? ? ? ? ? ?
2 4 1 2 ? ?
2 4 5
4 3 4 2
͌ ? ?
? ?
1 3 3 ? ?
Intuition of a Solution
52

1 3 5 ? ? 0.1 0.1 0.5 1.2 0.1 1.5


5 4 ? ? X -1.4 0.8 -0.5 0.4 -0.3 0.9
2 4 1 2 ? ?
2 4 5
4 3 4 2
͌ ? ?
? ? Fit least
X1 X2 Y squares
1 3 3 ? ?
0.1 -1.4 2 solution
0.1 0.8 4
1.2 0.4 1
0.1 -0.3 2
Idea: Alternating Least Squares
53

Sum over non-missing entries


2
2 2
min ෍ 𝑅𝑖𝑗 − 𝑄𝑃T 𝑖𝑗
+ λ1 𝑄 𝐹 + λ2 𝑃 𝐹
P,Q
𝑖𝑗∈𝑅
Frobenius norm = Matrix equivalent of L2
Input: R, d
Randomly initialize Q,P
for i = 0, 1,… do Solve m d-dimensional
ridge regression problems
Fix Q: Optimize P
Fix P: Optimize Q Solve u d-dimensional
ridge regression problems
Latent Factor Models with Biases
54

users factors users

factors
1 3 5 5 4 .1 -.4 .2 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4
5 4 4 2 1 -.5 .6 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1
items

2 4 1 2 3 4 3 5 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6


-.2 .3 .5

2 4 5 4 2 1.1 2.1 .3
PT
items
4 3 4 2 2 -.7 2.1 -2
1 3 3 2 4 -1 .7 .3

R Q
𝒓ො 𝒙𝒊 = 𝒒𝒊 ⋅ 𝒑𝒙 = ෍ 𝒒𝒊𝒇 ⋅ 𝒑𝒙𝒇
𝒇
𝑟𝑥𝑖 = 𝜇 + 𝑏𝑥 + 𝑏𝑖 + 𝑞𝑖 ⋅ 𝑝𝑥 T

Overall Bias for Bias for User-Movie


mean rating user x movie i interaction
Koren, Bell, Volinksy, IEEE Computer, 2009
55
Notes On Latent Factors
56

 Regularization is key to avoiding overfitting

 Often want to preprocess or normalize the data


(Note: Must convert back to original scale when
making a prediction)

 Alternating least squares is easy to parallelize

 Can also solve via stochastic gradient descent


Non-Negative Matrix Factorization
57

 Often want the entries of P, Q to be


non-negative: Easier to interpret and visualize

 Easy solution: Projected gradient methods


∂l(w)
 Normal gradient wi+1 ⟵ wi - η
∂wi
 Projected gradient wi+1 ⟵ max(0, wi – η ∂l(w) )
∂wi
Performance of Various Methods
Global average: 1.1296

User average: 1.0651


Movie average: 1.0533

Netflix: 0.9514
Basic Collaborative filtering: 0.94

CF+Biases+learned weights: 0.91


Latent factors: 0.90
Latent factors+Biases: 0.89

Grand Prize: 0.8563

58
59 Know Your Data
Temporal Effect: Early 2004 Sudden
Jump in Average Movie Rating
60

◼ Pre 2004: 3.4 stars

◼ Post 2004: >3.6 stars

 Netflix improved matching, leading to higher


rankings?
 People biased towards higher ratings: Meaning
of rating changes?
Temporal Effect: Movie Age
61

◼ High initial ratings

◼ Ratings increase with


movie age

 Older movies rated by users better matched to


them?
 Older movies are inherently better than newer
movies?
Temporal Biases Of Users
[Koren, KDD’09]
62

 Original model: rxi = m +bx + bi + qi ·px


 Add time dependence to biases:
rxi = m +bx(t)+ bi(t) +qi · px
 Make parameters bx and bi depend on time
 (1) Parameterize time-dependence by linear trends
(2) Each bin corresponds to 10 consecutive weeks

 Add temporal dependence to factors


 px(t)… user preference vector on day t
Performance of Various Methods
Global average: 1.1296

User average: 1.0651


Movie average: 1.0533

Netflix: 0.9514
Basic Collaborative filtering: 0.94

CF+Biases+learned weights: 0.91


Latent factors: 0.90
Latent factors+Biases: 0.89

Latent factors+Biases+Time: 0.876


Grand Prize: 0.8563

63
64 The More (Models) the Merrier
What Now??
65

 Tried lots of things, but still have not reached the


10% threshold
 Getting desperate…
 Idea: “Kitchen Sink Approach”
 Build lots and lots of predictors
◼ Classifiers
◼ Collaborative filters
◼ Ensembles

 Come up with clever ways to blend the results


66
2009
67

 At the end of June, the leading team


(BellKorPragmaticChaos) submits results that
exceed the 10% threshold
 Competition enters 30 day final period
 New “Ensemble” team formed based on
collaboration for others near at top of
leaderboard and quickly beat 10% threshold too
 The race is on…
Standing on June 26th 2009
68

June 26th submission triggers 30-day “last call”


The Final Countdown
69

 Direct competition between two teams


 Can only submit results once per day so each
team can submit once on final day
 A day before deadline, BellKorPragmaticChaos
notice that Ensemble is in the lead
 Each team prepares final results
 BellKorPragmaticChaos submits 40 mins early
 Ensemble submits 20 mins early

 And they wait…


70
September 2009: Prize Awarded
71

 Teams tied BellKorPragmaticChaos wins by


submitting earlier
72 Conclusions and Perspectives
Perspectives
73

 Must account for unpredictable users behavior

 Think about what you really want to predict and


whether you need a new target

 Knowing your data is crucial

 When in doubt, combine lots of models

 Privacy and ethics are tricky but important


Summary
74

 Recommender systems an important and active


area of research and use
 Data is really messy
 Paradigms
 Content: Based on designing features and
applying classification/regression algos
 Collaborative: Based on comparing users/items
(aka nearest neighbor)
 Latent factor approaches

 Baselines important in practice


Questions
75

 A number of slides are based on the mmds.org


lectures on recommender systems
http://mmds.org/#book

You might also like