Appm 3310 Final Project
Appm 3310 Final Project
1 Abstract
This report explains the different techniques, used in the Netflix recommender algorithm, to
model and predict ratings for movies and TV shows. It turns out that the Singular Value
Decomposition, and different variations of it, can be used to extract trends and similarities
between users and movies/shows. The importance of SVD has become revolutionary in
collaborative filtering and recommender algorithms. By Primary Component Analysis (PCA), the
SVD regularizes the matrix, and provides the most “important” or crucial latent feature trends of
a matrix, that otherwise would be hard to find.
This paper will begin by explaining and forming the basic algorithm used to predict all
unknown ratings. We will then explain how different factors can be added to this baseline model
to improve the accuracy of each predicted rating, and ultimately, utilizing some numerical
methods to further reduce error in the model. We will not go into extreme mathematical detail of
the parameters or the numerical techniques in this paper, but will provide some general
understanding. The focus is the importance and use of the SVD matrix factorization in the
recommender algorithm, of which we will provide clear details. This paper is written for
individuals with a somewhat basic understanding of linear algebra and its concepts.
2 Introduction
Recommender algorithms are data-filtering systems used widely in today’s technological
applications, ranging from Amazon, Netflix, news sites, etc. The goal of the recommender is to
determine what items the user would enjoy, based off the limited amount of information from the
user. In the long run, these recommenders reduce the amount of research a user must do to
stumble upon what he or she may like. It transforms research into discovery. For example,
Netflix offers almost 10,000 different movies and shows for a user to choose from, closer to
100,000 movies when including their mailing service, and can produce an overwhelming
experience when trying to select the perfect movie [3]. The recommender algorithm side steps
all the “fluff”, and will present, as accurate as it can, only the items that the user will enjoy.
These algorithms not only make the search experience less hectic, but also provide a
customized and personal viewing experience. Ultimately, the main motivation for these
companies to be constantly improving these algorithms is business. For Netflix, they realize that
they have only a short time-frame to catch the viewer’s attention, before the viewer resorts to
another streaming service or activity. This comes down to keeping their viewers happy in order
to maintain their subscribers. Likewise, Amazon presents items that they know their users will
enjoy so that their consumers will spend more money through their website. All in all, the use of
recommender algorithms creates a win-win situation for both consumer and corporation.
To fully explain the process of the algorithm, and the SVD used, the math on an example
problem of a 2x3 matrix will be thoroughly explained. In addition, Matlab code will be provided to
solve a more realistic, non-square matrix. The data in the matrices will be arbitrary.
1
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
3 Preliminaries
In 2006, Netflix launched a competition to improve their accuracy from their system at the time,
CineMatch, by 10%. Providing a probe data set of over 18 thousand movies, 480 thousand
users, and 100 million ratings from those users. The competitors’ algorithms were then tested
against a test set, using the root mean square error, RMSE, as the error metric. One year into
this competition, a team, known as BellKor, won the first progress prize with their design, giving
an 8.43% improvement [5]. This method used what is known as Singular Value Decomposition,
abbreviated as SVD. To achieve the full 10% improvement, BellKor teamed up with other
competitors and blended all their algorithms, ultimately learning that the more algorithms and
predictors, the more accurate the “guess” rating is. The prize-winning predictor model averaged
over 800 algorithms. These algorithms exploit all the user data that it can, utilizing bizarre things
like the date of rating, time of day, day of the week, how many ratings a user gave in that time
frame, etc. (people’s preferences over movies change over time).
To understand the goal of the Netflix algorithm, it is helpful to visualize a chart: each row
represents a user, and each column represent a movie or TV show item. Refer to Figure 1.
Movies
5 1 2
3 3
Users
4 1 3
2
2 3 1
Figure 1: Rating Matrix
Every time a user rates content, the corresponding data cell for the user and the rated
media gets filled in. Because viewers typically do not rate media often, this chart contains many
blank cells. The goal of the algorithm is to fill in the whole chart with the calculated value that
each user would rate each item. From the data in this chart, the algorithm would then find the
items with the highest guessed rating, and then present those items to the user. The process is
as follows. Refer to Figure 2.
Predictor Algorithm
2
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
The input data includes data such as the names of the users, ratings of all items each user
has rated, the title of the movie/show, the date of rating, etc. [7]. The predictor algorithm outputs
the ratings that each user would give for all unwatched or items that have not been rated.
4 Notation
3
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
r"$ = 𝜇 + 𝑏𝑢 + 𝑏𝑖 (1)
where r"$ is the predicted rating for user, u, of movie/show item, i, μ is the overall average item
rating, b" is the user bias, and b# is the item bias where one movie may be better or worse than
the average rating [7].
So, for example, to predict Joe’s rating of Star Wars, Joe may be more critical than most
and rate 0.2 stars less than the average user so b" = 0.2. Now, Star Wars may be, on average,
rated 0.4 above the average µ = 3.7, and hence, b# = 0.4. Joe’s predicted rating for Star Wars
would then be 3.9 stars [7].
where tui is the amount of days from the first date of rating in the data set to the date user u
rated item i [7].
4
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
To further optimize the model, user preferences, or features, pu, and item features, qi, are
added. These two vectors lay in a two-dimensional vector space, where the amount of
“seriousness” or “intensity” lies along the vertical axis and the amount of “chick-flick-ness” of the
item lies along the horizontal axis. Refer to Figure 3. Each item can be represented by a feature
vector qi and each user can be represented by a user feature vector pu. Namely, consider the
two vectors pu and qi of the same length, where each component of pu would be associated
with a specific genre, name of director, year of movie, etc. The values of the entries represent
how much the movie identifies with that description. For the user preference vector pu, the
components store the user’s preference over each respective component in qi, i.e. if the user
likes the respective genre, name of director, year of movie, etc.
For example, after analyzing Joe’s viewing and search history, it turns out that Joe likes
comedies, dramas, and horror films the most. If Netflix wanted to know if user Joe would like the
movie Forrest Gump (1994), the similarity between
𝑑𝑟𝑎𝑚𝑎 𝑞 𝑑𝑟𝑎𝑚𝑎 𝑝
𝑞 𝑝
𝒒𝑻𝒊𝒕𝒂𝒏𝒊𝒄 = 𝑟𝑜𝑚𝑎𝑛𝑐𝑒 = and 𝒑𝑱𝒐𝒆 = 𝑟𝑜𝑚𝑎𝑛𝑐𝑒 =
⋮ ⋮ ⋮ ⋮
ℎ𝑜𝑟𝑟𝑜𝑟 0 ℎ𝑜𝑟𝑟𝑜𝑟 𝑝
where q, p are real, nonzero values, can be analyzed using these feature matrices.
To compare the user, u, with the item, i, an inner product, in this case the Euclidean dot
product, of these two vectors can be taken. The dot product will show how much two vectors in
Figure 3 point in the same direction, or in other words, will give the similarity between the user’s
preferences and the features of the movie [8]. Negative outcomes from the dot product are
considered as a bad match. For example, from Figure 3, the user (red vector) matches better
with Die Hard and Dumb and Dumber, than the user does with The Notebook and Mean Girls,
based on the angle between them. This dot product is essentially the rating the user would give
to that item. So, the model becomes
In the BellKor model, the item biases are organized in bins; each bin contains all the
ratings from a specific time interval t [7]. The user biases and preferences are both based on the
time deviations of the date of rating and the mean rating date. The competing teams
experimented with different forms of functions for each time dependent factor, ultimately leading
to a satisfactory model.
In general, the more terms and dimensions of the variables, the more accurate the
predicted rating r"$ will be, at the expense of memory and restriction to computational limits. The
5
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
Netflix Prize-winning algorithm contained millions of parameters. The model in Equation (4) is
much simpler than the one used in the Netflix prize-winning algorithm, however, this is the basic
idea of how the predictor algorithm is formed. The qi and pu vectors are what hold the valuable
latent information that can be extracted using the SVD factorization [8], and allowed the BellKor
team to succeed in the Netflix challenge.
This model is then sent through various numerical techniques to ensure the greatest
amount of accuracy. The most important technique used is a Boltzmann machine.
where Equation (6) is a Tikhonov regularization technique, and λ is regularizing term to avoid
overfitting the data; found by cross-validating [6,7].
The model is sent through a Boltzmann machine, a type of neural network typically used
to “teach” models from a training data set, which uses stochastic gradient descent, an iterative
optimization method for finding minimums [1]. In this case, it finds the minimum value to solve
the least squares Equation (6). Because Equation (6) is differentiable, the stochastic gradient
descent explores the surface of the vector space to find a minimum value, by means of stepping
in the direction opposite the gradient. Minimizing the error in each rating will eventually “train”
the model and tune its parameters.
Once the algorithm is trained and optimized, the singular value decomposition is used to
extract the latent information from the rating matrix.
6
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
The Singular Value Decomposition is a generalization of the spectral factorization, and can be
used for any real, rectangular matrices. Consider the user-item matrix R. Its factorization is as
follows
R = PΣQT (7)
where R is the mxn factorable matrix with rank r, P is an orthogonal mxr (PT=P-1) matrix whose
columns are the left singular vectors of R, Σ is an rxr diagonal matrix with the singular values of
R as its diagonal entries, and QT is an rxn with orthonormal rows containing the right singular
vectors of R [2].
The main difference between the SVD and spectral factorization is that SVD can be
used on any size, real matrix, not just a square matrices. If non square, the mxn matrix can be
converted into a square, symmetric, positive semi-definite matrix K. Refer to Equation (8),
K = R` R (8)
where the transpose of R, RT, is multiplied by R [2]. The singular values of R, 𝜎U = 𝜆U , are the
positive square roots of the eigenvalues of K, and the corresponding singular vectors of R are
the eigenvectors of K. The singular values are placed in Σ.
The columns of Q are the unit singular vectors 𝐩# of R, corresponding to each singular
value σ# . The columns of P are given by Equation (9)
R (9)
𝐩# = ( )𝐪𝐢
σ#
After finding the column vectors of the matrices P and Q, and finding the singular values
of R, the factorization can be put in matrix multiplication form shown in Equation (7).
The singular value decomposition is essentially just a series of stretches, reflections, and
rotations. Namely, Q` and P rotate the matrix R, while Σ stretches the matrix in the direction of
the singular vectors, by an amount of the singular value of R.
The SVD is most commonly used for least squares approximation, determining rank,
range, and null-space of a matrix, and to find the pseudoinverse of a matrix, which alone has
many applications [2].
7
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
recommendation. This relationship is referred to as the ‘feature’ relationship between the user
and item. Comparing the ‘features’ or characteristics of users is much more efficient than
comparing users over each item. This approach is known as the previously discussed, Latent
Factor model [8].
The rank of the matrix can be decreased, because two linearly dependent vectors only
count as one for the rank, and so one of the important features of the SVD is that it shrinks the
rank of the matrix. This feature of the SVD is known as low-rank approximation, and can take
any matrix A and truncate it to find the closest matrix B with a selected rank r. Because the rank
of the matrix equals the number of nonzero singular values, the small singular values, the least
important corresponding eigen-directions, are set to zero. So, low-rank approximation can be
used to eliminate the less-important eigen-directions [2] and extracting just the important ones.
8 Example Problem
As an example of singular value decomposition, we begin with a rating matrix, R,
2 0 0 (10)
R=
0 3 0
which involves two users’ ratings of three movies.
Since our matrix is not square or symmetric, we need to take the transpose of the matrix
and multiply it by itself to produce a square, symmetric matrix. Refer to Equation (11).
2 0 4 0 0 (11)
2 0 0
K = R` R = 0 3 = 0 9 0
0 3 0
0 0 0 0 0
4−λ 0 0 (12)
det K − λI = 0 9−λ 0 =0
0 0 −λ
Since (K − λI) is a diagonal matrix, the determinant is simply the products of the elements of the
diagonal. Thus, we get our characteristic equation:
det K − λ = −λ 4 − λ 9 − λ = 0 (13)
After solving for the roots of the equation, λ1 = 9, λ2 = 4 and λ3 = 0. Because λ3 is equal to
zero, its associated eigenvector can be ignored. Our eigenvalues will be listed in descending
order, i.e. 𝜆m ≥ 𝜆] . The square roots of the nonzero eigenvalues σ# = λ# are the singular values
of R, and are the entries of the diagonal matrix Σ. So, the sigma matrix is as follows:
3 0 (14)
Σ=
0 2
8
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
The eigenvalues will now be used to solve for the corresponding eigenvectors by plugging each
eigenvalue into K − λI = 0 and solving the homogenous equation:
−5 0 0 (15)
𝜆m = 9 : K − 𝜆m I = 0 0 0 =0
0 0 −9
0
→ 𝒒𝟏 = 1
0
0 0 0 (16)
𝜆] = 4: K − 𝜆] I = 0 5 0 =0
0 0 −4
1
→ 𝒒𝟐 = 0
0
Normalizing these vectors will give us unit vectors in the direction of each eigenvector. In this
specific example, 𝐪𝟏 an𝐪𝟐 d are already unit vectors, and will make up the columns of our matrix
Q.
𝑅𝒒𝟏 0 (17)
𝒑𝟏 = =
𝜎m 1
𝑅𝒒] 1 (18)
𝒑] = =
𝜎] 0
Where 𝐩𝐢 represents the column vectors of the matrix P, and qn represents the unit vectors that
make up the columns of the matrix Q.
2 0 0 0 1 3 0 0 1 0 (19)
𝑅 = = 𝑃Σ𝑄 x =
0 3 0 1 0 0 2 1 0 0
9
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
9 Matlab Code
To solve much more realistic, larger, and non-square rating matrices, Matlab
(MATLAB_R2016b) can be used. This Matlab code will take a square matrix, and output
predicted ratings for each user and item. The pseudo code for a user-inputted matrix R is as
follows:
9.1 Pseudocode
1. Ask user for input of rating matrix R
2. Use Matlab’s built in svd function to extract the factorization matrices
3. Take dot product of each pairs of columns of the factorization matrices P
and Q
4. Implant the predicted values into a matrix, with all other elements as zero
(for readability purposes), to display the predicted ratings
The program takes the inputted matrix R, finds the SVD factorization of R, calculates the
dot product for each column pair 𝐩" and 𝐪# , and produces a predicted rating matrix R. The code
can be found in the Appendix.
9.2 Example
Inputting the following arbitrary 6x6 known-rating matrix R, which includes the ratings of
six movies by six users (all zero-valued entries are unknown ratings),
1 5 0 3 0 2 (20)
1 0 3 0 3 2
5 2 1 1 0 0
𝑅 =
0 0 1 1 4 5
1 1 0 1 1 0
3 0 0 0 4 5
produces the following predicted rating matrix (all zero-valued elements are the known ratings):
These predicted ratings do not include any user or item bias, or temporal factors. The predicted
ratings are based off the relationship between the user’s preference and the genre of movie. For
instance, user one and two (1st and 2nd rows) have given the same ratings to movie one and six
(1st and 6th column), and we would expect that their ratings for the other items should be similar.
If we compare their ratings for item three (3rd column), user one’s predicted rating of 3.4641 is
10
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
compared to user two’s known rating of 3. These ratings are similar, because both user’s
preferences are similar, which is what we expected.
10 Conclusion
The Netflix algorithm starts from a baseline model developed and improved by adding factors
and parameters to the equation. This equation is then taught by a test data set, and further
optimized using numerical techniques. The biggest contribution to the accuracy of the predicted
guess is the use of the Singular Value Decomposition in determining underlying features of the
user and items. Comparing the feature vectors between each user and item offers another type
of a similarity metric, in addition to the Neighborhood model where users are grouped together
based on similarity among other users. The main take away is how the SVD can extract these
trend directions that otherwise would not be observable. Without it, recommender algorithms
and collaborative filtering would never have reached the level it is today.
To improve this design, modifications can be made to the algorithm and to the SVD to
find more accurate feature vectors. The more accurate the feature vectors, the more accurate
the dot product will be, and therefore, the more accurate the predicted guess will be. Blending
more than one model will surely optimize the algorithm; averaging more than one prediction will
give a much more accurate prediction. It is apparent that consumer behavior and psychology
play a role in predictions, as we explained with the temporal factors. The power of the SVD in
the Netflix application is that it only relies on limited information of each user, but still can utilize
the information in tremendous ways.
Another powerful feature of the SVD is using Principal Component analysis to take a
matrix with a large rank, and find a matrix that holds the same information, but with a lower rank.
This minimizes computational effort and can turn an impossible-to-solve matrix into a more
reasonable problem. This is known as low-rank approximation. An important application of PCA
is photography, taking a high-pixel photo that uses lots of memory, and finding a lower-pixel
photo that stores the same resolution as the first. The possibilities of applications of the SVD are
endless. Whenever it is needed to find latent directional trends in big data, these techniques can
be utilized. For example, the variance and covariance of data sets can be calculated using the
SVD, which can provide an endless amount of information, depending on the application.
Because SVD relates two feature vectors, it could be used for online dating websites to
show similarity between two people. If the corresponding dot product between two people are
similar, a value will be produced as a metric for that similarity. If the dot product is zero, the
“people” vectors are orthogonal, and so, the two people have nothing in common. If the SVD
can be used for compressing images, then it surely could be used to compress data and
remove redundancies in a large set. Or perhaps, finding linear trends and predicting outcomes
in oscillations under various dampening conditions, for example. These are just examples of
how these math techniques can be used to solve problems that can be modeled by math. The
possibilities are endless.
11
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
11 Appendix
%NOTE: this program does not include biases or any other factors (q'p). Without a large data set,
and
%the knowledge of data science techniques, the biases would all have to be
%inputted by user and hard-coded, which seems non-important (it is just another factor
%that gets added to the predicted rating). This script shows the gist of
%the predictor algorithm.
%Initialize variables
mu = 3.7; %const overall average ratings from all users [stars]
predictRating function:
%Check if entries of R are in the range 0<=r<5. Copy R = R_p. Replace all zero
%entries of R_p (unknown ratings) with the predicted rating. Replace all
%nonzero entires of R_p to zero, so the only nonzero elements in R_p are
%the predicted ratings
for i = 1:size(R,1)
for j = 1:size(R,2)
if R(i,j) > 5 || R(i,j) < 0 %Check if entries are 0<r<=5
error('Error, values must be 0<=r<5, rerun program') %display error if the entries r
are outside the bounds
elseif R(i,j) == 0
R_p(i,j) = mu + dot(Q(:,j),P(:,i)); %the predicted rating is given by dot(q,p) + mu +
biases (which are not included)
else R(i,j) ~= 0
R_p(i,j) = 0; %readability: set all known values to zero in order to show the
predicted values without confusing them with known ratings
end
end
end
end
12
Netflix Recommender Algorithm Straub, Ahonen, Benalcazar
11.2 References
[1] A. Töscher, M. Jahrer, The BigChaos Solution to the Netflix Prize 2008, Commendo
Research
& Consulting, 2008.
[2] M. Vozalis, K. Margaritis, Applying SVD on Generalized Item-based Filtering, International
Journal of Computer Science & Application, Vol. 3, 27-51 (2006).
[3] R. Bell, A. Töscher, M. Jahrer, The BigChaos Solution to the Netflix Grand Prize,
Commendo
Research & Consulting, 2009.
[4] R. Salakhutdinov, A. Mnih, G. Hinton, Restricted Boltzmann Machines for Collaborative
Filtering, University of Toronto, 2007.
[5] R. Bell, Y. Koren, Lessons from the Netflix Prize Challenge, SIGKDD Explorations, Vol. 9,
75-79
(2007).
[6] S. Funk, Netflix Update: Try This at Home http://sifter.org/~simon/journal/20061211.html
(December 11, 2006)
[7] Y. Koren, The BellKor Solution to the Netflix Grand Prize, Yahoo! Research Israel, 2009.
[8] Y. Koren, R. Bell, C. Volinsky, Matrix Factorizations Techniques for Recommender
Systems, IEEE
Computer Society, Vol. 42, 42-49 (2009).
13