0% found this document useful (0 votes)
31 views

Using Singular Value Decomposition Approximation For Collaborati

1) The document discusses using singular value decomposition (SVD) approximation to reduce the computational cost of collaborative filtering algorithms that use the expectation-maximization (EM) procedure. 2) It proposes a novel algorithm that incorporates SVD approximation into the EM procedure to find a low-dimension linear model that approximates user ratings data with lower computational cost than full SVD. 3) It also extends this approach to a distributed recommendation system where users maintain private rating profiles, and a server periodically collects aggregate data from online users to provide predictions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Using Singular Value Decomposition Approximation For Collaborati

1) The document discusses using singular value decomposition (SVD) approximation to reduce the computational cost of collaborative filtering algorithms that use the expectation-maximization (EM) procedure. 2) It proposes a novel algorithm that incorporates SVD approximation into the EM procedure to find a low-dimension linear model that approximates user ratings data with lower computational cost than full SVD. 3) It also extends this approach to a distributed recommendation system where users maintain private rating profiles, and a server periodically collects aggregate data from online users to provide predictions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Using Singular Value Decomposition Approximation for Collaborative Filtering

Sheng Zhang, Weihong Wang, James Ford, Fillia Makedon


Department of Computer Science, Dartmouth College, Hanover, NH 03755
{clap, whwang, jford, makedon}@cs.dartmouth.edu
Justin Pearlman
Departments of Medicine & Radiology, Dartmouth Medical School, Lebanon, NH 03766
[email protected]

Abstract state-of-the-art collaborative filtering algorithms: [1, 4, 6]


directly assume that the user preferences database is gen-
Singular Value Decomposition (SVD), together with the erated from a linear model, matrix factorization based col-
Expectation-Maximization (EM) procedure, can be used laborative filtering methods [3, 9, 15, 16] obtain an explicit
to find a low-dimension model that maximizes the log- linear model to approximate the original user preferences
likelihood of observed ratings in recommendation systems. matrix, and [2, 11] use the Pearson correlation coefficient,
However, the computational cost of this approach is a major which is equivalent to a linear fit.
concern, since each iteration of the EM algorithm requires If we assume that users’ ratings are generated from
a new SVD computation. We present a novel algorithm that a low-dimension linear model together with Gaussian-
incorporates SVD approximation into the EM procedure to distributed noise, the Singular Value Decomposition (SVD)
reduce the overall computational cost while maintaining ac- technique can be used to find the linear model that maxi-
curate predictions. Furthermore, we propose a new frame- mizes the log-likelihood of the rating matrix, assuming it is
work for collaborating filtering in distributed recommenda- complete. If the rating matrix is incomplete, as is the case
tion systems that allows users to maintain their own rating in real-world systems, SVD cannot be applied directly. The
profiles for privacy. A server periodically collects aggre- Expectation-Maximization (EM) procedure [4, 16] can be
gate information from those users that are online to provide used to find the model that maximizes the log-likelihood
predictions for all users. Both theoretical analysis and ex- of the available ratings, but this requires a SVD computa-
perimental results show that this framework is effective and tion of the whole matrix for each EM iteration. As the size
achieves almost the same prediction performance as that of of the rating matrix is usually huge (due to large numbers
centralized systems. of users and items in typical recommendation systems), the
Keywords: Collaborative Filtering, SVD Approximation, computational cost of SVD becomes an important concern.
EM Procedure. Deterministic SVD methods for computing all singular vec-
tors on an m-by-n matrix take O(mn2 +m2 n) time, and the
Lanczos method requires roughly O(kmn log(mn)) time to
1. Introduction approximate the top k singular vectors [10].
In this work, we present a novel algorithm based on us-
Collaborative Filtering analyzes a user preferences ing an SVD approximation technique to reduce the com-
database to predict additional products or services in which putational cost of SVD based collaborative filtering. The
a user might be interested. The goal is to predict the prefer- basic idea is that in each iteration of EM procedure, a set of
ences of a user based on the preferences of others with sim- user rating profiles (rows) are sampled in a way such that
ilar tastes. There are two general classes of collaborative the top k right singular vectors of the resulting sub-matrix
filtering algorithms. Memory based algorithms [2, 11, 14] approximate the top k right singular vectors of the original
operate over the entire database to make predictions. Model matrix. When this is done, SVD can be conducted on this
based algorithms [1, 4, 6, 12, 13] use the database to esti- sub-matrix instead of the whole matrix while still accurately
mate or learn a model for predictions. updating all elements in the matrix.
Low-dimension linear models are a popular means to de- We further extend this idea to a distributed collabora-
scribe user preferences. The following are representative tive filtering scenario, where users maintain their own rating

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
profiles for privacy and a server periodically collects aggre- for any matrix M with rank less than or equal to k, where
gate information from online users in order to provide rating  · F denotes the Frobenius Norm. So given a completely
predictions on demand. We show that with a high probabil- known rating matrix A, X = Uk Sk VkT can be considered as
ity, the rating profiles of a random small set of users are the most probable or the most accurate k-dimensional linear
capable of making almost the same predictions as the rating model that describes A.
profiles of all users. In real recommendation systems, the rating matrix A is
The organization of this paper is as follows. Section 2 incomplete, which means that some ratings are available
describes the linear model used for rating, explains why while others are unknown (unrated). To deal with an in-
SVD is useful, and explains how the EM procedure can complete rating matrix, one can impute missing elements
be used in practical systems. Section 3 presents our algo- using some simple methods such as using user averages or
rithm, which incorporates the SVD approximation into the item averages, and then find the linear model that fits the
EM procedure, and gives a theoretical analysis of it. Section filled-in rating matrix best. However, imputed values ob-
4 proposes our framework for collaborative filtering in dis- tained by these naive methods are not likely to be true, and
tributed recommendation systems and presents algorithms thus the linear model found may deviate significantly from
for computing aggregates and making predictions. We also the true model. A better way to deal with missing ratings,
discuss algorithm stability and survey the security schemes described in [4, 16], is to find the model that maximizes the
that can be used for preserving privacy. Section 5 presents log-likelihood of all observed ratings (denoted as Ao ) using
experimental results obtained from real data sets. Finally, an EM procedure.
Section 6 concludes the paper. The Expectation step of the tth iteration in the EM pro-
cedure updates the expected expression of the complete-
2. Background data log-likelihood with respect to the unknown data (de-
noted as Au ) given the observed data Ao and the cur-
The fundamental problem addressed by linear models is rent parameter estimates X (t−1) , which can be written as
how to efficiently represent a large matrix of user ratings. E[log Pr(Ao , Au |X)|Ao , X (t−1) ]. If Aij is observed,
Denote the rating matrix as A (m users-by-n items); then
A(i) (the ith row of A) is user i’s rating profile and Aij rep- 1
E[log Pr(Aij |X)|Ao , X (t−1) ] = − (Aij − Xij )2 + C.
resents the rating given by user i on item j. Define matrix 2σ 2
X (m-by-n and with rank at most k) as a low-dimension (t−1) (t−1)
linear model that approximates the rating matrix A. Hence, If Aij is unknown, we have E[Aij |Xij ] = Xij . It
each element Aij is equal to Xij plus an error that is from is because Aij ∼ N(Xij , σ), and the expected expression
(t−1)
a Gaussian distribution with zero mean and standard devi- of Aij given the current parameter estimate Xij is thus
ation σij . For simplicity, we assume that all σij are equal equal to that estimate. Therefore,
to σ. Therefore, Aij ∼ N(Xij , σ), and the log-likelihood of
Aij given Xij follows as E[log Pr(Aij |X)|Ao , X (t−1) ]
(t−1)
1 = E[log Pr(Aij |Xij )|Xij ]
log Pr(Aij |Xij ) = − (Aij − Xij )2 + C.
2σ 2 1 (t−1)
= − 2 E[(Aij − Xij )2 |Xij ] + C
In this and following equations, C is a constant. If we as- 2σ
sume that Pr(Aij |Xij ) are independent, the log-likelihood 1 (t−1) (t−1)
= − 2 (E[A2ij |Xij ] − 2E[Aij Xij |Xij ] + Xij 2
)+C
of the whole rating matrix A given the linear model X is 2σ
 1 (t−1) (t−1)
log Pr(A|X) = log Pr(Aij |Xij ) = − 2 ((Xij )2 + σ 2 − 2Xij Xij + Xij 2
)+C

ij 1 (t−1)
1  = − 2 (Xij − Xij )2 + C.
= − 2 (Aij − Xij )2 + C. 2σ
2σ ij
By combining these two cases,
Thus, if A is completely known, finding the X that maxi-
mizes log E[log Pr(Ao , Au |X)|Ao , X (t−1) ]
 Pr(A|X) is equivalent to finding the X that min-
imizes ij (Aij − Xij )2 , which is a least square problem = E[log(Pr(Ao |X) Pr(Au |X))|Ao , X t−1 ]
and can be solved by SVD. If A is factorized in the form 1 
A = U SV T via SVD, then the product of the top k left = − 2( (Aij − Xij )2

singular vectors Uk , the diagonal matrix of the top k sin- Aij ∈Ao
 (t−1)
gular values Sk , and the transpose of the top k right singu- + (Xij − Xij )2 ) + C. (1)
lar vectors VkT satisfies A − Uk Sk VkT F ≤ A − M F Aij ∈Au

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
The Maximization step of the tth iteration computes the Ak (Ak = Uk Sk VkT ) as the best rank k approximation to
updated model X (t) that maximizes the expected expres- A. It follows that for any c ≤ n and δ > 0,
sion obtained in the Expectation step. By equation (1), we 
know that X (t) should be the matrix that minimizes  k
  A−AHH T 2F ≤ A−Ak 2F +2(1+ 8 ln(2/δ)) A2F .
(t−1) βc
(Aij − Xij )2 + (Xij − Xij )2 .
(2)
Aij ∈Ao Aij ∈Au
In addition, if pi = A(i) 2 /A2F , which implies that β is
equal to one,
Denote as A(t) a filled-in matrix whose unknown entries
have been filled in from X (t−1) , i.e., 
 k
 A−AHH T 2F ≤ A−Ak 2F +2(1+ 8 ln(2/δ)) A2F .
(t) Aij if Aij is observed c
Aij = (t−1) (3)
Xij if Aij is unknown The proof of both inequalities can be found in [5].
As discussed in Section 2, the objective of the maximiza-
. Thus, the objective becomes finding the model X (t) that tion step in the tth iteration of the EM procedure is to com-
 (t)
minimizes ij (Aij − Xij )2 . As shown above, this prob- pute the best rank k approximation to the filled-in matrix
lem can be solved by performing SVD on A(t) to calculate A(t) . Using SVD approximation, the server can sample c
the best rank k approximation to it. users’ rating profiles (rows in A) with probability propor-
The EM procedure is ensured to converge [8], which tional to their length squared and form the matrix C after
means that the log-likelihood of all observed ratings given scaling. Note here that the c samples are not necessar-
the current model estimate is always nondecreasing. Be- ily from c different users, and that a user’s rating profile
cause Aij is from a Gaussian distribution with Xij as its might be sampled more than once according to the sampling
mean, given Xij Pr(Aij |Xij ) obtains its maximum when method. After computing the top k right singular vectors of
Aij is equal to Xij . Thus, Xij is the best prediction for C and obtaining the matrix H, the server can use A(t) HH T
(t)
user i’s rating on item j. to approximate Ak . Then in the expectation step of the
(t + 1)th iteration, the unknown entry Aij is calculated as
(A(t) HH T )ij .
3. SVD Approximation in Centralized Recom-
By inequality (3), the server has a high confidence that
mendation Systems given the same filled-in matrix A(t) , the rank k model
A(t) HH T obtained via SVD approximation is close to the
(t)
In this section, we discuss how to use SVD approxima- best rank k approximation Ak . Therefore, with a high
tion to reduce the computational cost of the SVD based col- probability, the EM procedure is likely to calculate more
laborative filtering in traditional centralized recommenda- and more likely true values for missing entries. This implies
tion systems where the server keeps all users’ rating pro- that although the EM procedure with SVD approximation is
files. If the server uses the EM procedure shown in Section not guaranteed to converge on an optimal solution, the log-
2, the computational cost will be l · O(mn2 + m2 n), in likelihood of observed ratings will generally increase.
which l is the number of iterations of the EM procedure and
O(mn2 +m2 n) is the computational cost of performing de- Algorithm 1 EM Procedure via SVD approximation
terministic SVD on an m(users)-by-n(items) matrix. This (0)
cost is expensive because m and n can be very large in a 1: Set initial values Xij for unknown entries Aij .
recommendation system, from thousands to millions. 2: while in the tth iteration of EM procedure do
The SVD Approximation technique of Drineas et al. [5] 3: Fill in A by replacing unknown entries Aij with
(t−1)
shows that for any matrix A (m-by-n), if c rows are sam- Xij , denote the filled-in matrix as A(t) .
(t)
pled and scaled appropriately, the top right singular vectors 4: Set pi = A(i) 2 /A(t) 2F , and pick c rows from A.
of the new matrix C (c-by-n) approximate the top right sin- (t) √
If the ith row is picked, include A(i) / cpi in C.
gular vectors of A. More formally, assume that in picking
a certain row for C, the probability that the ith row in A is 5: Compute the top k right singular vectors of C and
picked (denoted as pi ) is such that pi ≥ βA(i) 2 /A2F , form these k vectors into the matrix H.
where β is a constant and  · , denoting a vector’s length, 6: X (t) = A(t) HH T .
7: end while
is equal to the squared root of the sum of the squares of all

its elements, and suppose that if A(i) is picked, A(i) / cpi
will be included as a row in C. Denote H (n-by-k) as the Algorithm 1 shows the details of applying SVD approx-
matrix formed by the top k right singular vectors of C and imation in centralized recommendation systems. Note that

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
to compute the top k right singular vectors of C (line 5) di- compute Gt compute Gt+1
rectly by performing SVD on C takes O(c2 n + cn2 ) time.
A more efficient approach is to compute CC T first (O(c2 n) t t+1
time), and then perform SVD on CC T (O(c3 ) time). From
basic linear algebra, we know that the left singular vectors predict based on Gt
of C T C are the same as the left singular vectors of C, and
that the singular values of C T C are the squares of the singu- Figure 1. The framework for collaborative fil-
lar values of C. Thus, the top k right singular vectors of C tering in distributed recommendation sys-
can be computed via its top k left singular vectors and top tems.
k singular values (O(cnk) time). As generally k is much
smaller than c and c is much smaller than n, the total com-
putational cost of performing SVD on C in the above way points at which more users are online seem preferable, since
is O(c2 n). having more users generally lead to more accurate learn-
ing results. On the other hand, according to the following
4. SVD Approximation in Distributed Recom- theoretical analysis and experimental results, a small subset
mendation Systems (about 5%) of all users is often enough for good predictions.

Traditional centralized recommendation systems have 4.1 Algorithms and Theoretical Analysis
problems such as users losing their privacy, retail monopo-
lies being favored, and diffusion of innovations being ham- We first present algorithms for computing aggregates and
pered [3]. Distributed collaborative filtering systems, where generating predictions and then a theoretical analysis of
users keep their rating profiles to themselves, have the po- their performance. Assume that there are c online users at
tential to correct these problems. However, in the dis- time point t, and that their rating profiles are denoted A(1)
tributed scenario there are two new problems that need to to A(c) . Algorithm 2 shows how to generate the aggregate.
be dealt with. The first problem is how to ensure that users’
data are not revealed to the server and other users. The sec- Algorithm 2 Computing the aggregate Gt
ond problem is how to ensure that users can get as accurate 1: for each user i, to each unknown entry Aij do
predictions as they do in the centralized scenario. This pa- 2: If Aij has been predicted before, replace Aij with
per is mainly focused on the second problem, and conse- the latest prediction.
quently we rely mechanisms shown in [3, 7] to address the 3: Else replace Aij with the average of user i’s ratings.
first problem. 4: end for
Since the server cannot directly see users’ rating profiles, 5: The server securely performs SVD on the matrix C (c-
it needs to compute an aggregate (a learning result based on by-n) formed by filled-in rating profiles.
user information) for making predictions. Figure 1 shows 6: Aggregate Gt is the matrix (n-by-c) formed by the top
our framework for collaborative filtering in distributed rec- k right singular vectors of C.
ommendation systems. At a certain time point t, the server
securely computes the aggregate (denoted as Gt ) from those When a user i asks for predictions, the server generates
users who are online at that time point (denoted as Ut ); “se- predictions as follows using the aggregate Gt .
curely” here means that users’ rating profiles are not dis-
closed to the server and other users. Between time point t
Algorithm 3 Generating predictions for user i
and t + 1, when a certain user (no matter whether she is
1: For each unknown entry Aij , if Aij has been predicted
in Ut or not) needs predictions, the server computes predic-
before, replace Aij with the latest prediction.
tions based on this user’s rating profile and the aggregate
2: Else replace Aij with the average of user i’s ratings.
Gt .
3: Multiply the filled-in rating profile vector (1-by-n) by
The reason of computing aggregates periodically is that
Gt GTt to generate predictions.
users’ rating profiles are dynamic. For any given user, the
probability that he is in Ut is independent of the probability
that he is in Ut+1 , so Ut and Ut+1 would be expected to For analysis, we make the following two assumptions.
have few users in common (given sufficiently many users). Assumption 1: there exists a constant β such that for
Therefore, it is hard to find a way to combine aggregates any m and for any user i,  the filled-in rating profile vector
m
computed at different time points for predictions. A more (denoted as A∗(i) ) satisfies j=1 A∗(j) 2 /(m · A∗(i) 2 ) ≥
minor concern in this framework is how the server picks β. Recall that m is the total number of users. The soundness
time points for aggregate computations. Of course, time of this assumption is shown in Appendix A.

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
Ratio of change in ||VV ||F

F
Assumption 2: there is a uniform, constant probability 0.08

Ratio of change in ||VVT||


0.08

T
that any user is online at any time point. 0.06 0.06
Assuming that c users are online (sampled) at a time 0.04 0.04
point according to Assumption 2, then such a sampling
0.02 0.02
method is similar to another sampling method that picks
also c samples in the following way: to choose a sample, 0
0 0.02 0.04
0
0 0.02 0.04
the probability that any user is picked is uniform (i.e., 1/m). Ratio of new rating cases Ratio of new users

Note that picking c users at random is not the same as se-


quentially picking c users with a uniform probability. The Figure 2. The impact of new ratings or new
difference is that each user will be picked at most once in users on the ratio of change in Vk VkT , that
the first case, while a user might appear more than once in is Vk VkT − V̂k V̂kT |F /Vk VkT F , here k = 10.
the second. Taking into account the fact that the probability
of repetitions is small when c is considerably smaller than n,
for simplicity these two sampling methods can be regarded
as equivalent in practice. 4.2 Stability of Predictions
This observation, together with Assumption 1, leads to
the following conclusion. At time point t, denote A∗ as A potential problem in a rating system is that users,
the filled-in rating matrix after each user imputes unknown items, and ratings are all dynamic and subject to change at
entries and A∗k as the best rank k approximation to A∗ . As- any time. As a result of this, the rating matrix may change
sume  that there are c online users and that if user i is online, between two aggregate computations as a result of the fol-
A∗(i) / c/m is a row in C. Then by inequality (2), the Gt lowing three situations: 1) users giving new (updated) rat-
formed from the top k right singular vectors of C satisfies ings on items; 2) new users being registered in the system;
the following inequality: for any δ > 0. and 3) new items being added to the system. An altered
rating matrix typically requires a different linear model to
A∗ − A∗ Gt GTt 2F ≤ A∗ − A∗k 2F + (4) best describe it. However, the linear model should not be
 disturbed to a significant extent when the rating matrix un-
 k
2(1 + 8 ln(2/δ)) A∗ 2F . dergoes a small change—otherwise predictions made be-
βc tween two aggregate computation will probably be inaccu-
 rate. Since the computed aggregate in Algorithm 2 approx-
Since each row in C has the same scalar c/m, it can be imates the top k right singular vectors (Vk ) of the current
omitted because it will not affect the resulting singular vec- filled-in rating matrix, the core concern is an instability in
tors. Therefore, online users’ filled-in rating profiles can be Vk VkT .
directly combined to form C using Algorithm 2. We conducted two preliminary experiments on a 5000-
The above inequality says that with a high probability the by-1427 rating matrix from the EachMovie data set to as-
computed aggregate implies a highly accurate linear model sess the impact of rating matrix changes on Vk VkT . In the
behind the current filled-in rating matrix. For an unknown first experiment, 90% of the entries are randomly picked to
entry Aij , if the server makes predictions for user i between form the original rating matrix A, and the rest of the en-
time point t and t + 1, then its corresponding value A∗ij at tries are progressively added to form Â. In the second ex-
time point t+1 will be imputed based on the model estimate periment, rating profiles from 90% of the users are used to
obtained at time point t. Therefore, the aggregate compu- form the initial matrix A and the remaining users are pro-
tation can be taken as the model re-estimation process in gressively added to create Â. The case where new items
the EM procedure maximization step, and prediction gen- are added is not considered here because the size of Vk VkT
eration can be taken as the calculation of expected expres- will change as the number of items is changed, and this
sions in the expectation step. Some online users involved in will make it difficult to assess its effect. Figure 2 shows
the aggregate computation may not ask for predictions since that the difference between Vk VkT and V̂k V̂kT is in a small
the last aggregate computation, which implies that their un- range (6%) when the number of rating cases or the number
known rating entries have not been imputed based on the of users is increased within a small range (3%). In a real
updated model estimate. Nonetheless, we can require them system, it is reasonable that changes in the number of rating
to compute predictions first based on Gt before they join the cases and the number of users will be in this range between
computation of Gt+1 . Therefore, the server may sometimes two consecutive aggregate computations.
require more times of aggregate computations to obtain the Another issue related to the stability of results is how fre-
accurate model, but the final prediction accuracy will not be quently the aggregate should be computed. Note that Gt GTt
affected a lot. (the product of aggregate and its transpose) has a fixed size

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
(n-by-n) no matter how many online users join the aggre- 0.175
2% approximation
gate computations. The frequency of aggregate computa- 5% approximation
10% approximation
tions can thus be decided based on the disturbance of Gt GTt . no approximation
When the change is very small, the interval between aggre-
0.17
gate computations can be made longer, and vice versa.

NMAE
4.3 Preserving Privacy
0.165
This work is not focused on improvements in preserving
privacy in the distributed scenario, so to address the privacy
issue in Algorithm 2 and 3, we apply security schemes pro-
posed in [3, 7]. In Algorithm 2, a distributed secure SVD 0.16
computation is needed to ensure that users’ rating profiles 0 10 20 30 40 50
Iterations of EM procedure
are not revealed to other users or the server. Canny’s paper
[3] proposed a scheme to achieve this objective. The idea (a) Jester
is to reduce the SVD computation to an iterative calculation 0.195
2% approximation
requiring only the addition of vectors of user data, and use 5% approximation
homomorphic encryption to allow the sums of encrypted 0.19 10% approximation
no approximation
vectors to be computed and decrypted without exposing in-
dividual data. 0.185
In Algorithm 3, the multiplication of a user’s rating pro-

NMAE
file by Gt GTt should be securely computed both so that the 0.18
server cannot learn the rating profile and so that the user
cannot learn Gt GTt . Moreover, the multiplication result 0.175

should not be revealed to the server either, as in that case


the server could easily compute the user’s rating data. The 0.17

multiplications of a vector by a matrix can be considered as


a group of scalar products. There are several privacy pre- 0.165
0 10 20 30 40 50
serving schemes for calculating scalar products in the secu- Iterations of EM procedure
rity literature and one of them is presented in [7]. It also has (b) EachMovie
an asymmetric version to let only one side know the final
result. Figure 3. Comparison of Algorithm 1 (with
three approximation ratios of 2%, 5%, and
5 Results 10%) with the standard EM algorithm (“no ap-
proximation").
The objectives of our experiments were to test the per-
formance of SVD approximation in two situations: in cen-
tralized recommendation systems (Algorithm 1) and in dis-
available and the other observed ratings are regarded as test
tributed recommendation systems (Algorithm 2 and 3).
cases. Algorithms are performed on the available cases to
Two data sets, known as Jester and EachMovie, were
make predictions for the test cases. For all algorithms, the
used in experiments. Jester is a web based joke recom-
reduced linear dimension (k) is equal to 4. This value was
mendation system developed at the University of California,
selected empirically based on both prediction results and
Berkeley [9]. The data set contains 100 jokes, with user rat-
convergence rate. It is consistent with the results in [16].
ings ranging from −10 to 10. The EachMovie set is from
The experimental measure used for comparisons is Nor-
a research project at the Compag Systems Research Center.
malized Mean Absolute Error (NMAE), which is the average
User ratings for movies are recorded on a numeric six-point
of the absolute values of the difference between the real rat-
scale (0, 0.2, 0.4, 0.6, 0.8, 1.0). The density (or fraction of
ings and the predicted ratings divided by the ratings range.
the rating matrix that is filled) of the Jester set is about 50%,
while that of EachMovie is only 3%.
For each experiment, we pick a 5000-by-100 (users-by- 5.1 Experiment 1
items) rating matrix from the Jester data set and a 5000-
by-1427 rating matrix from EachMovie. For each rating Experiment 1 examines the performance of applying
matrix, 90% of the observed ratings are randomly picked as SVD approximation in the centralized recommendation sys-

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
tem scenario. The selected approximation ratios (the num-
0.172
ber of rows (c) in Algorithm 1 over the number of users) 2% users
are 2%, 5%, and 10%. Figure 3 displays NMAE results of 5% users
10% users
0.17
Algorithm 1 with these three approximation ratios and the All users
standard EM procedure without SVD approximation. As
0.168
predicted analytically in Section 3, the log-likelihood of ob-

NMAE
served ratings in EM procedure with SVD approximation is
0.166
generally increasing, although it is not monotonically in-
creasing. Thus, Figure 3 is a verification that the prediction
0.164
accuracy of Algorithm 1 generally increases when more it-
eration are used in the EM procedure. In both data sets, the 0.162
NMAE of Algorithm 1 with an approximation ratio of 5%
is less than 2% higher than the NMAE for the standard al- 0.16
gorithm. Moreover, the convergence rate of both algorithms 0 10 20 30 40 50
Times of aggregate computations
is nearly identical. Algorithm 1 takes only about one tenth
the time for each iteration compared with the standard al- (a) Jester
gorithm. All these points support the conclusion that our 0.2
2% users
algorithm is practical for real-world systems. 5% users
0.195 10% users
All users
5.2 Experiment 2 0.19

NMAE
In experiment 2, the performance of Algorithm 2 and 0.185

3 when applied in the distributed recommendation system


0.18
scenario is compared with the performance of the standard
EM procedure in the centralized scenario. For algorithms in 0.175
the distributed scenario, a random set (2%, 5%, and 10%)
of the user rating profiles are picked for each time-step’s 0.17
aggregate computation. We assume that 50% of the users
0.165
ask for predictions during two aggregate computations, so 0 10 20 30 40 50
a random 50% of the rows are picked, in which unknown Times of aggregate computations
entries are imputed based on the current aggregate. In or- (b) EachMovie
der to compare the centralized EM algorithm with the other
cases, each iteration in it is regarded as one aggregate com- Figure 4. Comparison of algorithms in dis-
putation, and all missing entries are imputed based on the tributed recommendation systems (when the
current model estimate after each iteration. Figure 4 gives number of online users in each aggregate
the NMAE results obtained in these two scenarios. It shows computation is 2%, 5%, or 10% of the total
that our algorithms for distributed recommendation systems number of users) with the standard EM algo-
are comparable to the EM procedure algorithm in central- rithm as applied in centralized recommenda-
ized recommendation systems under these conditions. With tion systems (“All users").
only 5% of the users online in each time-step’s aggregate
computation, the distributed scenario achieves a prediction
accuracy that is at most 2% worse than that of the central-
ized scenario.
show that our algorithm is very promising. Its prediction ac-
curacy is almost the same as that of the standard algorithm
6 Conclusion even if a small approximation ratio is used. We also pro-
pose a new framework for collaborative filtering via SVD
This paper presents novel collaborative filtering algo- approximation in distributed recommendation systems in
rithms that incorporate SVD approximation technique into which the server periodically calculates an aggregate from
an SVD based EM procedure. In centralized recommen- online users’ rating profiles and makes predictions for all
dation systems, the standard SVD based EM procedure al- users based on it. Our experiments show that if the number
gorithm takes O(m2 n + mn2 ) time for each SVD compu- of online users is at least five percent of the total number
tation, while our new algorithm with SVD approximation of users, the prediction accuracy in this distributed scenario
takes only O(s2 n) time. Experiments on existing data sets is almost the same as what is obtained in the centralized

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE
scenario. [12] B. Marlin and R. S. Zemel. The multiple multiplicative fac-
In the proposed distributed framework, there is still a tor model for collaborative filtering. In Proceedings of the
central server that computes aggregates and provides pre- 21st International Conference on Machine learning, 2004.
dictions. This framework is suitable for online collaborative [13] D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles.
Collaborative filtering by personality diagnosis: A hybrid
filtering service providers if users are concerned about the
memory and model-based approach. In Proceedings of the
privacy of their rating profiles. However, a recommendation
16th Conference on Uncertainty in Artificial Intelligence,
system based on a central server is less suited for peer-to- 2000.
peer environments. Therefore, one possible project of fu- [14] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and
ture work is to extend the current server-based distributed J. Riedl. GroupLens: An Open Architecture for Collabo-
framework to a peer-to-peer environment, where we need rative Filtering of Netnews. In Proceedings of ACM Confer-
to consider additional issues such as the availability of peers ence on Computer Supported Cooperative Work, 1994.
that keep aggregates, the validity of aggregates, and limita- [15] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl.
tions on peer resources. Application of dimensionality reduction in recommender
systems–a case study. In ACM WebKDD Web Mining for
E-Commerce Workshop, 2000.
Acknowledgements [16] N. Srebro and T. Jaakkola. Weighted low rank approxima-
tion. In Proceedings of the 20th International Conference
This work was supported by the National Science Foun- on Machine Learning, 2003.
dation under grant IDM 0308229, and also in part by
FAMRI through a Center of Excellence. Appendix

References A. Soundness of Assumption 1

[1] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spec- To verify the soundness of Assumption 1, an experi-
tral analysis of data. In Proceedings of the 33rd ACM Sym- ment was performed on a 5000-by-1427 rating matrix from
posium on Theory of Computing, 2001. EachMovie. Missing entries are filled in using the av-
[2] J. S. Breese, D. Heckerman, and C. Kadie. Empirical anal- erage of that user for all available entries. Let β ∗ =
m
ysis of predictive algorithms for collaborative filtering. In mini j=1 A∗(j) 2 /(m · A∗(i) 2 ). Table 1 displays the
Proceedings of the 14th Conference on Uncertainty in Arti- mean value and the standard deviation of β ∗ (from 20 tri-
ficial Intelligence, 1998. als) when m increases from 1000 to 5000. It shows that β ∗
[3] J. Canny. Collaborative filtering with privacy. In Proceed- is very stable when m increases.
ings of the IEEE Symposium on Security and Privacy, 2002.
[4] J. Canny. Collaborative filtering with privacy via factor anal-
ysis. In Proceedings of the 25th ACM SIGIR Conference,
Table 1. The mean value (“mean") and the
2002.
[5] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay.
standard deviation (“std") of β ∗ from 20 tri-
Clustering of large graphs via the singular value decompo- als when m increases.
sition. IEEE Journal of Machine Learning, 56(1-3):9–33,
m 1000 2000 3000 4000 5000
2004.
[6] P. Drineas, I. Kerenidis, and P. Raghavan. Competitive rec- mean 0.406 0.407 0.407 0.407 0.407
ommendation systems. In Proceedings of the 34th ACM std 0.006 0.003 0.002 0.001 0.000
symposium on Theory of computing, 2002.
[7] W. Du and M. Atallah. Privacy-preserving cooperative sta-
tistical analysis. In Proceedings of the 17th Annual Com-
puter Security Applications Conference, 2001.
[8] Z. Ghahramani and M. I. Jordan. Learning from incomplete
data. Technical report, MIT, 1994.
[9] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigen-
taste: A constant time collaborative filtering algorithm. In-
formation Retrieval, 4(2):133–151, 2001.
[10] G. Golub and C. V. Loan. Matrix Computations (3rd edi-
tion). Johns Hopkins University Press, 1996.
[11] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl.
An algorithmic framework for performing collaborative fil-
tering. In Proceedings of the 22nd ACM SIGIR Conference,
1999.

Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC’05)


1530-1354/05 $20.00 © 2005 IEEE

You might also like