A Collaborative Filtering Recommendation Algorithm Based on Item Genre and Rating Similarity
A Collaborative Filtering Recommendation Algorithm Based on Item Genre and Rating Similarity
Abstract—Aiming at the disadvantages of user-based ratings, and the error of similarity computation will be come
collaborative filtering algorithm and item-based collaborative out. Then it will affect the quality of recommendation.
filtering algorithm on the instance of user’s rating data’s
extreme sparseness, introducing the similarity of item genre II. RELATED WORK
and rating and improving on it. The high ratings of users
group can also affect similarity when calculating the A. Basic knowledge
similarities of item genre and ratings. Through the experiment
the improved algorithm can play down user’s mean absolute
The technology of collaborative filtering
error and improve the quality of recommendation. recommendation can get the similarity of users/items
through analyzing the ratings of users on items, and then
Keywords- collaborative filtering; recommendation systems; predict the ratings of users on unrated items. High ratings
MAE; E-commerce can be recommended to users. There will be three processes
for the technology of collaborative filtering
recommendation, the first step is that users rate on items;
I. INTRODUCTION the second is the nearest neighbors to be found and the
third is items are recommended. The process is showed in
Internet has become an indispensable tool on working,
Figure 1.
living and entertainment. As the correlative statistic data
indicate, the total pages of websites have been more than 80
I1 I2 … In
million till Jan.2008. It is hard to get information which
he/she wants from so much network resource. Nowadays U1 R11 R12 … R1n
people mainly use search engine to get information, this is a U2 R21 R22 … R2n
very passive way, and the information from using search … … … … …
engine may be not the right that people want, or people may Um Rm1 Rm2 … Rmn
browse large numbers of pages before getting the right
information. So it should be ineffective.
Generally there are three kinds of methods about
recommendation: personalized recommendation—a
recommendation which is based on personal action that has
been past, social recommendation—a recommendation Ux I1x I2x I3x … Imx
which is based on similar users that have been past, item
recommendation—a recommendation which is based on Figure 1:The process of recommendation
item itself. Personalized recommendation system is an active
There is a list of m users U = {u1, u2, … , um} and a list
information service system. It can make up the disadvantage
of n items I = {i1, i2, … , in}. They can be represented as an
that search engine getting information passively. Nowadays
m×n ratings matrix in the first part of Figure 1. Each user ui
almost all the websites of E-commerce use recommendation
has a list of items Iui, which the user has expressed his/her
systems, e.g., Amazon, CDNow, EBay, DangDang, douban
opinions about. User ui can give items ij a rating Rij.
and so on. The methods of recommendation are
Opinions can be explicitly given by the user as a rating
recommendation content-based and collaborative filtering
score, generally within a certain numerical scale.
recommendation [1]; the latter has been a successful
technology [2], but it has some problems such as sparsity, B. Similarity Computation
scalability and cold start, etc. The methods of similarity computation are almost the
Every user can not rate on every item because of the large same between item-based and user-based. There are three
number of users and items. So it will be sparsity about
∑ (R
a) First, inputting a user randomly, and get a set of
a,i − R a )(R a,j − R a ) items as Iunrat which the user did not rate, then selecting a
a∈U i j target item as Iaim which should be attributed to Iunrat.
sim(i,j)= .(3) b) Selecting a group of users who rated target item
∑ (R a,i − R a )2 ∑ (R a,j − R a )2 highly in the set of training, and the group of users should
a ∈U i a∈ U j rate other items highly(we suppose that r is a threshold
value which users rate items and r>=4), Iother is a item
which target user has rated on in the set of training.
III. IMPROVED ALGORITHM c) Counting the genre number for every Iother,
computing the similarity simattri(i,j) between the genre of
A. Algorithm analysis Iother and target item.
Generally, traditional collaborative filtering d) Computing the similarity simrat(i,j) between target
recommendation algorithm is user-based and computes the item and Iother through the three methods of similarity
similarity between users. The number of users and items computation.
73
e) Computing the compositive similarity which is number (0 or 1) for a genre, 0 denotes one movie has not a
tagged as siminte(i,j), selecting the first N neighbors’items corresponding genre and 1 denotes one movie has a
as a set of the nearest neighbors’items NI for target item, corresponding genre. We only considered users that rated 20
and the formula is or more movies. The rating grade is integral and from 1 to 5.
siminte(i,j) = (1-α) simattri(i,j) + αsimrat(i,j) High number is expressed that the movie is preferred by the
user.
α denotes weighing coefficient which is between 0 -1.
We randomly selected 20891 ratings from the database
f) Computing the prediction P(user,Iaim) by the
for the experimentation which was 200 users rating on 368
siminte(i,j) and the user has rated the items which belong to
movies. The sparsity level of the movie data set is, therefore,
NI. Here we consider two such methods to predict rating
which is the rating of target user on unrated items, and 20891 , which is 0.7165. Every user has rated 50 or
1 −
200 × 368
denoted the prediction P(user,Iaim) as more, every movie has been rated by 30 or more users. A
Traditional: value of x = 0.8 would indicate 80% of the data was used as
training set and 20% of the data was used as test set.
∑ p −q i i
Here R I aim and R J are the average ratings of Iaim- MAE = i =1
. (6)
th item and j-th item. NI is a set of the most similar items. N
siminte(i,j) is a similarity between Iaim-th item and j-th
item. Ruser , j is the rating of target user on item j. C. Experimental results
g) Repeating from b) to f), computing all the In order to validate the collaborative filtering
prediction that user did not rate on items. Sorting the recommendation algorithm based on improved item genre
prediction P(user,Iaim), recommending the first N prediction and rating similarity. We took two group of experimentation
to the user. to compare the result.
IV. EXPERIMENTATION Experimentation Ⅰ
A. Data set Because weighing coefficient α is diverse when
computing the compositive similarity between items, the
In this paper we use the experimental data form
value of αcan affect the quality of recommendation. In this
Movielens’ research website (http://MovieLens.umn.edu/) to
experimentation, we used the collaborative filtering
evaluate different algorithms. The data set should be
recommendation algorithm based on improved item genre
disposed through read into Access database for the
and rating similarity to predict the ratings on unrated items.
experimental data. Movielens is a recommender system
We used correlation-based similarity, cosine-based
which is based on web research, users can rate about similarity and adjusted-cosine similarity to compute the
movies on it. At the same time, the recommendation system similarity, and the ratings-predictions were traditional and
can recommend a set of movies to users. Up to now, the site weighted sum. The number of neighbor for target item is 20,
has over 45000 users who have expressed opinions on
we increased the weighting coefficientαsize from 0 to 1 in
6600+ different movies. It contains 10000 ratings that are
an increment of 0.1, and the sensitivity of MAE is showed in
943 users rating on 1682 movies. There are 19 genres
Figure 2.
(unknown |Action |Adventure |Animation |Children's
|Comedy |Crime |Documentary |Drama |Fantasy |Film-Noir
|Horror |Musical |Mystery |Romance |Sci-Fi |Thriller |War
|Western) about the movies. We converted into a movie-
genre matrix that had 1682 rows (i.e., 1682 movies) and 19
columns (i.e., 19 genres). In the matrix, every movie has a
74
Sensitivity of different weighing coefficient User-based
Sensitivity of the neighborhood
in different similarity(traditional)
size in different algorithms
Correlation-based Cosine-based Adjusted-cosine
1.2
Item-based
0.75 1
MAE
0.8
0.7
MAE
0.6 Item-based
0.4 using genre
0.65
0.2
0.6 0 improved item
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 genre and
5 10 15 20 25 30 35 40 45 50 rating
No. of Neighbors similarity
75