Movie Recommender System Based on Collaborative Filtering Using Apache Spark
Movie Recommender System Based on Collaborative Filtering Using Apache Spark
net/publication/326146390
CITATIONS READS
30 8,888
2 authors:
All content following this page was uploaded by Mohammed Fadhel Aljunid on 07 June 2019.
Keywords Recommender systems Collaborative filtering Alternating Least
Squares Apache Spark Big data MovieLens dataset
M. F. Aljunid (&)
Mangalore University, Mangalore, Karnataka, India
e-mail: [email protected]
D. H. Manjaiah (&)
Department of Computer Science, Mangalore University, Mangalore
Karnataka, India
e-mail: [email protected]
1 Introduction
In recent times, big data is becoming one of the newest research interests in the
areas of computer science and other related areas. With the possibility of a radical
change in companies and organizations that use the information for improving the
customer experience and transform their business models. Big data has several
features which are volume, velocity, variety, value, and veracity. Big data is facing
difficulties in managing using conventional tools, techniques, and procedures. Big
data analytics is used for handling bulk quantities of data. It is used to mine and
extract patterns, information, and knowledge from the data in an effective way. Big
data analytics become an important trend for organizations and enterprises that are
interesting in providing innovative ideas for enhancing and increasing their busi-
ness performance and decision-making. RS are a group of techniques that allow
filtering through large samples and information space in order to give suggestion to
users when needed. Currently, RS are becoming highly popular and utilized in
different areas such as movies, research articles, search queries, news, books, social
tags, and music. Furthermore, there are other essential RS basically applicable for
specialist, collaborators, funny story, restaurant and hotels, dresses, monetary ser-
vices, life insurance, passion associates which give online dating services and
several other social media such as Twitter, LinkedIn, and Facebook.
RS use a number of different technologies to filter out best suit results and
provide to users to satisfy their information need. RS are classified into three broad
groups which are content-based systems, collaborative filtering systems, and hybrid
recommender system [1]. Content-based systems which try to test the behavior of
the item which is labeled as recommended one. It works by learning the behavior of
the new users based on their information need presented in objects whereby the user
has rated. It is a keyword-specific RS where the keywords are used to illustrate the
items. Thus, in a content-based RS, models work in such a way that they recom-
mend users’ comparable items that have been liked in the past or is browsing
currently. For instance, if a MovieLen user has to browse several comedies movies,
then, the RS will classify those movies into the database as getting the most ratings
on the comedy varieties. Collaborative filtering system is based on similarity
measures between user’s information need and the items. The items recommended
to a new user are those which were liked by other similar users in previous
browsing history. Collaborative filtering algorithm uses an average rating of
objects, recognizes similarities between the users on the basis of their ratings, and
generates new recommendations based on inter-user comparisons. However, it
faces many challenges and limitation such as data sparsity whose role is to the
evaluation of large item set. Another limitation is hard to make prediction based on
nearest neighbor algorithm, third is scalability in which number of users and
number of items both increases, and the last one is cold start where poor rela-
tionship among like-minded people. To solve encounters, above mentioned, we
moved to other approaches of collaborative filtering, and we landed up on
model-based collaborative filtering [2]. Hybrid RS performs their tasks by
Movie Recommender System Based on Collaborative Filtering … 285
2 Related Work
So far, several researchers introduced and presented research in the area of building
recommendation systems. Wei et al. [4] proposed a hybrid recommender model to
address the cold start problem, which explores the item content features, learned
from a deep learning neural network and applies them to the timeSVD++ CF model.
A hybrid recommendation model is proposed which combines a time-aware model
timeSVD++ with a deep learning architecture SDAE to address the cold start
problem of collaborative filtering recommendation models. Kupisz and Unold [5]
developed and compared item-based collaborative filtering algorithm using two
cluster computing frameworks normally Hadoop’s disk-based MapReduce para-
digm and Spark’s in-memory based RDD paradigm. In order to enhance the reli-
ability, scalability, and to improve processing ability of large-scale data, Zeng et al.
[6] proposed PLGM. In their work, two matrix factorization algorithms were
considered, which are ALS and SGD. The parallel matrix factorization based on
SGD was implemented on spark and was compared with ALS in MLib for its
performance. The advantage and disadvantage of each model based on test results
were analyzed. A variety of profile aggregation approaches were studied and the
model which gives the best result was adopted. Models such as PLGM and LGM
were studied in terms of efficiency and accuracy. Dianping, Lakshmi et al. [7] used
item-based collaborative filtering techniques. In this method, they first inspect the
user item rating matrix and they categorize the relationships among different items,
and they utilize these relationships so as to figure out the recommendations for the
user. A new concept namely movie swarm mining was proposed by Halder et al. [8]
using format frequent item mining and two pruning rules. It addresses the problem
of item recommendation and thus gives an idea about the user interests and famous
movies trend. This technique can be very helpful for movie producers to manage
their new movies. In addition to this, a new algorithm was proposed to recommend
movies to a new user. A scalable method for building recommender systems based
on similarity join has been proposed by Dev et al. [9]. MapReduce framework was
used to design the system in order to work with big data applications. The
unnecessary computation overhead such as redundant comparisons in the similarity
computing phase can significantly be reduced by the system using a method called
extended prefix filtering (REF). Chen et al. [10] used co-clustering with augmented
matrices (CCAM) to design several methods including a heuristic scoring, tradi-
tional classifier, and machine learning to build a recommendation system and
integrate content-based collaborative filtering for a hybrid recommendation system.
Similarly, a collaborative filtering algorithm based on the ALS, as a powerful
Movie Recommender System Based on Collaborative Filtering … 287
This section provides the idea of the proposed system. The proposed system is a
movie recommender system based on ALS using Apache Spark. The novelty of this
work is based on the selection of parameters of ALS algorithms that can affect the
performance of building of a movie recommender system.
In this work, we apply user’s ratings from the datasets the popular website like
IMDB, Rotten Tomatoes, MovieLen, and Time Movie Ratings. This dataset is
available in many formats such as CSV file, text file, and databases. We can either
stream the data live from the websites or download and store them on our local file
system or HDFS. Spark streaming is used to stream real-time data from the various
source like Twitter, the stock market, and geographical system and perform pow-
erful analytics to businesses. It used for processing real-time streaming data. We use
collaborative filtering (CF) to predict the ratings of users for particular movies
based on their ratings for other movies. Then collaborate this with another user’s
rating for that particular movie. We train the ALS algorithm using MovieLen data
and get the results from the machine learning model. We use spark SQL’s data
frame, dataset, and SQL service to store the data. The result of the machine learning
model is stored in RDBMS so that the web application can display the recom-
mendation to a particular use. The results of the movie recommendation system are
stored in our local drive. We store the recommendation movies along with the
ratings in a text file and CSV file formats. We prefer storing the result into an
RDBMS system so as to access it directly from the web application and display
recommendation and top movies as shown in Fig. 2.
This subsection provides the steps of applying the ALS algorithm on MovieLens
datasets for train and test the selection of best parameter when building a movie
recommendation system.
288 M. F. Aljunid and D. H. Manjaiah
4 Experimental Study
This section presents the experimental setup and results in discussion and analysis.
programming languages such as Java, Python, Scala, and R, and has an engine that
supports general execution graphs. It also supports a good set of higher level tools
involving Spark SQL for structured data processing, MLlib for machine learning,
GraphX for graph processing, and Spark Streaming for real-time applications. It
was built on top of Hadoop and MapReduce and extends the MapReduce Model to
efficiently use more types of computations. Spark application runs as a separate set
of process on the cluster. All of the distributed processes are coordinated by a
SparkContext object in the drive program. SparkContext connects to one type of
cluster manager (standalone/Yarn/Mesos) for resource allocation across clusters.
Cluster manager provides executors, which are essentially JVM process to run the
logic and store application data. Then the SparkContext object sends the application
code (jar files/python scripts) to executors. Finally, the SparkContext executes tasks
in each executor.
The dataset which is used in this work is MovieLens dataset. This dataset contains
24 million ratings and 670,000 tag applications applied to 40,000 movies by
260,000 users. This dataset contains three files called ratings.csv, movies.csv and
tags.csv. ratings.csv contains tree column (userId, movieId, rating). While movies.
csv contains movieId, title, genres. The genres have the format: Genre1, Genre2,
Genre3. The tags file (tags.csv) has the format: userId, movieId, tag, timestamp and
finally, the links.csv file has the format: movieId, imdbId, tmdbId. We can split the
data into three portions which are training, validation, and test data to parse their
lines once they are loaded into RDDs. Parsing the movies and rating files yields two
RDDs: For each row in the ratings dataset, we have created a vector of (userId,
movieId, rating). During preprocessing, we have dropped the timestamp attribute
because we do not need it for this recommender. Similarly, each row in the movies
dataset, we have created a vector of (movieId, title). We have dropped the genres
attribute because we do not use it for this recommender.
In order to determine the best ALS parameters for our experiments, we need to
break up the ratings RDD dataset into three pieces as follows: a training set which
we will use 60% of the data to train models, validation set, which used 20% of the
data to choose the best model and test set, which used 20% of the data for our
experiments to randomly split the dataset into the multiple groups.
The test has been done on a machine which contains the subsequent descriptions
P. A machine with Ubuntu 14.04 LTS, 4 GB memory, and Intel® Core™ i5-2400
CPU @ 3.10 GHz 4 processor as well as a hard disk of 500 GB. In this machine,
290 M. F. Aljunid and D. H. Manjaiah
Apache Spark with version 2.1.1 is installed and is used to develop the proposed
system. The dataset which is used in research work is MovieLens dataset [13]. In
the proposed model, root mean squared error (RMSE) is used as a performance
measure. RMSE works by measuring the difference between error rate a user gives
to the system and the predicted error by the model. Equation (1) depicts how RMSE
works on movie recommender system.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u n
uX xui ygi 2
RMES ¼ t ð1Þ
i¼0
n
whereby xui is the rating that user u gives to an item i in the experimental data, ygi is
a predicted rating that the movie that user u gives to an item and where n is the
number of ratings in the test data.
the following test setups. Tables 1, 2 and 3 show the performance of movie rec-
ommendation engine based on ALS under different values of lambda and iteration.
Table 1 illustrates the execution of time with the changes of lambda with iterations
parameters of ALS model, while Table 2 the rank of best-trained model with the
changes of lambda with iterations parameters of ALS algorithm, finally Table 3
indicates the RMSE with the changes of lambda with iterations parameters of ALS
model. The results presented in Table 1 indicate that when lambda is set to 0.6 and
iteration set is 10, the time value is minimum which is 1.41323 s, and rank value is
8 as shown in Table 2. Moreover, the RMSE register for this rating is 1.07424
as indicated in Table 3. On the other hand, as it is indicated in Table 1 when
lambda is set to 0.2 and iteration is 15, running time becomes 1.463743 and rank is
12 for this item as shown in Table 2. The RMSE value for this item is the mini-
mum, which is 0.9167, as presented in Table 3.
As mentioned above, the analysis for movie recommendation system is done
using three quality metrics which are RMSE, time, and rank. Using these three
metrics, two cases are achieved as shown in Table 4, case 1 with high time and low
RMSE rate while the case 2 with low time and high RMSE rate. According to
results in Table 4, the prediction for Top 25 movies is shown in Figs. 3 and 4.
In general, the lowest value of the RMSE is considered the best case for pre-
diction in building recommendation system. Therefore, we will adopt the second
case because the value of the RMSE is smaller compared to the value in the first
case as well as adopt the second case as the best case because there is no significant
difference in the amount of time execution between the two cases. Now, we can get
the top recommended movies by using the second case. Finally, we concluded that
from these results the best case is the second case which has the best value for
RMSE, which can be useful for building recommendation engines for predicting the
top 25 ranked movies.
model was trained. Two best cases are chosen based on best parameters selection
from experimental results which can lead to building god prediction rating for a
movie recommender engine. From these cases, the lowest value of the RMSE is
considered the best case for prediction in building movie recommendation system.
Therefore, the second case is recommended to be used since the value of the RMSE
is smaller compared to the value in the first case as well as adopt the second case as
the best case, because there is no significant difference in the amount of time
execution between the two cases. Finally, we concluded that from these results that
the best case is the second case which has the best value for RMSE, which can be
useful for building recommendation engines for predicting the top 25 ranked
movies. In the future work, we plan to develop and improve a new loss function
because of the shortcomings of the recommender system algorithm based on ALS
model based on the parameter of the best case which has the best value for RMSE
using Apache Spark.
References
1. Verma, J. P., Patel, B., & Patel, A. (2015). Big data analysis: Recommendation system with
Hadoop framework. In 2015 IEEE International Conference on Computational Intelligence &
Communication Technology (CICT). IEEE.
2. Katarya, R., & Verma, O. P. (2016). A collaborative recommender system enhanced with
particle swarm optimization technique. Multimedia Tools and Applications, 75(15), 9225–
9239.
3. https://docs.databricks.com/_static/notebooks/cs100x-2015-introduction-to-big-data/module-
5–machine-learning-lab.html.
4. Wei, J., et al. (2016). Collaborative filtering and deep learning based hybrid recommendation
for cold start problem. In 2016 IEEE 14th International Conference on Dependable,
Autonomic and Secure Computing, 14th International Conference on Pervasive Intelligence
and Computing, 2nd International Conference on Big Data Intelligence and Computing and
Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE.
5. Kupisz, B., & Unold, O. (2015). Collaborative filtering recommendation algorithm based on
Hadoop and Spark. In 2015 IEEE International Conference on Industrial Technology (ICIT).
IEEE.
6. Zeng, X., et al. (2016). Parallelization of latent group model for group recommendation
algorithm. In IEEE International Conference on Data Science in Cyberspace (DSC). IEEE.
7. Ponnam, L. T., et al. (2016). Movie recommender system using item based collaborative
filtering technique. In International Conference on Emerging Trends in Engineering,
Technology, and Science (ICETETS). IEEE.
8. Halder, S., Sarkar, A. M. J., & Lee, Y.-K. (2012). Movie recommendation system based on
movie swarm. In 2012 Second International Conference on Cloud and Green Computing
(CGC). IEEE.
9. Dev, A. V., & Mohan, A. (2016). Recommendation system for big data applications based on
set similarity of user preferences. In International Conference on Next Generation Intelligent
Systems (ICNGIS). IEEE.
10. Chen, Y.-C., et al. (2016). User behavior analysis and commodity recommendation for
point-earning apps. In 2016 Conference on Technologies and Applications of Artificial
Intelligence (TAAI). IEEE.
Movie Recommender System Based on Collaborative Filtering … 295
11. Zhou, Y. H., Wilkinson, D., & Schreiber, R. (2008). Large scale parallel collaborative
filtering for the Netflix prize. In Proceedings of 4th International Conference on Algorithmic
Aspects in Information and Management (pp. 337–348). Shanghai: Springer.
12. https://spark.apache.org/docs/latest/. Accessed March 10, 2017.
13. https://grouplens.org/datasets/movielens/. Accessed May 15, 2017.
14. Delgado, J. A. (2000, February). Agent-based information filtering and recommender systems
on the internet (Ph.D. thesis). Nagoya Institute of Technology.
15. Mooney, R. J., & Roy, L. (1999). Content-based book recommendation using learning for text
categorization. In Proceedings of the Workshop on Recommender Systems: Algorithms and
Evaluation (SIGIR ‘99). Berkeley, CA, USA.