0% found this document useful (0 votes)
3 views

Machine_Learning_Model_for_Movie_Recomme

The document discusses the development of a movie recommendation system using a machine learning model called CineMatch, which utilizes collaborative filtering techniques to provide personalized movie suggestions based on user ratings and preferences. It highlights the importance of privacy-preserving methods in recommendation systems and outlines various algorithms, including KNN and SVD, to enhance prediction accuracy. The research also addresses challenges such as the cold start problem and the need for exploratory data analysis in processing user-item data.

Uploaded by

mardemes1358
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Machine_Learning_Model_for_Movie_Recomme

The document discusses the development of a movie recommendation system using a machine learning model called CineMatch, which utilizes collaborative filtering techniques to provide personalized movie suggestions based on user ratings and preferences. It highlights the importance of privacy-preserving methods in recommendation systems and outlines various algorithms, including KNN and SVD, to enhance prediction accuracy. The research also addresses challenges such as the cold start problem and the need for exploratory data analysis in processing user-item data.

Uploaded by

mardemes1358
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Published by : International Journal of Engineering Research & Technology (IJERT)

http://www.ijert.org ISSN: 2278-0181


Vol. 9 Issue 04, April-2020

Machine Learning Model for Movie


Recommendation System
M. Chenna Keshava S. Srinivasulu
Assistant Professor, Dept of CSE, Student, Dept of CSE, JNTUACE,
JNTUACE, Pulivendula, AP, India Pulivendula, AP, India

P. Narendra Reddy B. Dinesh Naik


Student, Dept of CSE, JNTUACE, Student, Dept of CSE, JNTUACE,
Pulivendula, AP, India Pulivendula, AP, India.

the user with a customized ranked listing of


Abstract— The primary aim of recommendation systems is to encouraged items [1].
recommend applicable objects to a consumer-based totally on • With the increasing need for retaining confidential
ancient data. If a movie is rated excessive by means of a
statistics at the same time as supplying tips, privacy-
consumer who also watched the movie you are watching now,
it's miles possibly to show up inside the recommendations. maintaining Collaborative filtering has been receiving
The films with the highest overall scores are in all likelihood increasing attention. To make statistics proprietors
to be enjoyed by way of nearly everyone. The algorithm which experience more comfortable even as imparting
does all these features is called CineMatch. For personal predictions, various schemes were proposed to
users, it also learns from the conduct of the person to higher estimate pointers without deeply jeopardizing privacy.
expect a movie the consumer is anticipated to be fascinated in. Such methods dispose of or reduce statistics
Here we have to increase our CineMatch algorithm 10% by proprietors' privacy, financial, and legal concerns by
using fashionable collaborative filtering techniques. means of employing exceptional privacy-retaining
techniques [2].
Keywords—Machine learningmodels,Movies,Ratings,Similarity
matrix,Sparse matrix. • In the spread of information, the way to quickly
locate one’s favorite film in a massive variety of
I. INTRODUCTION movies end up a very essential issue. Personalized
A. Motivation and Scope recommendation machines can play a crucial role in
We are leaving the age of facts and coming into the age particular whilst the person has no clean target movie.
of recommendation. Like many device mastering [3].
techniques, a recommender system makes a prediction • In this paper, we design and implement a movie
based on users’ ancient behaviors. Specifically, it’s to recommendation machine prototype blended with the
expect user choice for a fixed of items based totally on past actual wishes of movie recommendation thru gaining
experience. knowledge of KNN algorithm and collaborative
filtering algorithm [4].
B. Need to study • In this study, we examine a privacy-retaining
Recommendation systems are getting increasingly collaborative filtering method for binary facts referred
important in today’s extraordinarily busy world. People are to as a randomized reaction technique. We develop a
always short on time with the myriad duties they need to method focused on the second thing of privacy to find
accomplish within the restrained 24 hours. Therefore, the out faux binary rankings the usage of auxiliary and
recommendation structures are vital as they help them public information [5].
make the right choices, without having to dissipate their
• If privacy measures are provided, they may decide
cognitive resources. The reason for a recommendation
to grow to be worried about prediction generation
system essentially is to look for content that would be
processes. We advocate privacy-maintaining
thrilling to an individual. Moreover, it includes a number of
schemes getting rid of e-commerce sites' privateness
things to create customized lists of beneficial and exciting
concerns for imparting predictions on allotted data
content unique to every user/individual. Recommendation
[6].
structures are Artificial Intelligence primarily based
• With the improvement of the Internet and e-
algorithms that skim thru all possible alternatives and
commerce, the recommendation machine has been
create a customized listing of objects which might be
widely used. In this paper, the electronic commerce
thrilling and relevant to an individual.
recommendation system has a similar look at and
C. Literature Survey/Review of Literature makes a specialty of the collaborative filtering
algorithm in the utility of personalized film
• The two principal tasks addressed by way of
recommendation system [7].
collaborative filtering techniques are rating prediction
and rating. In contrast, ranking fashions leverage
implicit feedback (e.g. Clicks) so that you can offer

IJERTV9IS040741 www.ijert.org 800


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 04, April-2020

II. RESEARCH GAP proper now.. XGBoost and Gradient Boosting Machines
The data set provided quite a few rating information, (GBMs) are each ensemble tree techniques that follow
and a prediction accuracy bar this is 10% better than what the precept of boosting susceptible learners using the
Cinematch algorithm can do on the equal training data set. gradient descent architecture.
(Accuracy is a measurement of the way closely predicted
scores of films in shape subsequent actual rankings).And we B. Surprise Baseline
have to Predict the score that a consumer would supply to a This Algorithm predicting a random rating based totally
movie that she or he has not yet rated. And also Minimize on the data.
the difference between predict and the actual score. Predicted rating: (baseline prediction)
III. RESEARCH METHODOLOGY
μ : Average of all trainings in training data.
A. User-Item Sparse Matrix bu : User bias.
In the User-Item matrix, each row represents a person and bi : Item bias (movie biases)
every column represents an object and every cell represents
rating given with the id of a user to an item. C. Suprise KNNBaseline Predictor
It is a number one collaborative filtering algorithm
B. User-User Similarity Matrices considering a baseline rating.
Here, two customers could be similar to the premise of Predicted Rating: (based on User-User similarity)
the comparable ratings given with the id of each of them. If
any two users are similar then it means both of them have
given very comparable scores to the items due to the fact
here the consumer vector is nothing however the row of a
matrix which in flip contains rankings given through user
to the items. Now considering cosine similarity can variety
from ‘0’ to ‘1’ and ‘1’ means the highest similarity, so This is exactly same as our hand-crafted features
consequently, all the diagonal elements could be ‘1’ 'SUR'- 'Similar User Rating'. Means here we have taken
because the similarity of the consumer with him/herself is 'k' such similar users 'v' with user 'u' who also rated
the highest. But there's one hassle with user-user similarity. movie 'i'. 𝑟𝑣𝑖is the rating which user 'v' gives on item
User alternatives and tastes change over time. If any 'i'. 𝑏𝑣𝑖 is the predicted baseline model rating of user 'v' on
consumer favored some item one year in the past then it item 'i'.Generally, it will be cosine similarity or Pearson
isn't important that he/she will like the identical object even correlation coefficient.
today.
Predicted rating (based on Item Item similarity):
C. Item-Item Similarity Matrix
Here, two items can be comparable to the idea of the
comparable rankings given to each of the items via all of
the users. If any two gadgets are comparable then it means
both of them had been given very comparable ratings by
means of all of the users due to the fact here the item vector
is nothing however the column of the matrix which in flip
contains scores given with the aid of consumer to the
objects. Now due to the fact cosine similarity can variety D. Matrix Factorization SVD
from ‘0’ to ‘1’ and ‘1’ means the highest similarity, so The Singular-Value Decomposition, or SVD for short,
consequently, all of the diagonal elements might be ‘1’ due is a matrix decomposition technique for decreasing a
to the fact the similarity of an item with the identical item matrix to its constituent elements in order to ensure the
is the highest. next matrix calculations simpler. The SVD is used
broadly both within the calculation of different matrix
D. Cold Start Problem operations, including matrix inverse, but also as a
The cold start problem concerns the personalized statistics reduction approach in machine learning.
guidelines for users without a few past histories (new
users). Providing suggestions to users with small beyond Predicted Rating:
history turns into tough trouble for CF models due to the
fact their studying and predictive ability is limited. 𝑞𝑖 — Representation of item(movie) in latent factor
space.
IV. SURPRISE LIBRARY MODELS pu — Representation of user in new latent factor space.
A. XGBoost
However, with regards to small-to-medium E. Matrix Factorization SVDpp
structured/tabular data, choice tree primarily based Here, an implicit rating describes the fact that a
algorithms are taken into consideration best-in-class consumer u rated an item j, regardless of the rating value.

IJERTV9IS040741 www.ijert.org 801


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 04, April-2020

𝑦𝑖 is an object vector. For every object j, there is an object Then we need to do away with duplicates, Duplicates
vector 𝑦𝑗 that is an implicit remarks. Implicit feedback in a are the values which befell extra than once inside the
roundabout way displays opinion by looking at consumer given information. Here we should find the duplicates and
behavior including purchase history, surfing history, seek dispose of it by way of duplicate characteristic
patterns, or even mouse movements. Implicit comments
commonly denotes the presence or absence of an event
B. Performing Exploratory Data Analysis on Data
In statistics, exploratory data analysis isn't the same as
initial data analysis (IDA), which focuses extra narrowly
on checking assumptions required for version becoming
Iu — the set of all items rated by user u. and hypothesis trying out, and coping with lacking values
yj— implicit ratings. and making transformations of variables as needed.
For example, there's a film 10 in which a person has just
checked the info of the film and spend some time there,
which will contribute to implicit rating. Now, since here
our records set has now not provided us the details that for
how long a person has hung out on the movie, so right here
we are considering the fact that despite the fact that a user
has rated some film then it means that he has spent some
time on that film which contributes to implicit rating. If
person u is unknown, then the bias 𝑏𝑢 and the elements 𝑝𝑢
are assumed to be zero. The equal applies for item i with
𝑏𝑖, 𝑞𝑖, and 𝑦𝑖
V. IMPLEMENTATION
Fig. 1. Distribution of Ratings in data
A. Reading and Storing Data
The dataset I am working with is downloaded from The above graph shows the distribution of ratings from
Kaggle .https://www.kaggle.com/Netflix-inc/Netflix-prize- the data set. For example it implies that there are 2millions
data. of ratings with a rating of 1.And similarly for the reaming
ratings also.
It consists of four .txt files and we have to convert the
four .txt files to .csv file. And the .csv file consists of the
following attributes.
TABLE I. TOP 5 ROWS OF THE DATA SET
MovieID CustID Ratigs Date

49557332 17064 510180 2 1999-11-11


46370047 16465 510180 3 1999-11-11
22463125 8357 510180 4 1999-11-11
35237815 14660 510180 2 1999-11-11
21262258 8079 510180 2 1999-11-11
Fig. 2. Analysis of Ratings per movie.

It clearly shows that there are some movies which


MovieID: Unique identifier for the movie.
are very popular and were rated by many users as
CustID: Unique identifier for the Customer. compared to other movies.
Ratings - 1 to 5: Rating between 1 to 5. C. Creating User-Item sparse matrix for the data
Once the data preprocessing was completed then we
Date: Date on which customer had watched the have to create a user-Item sparse matrix for the data. Shape
movie and given rating. of sparse matrix depends on highest value of User ID and
Once the statistics analysis was completed we have highest value of Movie ID.Then we have to find the global
to test for empty values for the records set. By the usage average of all movie ratings, average rating per user and
of null characteristic. In Python, especially Pandas, average rating per movie. And next we have to compute
NumPy and Scikit-Learn, we mark missing values as the similarity matrices, there are mainly two similarity
NaN. Values with a NaN cost are overlooked from matrices such as user-user and item-item and we have to
operations like sum, count, etc. We can mark values as compute both matrices with our data set.And there is a csv
NaN without difficulty with the Pandas Data Frame by file which consists of movie names for the movie id’s
means of using the replace () feature on a subset of the which are present in our data set.
columns we are involved in.

IJERTV9IS040741 www.ijert.org 802


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 04, April-2020

TABLE I. TOP 5 ROWS OF THE MOVIE TITLE 1) XGBoost was the first model which we are applying
MovieID Year_of_Release Movie_title for the featurize data. When we run the model we get the
RMSE and MAPE for the train and test data.
1 2003 Dinosaur Planet
Isle of Man TT 2004
2 2004
Review
3 1997 Character
Paula Abdul's Get Up &
4 1994
Dance
The Rise and Fall of
5 2004
ECW

Let’s check does movie-movie similarity works. Pick


a random movie and check its top 10 most similar movies.
Suppose pick a movieid with number 17767. The
number with particular movieid is picked from the movie
titles and will show the name of the movie. Then by using
the movie-movie similarity matrix we can find the total Fig.1. Feature importance of xgboost model.
number of ratings given to the particular movie and it will
also show the similar movies. Here user average and movie average are most
important features. Here the RMSE and MAPE are two
For example the movie with movieid 17767 is error metrics which we are used for measuring error rate.
American experience. The top ten similar movies for
TABLE I. ERROR RATES OF XGBOOST MODEL
American experience are as follows.
TRAIN DATA TEST DATA

TABLE II. TOP 10 SIMILAR MOVIES FOR THE RMSE 0.81056869451969 1.0722769984483742
MOVIEID(17767) 48
MAPE 24.1616427898407 33.160274170446975
Movie ID Year_of_Release Movie_Title
9044 2002 Fidel
7707 2000 Cuba Feliz
2) Surprise Baselineonly was the next model we are
using. Here we are updating the train and test data with
15352 2002 Fidel: The Castro the extra feature baseline only. When we run baseline
Project
model we get the following as the output.
6906 2004 The History Channel
Presents: The War of
1812
16407 2003 Russia: Land of the
Tsars
5168 2003 Lawrence of Arabia: The
Battle for the Arab World
7100 2005 Auschwitz: Inside
the Nazi State
7522 2003 Pornografia
7663 1985 Ken Burns' America:
Huey Long
17757 2002 Ulysses S. Grant:
Warrior / President:
America... Fig. 2. Feature importance of baseline only model

From the above graph we can say that user average and
D. Applying Machine Learning Models movie average are most important features while
Before us applying the models we have to featurize baselineonly is the least important feature. And the error
data for the regression problem. Once it was completed rates for the baseline model is as follows.
we have transform data to surprise models. We can’t give
raw data (movie, user, and rating) to train the model in
Surprise library. Following are the models which we are
applying for the data.

IJERTV9IS040741 www.ijert.org 803


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 04, April-2020

TABLE II. ERROR RATES OF BASELINEONLY MODEL. TABLE V. ERROR RATES OF MATRIX FACOTRIZATION
SVDpp MODEL
TRAIN DATA TEST DATA
TRAIN DATA TEST DATA
RMSE 0.81021190178057 1.0688807299545566
83 RMSE 0.787158181566280 1.0675020897465601
4
MAPE 24.1669178009033 33.334272483120664 MAPE 24.06204006168546 33.39327837052172
2

3) Surprise KNNBaseline was the next model we are VI. RESULT ANALYSIS
applying for the data set. Here we have to update our data
Comparison of all the models.
set with the features from the previous model. When we run
the model we get the following as the output.

Fig. 3. Feature importance of surprise knnbaseline model.

From the above graph we can say that user average and
movie average are most important features while Figure 1. Train and Test RMSE and MAPE of all Models.
baseline_user is the least important feature. And the error The above graph will show the comparision of all model
rates for the baseline model is as follows. with error values.
TABLE III. ERROR RATES OF SURPRISE KNNBASELINE TABLE I. SUMMARY OF ALL THE MODELS WITH TRAIN AND
MODEL TEST RMSE VALUE
TRAIN DATA TEST DATA

RMSE 0.810123471320971 1.0717424411624028 S.NO MODEL TRAIN TEST


RMSE RMSE
MAPE 24.16132688522339 33.18525885602669 0 XGBOOST 0.810569 1.07228
1 BASELINEONLY 0.881143 1.06784
2 XGB_BSL 0.810212 1.06888
4) Matrix Factorization SVD was the next model we 3 KNNBASELINE_ 0.304498 1.06765
are using. And here we have to update our data set each USER
time. And when we run the matrix factorization svd we get 4 KNNBASELINE_ 0.181651 1.06765
the following as the output. And the error rates for the ITEM
model is as follows. 5 XGB_BSL_K 0.810123 1.07174
TABLE IV. ERROR RATES OF MATRIXFACTORIZATION NN
SVD MODEL 6 SVD 0.891529 1.06766
TRAIN DATA TEST DATA
7 SVDPP 0.787158 1.0675
RMSE 0.8915292018008784 1.0676633276455576 8 XGB_BSL_K 0.810568 1.0687
NN_MF
MAPE 27.929401130209502 33.39843901285594
9 XGB_KNN_M 1.07269 1.07276
F
5) Matrix Factorization SVDpp was the final model we
are applying for the data set. And here we have to update
the data set with the features with from the previous
models.And the error rates for the model is as follows.

IJERTV9IS040741 www.ijert.org 804


(This work is licensed under a Creative Commons Attribution 4.0 International License.)
Published by : International Journal of Engineering Research & Technology (IJERT)
http://www.ijert.org ISSN: 2278-0181
Vol. 9 Issue 04, April-2020

VII. CONCLUSION
So, far our best model is SVDpp with Test RMSE of
1.0675.Here we are not much worried about our RMSE
because we haven’t trained it on the whole data . Our main
intention here is to learn more about Recommendation
Systems .If we taken whole data we would definitely get
better RMSE .

VIII. FUTURE ENHANCEMENT


Tune hyper parameters of all the Xgboost models above
to improve the RMSE. Here we used 10K users and 1K
movies to train the above models due to my pc ram issues.
In the future, I am going to run on the entire information
set using cloud resources.

REFERENCES

[1] Davidsson C, Moritz S. Utilizing implicit feedback and context


to recommend mobile applications from first use. DOI:
10.1051/,04008 (2017) 712012ITA 2017 ITM Web of
Conferences itmconf/201 40084 In: Proc. of the Ca RR 2011.
New York: ACM Press, 2011. 19-
22.http://dl.acm.org/citation.cfm?id=1961639[doi:10.1145/1961
634.1961
[2] Bilge, A., Kaleli, C., Yakut, I., Gunes, I., Polat, H.: A survey of
privacy-preserving collaborative filtering schemes. Int. J. Softw.
Eng. Knowl. Eng. 23(08), 1085–1108 (2013)CrossRefGoogle
Scholar.
[3] Calandrino, J.A., Kilzer, A., Narayanan, A., Felten, E.W.,
Shmatikov, V.: You might also like: privacy risks of collaborative
filtering. In: Proceedings of the IEEE Symposium on Security and
Privacy, pp. 231–246, Oakland, CA.
[4] Okkalioglu, M., Koc, M., Polat, H.: On the discovery of fake
binary ratings. In: Proceedings of the 30th Annual ACM
Symposium on Applied Computing, SAC 2015, pp. 901–907.
ACM, USA (2015).
[5] Kaleli, C., Polat, H.: Privacy–preserving naïve bayesian classifier
based recommendations on distributed data. Comput. Intell.
31(1), 47–68(2015).
[6] Munoz-Organero, Mario, Gustavo A. Ramíez-González, Pedro J.
Munoz-Merino, and Carlos Delgado Kloos. "A Collaborative
Recommender System Based on Space-Time Similarities", IEEE
Pervasive Computing, 2010.
[7] Peng, Xiao, Shao Liangshan, and Li Xiuran. "Improved
Collaborative Filtering Algorithm in the Research and
Application of Personalized Movie.

IJERTV9IS040741 www.ijert.org 805


(This work is licensed under a Creative Commons Attribution 4.0 International License.)

You might also like