Movie Recommender
Movie Recommender
Movie
System
By Group 9
Pranjal Bhardwaj 210741
Harshvardhan Pulipati 210792
Nitin Vedwal 200650
Nikita Singh 200638
cONTENTS
Introduction
Dataset
Methodology
1. Collaborative filtering methods
2. Content based filtering methods
3. Hybrid recommendation system
Exploratory data analysis
Model
1. Score calculation
2. Cosine similarity
3. User-User comparison
4. Item -Item comparison
Model in action
Conclusion
1
INTRODUCTION
MORE CHOICES, LESS HAPPINESS
During the last few decades, with the rise of Youtube, Amazon, Netflix and many other
such web services, recommender systems have taken more and more place in our lives.
From e-commerce (suggest to buyers articles that could interest them) to online
relevant items to users (items being movies to watch, text to read, products to buy or
Recommender systems are really critical in some industries as they can generate a huge
amount of income when they are efficient or also be a way to stand out significantly from
that, a few years ago, Netflix organised a challenges (the “Netflix prize”) where the goal
was to produce a recommender system that performs better than its own algorithm with
2
Dataset
The Movies Dataset
ABOUT DATASET
The dataset contains metadata on over 45,000 movies. 26 million ratings from over 270,000
users.
The dataset consists of movies released on or before July 2017. Data points include cast, c
plot keywords, budget, revenue, posters, release dates, languages, production companies,
countries, TMDB vote counts and vote averages.
Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.
CONTENT
This dataset consists of the following files:
movies_metadata.csv: The main Movies Metadata file. Contains information on 45,000 m
featured in the Full MovieLens dataset. Features include posters, backdrops, budget, reve
release dates, languages, production countries and companies.
keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in
form of a stringified JSON Object.
credits.csv: Consists of Cast and Crew Information for all our movies. Available in the for
a stringified JSON Object.
links.csv: The file that contains the TMDB and IMDB IDs of all the movies featured in the
MovieLens dataset.
links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of
Full Dataset.
ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.
Methodology
3
COLLABORATIVE FILTERING METHODS
Collaborative methods for recommender systems are methods that are based solely on
the past interactions recorded between users and items in order to produce new
matrix”.
Then, the main idea that rules collaborative methods is that these past user-item
interactions are sufficient to detect similar users and/or similar items and make
The class of collaborative filtering algorithms is divided into two sub-categories that are
generally called memory based and model based approaches. Memory based
approaches directly works with values of recorded interactions, assuming no model, and
are essentially based on nearest neighbours search (for example, find the closest users
from a user of interest and suggest the most popular items among these neighbours).
Model based approaches assume an underlying “generative” model that explains the
4
The main advantage of collaborative approaches is that they require no information
about users or items and, so, they can be used in many situations. Moreover, the more
users interact with items the more new recommendations become accurate: for a fixed
set of users and items, new interactions recorded over time bring new information and
filtering suffer from the “cold start problem”: it is impossible to recommend anything to
new users or to recommend a new item to any users and many users or items have too
way: recommending random items to new users or new items to random users (random
strategy), recommending popular items to new users or new items to most active users.
Unlike collaborative methods that only rely on the user-item interactions, content
based approaches use additional information about users and/or items. If we
consider the example of a movies recommender system, this additional information
can be, for example, the age, the sex, the job or any other personal information for
users as well as the category, the main actors, the duration or other characteristics
for the movies (items).
5
Then, the idea of content based methods is to try to build a model, based on the
available “features”, that explain the observed user-item interactions. Still
considering users and movies, we will try, for example, to model the fact that young
women tend to rate better some movies, that young men tend to rate better some
other movies and so on. If we manage to get such model, then, making new
predictions for a user is pretty easy: we just need to look at the profile (age, sex,
…) of this user and, based on this information, to determine relevant movies to
suggest.
Content based methods suffer far less from the cold start problem than
collaborative approaches: new users or items can be described by their
characteristics (content) and so relevant suggestions can be done for these new
entities. Only new users or items with previously unseen features will logically
suffer from this drawback, but once the system old enough, this has few to no
chance to happen.
6
EXPLORATORY DATA ANALYSIS
Netflix is a good example of a company using hybrid recommender filtering on its
website. It accounts for users’ interests (collaborative filtering) and also for movie
descriptions and features (content-based filtering).
First thing first, there's always an EDA to give us a sense of what data we are dealing
with. It's also useful to acquire some insights, information’s, and even mistakes from
data.
7
INSIGHTS
Budget and Revenue just slighly influence the popularity of the movies.
8
Most of the movies lay on top of the yellow line, indicate that those movies make a profit
9
Word "life", "one", "find", "love" apparently appear in many occasions.
10
Started from 1930, movies industry had grown significantly from 50 years ago
A drop in total released movies around 2020 is because of the Covid outbreak
impacting the industry
11
For this particular dataset, english is on top of the list for the original and spoken
language in the movies.
Jr. and Cedric Gibbons are actor and crew involved in the most movies in the list
respectively.
Warner Bros. with 1194 movies make it become top 1 production company in the
list.
Many great production companies come from USA. So, it's not a surprise if USA is
become our top 1 for production country.
Movies that either got rating 0 or 10 are basically caused by small number of voter.
As the vote count increase, the rating is most likely around 5 to 8.5.
It's clear that popular movies will get more vote count as shown from above plot.
12
The movie genre that has the longest runtime is drama.
Action movies spent more money than the rest of the movies.
One of the action movies got a vast profit compared to the others.
13
Vote count, budget, and popularity are 3 dominant features that will determined the
revenue of the movies
This was the EDA of the data now we shall proceed to towards the preprocessing
and Model making .
14
Model
We initially make a score parameter to find out of the movies based upon rating
given to the movie and overall mean rating based on the formula given below.
People watch a movie not just because they see a good rating for that movie, but
also because of the hype of certain movie. So, in this case, put popularity into a
consideration is a wise choice.
Let's take 40% weight for weighted average and 60% weight for popularity
considering people don't want to miss a hype movie even the reviews and ratings
are poor, to find the score. We can play around with the number.
Using cosine similarity we will predict movies similar to a given movie. Cosine
similarity will be applied on the bag of words and other features.
15
This time, we use multi-objective approach that applies both implicit (movie
watches) and explicit signals (ratings). In the end, we can predict what movies
should the user watch along with the given rating corresponds to historical data.
User-user
Assume that we want to make a recommendation for a given user. First, every user
can be represented by its vector of interactions with the different items (“its line” in
the interaction matrix). Then, we can compute some kind of “similarity” between our
user of interest and every other users. That similarity measure is such that two
users with similar interactions on the same items should be considered as being
close. Once similarities to every users have been computed, we can keep the k-
nearest-neighbours to our user and then suggest the most popular items among
them (only looking at the items that our reference user has not interacted with yet).
Notice that, when computing similarity between users, the number of “common
interactions” (how much items have already been considered by both users?)
should be considered carefully! Indeed, most of the time, we want to avoid that
someone that only have one interaction in common with our reference user could
have a 100% match and be considered as being “closer” than someone having 100
common interactions and agreeing “only” on 98% of them. So, we consider that two
users are similar if they have interacted with a lot of common items in the same
way (similar rating, similar time hovering…).
16
Item-item
Notice that in order to get more relevant recommendations, we can do this job for
more than only the user’s favourite item and consider the n preferred items instead.
In this case, we can recommend items that are close to several of these preferred
items.
17
Evaluation of a recommender system
MODEL IN ACTION
18
Prediction of top 10 similar movies to 'Baby Driver'
19
Prediction of user 123 , his top 5 movies that he will like according to his rating to
other movies
20
If we examine the movies that is recommended as per model
At a glance, we can see if User 123 love watching Drama movies most of the time.
He/She also gives a good rating for that genre. In our recommendation, We give 5 more
Drama movies that we expect him/her to love the movies in a similar way with the
previous watched movies.
In our dataset, we don't see any Animation movies that have been watched by User
123. So, it's not a surprise if the estimated rating for Minions is quite low.
CONCLUSION
In this project, we developed a movie recommender system using a hybrid approach
that combines collaborative filtering and content-based filtering. The system was able to
generate personalized recommendations for users based on their past ratings and the
ratings of similar users, as well as the movies' genres, keywords, and other features.
21
We evaluated the performance of our system using the root mean square error (RMSE)
metric. The RMSE score for our system was 0.9.
Overall, our movie recommender system is a feasible and effective way to generate
personalized recommendations for users. The system is able to take into account both
the users' past preferences and the movies' features to generate recommendations that
are likely to be of interest to each user.
Here are some specific findings from our experiments:
Collaborative filtering was more effective than content-based filtering for
predicting users' ratings of movies they had not yet rated. This suggests that
users' past ratings of similar movies are a good predictor of their ratings of new
movies.
The hybrid approach that combines collaborative filtering and content-based
filtering outperformed both individual approaches. This suggests that the two
approaches complement each other and can be used to generate more accurate
recommendations.
The system was able to generate personalized recommendations for users with
different preferences. For example, the system was able to recommend different
movies to users who preferred different genres.
22