Movie_Recommendation_Report
Movie_Recommendation_Report
ARTIFICIAL INTELLIGENCE
TOPIC: MOVIE RECOMMENDATIONS SYSTEM
GROUP 3
Team Members
Vu Thuong Tin 20230091 Chu Anh Duc 20230081
Tran Quang Hung 20235502 Phan Dinh Trung 20230093
HANOI - 11 - 2024
TABLE OF CONTENTS
CHAPTER 1. INTRODUCTION............................................................. 2
2
CHAPTER 1. INTRODUCTION
difficult for a system to provide him recommendation about any item. Same case
can be with new item, as it is not rated by any user because it’s new for the user.
Both these problem can be resolved by implementing hybrid techniques.
Data Sparsity: The user or rating matrix is very sparse. It is very hard to
find users that have rated the same items because most of the user does not rate
the items. So it becomes hard to find set of users who rate the items. To give
recommendation is really tough when there is less information about any user .
Scalability: Collaborative Filtering use massive amount of data to make reliable
better which require more number of resources. As information grows exponentially
processing becomes expensive and inaccurate result from this Big data challenge.
3
CHAPTER 1. INTRODUCTION
4
CHAPTER 1. INTRODUCTION
2. User profile: A user profile is built based on their watch history, ratings, or
interactions. The user profile is represented by aggregating the content vectors
of the movies they have rated highly.
3. Finding similar movies: After creating the user profile, the algorithm searches
for movies with content vectors similar to the user profile vector (usually using
similarity measures like cosine similarity). Movies with high similarity scores
are recommended.
Neighborhood - Based Collaborative Filtering: Neighborhood-Based Collaborative
Filtering is a widely used technique in recommendation systems that relies on the
concept of similarity between users or items (movies) to make predictions. Unlike
Content-Based Recommendation, which focuses on the attributes of movies, Collaborative
Filtering works by analyzing user-item interactions (e.g., ratings) to identify patterns
of shared preferences. There are 2 steps.
1. User-Based Collaborative Filtering: Focuses on identifying similar users
based on their past interactions, key idea - Users who have rated or liked
similar movies in the past are likely to have similar preferences in the future.
Compute similarity between the target user and other users (e.g., using cosine
similarity or Pearson correlation on their rating vectors), identify a set of
neighboring users. Then, we aggregate the ratings of the neighboring users
to predict the target user’s rating for a movie.
2. Item-Based Collaborative Filtering: Focuses on finding similar movies based
on user ratings, key idea - Movies that are rated similarly by multiple users are
likely to have similar appeal. Compute similarity between movies, for a given
movie, find a set of neighboring movies. Then, we predict the user’s rating for
a movie based on their ratings of similar movies.
Matrix Factorization: Matrix Factorization is a powerful and widely used
technique in recommendation systems that leverages linear algebra to predict missing
entries in a user-item interaction matrix (e.g., a matrix of user ratings for movies).
It is particularly effective for large, sparse datasets and forms the basis of modern
recommendation algorithms.
5
CHAPTER 2. LITERATURE REVIEW
6
CHAPTER 2. LITERATURE REVIEW
7
CHAPTER 3. BASIC MATHEMATICS
N
1 X 2 λ
L(w) = yi − x ⊤
i w + ∥w∥22
2N 2
i=1
Explanation
Least squares error
N
1 X 2
yi − x ⊤
i w
2N
i=1
+ This is the sum of squared errors between the actual value yi and the
predicted value x⊤
i w.
w∗ = argmin L(w)
w
−1
w∗ = X⊤ X + λI X⊤ y
where:
8
CHAPTER 3. BASIC MATHEMATICS
+ I: identity matrix.
+ λ: regularization coefficient.
2. Utility matrix
Formally, a utility matrix Y is a matrix of size M × N , where:
+ M : Number of items.
+ N : Number of users.
+ ymn : The utility (e.g., rating, score, or preference) assigned by user n to
item m.
Key keatures of a utility matrix
(a) Sparse representation
In most real-world applications, the utility matrix is sparse because:
+ Not all users rate or interact with all items.
+ Many entries ymn are unknown or missing.
(b) Numerical entries
The entries in the matrix are typically numerical, such as:
+ Ratings (e.g., 1–5 stars in a movie-rating system).
+ Binary values (e.g., 0 for no interaction, 1 for interaction).
+ Implicit scores (e.g., frequency of interaction or purchase history).
(c) Approximation:
The goal of many machine learning or optimization methods (e.g., matrix
factorization) is to approximate the missing values in the utility matrix by
learning patterns from the observed entries.
3.2 Neighborhood - Collaborative Filtering
1. Similarity function Is an important concept in machine learning, data processing,
and problems involving the comparison of two objects. This function measures
the degree of similarity between two objects (e.g., vectors, data points, or sets)
based on specific criteria.
2. General formula A similarity function sim(a, b) takes two objects a and b as
input and returns a value within a defined range, typically:
9
CHAPTER 3. BASIC MATHEMATICS
Cosine similarity: Measures the angle between two vectors in a feature space.
It is calculated by:
a·b
simcosine (a, b) =
∥a∥∥b∥
The value ranges from [−1, 1].
3. Prediction rating
Formula
The general formula for the predicted rating r̂ui in a recommendation system
is:
r̂ui = f (u, i)
Where:
+ r̂ui is the predicted rating for user u on item i.
+ f (u, i) is a function that calculates the predicted rating based on user u’s
preferences and item i’s characteristics.
In the context of different methods, the formula can vary :
User-User Collaborative filtering
P
v∈N (u) sim(u, v) · rvi
r̂ui = P
v∈N (u) |sim(u, v)|
Where:
+ N (u) is the set of users most similar to user u.
+ sim(u, v) is the similarity between the user u and the user v .
+ rvi is the rating that user v gave to item i.
Item-Item Collaborative filtering
P
j∈N (i) sim(i, j) · ruj
r̂ui = P
j∈N (i) |sim(i, j)|
Where:
+ N (i) is the set of items most similar to item i.
+ sim(i, j) is the similarity between item i and item j .
+ ruj is the rating that user u gave to item j .
10
CHAPTER 3. BASIC MATHEMATICS
∂ Loss X
= −2 (rij − pTi qj )qjk + 2λpik
∂pik
j∈κi
Where:
+ κi is the set of items rated by user i
+ rij is the actual rating for user i and item j
+ pTi qj is the predicted rating for user i and item j
+ λ is the regularization parameter
+ qjk is the feature of item j corresponding to the latent factor k
The update rule for pik is:
∂ Loss
pik := pik − η
∂pik
∂ Loss X
= −2 (rij − pTi qj )pik + 2λqjk
∂qjk
i∈κj
Where:
+ κj is the set of users who rated item j
+ pTi qj is the predicted rating for user i and item j
+ pik is the feature of user i corresponding to the latent factor k
+ λ is the regularization parameter
The update rule for qjk is:
11
CHAPTER 3. BASIC MATHEMATICS
∂ Loss
qjk := qjk − η
∂qjk
r̂ui = pTu qi
Where:
+ pu is the latent feature vector for the user u
+ qi is the latent feature vector for item i
+ pTu qi is the dot product between the vectors of features of the user and
the item
12
CHAPTER 4. OVERVIEW
4.1 Abstract
This study outlines a comprehensive methodology pipeline for building and
deploying a recommendation system using the MovieLens 100k dataset. The
pipeline is divided into four major stages: Data Processing, Model Training, Deployment.
The recommendation algorithms used in this research are Content-based Filtering,
Collaborative Filtering, and Matrix Factorization. Each stage plays a crucial role
in ensuring the effectiveness, accuracy, and scalability of the recommendation
system. Below is a detailed description of each step in the methodology.
4.2 Data Processing
We will use dataset Movielen100k: https://files.grouplens.org/
datasets/movielens/ml-100k.zip
1. Collection Data
Here are brief descriptions of the data.
• u.data – The full u data set, 100000 ratings by 943 users on 1682
items. Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly ordered. This
is a tab-separated list of:
user id|item id|rating|timestamp
• u.info – The number of users, items, and ratings in the u data set.
• u.item – Information about the items (movies); this is a tab-separated
list of:
movie id|movie title|release date|video release date|
IMDb URL|unknown|Action|Adventure|Animation|
Children’s|Comedy|Crime|Documentary|Drama|Fantasy|
Film-Noir|Horror|Musical|Mystery|Romance|Sci-Fi|
Thriller|War|Western
The last 19 fields are the genres, a 1 indicates the movie is of that
genre, a 0 indicates it is not; movies can be in several genres at once.
The movie ids are the ones used in the u.data file.
13
CHAPTER 4. OVERVIEW
14
CHAPTER 4. OVERVIEW
9,430 ratings. The training data is used to build and train the model, while
the test data is used to evaluate its performance on unseen ratings. The data
was read using Python’s pandas library with tab-separated values, and then
converted into NumPy arrays for efficient computation. This split ensures that
the model is tested for generalization and not just for memorization of the
training data, providing a reliable estimate of its performance on new, unseen
data.
4.3 Model Training - Algorithms
There are used to 3 Algorithms
1. Content - Based Recommendations
Problem: Suppose the number of users is N , the number of items is M , and
the utility matrix is represented by matrix Y. The entry in row m, column
n of Y is the level of interest (e.g., rating) of user n for item m, which the
system has collected. The matrix Y is sparse and contains many missing
entries corresponding to values the system needs to predict.
Additionally, let R be the rated or not matrix, which indicates whether a user
has rated an item or not. Specifically, rij equals 1 if item i has been rated by
user j , and equals 0 otherwise.
Linear Model
Suppose we can find a model for each user, represented by a column vector
of coefficients wn and a bias term bn , such that the level of interest of a user
for an item can be calculated by a linear function:
ymn = xm wn + bn (4.1)
1 X λ
Ln = (xm wn + bn − ymn )2 + ∥wn ∥22 (4.2)
2 2
m:rmn =1
Here, the second term is the regularization term, with λ being a positive
hyperparameter. Note that regularization is usually not applied to bn . In
15
CHAPTER 4. OVERVIEW
practice, the average error is often used, and the loss Ln can be rewritten as:
1 X λ
Ln = (xm wn + bn − ymn )2 + ∥wn ∥22 (4.3)
2sn 2sn
m:rmn =1
Where sn is the number of items that user n has rated. In other words:
M
X
sn = rmn , (4.4)
m=1
Which is the total of the elements in column n of the rated or not matrix R.
16
CHAPTER 4. OVERVIEW
17
CHAPTER 4. OVERVIEW
y ≈ x⊤ w
Using this formulation, the Utility Matrix Y, assuming all values are filled,
is approximated as:
x⊤
1 w1 x⊤ ⊤
1 w 2 · · · x1 w N x⊤
1
⊤
x⊤ ⊤
⊤h
2 w2 · · · x 2 wN
x2 w1 x i
= .2 w1 w2 · · · wN = XW
Y≈
.. .. ... .. .
. . . .
x⊤ ⊤ ⊤
M w1 xM w2 · · · xM wN x⊤
M
18
CHAPTER 4. OVERVIEW
N
1 X X 2 λ
L(W) = ymn − x⊤
m wn + ∥W∥2F
2s 2
n=1 m:rmn =1
N
1 X X 2 λ
L(X) = ymn − x⊤
m wn + ∥X∥2F
2s 2
n=1 m:rmn =1
19
CHAPTER 5. NUMERICAL RESULTS
N
1 X
MAE = |yi − ŷi |
N
i=1
TF-IDF
1. (Term Frequency - Inverse Document Frequency) measures the importance
of a term t in a document d within a corpus. It is calculated as:
Frequency of t in d
TF(t, d) =
Total terms in d
N +1
IDF(t) = log +1
1 + DF(t)
Where:
+ N : Total documents in the corpus.
+ DF(t) : Number of documents containing t
20
CHAPTER 5. NUMERICAL RESULTS
21
CHAPTER 5. NUMERICAL RESULTS
22
CHAPTER 5. NUMERICAL RESULTS
23
CHAPTER 6. CONCLUSIONS
6.1 Summary
In this paper, a Movie Recommendation System is developed using the MovieLens
100k dataset, which contains 100,000 movie ratings from 943 users on 1,682
movies. The system implements three key recommendation algorithms:
1. Content-based Recommendations: This algorithm recommends movies to
users based on the features of the movies and the user’s past preferences.
Features such as genre, director, and cast are used to calculate similarities
between movies.
2. Neighborhood-based Collaborative Filtering: This algorithm makes predictions
based on the preferences of similar users. It identifies users who have similar
movie preferences and recommends movies that those users have rated highly.
3. Matrix Factorization (e.g., Singular Value Decomposition - SVD): Matrix
factorization techniques break down the user-item interaction matrix into lower-
dimensional matrices representing latent features of users and movies. This
method captures hidden patterns in user preferences and movie characteristics
that are not immediately apparent from the data.
Dataset and Algorithm Challenges
While the MovieLens 100k dataset is a valuable resource, it has some inherent
limitations. Sparsity: The dataset is sparse, meaning many users have rated only
a small fraction of available movies, which makes it difficult to make accurate
predictions for unseen items. Cold Start Problem: New users or new movies
without sufficient ratings make it challenging to generate meaningful recommendations.
This issue affects all three algorithms. Bias in Data: The dataset contains bias in
terms of user activity (some users rate far more movies than others), which can
impact the fairness of recommendations.
6.2 Suggestion for Future Works
To enhance the Movie Recommendation System, integrating Deep Learning
models can help to better capture complex patterns in user preferences and item
relationships. Here are several deep learning approaches that can improve the
system:
1. Deep Neural Networks (DNNs): DNNs can learn latent representations of
users and items by encoding user-item interactions. This can overcome some
limitations of traditional methods like matrix factorization, particularly in
24
CHAPTER 6. CONCLUSIONS
25
REFERENCE
26