1st Harvard Project
1st Harvard Project
Konstantinos Roinas
16/6/2021
INTRODUCTION
In this project we will work with movielens dataset (10M lines). We will break it down in 2 parts edx and
validation. The first will be used for testing training purposes and the validation part only at the end to calculate
the final criterion of success or not of our model. The criterion will be a low RMSE. We will build our model as
continuation of the one showed within the lecture, adding and testing more factors(predictors). Always trying to
predict the rating that a specific user gives to a specific movie.
LOADING OUR DATASET
##
##
## lift
##
##
##
## transpose
library(tidyverse)
library(caret)
library(data.table)
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
title = as.character(title),
genres = as.character(genres))
set.seed(1, sample.kind="Rounding")
## used
Make sure userId and movieId in validation set are also in edx set
semi_join(edx, by = "userId")
Add rows removed from validation set back into edx set
ANALYSIS
dim(edx)
## [1] 9000055 6
head(edx)
## genres
## 1: Comedy|Romance
## 2: Action|Crime|Thriller
## 3: Action|Drama|Sci-Fi|Thriller
## 4: Action|Adventure|Sci-Fi
## 5: Action|Adventure|Drama|Sci-Fi
## 6: Children|Comedy|Fantasy
It has more than 9 million samples and columns like title, timestamp (of rating), userId (the one’s that rates),
movieId and title (of rated movie), genres (category) and of course rating
We have to create test and train set from edx. Here against Pareto principle we will use a ratio 90/10 train / set
in order to have a test set comparable to validation one.
CREATION OF NEW COLLUMNS (VARIABLES - POSSIBLE FACTORS) We decided create new variables
and test our model with them. Of course we will apply changes in both sets (train and set) NRPM (number of
ratings per movie), NRPU (number of ratings per user), MA (movie age - difference between year of rating and
production year), MR (month of rating), HR (hour of rating).
train_set<-train_set%>%mutate(MY=substr(title,nchar(title)-4,nchar(title)-1))
train_set<- train_set%>%mutate(MA=RY-as.numeric(MY))
test_set<-test_set%>%mutate(MY=substr(title,nchar(title)-4,nchar(title)-1))
test_set<- test_set%>%mutate(MA=RY-as.numeric(MY))
To light our dataset at this point we throw away the following columns that we will not use or no need them
anymore: title, MY, RY.
test_set$title<-NULL
test_set$MY<-NULL
test_set$RY<-NULL
train_set$title<-NULL
train_set$MY<-NULL
train_set$RY<-NULL
So we have all our variables except the Hour of Rating HR, that we will extract from timestamp. Lets throw
away Rating_date that is no needed now.
train_set$Rating_date<-NULL
test_set$Rating_date<-NULL
So now we have all our variables and we can delete timestamp and Rating_date_time.
train_set$timestamp<-NULL
train_set$Rating_date_time<-NULL
test_set$timestamp<-NULL
test_set$Rating_date_time<-NULL
No need to use the whole edx or train_set 8-9 million lines for this check. Lets use a sample of it.
dim(train_set_sample)
## [1] 809877 9
train_set_sample%>%group_by(NRPU)%>%summarize(NRPU,AVG_RATING=mean(rating))%>%ggplot(aes(NRP
U,AVG_RATING))+geom_point()+geom_smooth(method = "lm")
We see the more a user do rating the stricter is, that sounds normal. From few rating and average more than
3.5 we reach users with 3000 ratings and a bit less than 3 average.
train_set_sample%>%group_by(NRPM)%>%summarize(NRPM,AVG_RATING=mean(rating))%>%ggplot(aes(NRP
M,AVG_RATING))+geom_point()+geom_smooth(method = "lm")
train_set_sample%>%group_by(MA)%>%summarize(MA,AVG_RATING=mean(rating))%>%ggplot(aes(MA,AVG_R
ATING))+geom_point()+geom_smooth(method = "lm")
train_set_sample%>%group_by(MR)%>%summarize(MR,AVG_RATING=mean(rating))%>%ggplot(aes(MR,AVG_R
ATING))+geom_point()+geom_smooth(method = "lm")
train_set_sample%>%group_by(HR)%>%summarize(HR,AVG_RATING=mean(rating))%>%ggplot(aes(HR,AVG_R
ATING))+geom_point()+geom_smooth(method = "lm")
Another variable that we will use as predictor is genres. We think type affects rating. Lets find out how many
unique categories genres we have.
length(unique(edx$genres))
## [1] 797
There are almost 800 categories and we have 9 millions samples. So no reason to break down genres, we will
use it as it is.
So we are ready to pass to build our model. Criterion is RMSE that we define here.
sqrt(mean((true_ratings - predicted_ratings)^2))}
Set mu any possible NA that could destroy our RMSE calculation (one is enough).
predicted_ratings_f_m[is.na(predicted_ratings_f_m)]<-mu
RMSE(predicted_ratings_f_m,test_set$rating)
## [1] 0.9441596
.$pred
predicted_ratings_f_u[is.na(predicted_ratings_f_u)]<-mu
RMSE(predicted_ratings_f_u,test_set$rating)
## [1] 0.8659785
So here we are more on less on the level of the taught course userId+movieId and RMSE of 0.8659 Lets try go
further trying other factors to see if they will improne our criterion.
Adding genres
group_by(genres) %>%
.$pred
predicted_ratings_f_g[is.na(predicted_ratings_f_g)]<-mu
RMSE(predicted_ratings_f_g,test_set$rating)
## [1] 0.8656067
group_by(MA) %>%
left_join(ma_avgs,by='MA') %>%
.$pred
predicted_ratings_f_a[is.na(predicted_ratings_f_a)]<-mu
RMSE(predicted_ratings_f_a,test_set$rating)
## [1] 0.8651742
Adding NRPU
left_join(ma_avgs,by='MA') %>%
group_by(NRPU)%>%
summarize(f_nrpu = mean(rating - mu - f_u - f_m-f_g-f_ma))
left_join(ma_avgs,by='MA') %>%
left_join(nrpu_avgs,by='NRPU')%>%
.$pred
predicted_ratings_f_nrpu[is.na(predicted_ratings_f_nrpu)]<-mu
RMSE(predicted_ratings_f_nrpu,test_set$rating)
## [1] 0.901913
Try NRPM
left_join(ma_avgs,by='MA') %>%
group_by(NRPM)%>%
summarize(f_nrpm = mean(rating - mu - f_u - f_m-f_g-f_ma))
left_join(ma_avgs,by='MA') %>%
left_join(nrpm_avgs,by='NRPM')%>%
.$pred
predicted_ratings_f_nrpm[is.na(predicted_ratings_f_nrpm)]<-mu
RMSE(predicted_ratings_f_nrpm,test_set$rating)
## [1] 0.8995502
Again the same happens. So we will remain with movieId, UserId, genres and MA (movie age).
The only that remains to try more is REGULARIZATION. We will apply it in all factors to suppress `noise’.
Saying noise we mean for example in case of movieId, entries in our dataset of movies with very few ratings.
mu <- mean(train_set$rating)
group_by(movieId) %>%
group_by(userId) %>%
group_by(genres) %>%
group_by(MA) %>%
predicted_ratings <-
test_set %>%
left_join(f_m, by = "movieId") %>%
.$pred
predicted_ratings[is.na(predicted_ratings)]<-mu
return(RMSE(predicted_ratings, test_set$rating))
})
qplot(lambdas, rmses)
From the plot is clear further improvement that brings us to the best possible RMSE that project asks.
The lamda that gives minimum RMSE and the minimum RMSE are:
lamda<-lambdas[which.min(rmses)]
min(rmses)
## [1] 0.8646729
lamda
## [1] 4.75
So we have a perfect RMSE of 0.86467 with lamda just below 5. This value will be used for validation
prediction.
As we have reached our final model it is time prepare the validation set. We have to bring it to the same form
as train or test sets. Of course no need do the steps for NRPM or NRPU that finally we do not use.The same
for HR or MR.
validation<-validation%>%mutate(MY=substr(title,nchar(title)-4,nchar(title)-1))
validation<- validation%>%mutate(MA=RY-as.numeric(MY))
validation$title<-NULL
validation$MY<-NULL
validation$RY<-NULL
validation$timestamp<-NULL
validation$Rating_date<-NULL
mu <- mean(edx$rating)
group_by(movieId) %>%
group_by(userId) %>%
group_by(genres) %>%
group_by(MA) %>%
predicted_ratings <-
validation %>%
left_join(f_m, by = "movieId") %>%
.$pred
predicted_ratings[is.na(predicted_ratings)]<-mu
RMSE(predicted_ratings, validation$rating)
## [1] 0.8644001
One note here, we used the whole edx as training here to take advance from the bigger pool of samples. No
need work here with train_set as before.
CONCLUSIONS
We managed have an RMSE of 0.8644 on the validation set, using the known variables userId and movieId
plus genres and another variable created named MA (movie age) that shows the difference of years between
production and rating. Finally on top of them we applied reguralization in all these variables that suppress
‘noise’.
Limitations we faced were that adding more predictors is not always better. It is important their combination.
Another one was transforming our datasets in order produce the experimental new variables. We saw that we
had to apply the changes separately in train and test set and not on the whole edx set and then split it. Doing
the last we had crash even with 8 or 12 GB ram.The HW is a great limitation in ML and here comes the
possible future work.
Initially we tried apply other models using caret package like knn or rpart or even random forest but this was
really impossible having sometimes to wait even 3 days without result. Or other times receiving memory
exceptions and errosr. So would be really interesting try models like the ones mentioned but with such big
dataset, HW like 8-12GB RAM and processor i5 is nothing. So yes would be a challenge using a much more
powerfull machine applying other models on this dataset and see their results compared with the ones we
have.