A Hybrid Approach For Movie Recommendation System Using Feature Engineering
A Hybrid Approach For Movie Recommendation System Using Feature Engineering
Abstract— Recommender system is used to recommend items RS faces many challenges viz., cold start, scalability,
and services to the users and provide recommendations based on synonymy, gray sheep and shilling attacks [3,4]. The cold
prediction. The prediction performance plays vital role in the start problem occurs when a new user or item entered into the
quality of recommendation.To improve the prediction performance, system, it is hard to find the similar products or items because
this paper proposed a new hybrid method based on naïve Bayesian there is lack of information. When number of existing users
classifier with Gaussian correction and feature engineering. The and items grow enormously it is difficult to find the solution
proposed method is experimented on the well known movie lens with available algorithms and computational resources, called
100k data set. The results show better results when compared with scalability. Synonymy is a problem where we have different
existing methods.
names for similar items. Sometimes users opinion does not
Keywords— recommender systems, collaborative filtering, naive
belong to any one kind of category like agree or disagree.
bayesian Another issue is shilling attack where people give tons of
positive comments about their product. Though there are
I. INTRODUCTION many challenges in RS, scalability and cold start are
considered as significant. In order to address this issue, this
Recommender System (RS) plays a vital role in
paper proposes an enhanced hybrid technique based model
World Wide Web (www). The main objective of a
based approach.
Recommender System is to produce accurate
recommendations from the collection of users for items that
The rest of the paper is organized as follows: Section
might be liked by the user. The information can be collected in
2 describes the related work in the state of architecture of
two ways: (i) Implicitly (ii) Explicitly. The users browsing
recommender system. The proposed hybrid approach for
behavior, their demographic and profile information are
improving the prediction accuracy of recommender system is
collected implicitly. The users provide rating to various
presented in section 3. Section 4 discusses the experiment and
products and these information are collected explicitly[2].
results of the proposed method. Conclusion is presented in
They are used in various fields like movies, music, shopping,
section 5.
television, e-commerce, books, news etc.[2]. Nowadays the
websites like amazon, flip kart are also making use of RS to II. RELATED WORK
promote their sales. It’s also helpful to the users to save their
shopping time and effort. In this section the related work which belong to model
based approach along with feature engineering are described.
There are four approaches available in recommender Generally model based approaches are performed using 1.
systems: (i) Collaborative Filtering (CF) (ii) Content Based Probabilistic Approaches 2. Bayesian Networks 3. Nearest
Filtering (CBF) (iii) Hybrid filtering (iv)Demographic filtering neighbors algorithm 4.Bio-inspired algorithms(neural
[2]. CF extracts similarities between users or items on the networks and genetic algorithms) [1,2].
basis of either explicitly or implicitly. CF can be implemented
as (i) User Based CF (UBCF) and (ii) Item Based CF (IBCF) A novel method predicting the user’s preferences in
[1]. Users those have similar taste, rated the same item. recommendation systems described in Antonio Hernando et
Based on this assumption UBCF recommends unseen items to al.[5]. This non negative matrix factorization collaborative
the active user. IBCF recommends items with highest filtering based on Bayesian probabilistic model is competitive
correlation i.e. based on ratings of the items. CBF with classical matrix factorization.
recommends products to user by incorporating the features of
items and individual users purchase history. Hybrid system Jian Wei1 et al.,[6] proposed a method based on deep
combines both content based and collaborative based filtering. learning neural network. The proposed method solves
Demographic filtering works on certain common personal complete cold start problem (where no rating record available)
attributes like sex, age, country, etc. and incomplete cold start problem(small number of rating
record available). This method performed better for complete
cold start problem only.
Jianjun Xie et. al., presented a method to divide the users Feature Class priority in %
highly rated songs from unrated songs in a large set of Yahoo age 11.4
Music dataset. In this method, features are generated from k- Zip code 21.18
gender 27.27
nearest neighbors of the users and k nearest neighbors of
occupation 34.02
items. The user based k nearest neighbors was weak when
compared with item based features [12].
Based upon the class priority four features are
Motivated by the above literature, this paper proposes a selected for this proposed system. Among the selected features
hybrid method with feature engineering and naïve Bayesian occupation has higher priority and age has least priority. New
with Gaussian correction. The proposed method is explained features are created for the above mentioned classes.
in the next section.
The following Fig.2 shows the distribution of age
III. PROPOSED SYSTEM feature over the ratings. We understand that the ratings for the
This proposed method consists of four stages: (i) items given by the various users. Here the minimum age is 7
Preprocessing (ii) Feature Engineering (iii) Naïve Bayesian and maximum age is 73 and more ratings given by the users in
Model (iv)Prediction. The proposed method diagrammatically the age range between 20 and 50. To explain this process of
feature engineering, age attribute is divided into different The Algorithm in the model based collaborative process to
ranges of age group from the movie lens dataset. A new predict the ratings is as follows. In this algorithm, the
feature is created as subset from the existing feature. dependent variables X={x1,x2,…xn} are used to predict the
independent variable {y} using the Gaussian naïve Bayesian
model[12]. The proposed method follows the algorithm which
takes age, sex, occupation and location and produced rating as
output.
Where n is the total number of ratings over all users Pi,j is the
( | ) ( ) predicted rating for user i on item j and r i,j is the actual rating.
( | ) ( )
( ) The equation is used to calculate mean absolute error to
measure the performance. The experiments and result will be
where P(c|x) is posterior probability, P(x|c) is likelihood, P(c) discussed in the next section.
is class prior probability and P(x) predictor prior probability.
GNB is one of the simplest classification algorithms. It IV. EXPERIMENTS AND RESULT
consists of assigning the label of the class that maximizes the
posterior probability of each sample. The likelihood of the The proposed approach is experimented using movie lens
features is assumed to be Gaussian which follows normal 100K dataset [15] using python platform. In this dataset, there
distribution. Gaussian noise is statistical noise having a are 943 users and 1642 movies and each user rated at least 20
probability density function equal to that of the normal movies. The rating scale is varied from 1 to 5 for various
distribution, which is also known as the Gaussian distribution genres. Features related with users and items are added and
[13]. unwanted feature which is not required for experiment is
deleted. The following Table 2. gives the statistical
information of the features for our experiment. The statistical