0% found this document useful (0 votes)
49 views

A Hybrid Approach For Movie Recommendation System Using Feature Engineering

Uploaded by

Zubia Naz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

A Hybrid Approach For Movie Recommendation System Using Feature Engineering

Uploaded by

Zubia Naz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)

IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

A Hybrid Approach for Movie Recommendation


System Using Feature Engineering
S.Sathiya Devi G.Parthasarathy
Department of Information Technology Department of Computer Science and Engineering
University College of Engineering-BIT Campus TRP Engineering College
Tiruchirapalli, India Tiruchirapalli, India
[email protected] [email protected]

Abstract— Recommender system is used to recommend items RS faces many challenges viz., cold start, scalability,
and services to the users and provide recommendations based on synonymy, gray sheep and shilling attacks [3,4]. The cold
prediction. The prediction performance plays vital role in the start problem occurs when a new user or item entered into the
quality of recommendation.To improve the prediction performance, system, it is hard to find the similar products or items because
this paper proposed a new hybrid method based on naïve Bayesian there is lack of information. When number of existing users
classifier with Gaussian correction and feature engineering. The and items grow enormously it is difficult to find the solution
proposed method is experimented on the well known movie lens with available algorithms and computational resources, called
100k data set. The results show better results when compared with scalability. Synonymy is a problem where we have different
existing methods.
names for similar items. Sometimes users opinion does not
Keywords— recommender systems, collaborative filtering, naive
belong to any one kind of category like agree or disagree.
bayesian Another issue is shilling attack where people give tons of
positive comments about their product. Though there are
I. INTRODUCTION many challenges in RS, scalability and cold start are
considered as significant. In order to address this issue, this
Recommender System (RS) plays a vital role in
paper proposes an enhanced hybrid technique based model
World Wide Web (www). The main objective of a
based approach.
Recommender System is to produce accurate
recommendations from the collection of users for items that
The rest of the paper is organized as follows: Section
might be liked by the user. The information can be collected in
2 describes the related work in the state of architecture of
two ways: (i) Implicitly (ii) Explicitly. The users browsing
recommender system. The proposed hybrid approach for
behavior, their demographic and profile information are
improving the prediction accuracy of recommender system is
collected implicitly. The users provide rating to various
presented in section 3. Section 4 discusses the experiment and
products and these information are collected explicitly[2].
results of the proposed method. Conclusion is presented in
They are used in various fields like movies, music, shopping,
section 5.
television, e-commerce, books, news etc.[2]. Nowadays the
websites like amazon, flip kart are also making use of RS to II. RELATED WORK
promote their sales. It’s also helpful to the users to save their
shopping time and effort. In this section the related work which belong to model
based approach along with feature engineering are described.
There are four approaches available in recommender Generally model based approaches are performed using 1.
systems: (i) Collaborative Filtering (CF) (ii) Content Based Probabilistic Approaches 2. Bayesian Networks 3. Nearest
Filtering (CBF) (iii) Hybrid filtering (iv)Demographic filtering neighbors algorithm 4.Bio-inspired algorithms(neural
[2]. CF extracts similarities between users or items on the networks and genetic algorithms) [1,2].
basis of either explicitly or implicitly. CF can be implemented
as (i) User Based CF (UBCF) and (ii) Item Based CF (IBCF) A novel method predicting the user’s preferences in
[1]. Users those have similar taste, rated the same item. recommendation systems described in Antonio Hernando et
Based on this assumption UBCF recommends unseen items to al.[5]. This non negative matrix factorization collaborative
the active user. IBCF recommends items with highest filtering based on Bayesian probabilistic model is competitive
correlation i.e. based on ratings of the items. CBF with classical matrix factorization.
recommends products to user by incorporating the features of
items and individual users purchase history. Hybrid system Jian Wei1 et al.,[6] proposed a method based on deep
combines both content based and collaborative based filtering. learning neural network. The proposed method solves
Demographic filtering works on certain common personal complete cold start problem (where no rating record available)
attributes like sex, age, country, etc. and incomplete cold start problem(small number of rating
record available). This method performed better for complete
cold start problem only.

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 378


Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on May 24,2024 at 00:52:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

represented in Fig.1. shows the description of the hybrid


A method was described to improve the quality of method. In this proposed method, the data cleaning and
recommendation is carried out by Kebin Wang and Ying Tan preparation of dataset for experiment was done in
[7]. In this improved naïve Bayesian algorithm, where the preprocessing stage. Feature engineering stage select the
conditional independence is not followed strictly. This attributes based on importance and create a new attribute
improved algorithm facing difficulty while calculating item based on constraints. In naïve Bayesian Model, the training
independence. data is applied and the model is created. Finally, Prediction
stage predicts the ratings by applying the test data. The above
Lifang Ren et.al.[8] described method to provide desired four stages are explained below.
services to the users based on support vector machine. This
collaborative service recommendation method filter out the
services which are preferred by the user. It yields A. Preprocessing
comparatively better recommendation and efficiency. Preprocessing is a technique to improve the quality of
data by removing noisy, missing and inconsistent data from
A novel software process model recommendation method heterogeneous source in large amount. Movie lens 100k
was presented by Qinbao Song et.al.[9]. It recommends dataset [15] is used as input dataset for our proposed system.
managers to choose suitable project models for their new In this dataset, there are many files related with user and item
projects. This method used different classification algorithms information which need part of preprocessing called data
including single learner and ensemble learner for the integration. Apart from the rating attribute we have user
recommendation. personal information. For our experiment we have taken the
user related information like age, gender, occupation and
Though, model based recommendations are performed location. These attributes are combined with item related
with different approaches, among which naïve Bayesian is information like user id, movie id and rating. This integrated
considered as one of the best approaches with respect to dataset is transformed into data frame through data
prediction accuracy [7]. transformation. The next stage gives the procedure to generate
features from the existing attribute.
Though model based approaches are considered as best
practice for recommendation system, it suffers from cold start
problem [6]. Hence hybrid approaches are used to overcome
the above mentioned issue and also to increase the prediction
accuracy. Hybrid approaches are performed with (i).Weighted
(ii). Switching (iii)Mixed (iv)Feature Combination (v)Cascade
(vi)Feature augmentation (vii)Meta-level[10]. Hence, this
Fig. 1. Proposed Hybrid based recommender system model
paper adapts a hybrid approach by combining naïve Bayesian
and Feature Engineering. The Literature of Feature B. Feature Engineering
Engineering follows. A feature selection process can be used to remove terms in
the training documents that are statistically uncorrelated with
Chun-Liang Li et al., proposed a feature engineering class labels [17]. For our experiment we have selected features
based technique for Knowledge Discovery and Data Mining listed in section 3.1. To create a new feature we have analyzed
(KDD) Cup Challenge 2013. In this method, various features selected features based on the class priority. The results are
are generated from the available paper author identification. tabulated in Table 1.
This model created features from author information,
bibliography and publication time [11]. TABLE I. CLASS PRIORITY OF DIFFERENT FEATURES

Jianjun Xie et. al., presented a method to divide the users Feature Class priority in %
highly rated songs from unrated songs in a large set of Yahoo age 11.4
Music dataset. In this method, features are generated from k- Zip code 21.18
gender 27.27
nearest neighbors of the users and k nearest neighbors of
occupation 34.02
items. The user based k nearest neighbors was weak when
compared with item based features [12].
Based upon the class priority four features are
Motivated by the above literature, this paper proposes a selected for this proposed system. Among the selected features
hybrid method with feature engineering and naïve Bayesian occupation has higher priority and age has least priority. New
with Gaussian correction. The proposed method is explained features are created for the above mentioned classes.
in the next section.
The following Fig.2 shows the distribution of age
III. PROPOSED SYSTEM feature over the ratings. We understand that the ratings for the
This proposed method consists of four stages: (i) items given by the various users. Here the minimum age is 7
Preprocessing (ii) Feature Engineering (iii) Naïve Bayesian and maximum age is 73 and more ratings given by the users in
Model (iv)Prediction. The proposed method diagrammatically the age range between 20 and 50. To explain this process of

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 379


Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on May 24,2024 at 00:52:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

feature engineering, age attribute is divided into different The Algorithm in the model based collaborative process to
ranges of age group from the movie lens dataset. A new predict the ratings is as follows. In this algorithm, the
feature is created as subset from the existing feature. dependent variables X={x1,x2,…xn} are used to predict the
independent variable {y} using the Gaussian naïve Bayesian
model[12]. The proposed method follows the algorithm which
takes age, sex, occupation and location and produced rating as
output.

Input: {age, sex, occupation, location }


Output: {rating}
Algorithm:
Step 1: Preprocessing
Step 2: Feature Engineering
Step 3: Model Creation
Step 4: Prediction
Step 5: Performance Measure
Step6: End

Fig. 3. Proposed Algorithm


Fig. 2. Distribution of age

The various steps involved in the proposed algorithm is


Let x1 be the existing feature has different instances. A new depicted in Fig.3. It starts with preprocessing step by
feature related with x1 can be created as fe1 with different accepting the input age, sex, occupation and location. Here the
ranges. Similarly the features occupation, gender and sex are input is cleaned and the various features related to user and
analyzed. These generated features are used as inputs for the item are combined. New Feature is created by analyzing each
naïve Bayesian classifier. The working principle naïve attribute separately. Next, a trained model is created from the
Bayesian classifier model is explained in the next section. available features from the previous step. Prediction is
calculated in the next step of our proposed algorithm and
performance will be measured. The various performance
C. Gaussian naïve Bayesian Classifier measures for prediction will be discussed in the next section.
Bayesian classification is a kind of supervised
machine learning technique where rows are represented as D. Performance Measures
instances and columns are represented as attributes. The
attributes are divided as dependent variables and independent There are three important performance measures while
variables. The independent variable class label is predicted calculating prediction value viz., (i) Root Mean Square
based dependent variable observations[19]. Error(RMSE) (ii) Mean Absolute Error (iii) Normalized Mean
It is a generative model, which is commonly used for Average Error [14][4]. Our proposed approach concentrate on
classification. By treating users as instances and items as Mean Absolute Error which in turn increase the accuracy of
features ,we can be able predict the values with this classifier. the recommender system. Formula to find Mean Absolute
We can treat any feature as independent variable in Error (MAE) is indicated in the equation 2 [16].
collaborative filtering. Consider the uth user, who has given
ratings for different items. We can represent the model based
on Bayes rule in probability theory[12]. ∑( )| |
(2)

Where n is the total number of ratings over all users Pi,j is the
( | ) ( ) predicted rating for user i on item j and r i,j is the actual rating.
( | ) ( )
( ) The equation is used to calculate mean absolute error to
measure the performance. The experiments and result will be
where P(c|x) is posterior probability, P(x|c) is likelihood, P(c) discussed in the next section.
is class prior probability and P(x) predictor prior probability.

GNB is one of the simplest classification algorithms. It IV. EXPERIMENTS AND RESULT
consists of assigning the label of the class that maximizes the
posterior probability of each sample. The likelihood of the The proposed approach is experimented using movie lens
features is assumed to be Gaussian which follows normal 100K dataset [15] using python platform. In this dataset, there
distribution. Gaussian noise is statistical noise having a are 943 users and 1642 movies and each user rated at least 20
probability density function equal to that of the normal movies. The rating scale is varied from 1 to 5 for various
distribution, which is also known as the Gaussian distribution genres. Features related with users and items are added and
[13]. unwanted feature which is not required for experiment is
deleted. The following Table 2. gives the statistical
information of the features for our experiment. The statistical

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 380


Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on May 24,2024 at 00:52:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

measures like mean,median,standard deviation and count are


found.

TABLE 2. STATISTICAL MEASURES OF MOVIE LENS 100K

Fig 4. MAE Values

The generated features are tested individually with


different split of training and test set. In feature generation
experiment conducted on individual feature. Then we
combined the features and the same experiment procedure is
followed and values are tabulated. The mean absolute error
values for all feature combinations are recorded and shown in
This input dataset is divided into 80 % of samples which Fig 4. It shows that, among the available feature the gender
are randomly selected as training set and the remaining used influences the rating more.
for testing. This split follows 10 fold cross validation. With V. CONCLUSION
the existing features, the naïve Bayesian algorithm model is
built using the training set. Next, the test dataset is applied to
the model and prediction is measured. To measure the The generated features are tested individually with
prediction performance, mean absolute error is calculated and different split of training and test set. Apart from feature
tabulated. generation from individual feature, we combined the features
The second phase implementation starts with feature and the same experiment procedure is followed and values are
engineering. The feature is splitted into different ranges by tabulated. The mean absolute error values for all feature
analyzing the histogram in section 3.2. By getting input from combinations are recorded. It shows that, among the available
the active user the system can be able to find the subset of feature the gender influences the rating more. It proves that,
selected attribute. This training input is applied to the the newly generated features related to user’s information
Gaussian naïve Bayesian Classifier to construct a new model. gives better results and improves the quality of
The output ratings are predicted by applying the test dataset. recommendation.

TABLE 3. MAE VALUES OF PROPOSED SYSTEM


REFERENCES
Method MAE
Existing Features 0.8642 [1] Francesco Ricci, Lior Rokach ,Bracha Shapira and Paul B. Kantor,
fe1 0.8755 “Recommender Systems Handbook”, Springer, e-ISBN 978-0-387-
fe2 0.9788 85820-3,2010,pp 1-5.
fe3 0.8480 [2] J. Bobadilla , F. Ortega, A. Hernando, and A. Gutierre, “Recommender
systems survey” Journal of Knowledge Based Systems,2013,pp 103-
fe4 0.9166 132.
[3] Bobadilla, J., Ortega, F., Hernando, A., & Alcalá, J. (2011). “Improving
The test run is conducted 100 times by keeping the collaborative filtering recommender system results and performance
training and test set split in 80/20,70/30 and 60/40 ratio[18] . using genetic algorithms”. Knowledge-based systems,24 , 1310–1316.
The experiment is conducted in the python platform and the [4] Xiaoyuan Su and Taghi M. Khoshgoftaar “A Survey of Collaborative
results are tabulated in Table 5. New features are created from Filtering Techniques”. Hindawi Publishing Corporation,Advances in
Artificial Intelligence Volume 2009, Article ID 421425, 19 pages
the existing features and they are named as fe1, fe2, fe3 and
fe4 . The measured prediction performance is calculated using [5] Antonio Hernandoa, , Jesús Bobadillaa, Fernando Ortegaa, “A non
negative matrix factorization for collaborative filtering recommender
mean absolute error and results are tabulated. The results systems based on a Bayesian probabilistic model”, Knowledge-Based
show, the features related to user and item data in the movie Systems (2016),15 pages.
lens dataset plays a reasonable role to predict the rating of the [6] Jian Wei, Jianhua He, Kai Chen, Yi Zhou, Zuoyin Tang , “Collaborative
movie. Filtering and Deep Learning Based Recommendation System For Cold
Start Items”, Journal of Expert Systems with Applications (2016),32
pages
[7] Kebin Wang and Ying Tan, “A New Collaborative Filtering
Recommendation Approach Based on Naive Bayesian Method”, ICSI
2011, Part II, LNCS 6729, 2011, , pp. 218–227
[8] L. Ren, W. Wang, An SVM-based collaborative filtering approach for
Top-N web services recommendation, Future Generation Computer

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 381


Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on May 24,2024 at 00:52:28 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 2nd International Conference on Inventive Communication and Computational Technologies (ICICCT 2018)
IEEE Xplore Compliant - Part Number: CFP18BAC-ART; ISBN:978-1-5386-1974-2

Systems (2017),http://dx.doi.org/10.1016/j.future.2017.07.027M. [14] J.Bobadilla,F.Ortega,A.Hernando and .Gutierrez, “ Recommender


Young, The Technical Writer’s Handbook. Mill Valley, CA: University systems survey”,Journal of Knowledge-Based Systems,2013,pp 109-132
Science, 1989. [15] https://grouplens.org/datasets/movielens/100k/ last accessed on
[9] Qinbao Song, Xiaoyan Zhu, Guangtao Wang, Heli Sun, He 12.12.2017
Jiang,Chenhao Xue, Baowen Xu, Wei Song, “A Machine Learning [16] Nikolaos Polatidis and Christos K.Georgidadis, “A dynamic ,multi-level
Based Software Process Model Recommendation Method”, The Journal collaborative filtering method for improved recommendations”,
of Systems & Software (2016), doi: 10.1016/j.jss.2016.05.002 Computer Standards and Interfaces”, 2016,14 pages
[10] Burke, R. User Model User-Adap Inter (2002) 12: 331. [17] Jiawei Han and Micheline Kamber, “Data Mining Concepts and
https://doi.org/10.1023/A:1021240730564 Techniques”, Morgan Kaufmann Publishers, An imprint of
[11] Chun-Liang Li et.al., “Combination of Feature Engineering and Ranking Elsevier,2010, pp 626-641.
Models for Paper-Author Identi_cation in KDD Cup 2013”, Journal of [18] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12,
Machine Learning Research 16 (2015) 2921-2947 pp. 2825-2830, 2011
[12] Jianjun Xie et.al., “Feature Engineering in User's Music Preference [19] C.C Aggarwal,” Recommender Systems: The Textbook”, Springer
Prediction”, JMLR: Workshop and Conference Proceedings International Publishing Switzerland 2016.
18:183{197, 2012}
[13] Marlis Ontivero-Ortega , Agustin Lage-Castellanos ,, Giancarlo Valente
, Rainer Goebel and Mitchell Valdes-Sosa, “Fast Gaussian Naïve Bayes
for searchlight classification analysis”,Journal of NeuroImage,2017,pp1-
9

978-1-5386-1974-2/18/$31.00 ©2018 IEEE 382


Authorized licensed use limited to: Kwangju Institute of Science and Technology. Downloaded on May 24,2024 at 00:52:28 UTC from IEEE Xplore. Restrictions apply.

You might also like