0% found this document useful (0 votes)
6 views

JOU Classification of sentiment reviews using n-gram machine learning

The paper discusses the classification of sentiment reviews using various machine learning algorithms, specifically focusing on Naive Bayes, Maximum Entropy, Stochastic Gradient Descent, and Support Vector Machine. It highlights the importance of processing unstructured data from social media and online reviews into structured formats for effective sentiment analysis. The study evaluates the performance of these algorithms using metrics such as precision, recall, f-measure, and accuracy on the IMDb dataset, demonstrating improved accuracy compared to previous studies.

Uploaded by

SinghSaurabh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

JOU Classification of sentiment reviews using n-gram machine learning

The paper discusses the classification of sentiment reviews using various machine learning algorithms, specifically focusing on Naive Bayes, Maximum Entropy, Stochastic Gradient Descent, and Support Vector Machine. It highlights the importance of processing unstructured data from social media and online reviews into structured formats for effective sentiment analysis. The study evaluates the performance of these algorithms using metrics such as precision, recall, f-measure, and accuracy on the IMDb dataset, demonstrating improved accuracy compared to previous studies.

Uploaded by

SinghSaurabh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Expert Systems With Applications 57 (2016) 117–126

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Classification of sentiment reviews using n-gram machine learning


approach
Abinash Tripathy∗, Ankit Agrawal, Santanu Kumar Rath
Department of Computer Science and Engineering, National Institute of Technology Rourkela, India

a r t i c l e i n f o a b s t r a c t

Article history: With the ever increasing social networking and online marketing sites, the reviews and blogs obtained
Received 3 March 2015 from those, act as an important source for further analysis and improved decision making. These reviews
Revised 15 March 2016
are mostly unstructured by nature and thus, need processing like classification or clustering to provide a
Accepted 16 March 2016
meaningful information for future uses. These reviews and blogs may be classified into different polarity
Available online 24 March 2016
groups such as positive, negative, and neutral in order to extract information from the input dataset.
Keywords: Supervised machine learning methods help to classify these reviews. In this paper, four different machine
Sentiment analysis learning algorithms such as Naive Bayes (NB), Maximum Entropy (ME), Stochastic Gradient Descent (SGD),
Naive Bayes (NB) and Support Vector Machine (SVM) have been considered for classification of human sentiments. The
Maximum Entropy (ME) accuracy of different methods are critically examined in order to access their performance on the basis
Stochastic Gradient Descent (SGD) of parameters such as precision, recall, f-measure, and accuracy.
Support Vector Machine (SVM)
N-gram © 2016 Elsevier Ltd. All rights reserved.
IMDb dataset

1. Introduction to obtain a reasonable output which help in proper decision mak-


ing Gautam and Yadav (2014). Unlike supervised learning, unsuper-
Sentiment analysis, also known as opinion mining, analyzes vised learning process do not need any label data; hence they can
people’s opinion as well as emotions towards entities such as prod- not be processed at ease. In order to solve the problem of pro-
ucts, organizations, and their associated attributes. In the present cessing of unlabeled data, clustering algorithms are used Hastie,
day scenario, social media play a pertinent role in providing infor- Tibshirani, and Friedman (2009). This study presents the impact of
mation about any product from different reviews, blogs, and com- supervised learning method on labeled data.
ments. In order to derive meaningful information from people’s The movie reviews are mostly in the text format and unstruc-
sentiments, different machine learning techniques are applied by tured in nature. Thus, the stop words and other unwanted infor-
scholars and practitioners Liu (2012). mation are removed from the reviews for further analysis. These
Sentiment analysis is observed to be carried out in three dif- reviews goes through a process of vectorization in which, the text
ferent levels such as document level, sentence level, and aspect data are converted into matrix of numbers. These matrices are
level Feldman (2013). Document level classifies whether the doc- then given input to different machine learning techniques for clas-
ument’s opinion is positive, negative or neutral. Sentence level de- sification of the reviews. Different parameters are then used to
termines whether the sentence expresses any negative, positive or evaluate the performance of the machine learning algorithms.
neutral opinion. Aspect level focuses on all expressions of senti- The main contribution of the paper can be stated as follows:
ments present within given document and the aspect to which it
refers. In this study, document level sentiment analysis has been i. Different machine learning algorithms are proposed for the
taken into consideration. classification of movie reviews of IMDb dataset IMDb (2011) us-
There are mainly two types of machine learning techniques, ing n-gram techniques viz., Unigram, Bigram, Trigram, combina-
which are very often used in sentiment analysis, i.e., the tech- tion of unigram and bigram, bigram and trigram, and unigram
nique based on supervised and unsupervised learning. In super- and bigram and trigram.
vised learning technique, the dataset is labeled and thus, trained ii. Four different machine learning techniques such as Naive Bayes
(NB), Maximum Entropy (ME), Support Vector Machine (SVM),
and Stochastic Gradient Descent (SGD) are used for classifica-

Corresponding author. Tel.: +91 9437124235 tion purpose using the n-gram approach.
E-mail addresses: [email protected] (A. Tripathy), [email protected] iii. The performance of the machine leaning techniques are eval-
(A. Agrawal), [email protected] (S.K. Rath). uated using parameters like precision, recall, f-measure, and

http://dx.doi.org/10.1016/j.eswa.2016.03.028
0957-4174/© 2016 Elsevier Ltd. All rights reserved.
118 A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126

accuracy. The results obtained in this paper indicate, the higher Dave et. al. have used a tool for synthesizing reviews, then
values of accuracy when compared with studies made by other shifted them and finally sorted them using aggregation sites Dave,
authors. Lawrence, and Pennock (2003). These structured reviews are used
for testing and training. From these reviews features are identified
The structure of the paper is defined as follows: and finally scoring methods are used to determine whether the re-
Section 2 presents literature survey. Section 3, indicates the views are positive or negative. They have used a classifier to clas-
methodology about the classification algorithm and its details. In sify the sentences obtained from web-search through search query
Section 4, the proposed approach is explained. Section 5, indicates using product name as search condition.
the implementation of the proposed approach. In Section 6, per- Matsumoto et.al., have used the syntactic relationship among
formance evaluation of the proposed approach is carried out. The words as a basis of document level sentiment analysis Matsumoto,
last section i.e., Section 7 concludes the paper and presents the Takamura, and Okumura (2005). In their paper, frequent word sub-
scope for future work. sequence and dependency sub-trees are extracted from sentences,
which act as features for SVM algorithm. They extract unigram, bi-
2. Literature survey gram, word subsequence and dependency subtree from each sen-
tences in the dataset. They used two different datasets for conduct-
The literature on sentiment analysis indicates that a good ing the classification i.e., IMDb dataset IMDb (2011) and Polarity
amount of study has been carried out by various authors based dataset Pang and Lee (2004). In case of IMDb dataset, the train-
on document level sentiment classification. ing and testing data are provided separately but in Polarity dataset
10-fold cross validation technique is considered for classification as
2.1. Document level sentiment classification there is no separate data designated for testing or training.
Zhang et.al. have proposed the classification of Chinese com-
Pang et.al., have considered the aspect of sentiment classifica- ments based on word2vec and SVMperf Zhang, Xu, Su, and Xu
tion based on categorization study, with positive and negative sen- (2015). Their approach is based on two parts. In first part, they
timents Pang, Lee, and Vaithyanathan (2002). They have under- have used word2vec tool to cluster similar features in order to
taken the experiment with three different machine learning al- capture the semantic features in selected domain. Then in second
gorithms, such as, NB, SVM, and ME. The classification process is part, the lexicon based and POS based feature selection approach is
undertaken using the n-gram technique like unigram, bigram, and adopted to generate the training data. Word2vec tool adopts Con-
combination of both unigram and bigram. They have used bag-of- tinuous Bag-of-Words (CBOW) model and continuous skip-gram
word features framework to implement the machine learning al- model to learn the vector representation of words Mikolov, Chen,
gorithms. As per their analysis, NB algorithm shows poor result Corrado, and Dean (2013). SVMperf is an implementation of SVM
among the three algorithms and SVM algorithm yields the result for multi-variate performance measures, which follows an alterna-
in a more convincing manner. tive structural formulation of SVM optimization problem for binary
Salvetti et.al., have discussed on Overall Opinion Polarity (OvOp) classification Joachims (2006).
concept using machine learning algorithms such as NB and Markov Liu and Chen have proposed different multi-label classification
model for classification Salvetti, Lewis, and Reichenbach (2004). In on sentiment classification Liu and Chen (2015). They have used
this paper, the hypernym provided by wordnet and Part Of Speech eleven multilevel classification methods compared on two micro-
(POS) tag acts as lexical filter for classification. Their experiment blog dataset and also eight different evaluation matrices for analy-
shows that the result obtained by wordnet filter is less accurate in sis. Apart from that, they have also used three different sentiment
comparison with that of POS filter. In the field of OvOp, accuracy is dictionary for multi-level classification. According to the authors,
given more importance in comparison with that of recall. In their the multi-label classification process perform the task mainly in
paper, the authors presented a system where they rank reviews two phases i.e., problem transformation and algorithm adapta-
based on function of probability. According to them, their approach tion Zhang and Zhou (2007). In problem transformation phase, the
shows better result in case of web data. problem is transformed into multiple single-label problems. During
Beineke et.al., have used NB model for sentiment classifica- training phase, the system learns from these transformed single la-
tion. They have extracted pair of derived features which are lin- bel data, and in the testing phase, the learned classifier makes pre-
early combinable to predict the sentiment Beineke, Hastie, and diction at a single label and then translates it to multiple labels. In
Vaithyanathan (2004). In order to improve the accuracy result, they algorithm adaption, the data is transformed as per the requirement
have added additional derived features to the model and used la- of the algorithm.
beled data to estimate relative influence. They have followed the Luo et.al., have proposed an approach to convert the text data
approach of Turney which effectively generates a new corpus of la- into low dimension emotional space (ESM) Luo, Zeng, and Duan
bel document from the existing document Turney (2002). This idea (2016). They have annotated small size words, which have definite
allows the system to act as a probability model which is linear in and clear meaning. They have also used Ekman Paul’s research to
logistics scale. The authors have chosen five positive and negative classify the words into six basic categories such as anger, fear, dis-
words as anchor words which produce 25 possible pairs and they gust, sadness, happiness and surprise Ekman and Friesen (1971).
used them for the coefficient estimation. They again have considered two different approaches for assign-
Mullen and Collier have applied SVM algorithm for sentiment ing weight to words by emotional tags. The total weight of all
analysis where values are assigned to few selected words and then emotional tags are calculated and based on these values, the mes-
combined to form a model for classification Mullen and Collier sages are classified into different groups. Although their approach
(2004). Along with this, different classes of features having close- yields reasonably a good result for stock message board, the au-
ness to the topic are assigned with the favorable values which thors claim that it can be applied in any dataset or domain.
help in classification. The authors have presented a comparison of Niu et.al., have proposed a Multi-View Sentiment Analysis
their proposed approach with data, having topic annotation and (MVSA) dataset, including a set of image-text pair with manual
hand annotation. The proposed approach has shown better result annotation collected from Twitter Niu, Zhu, Pang, and El Saddik
in comparison with that of topic annotation where as the results (2016). Their approach of sentiment analysis can be categorized
need further improvement, while comparing with hand annotated into two parts, i.e., lexicon based and statistic learning. In case
data. of lexicon based analysis, a set of opinion words or phrases are
A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126 119

Table 1
Comparison of sentiment techniques.

Author Approach Algorithm used Obtained result (Accuracy %) Dataset used

Pang et.al. Pang et al. (2002) Classify the dataset using different machine Naive Bayes (NB), Unigram: SVM (82.9), Bigram: Internet Movie Database
learning algorithms and n-gram model Maximum Entropy (ME), ME (77.4), Unigram + Bigram (IMDb)
Support Vector Machine : SVM (82.7)
(SVM)
Salvetti et.al. Salvetti et al. Accessed overall opinion polarity(OvOp) Naive Bayes (NB) and NB: 79.5, MM: 80.5 Internet Movie Database
(2004) concept using machine learning Markov Model (MM) (IMDb)
algorithms
Beineke et.al. Beineke et al. Linearly combinable paired feature are Naive Bayes NB: 65.9 Internet Movie Database
(2004) used to predict the sentiment (IMDb)
Mullen and Collier Mullen and Values assigned to selected words then Support Vector Machine SVM: 86.0 Internet Movie Database
Collier (2004) combined to form a model for (SVM) (IMDb)
classification
Dave et.al. Dave et al. (2003) Information retrieval techniques used for SVMlite , Machine learning Naive Bayes : 87.0 Dataset from Cnet and
feature retrieval and result of various using Rainbow, Naive Amazon site
metrics are tested Bayes
Matsumoto et.al. Matsumoto Syntactic relationship among words used Support Vector Machine Unigram: 83.7, Bigram: 80.4, Internet Movie Database
et al. (2005) as a basis of document level sentiment (SVM) Unigram+Bigram : 84.6 (IMDb), Polarity dataset
analysis
Zhang et.al. Zhang et al. (2015) Use word2vec to capture similar features SVMperf Lexicon based: 89.95, POS Chinese comments on
then classify reviews using SVMperf based: 90.30 clothing products
Liu and Chen Liu and Chen Used multi-label classification using eleven Eight different evaluation Average highest Precision: 75.5 Dalian University of
(2015) state-of-art multi-label, two micro-blog matrices Technology Sentiment
dataset, and eight different evaluation Dictionary (DUTSD),
matrices on three different sentiment National Taiwan
dictionaries. University Sentiment
Dictionary (NTUSD),
Howset Dictionary (HD)
Luo et.al. Luo et al. (2016) Ekman Paul’s research approach is used to Support Vector Machine SVM: 78.31, NB: 63.28, DT: Stock message text
convert the text into low dimensional (SVM), Naive Bayes (NB), 79.21 data(The Lion forum)
emotional space (ESM), then classify Decision Tree (DT)
them using machine learning techniques
Ekman and Friesen (1971)
Niu et.al. Niu et al. (2016) Used Lexicon based analysis to transform BOW feature with TF and Text: 71.9, Visual Feature: 68.7, Manually annotated Twitter
data into required format and then use TF-IDF approach Multi-view:75.2 data
statistical learning methods to classify
the reviews
Proposed Approach Converting text reviews into numeric Support Vector Machine NB: 86.23, ME: 88.48, SVM: Internet Movie Database
matrices using countvectorizer and (SVM), Naive Bayes (NB), 88.94, SGD: 85.11 (IMDb)
TF-IDF, which then given input to Maximum Entropy (ME),
machine learning algorithms for Stochastic Gradient
classification Descent (SGD), N-gram

considered which have pre-defined sentiment score. While in cation using unigram, bigram, trigram, and their combinations
statistic learning, various machine learning techniques are used for classification of movie reviews.
with dedicated textual features. ii. Also a number of authors have used Part-of-Speech (POS) tags
Table 1 provides a comparative study of different approaches for classification purpose. But it is observed that the POS tag
adopted by different authors, contributed to sentiment classifica- for a word is not fixed and it changes as per the context of
tion. their use. For example, the word ‘book’ can have the POS ‘noun’
when used as reading material where as in case of “ticket book-
ing” the POS is verb. Thus, in order to avoid confusion, instead
2.2. Motivation for proposed approach of using POS as a parameter for classification, the word as a
whole may be considered for classification.
The above mentioned literature survey helps to identify some iii. Most of the machine learning algorithms work on the data rep-
possible research areas which can be extended further. The follow- resented as matrix of numbers. But the sentiment data are al-
ing aspects have been considered for carrying out further research. ways in text format. Therefore, it needs to be converted to
number matrix. Different authors have considered TF or TF-IDF
i. Most of the authors apart from Pang et al. (2002) and to convert the text into matrix on numbers. But in this paper,
Matsumoto et al. (2005), have used unigram approach to clas- in order to convert the text data into matrix of numbers, the
sify the reviews. This approach provides comparatively better combination of TF-IDF and CountVectorizer have been applied.
result, but fail in some cases. The comment “The item is not The rows of the matrix of numbers represents a particular text
good,” when analyzed using unigram approach, provides the file where as its column represent each word / feature present
polarity of sentence as neutral with the presence of one pos- in that respective file which is shown in Table 3.
itive polarity word ‘good’ and one negative polarity word ‘not’.
But when the statement is analyzed using bigram approach, it 3. Methodology
gives the polarity of sentence as negative due to the presence
of words ‘not good’, which is correct. Therefore, when a higher Classification of sentiments may be categorized into two types,
level of n-gram is considered, the result is expected to be bet- i.e., binary sentiment classification and multi-class sentiment clas-
ter. Thus, analyzing the research outcome of several authors, sification Tang, Tan, and Cheng (2009). In binary classification type,
this study makes an attempt to extend the sentiment classifi- each document di in D, where D = {d1 , ......, dn } is classified as a la-
120 A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126

bel C, where C = {Positive, Negative} is a predefined category set. value in terms of exponential function can be expressed as
In multi class sentiment analysis, each document di is classified as 
1
a label in C∗ , where C∗ = {strong positive, positive, neural, nega- PME (c|d ) = exp( λi,c fi,c (d, c )) (2)
Z (d )
tive, strong negative}. It is observed in the literature survey, that a i
good number of authors have applied binary classification method where PME (c|d) refers to probability of document ‘d’ belonging
for sentiment analysis. to class ‘c’, fi, c (d, c) is the feature / class function for feature fi
The movie reviews provided by the reviewers are mainly in and class c, λi, c is the parameter to be estimated and Z(d) is
text format; but for classification of sentiment of the reviews using the normalizing factor.
the machine learning algorithms, numerical matrices are required. In order to use ME, a set of features is needed to be selected.
Thus, the task of conversion of text data in reviews into numerical For text classification purpose, word counts are considered as
matrices are carried out using different methods such as features. Feature / class function can be instantiated as follows:
• CountVectorizer: It converts the text document collection into 
0 i f c = c 
a matrix of integers Garreta and Moncecchi (2013). This method fi,c (d, c ) = N (d,i ) (3)
N (d )
otherwise
helps to generate a sparse matrix of the counts.
• Term frequency - Inverse document frequency (TF-IDF): It re-
where fi,c (d, c ) refers to features in word-class combination in
flects the importance of a word in the corpus or the collection
class ‘c’ and document ‘d’, N(d, i) represents the occurrence of
Garreta and Moncecchi (2013). TF-IDF value increases with in-
feature ‘i’ in document ‘d’ and ‘N(d)’ number of words in ‘d’.
crease in frequency of a particular word in the document. In or-
As per the expression, if a word occurs frequently in a class,
der to control the generality of more common words, the term
the weight of word-class pair becomes higher in comparison to
frequency is offset by the frequency of words in corpus. Term
other pairs. These highest frequency word-class pairs are con-
frequency is the number of times a particular term appears in
sidered for classification purpose.
the text. Inverse document frequency measures the occurrence
iii. Stochastic gradient descent (SGD) method: This method is
of any word in all documents.
used when the training data size is observed to be large. In
SGD method instead of computing the gradient, each iteration
In this paper, the combination of methods i.e., CountVectorizer
estimates the value of gradient on the basis of single randomly
and TF-IDF have been applied to transform the text document into
picked example considered by Bottou (2012).
a numerical vector, which is then considered as input to supervised
machine learning algorithm. wt+1 = wt − γt ∇w Q (zt , wt ) (4)

The stochastic process {wt , t = 1, 2, .......} depends on randomly


3.1. Application of machine learning algorithm picked example at each iteration, where Q(zt , wt ) is used to
minimize the risk and γ t is the learning rate. The convergence
When supervised machine learning algorithms are considered of SGD gets effected by the noisy approximation of the gradi-
for classification purpose, the input dataset is desired to be a la- ent. If learning rate decreases slowly, the parameter estimate
beled one. In this study, different supervised learning techniques wt decreases equally slowly; but if rate decreases too quickly,
are applied for classification purpose such as NB, ME, SGD, SVM, the parameter estimate wt takes significant amount of time to
and n-gram method. reach the optimum point.
iv. Support vector machine (SVM) method: This method analyzes
i. Naive Bayes (NB) method: This method is used for both classi- data and defines decision boundaries by having hyper-planes.
fication as well as training purposes. This is a probabilistic clas- In binary classification problem, the hyper-plane separates the
sifier method based on Bayes’ theorem. In this paper, multino- document vector in one class from other class, where the sep-
mial Naive Bayes classification technique is used. Multinomial aration between hyper-planes is desired to be kept as large as
model considers word frequency information in document for possible.
analysis, where a document is considered to be an ordered se- For a training set with labeled pair (xi , yi ), i = 1, 2, .... where xi
quence of words obtained from vocabulary ‘V’. The probability ∈ Rn and y ∈ {1, −1}l , the SVM method need to solve the fol-
of a word event is independent of word context and it’s posi- lowing optimization problem, which can be represented as
tion in the document McCallum, Nigam et al. (1998). Thus, each l
document di obtained from multinomial distribution of word is min 1
2
W TW +C i=1 ξi
w,b,ξ
independent of the length of di . Nit is the count of occurrence (5)
sub ject t o yi (wT φ (Xi ) + b) ≥ 1 − ξi ,
of wt in document di . The probability of a document belonging
ξi ≥ 0 .
to a class, can be obtained using the following equation:
where ‘W’ is the weight parameter assigned to variables, ξ is
|V |
 P (wt |c j ; θ )Nit the slack or error correction added and ‘C’ is the regulariza-
P ( di |c j ; θ ) = P ( |di | )|di |! (1)
Nit ! tion factor Hsu, Chang, and Lin (2003). Since the objective of
t=1 
the problem is to minimize “ 12 W T W + C li=1 ξi ,” where value
where P(di |cj ; θ ) refers to the probability of document ‘d’ be- of “yi (wT φ (Xi ) + b)” needs to be greater than “1-ξ i ” and the
longing to class ‘c’. P(|di |) is the probability of document ‘d’ and value of ‘ξ ’ is considered to be very small i.e., nearly equal to 0.
P(wt |cj ; θ ) is the probability of occurrence of a word ‘w’ in a Here training vector ‘xi ’ is mapped to higher dimensional space
class ‘c’. After estimating the parameters calculated from train- by ‘φ ’.
ing document, classification process is carried out on text doc- Since SVM requires input in the form of a vector of numbers,
ument by calculating posterior probability of each class and se- the reviews of text file for classification need to be converted to
lecting the highest probable class. numeric value. After the text file is converted to numeric vector,
ii. Maximum entropy (ME) method: In this method, the training it may go through a scaling process, which helps to manage the
data is used to set constraint on conditional distribution Nigam, vectors and keep them in the range of [1, 0].
Lafferty, and McCallum (1999). Each constraint is used to ex- v. N-gram model: It is a method of checking ‘n’ continuous words
press characteristics of training data. Maximum Entropy (ME) or sounds from a given sequence of text or speech. This model
A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126 121

Table 2
Confusion matrix. Dataset
Correct labels

Positive Negative
Preprocessing : Stop word,
Positive TP (True positive) FP (False positive)
Negative FN (False negative) TN (True negative) Numeric and special
character removal
Table 3
Matrix generated under CountVectorizer scheme.

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Vectorization


Sentence 1 1 1 1 0 0
Sentence 2 1 1 0 1 0
Sentence 3 1 1 0 0 1
Train using machine
learning algorithm
helps to predict the next item in a sequence. In sentiment anal-
ysis, the n-gram model helps to analyze the sentiment of the
text or document. Unigram refers to n-gram of size 1, Bigram
refers to n-gram of size 2, Trigram refers to n-gram of size 3.
Higher n-gram refers to four-gram, five-gram, and so on. The Classification
n-gram method can be explained using following example:
A typical example of a sentence may be considered as “The
movie is not a good one”.
• Its unigram: “‘The’,‘movie’,‘is’, ‘not’, ‘a’, ‘good’,‘one”’ where a
Result
single word is considered.
• Its bigram: “‘The movie’,‘movie is’, ‘is not’, ‘not a’, ‘a good’,
Fig. 1. Diagrammatic view of the proposed approach.
‘good one’ ” where a pair of words are considered.
• Its trigram: “‘The movie is’, ‘movie is not’, ‘is not a’, ‘not a

good’,‘a good one”’ where a set of words having count equal recall, which have more influence on final result.
to three is considered.
2 ∗ P recision ∗ Recall
F − Measure = (8)
3.2. Performance evaluation parameters P recision + Recall
• Accuracy: It is the most common measure of classification pro-
The parameters helpful to evaluate performance of supervised cess. It can be calculated as the ratio of correctly classified ex-
machine learning algorithm is based on the element from a ma- ample to total number of examples.
trix known as confusion matrix or contingency table. It is used TP + TN
in supervised machine learning algorithm to help in assessing Accuracy = (9)
TP + TN + FP + FN
performance of any algorithm. From classification point of view,
terms such as “True Positive(TP)”, “False Positive (FP)”, “True Nega- 3.3. Dataset used
tive(TN)”, “False Negative (FP)” are used to compare label of classes
in this matrix as shown in Table 2 Mouthami, Devi, and Bhaskaran In this paper, the acl Internet Movie Database (IMDb) dataset
(2013). True Positive represents the number of reviews those are is considered for sentiment analysis IMDb (2011). It consists of
positive and also classified as positive by the classifier, where as 12,500 positively labeled test reviews, and 12,500 positively la-
False Positive indicates positive reviews, but classifier does not beled train reviews. Similarly, there are 12,500 negative labeled
classify it as positive. Similarly, True Negative represents the re- test reviews, 12,500 negative labeled train reviews. Apart from la-
views which are negative also classified as negative by the clas- beled supervised data, an unsupervised dataset is also present with
sifier, where as False Negative are negative reviews but classifier 50,0 0 0 unlabeled reviews.
does not classify it as negative.
Based on the values obtained from confusion matrix, other pa- 4. Proposed-approach
rameters such as “precision”, “recall”, “f-measure”, and “accuracy”
are found out for evaluating performance of any classifier. The reviews of IMDb dataset is processed to remove the stop
• Precision: It measures the exactness of the classifier result. It is words and unwanted information from dataset. The textual data is
the ratio of number of examples correctly labeled as positive to then transformed to a matrix of number using vectorization tech-
total number of positively classified example. niques. Further, training of the dataset is carried out using machine
learning algorithm. Steps of the approach is discussed in Fig. 1.
TP
P recision = (6)
TP + FP Step 1: The aclIMDb dataset consisting of 12,500 positive and
• Recall: It measures the completeness of the classifier result. It is 12,500 negative review for training and also 12,500 posi-
the ratio of total number of positively labeled example to total tive and 12,500 negative reviews for testing IMDb (2011),
examples which are truly positive. is taken into consideration.
Step 2: The text reviews sometimes consist of absurd data, which
TP need to be removed, before considered for classification.
Recall = (7)
TP + FN The usually identified absurd data are:
• F-Measure: It is the harmonic mean of precision and recall. It • Stop words: They do not play any role in determining

is required to optimize the system towards either precision or the sentiment.


122 A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126

• Numeric and special character: In the text reviews, it is tor of one class from other classes, where the sepa-
often observed that there are different numeric (1,2,...5 ration is maintained to be large as possible Hsu et al.
etc.) and special characters (@, #, $,% etc.) present, (2003).
which do not have any effect on the analysis. But they Step 5: As mentioned in step 1, the movie reviews of acl IMDb
often create confusion during conversion of text file to dataset is considered for analysis, using the machine learn-
numeric vector. ing algorithms discussed in step 4. Then different variation
Step 3: After the preprocessing of text reviews, they need to be of the n-gram methods i.e., unigram, bigram, trigram, uni-
converted to a matrix of numeric vectors. The following gram + bigram, unigram + trigram, and unigram + bigram
methodologies are considered for conversion of text file to + trigram are applied to obtain the result which is shown
numeric vectors: in Section 5.
• CountVectorizer: It converts the text reviews into a ma- Step 6: The results obtained from this analysis are compared
trix of token counts. It implements both tokenization with the results available in other literatures is shown in
and occurrence counting. The output matrix obtained Section 6.
after this process is a sparse matrix. This process of
conversion may be explained using the following exam-
ple: 5. Implementation
• Calculation of CountVectorizer Matrix: An xam-

ple is considered to explain the steps of calculat- • Application of NB method: The confusion matrix and various
ing elements of the matrix Garreta and Moncecchi evaluation parameters such as precision, recall, f-measure, and
(2013) which helps in improving the understand- accuracy values obtained after classification using NB n-gram
ability. Suppose, three different documents contain- techniques are shown in Table 4.
ing following sentences are taken for analysis: As shown in Table 4, it can be analyzed that the accuracy value
Sentence 1: “Movie is nice”. obtained using bigram is better than value obtained using tech-
Sentence 2: “Movie is Awful”. niques such as unigram and trigram. NB method is a prob-
Sentence 3: “Movie is fine”. abilistic method, where the features are independent of each
A matrix may be formed with different values for its other. Hence, when analysis is carried out using “single word
elements size 4∗ 6, as there exists 3 documents and (unigram)” and “double word (bigram)”, the accuracy value ob-
5 distinct features. In the matrix given in Table 3, tained is comparatively better than that obtained using trigram.
the elements are assigned with value of ‘1’, if the But when ‘triple word (trigram)’ is being considered for analy-
feature is present or else in case of the absence of sis of features, words are repeated a number of times; thus, it
any feature, the element is assigned with value ‘0’. affects the probability of the document. For example: for the
• TF-IDF: It suggests the importance of the word to the statement “it is not a bad movie”, the trigram “it is not”, and
document and whole corpus. Term frequency informs “is not a” show negative polarity, where as the sentence repre-
about the frequency of a word in a document and IDF sents positive sentiment. Thus, the accuracy of classification de-
informs about the frequency of the particular word in creases. Again, when the trigram model is combined with un-
whole corpus Garreta and Moncecchi (2013). igram or bigram or unigram + bigram, the impact of trigram
• Calculation of TF-IDF value: An example may be makes the accuracy value comparatively low.
considered to improve understandability. If a movie • Application of ME method: The confusion matrix and evalua-
review contains 10 0 0 words wherein the word tion parameters such as precision, recall, f-measure, and accu-
“Awesome” appears 10 times. The term frequency racy values obtained after classification using ME n-gram tech-
(i.e., TF) value for the word “Awesome” may be niques are shown in Table 5.
found as 10/10 0 0 = 0.01. Again, suppose there are As represented in the Table 5, it may be analyzed that the ac-
1 million reviews in the corpus and the word “Awe- curacy value obtained using unigram is better than that of bi-
some” appears 10 0 0 times in whole corpus. Then, gram and trigram. As ME algorithm based on conditional distri-
the inverse document frequency (i.e., IDF) value is bution and word-class pair help to classify the review, unigram
calculated as log(1,0 0 0,0 0 0/1,0 0 0) = 3. Thus, the TF- method which considers single word for analysis, provides best
IDF value is calculated as 0.01 ∗ 3 = 0.03. result in comparison with other methods. In both bigram and
Step 4: After the text reviews are converted to matrix of numbers, trigram methods, the negative or positive polarity word appears
these matrices are considered as input for the following more than once; thus, affecting the classification result. The bi-
four different supervised machine learning algorithms for gram and trigram methods when combined with unigram and
classification purpose. between themselves, the accuracy values of various combina-
• NB method: Using probabilistic classifier and pattern tions are observed to be low.
learning, the set of documents are classified McCallum • Application of SVM method: The confusion matrix and evalu-
et al. (1998). ation parameters such as precision, recall, f-measure, and accu-
• ME method: The training data are used to set constraint racy values obtained after classification using SVM n-gram tech-
on conditional distribution Nigam et al. (1999). Each niques are shown in Table 6.
constraint is used to express characteristics of training As exhibit in Table 6, it may be analyzed that the accuracy
the data. These constraints then are used for testing the value obtained using unigram is better than the value ob-
data. tained using bigram and trigram. As SVM method is a non-
• SGD method: SGD method is used when the training probabilistic linear classifier and trains model to find hyper-
data size is mostly large in nature. Each iteration es- plane in order to separate the dataset, the unigram model
timates the gradient on the basis of single randomly which analyzes single words for analysis gives better result. In
picked example Bottou (2012). bigram and trigram, there exists multiple word combinations,
• SVM method: Data are analyzed and decision bound- which, when plotted in a particular hyperplane, confuses the
aries are defined by having hyper planes. In two cate- classifier and thus, it provides a less accurate result in compar-
gory case, the hyper plane separates the document vec- ison with the value obtained using unigram. Thus, the less ac-
A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126 123

Table 4
Confusion matrix, evaluation parameter and accuracy for naive bayes n-gram classifier.

Method Confusion matrix Evaluation parameter Accuracy

Unigram Correct labels Precision Recall F-Measure 83.652


Positive Negative
Positive 11025 1475 0.88 0.81 0.84
Negative 2612 9888 0.79 0.87 0.83
Bigram Correct labels Precision Recall F-Measure 84.064
Positive Negative
Positive 11156 1344 0.89 0.81 0.85
Negative 2640 9860 0.79 0.88 0.83
Trigram Correct labels Precision Recall F-Measure 70.532
Positive Negative
Positive 10156 2344 0.81 0.67 0.73
Negative 5023 7477 0.6 0.76 0.67
Unigram + Bigram Correct labels Precision Recall F-Measure 86.004
Positive Negative
Positive 11114 1386 0.89 0.84 0.85
Negative 2113 10387 0.83 0.88 0.85
Bigram + Trigram Correct labels Precision Recall F-Measure 83.828
Positive Negative
Positive 11123 1377 0.89 0.81 0.85
Negative 2666 9834 0.79 0.88 0.83
Unigram + Bigram + Trigram Correct labels Precision Recall F-Measure 86.232
Positive Negative
Positive 11088 1412 0.89 0.85 0.87
Negative 2030 10470 0.84 0.88 0.86

Table 5
Confusion matrix, evaluation parameter and accuracy for maximum entropy n-gram classifier.

Method Confusion matrix Evaluation parameter Accuracy

Unigram Correct labels Precision Recall F-Measure 88.48


Positive Negative
Positive 11011 1489 0.88 0.89 0.88
Negative 1391 11109 0.89 0.88 0.88
Bigram Correct labels Precision Recall F-Measure 83.228
Positive Negative
Positive 10330 2170 0.83 0.84 0.83
Negative 2023 10477 0.84 0.83 0.83
Trigram Correct labels Precision Recall F-Measure 71.38
Positive Negative
Positive 8404 4096 0.67 0.73 0.70
Negative 3059 9441 0.76 0.70 0.73
Unigram + Bigram Correct labels Precision Recall F-Measure 88.42
Positive Negative
Positive 11018 1482 0.88 0.89 0.88
Negative 1413 11087 0.89 0.88 0.88
Bigram + Trigram Correct labels Precision Recall F-Measure 82.948
Positive Negative
Positive 10304 2196 0.82 0.83 0.83
Negative 2067 10433 0.83 0.83 0.83
Unigram + Bigram + Trigram Correct labels Precision Recall F-Measure 83.36
Positive Negative
Positive 11006 1494 0.88 0.89 0.88
Negative 2666 9834 0.78 0.87 0.82

curate bigram and trigram, when combined with unigram and 6. Performance evaluation
with each other also, provide a less accurate result.
• Application of SGD method: The confusion matrix and evalua- The comparative analysis based on results obtained using pro-
tion parameters such as precision, recall, f-measure, and accu- posed approach to that of other literatures using IMDb dataset and
racy values obtained after classification using SGD n-gram tech- n-gram approaches are shown in Table 8.
niques are shown in Table 7. Pang et. al., have used machine learning algorithm viz., NB, ME
method, and SVM method using n-gram approach of unigram, bi-
As illustrate in Table 7, it can be analyzed that the accuracy ob- gram and combination of unigram and bigram. Salvetti et.al. and
tained using unigram is better than that of bigram and trigram. In Beineke et.al. have implemented the NB method for classification;
SDG method, the gradient is estimated on single randomly picked but only the unigram approach is used for classification. Mullen
reviews using learning rate to minimize the risk. In unigram, a sin- and Collier, have proposed SVM method for classification; with un-
gle word is randomly picked to analyze, but in bigram and trigram igram approach only. Matsumoto et.al. have also implemented the
both the combination of the words adds noise, which reduces the SVM for classification and used the unigram, bigram, and combi-
value of accuracy. Thus, when the bigram and trigram model is nation of both i.e., unigram and bigram for classification.
combined with other model, their less accuracy value affects the
accuracy of the total system.
124 A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126

Table 6
Confusion matrix, evaluation parameter and accuracy for support vector machine n-gram classifier.

Method Confusion matrix Evaluation parameter Accuracy

unigram Correct labels Precision Recall F-Measure 86.976


Positive Negative
Positive 10993 1507 0.88 0.86 0.87
Negative 1749 10751 0.86 0.88 0.87
Bigram Correct labels Precision Recall F-Measure 83.872
Positive Negative
Positive 10584 1916 0.85 0.83 0.84
Negative 2116 10384 0.83 0.84 0.84
Trigram Correct labels Precision Recall F-Measure 70.204
Positive Negative
Positive 8410 4090 0.67 0.71 0.69
Negative 3359 9141 0.73 0.69 0.71
Unigram +Bigram Correct labels Precision Recall F-Measure 88.884
Positive Negative
Positive 11161 1339 0.89 0.89 0.89
Negative 1440 11060 0.88 0.89 0.89
Bigram + Trigram Correct labels Precision Recall F-Measure 83.636
Positive Negative
Positive 10548 1152 0.84 0.83 0.84
Negative 2139 10361 0.83 0.84 0.84
Unigram + Bigram + Trigram Correct labels Precision Recall F-Measure 88.944
Positive Negative
Positive 11159 1341 0.89 0.89 0.89
Negative 1423 11077 0.89 0.89 0.89

Table 7
Confusion matrix, Evaluation parameter and accuracy for stochastic gradient descent n-gram classifier.

Method Confusion matrix Evaluation parameter Accuracy

Unigram Correct labels Precision Recall F-Measure 85.116


Positive Negative
Positive 9860 2640 0.79 0.90 0.84
Negative 1081 11419 0.91 0.81 0.86
Bigram Correct labels Precision Recall F-Measure 95
Positive Negative
Positive 12331 169 0.99 0.92 0.95
Negative 1081 11419 0.91 0.99 0.95
Trigram Correct labels Precision Recall F-Measure 58.408
Positive Negative
Positive 11987 513 0.96 0.55 0.70
Negative 9885 2615 0.21 0.84 0.33
Unigram + Bigram Correct labels Precision Recall F-Measure 83.36
Positive Negative
Positive 9409 3091 0.75 0.90 0.82
Negative 1069 11431 0.91 0.79 0.85
Bigram + Trigram Correct labels Precision Recall F-Measure 58.744
Positive Negative
Positive 12427 73 0.99 0.55 0.71
Negative 10241 2259 0.18 0.97 0.30
Unigram + Bigram + Trigram Correct labels Precision Recall F-Measure 83.336
Positive Negative
Positive 9423 3077 0.75 0.90 0.82
Negative 1089 11411 0.91 0.79 0.85

In this present paper, four different algorithms viz., NB, they have bought. But now-a-days people share those views
ME method, SVM, and SGD using n-gram approaches like uni- through reviews or blogs.
gram, bigram, trigram, unigram+bigram, bigram+trigram, and un- • The reviews can be collected and given input to the proposed
igram+bigram+trigram are carried out. Result obtained in the approach for qualitative decisions.
present approach is observed to be better than the result available • The proposed approach classifies the reviews into either pos-
in the literature where both IMDb dataset and n-gram approach itive or negative polarity; hence is able to guide the managers
are used. properly by informing them about the shortcoming or good fea-
tures of the product which they need to incorporate, to sustain
the market competition.
6.1. Managerial insights based on result

The managerial insight based on the obtained result can be ex- 7. Conclusion and future work
plained as follows:
This paper makes an attempt to classify movie reviews us-
• It was almost an observed practice that, sellers send question- ing various supervised machine learning algorithms, such as Naive
naires to the customers, about the feed back of the product Bayes (NB), Maximum Entropy (ME), Stochastic Gradient De-
A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126 125

Table 8
Comparative result of values on “Accuracy” result obtained with different literature using IMDb Dataset and ngram approach.

Method Pang et.al. Salvetti et.al. Beineke et.al. Mullen & Collier Matsumoto et.al. Proposed approach

Naive Bayes classifier Unigram 81.0 79.5 65.9   83.65


Bigram 77.3     84.06
Trigram      70.53
Unigram + Bigram 80.6     86
Bigram + Trigram      83.82
Unigram + Bigram + Trigram      86.23
Maximum entropy Unigram 80.4     88.48
Bigram 77.4     83.22
Trigram      71.38
Unigram + Bigram 80.8     88.42
Bigram + Trigram      82.94
Unigram + Bigram + Trigram      83.36
Support Vector Machine Unigram 72.9   86.0 83.7 86.97
Bigram 77.1    80.4 83.87
Trigram      70.16
Unigram + Bigram 82.7    84.6 88.88
Bigram + Trigram      83.63
Unigram + Bigram + Trigram      88.94
Stochastic Gradient Descent Unigram      85.11
Bigram      62.36
Trigram      58.40
Unigram + Bigram      83.36
Bigram + Trigram      58.74
Unigram + Bigram + Trigram      83.36

 indicate that the algorithm is not considered by the author in their respective paper

scent(SGD), and Support Vector machine (SVM). These algorithms ment. All of above mentioned limitations may be considered for
are further applied using n-gram approach on IMDb dataset. It is the future work, in order to improve the quality of sentiment clas-
observed that as the value of ‘n’ in n-gram increases the classifi- sification.
cation accuracy decreases i.e., for unigram and bigram, the result
obtained using the algorithm is remarkably better; but when tri- References
gram, four-gram, five-gram classification are carried out, the value
of accuracy decreases. Beineke, P., Hastie, T., & Vaithyanathan, S. (2004). The sentimental factor: improving
review classification via human-provided information. In Proceedings of the 42nd
As discussed in Section 2.2, instead of using unigram and POS
annual meeting on association for computational linguistics (p. 263). Association
tag, the use of unigram, bigram, trigram, and their combination for Computational Linguistics.
have shown a better result. Again, use of TF-IDF and CountVector- Bottou, L. (2012). Stochastic gradient descent tricks. In Neural networks: tricks of the
trade (pp. 421–436). Springer.
izer techniques as a combination for converting the text into ma-
Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: opinion
trix of numbers also help to obtain the value of accuracy in an extraction and semantic classification of product reviews. In Proceedings of the
improved manner, when machine learning techniques are used. 12th international conference on World Wide Web (pp. 519–528). ACM.
The present study has also some limitations as mentioned be- Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emo-
tion.. Journal of Personality and Social Psychology, 17(2), 124.
low: Feldman, R. (2013). Techniques and applications for sentiment analysis. Communica-
tions of the ACM, 56(4), 82–89.
• The Twitter comments are mostly small in size. Thus, the pro- Garreta, R., & Moncecchi, G. (2013). Learning scikit-learn: machine learning in python.
posed approach may have some issues while considering these Berlin Heidelberg: Packt Publishing Ltd.
reviews. Gautam, G., & Yadav, D. (2014). Sentiment analysis of twitter data using machine
learning approaches and semantic analysis. In Contemporary computing (IC3),
• Different reviews or comments contain symbols like ( , , , 2014 seventh international conference on (pp. 437–442). IEEE.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Unsupervised learning. New York:
Springer.
Hsu, C. W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector clas-
) which help in presenting the sentiment, but these being sification. Simon Fraser University, 8888 University Drive, Burnaby BC, Canada,
V5A 1S6.
images are not taken into consideration in this study for analy- IMDb, Internet movie database sentiment analysis dataset (IMDB), 2011,
sis. Joachims, T. (2006). Training linear svms in linear time. In Proceedings of the 12th
• In order to give stress on a word, it is observed that some per- ACM SIGKDD international conference on knowledge discovery and data mining
(pp. 217–226). ACM.
sons often repeat the last character of the word a number of Liu, S. M., & Chen, J.-H. (2015). A multi-label classification based approach for sen-
times such as “greatttt, Fineee”. These words do not have a timent classification. Expert Systems with Applications, 42(3), 1083–1093.
proper meaning; but they may be considered and further pro- Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures on Human
Language Technologies, 5(1), 1–167.
cessed to identify sentiment. However, this aspect is also not
Luo, B., Zeng, J., & Duan, J. (2016). Emotion space model for classifying opinions in
considered in this paper. stock message board. Expert Systems with Applications, 44, 138–146.
Matsumoto, S., Takamura, H., & Okumura, M. (2005). Sentiment classification using
In this paper, after removal of stop words, other words are con- word sub-sequences and dependency sub-trees. In Advances in knowledge dis-
sidered for classification. The list of words finally obtained are ob- covery and data mining (pp. 301–311). Berlin Heidelberg: Springer.
McCallum, A., Nigam, K., et al. (1998). A comparison of event models for naive bayes
served to be very large in a good number of cases; thus in future,
text classification. In AAAI-98 workshop on learning for text categorization: 752
different feature selection mechanism may be identified to select (pp. 41–48). Citeseer.
the best features from the set of features and based on which, the Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
classification process may be carried out. It may also happen that representations in vector space. arXiv preprint arXiv:1301.3781.
Mouthami, K., Devi, K. N., & Bhaskaran, V. M. (2013). Sentiment analysis and clas-
the accuracy value may improve, if some of the hybrid machine sification based on textual reviews. In Information communication and embedded
learning techniques are considered for classification of the senti- systems (ICICES), 2013 international conference on (pp. 271–276). IEEE.
126 A. Tripathy et al. / Expert Systems With Applications 57 (2016) 117–126

Mullen, T., & Collier, N. (2004). Sentiment analysis using support vector machines Salvetti, F., Lewis, S., & Reichenbach, C. (2004). Automatic opinion polarity classifi-
with diverse information sources. In EMNLP: 4 (pp. 412–418). cation of movie. Colorado research in linguistics, 17, 2.
Nigam, K., Lafferty, J., & McCallum, A. (1999). Using maximum entropy for text clas- Tang, H., Tan, S., & Cheng, X. (2009). A survey on sentiment detection of reviews.
sification. In IJCAI-99 workshop on machine learning for information filtering: 1 Expert Systems with Applications, 36(7), 10760–10773.
(pp. 61–67). Turney, P. D. (2002). Thumbs up or thumbs down?: semantic orientation applied to
Niu, T., Zhu, S., Pang, L., & El Saddik, A. (2016). Sentiment analysis on multi-view unsupervised classification of reviews. In Proceedings of the 40th annual meeting
social data. In Multimedia modeling (pp. 15–27). Springer. on association for computational linguistics (pp. 417–424). Association for Com-
Pang, B., & Lee, L. (2004). A sentimental education: sentiment analysis using subjec- putational Linguistics.
tivity summarization based on minimum cuts. In Proceedings of the 42nd annual Zhang, D., Xu, H., Su, Z., & Xu, Y. (2015). Chinese comments sentiment classifica-
meeting on association for computational linguistics (p. 271). Association for Com- tion based on word2vec and svm perf. Expert Systems with Applications, 42(4),
putational Linguistics. 1857–1863.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification Zhang, M.-L., & Zhou, Z.-H. (2007). Ml-knn: a lazy learning approach to multi-label
using machine learning techniques. In Proceedings of the ACL-02 conference on learning. Pattern recognition, 40(7), 2038–2048.
Empirical methods in natural language processing-Volume 10 (pp. 79–86). Associ-
ation for Computational Linguistics.

You might also like