A Study of Feature Extraction Techniques For
A Study of Feature Extraction Techniques For
Sentiment Analysis
1 Introduction
Avinash M
National Institute of Technology,Tiruchirapalli , Tanjore Main Road, National Highway 67, Near
BHEL Trichy, Tiruchirappalli, Tamil Nadu, 620015. e-mail: [email protected]
Sivasankar E
National Institute of Technology,Tiruchirapalli , Tanjore Main Road, National Highway 67, Near
BHEL Trichy, Tiruchirappalli, Tamil Nadu, 620015. e-mail: [email protected]
1
2 Avinash M and Sivasankar E*
niques used in machine learning to map higher dimensional data onto a set of low
dimensional potential features.Extracting informative and essential features greatly
enhances the performance of machine learning models and reduces the computa-
tional complexity.
The growth of modern web applications like facebook,twitter persuaded users to
express their opinions on products,persons and places. Millions of consumers re-
view the products on online shopping websites like amazon,flipkart.These reviews
act as valuable sources of information that can increase the quality of services pro-
vided.With the growth of enormous amount of user generated data,lot of efforts are
made to analyze the sentiment from the consumer reviews.But analyzing unstruc-
tured form of data and extracting sentiment out of it requires a lot of natural lan-
guage processing(nlp) and text mining methodologies.Sentiment analysis attempts
to derive the polarities from the text data using nlp and text mining techniques.The
classification algorithms in machine learning require the most appropriate set of
features to classify the text as positive polarity and negative polarity.Hence, feature
extraction plays a prominent role in sentiment analysis.
2 Literature Review
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan [1] have conducted a study
on sentiment analysis using machine learning techniques.They compared the per-
formance of machine learning techniques with human generated baselines and pro-
posed that machine learning techniques are quite good in comparison to human
generated baselines.A. P. Jain and V. D. Katkar [2] have performed sentiment anal-
ysis on twitter data using data mining techniques.They analyzed the performance
of various data mining techniques and proposed that data mining classifiers can be
a good choice for sentiment prediction.Tim O’Keefe and Irena Koprinska [3] have
done research on feature selection and weighting methods in sentiment Analysis. In
their research they have combined various feature selection techniques with feature
weighing methods to estimate the performance of classification algorithms. Shereen
Albitar, Sebastien Fournier, Bernard Espinasse [4] have a proposed an effective
TF-IDF based text-to-text semantic similarity measure for text classification. Quoc
Le and Tomas Mikolov [5] introduced distributed representation of sentences and
documents for text classification. Parinya Sanguansat [6] performed paragraph2vec
based sentiment analysis on social media for business in Thailand.Izzet Fatih enturk
and Metin Bilgin [7] performed sentiment analysis on twitter data using doc2vec.
Analysing all the above techniques gave us deep insight about various methodolo-
gies used in sentiment analysis.Therefore, we propose a comparison based study on
the performance of feature techniques used in sentiment analysis and the results are
described in the following sections.
The proposed study involves two different features extraction techniques TF-IDF
and Doc2Vec. These techniques are used to extract features from the text data. These
features are then used in identifying the polarity of text data using classification al-
A Study of Feature Extraction techniques for Sentiment Analysis 3
3 Methodology
3.1 Pre-processing
Two techniques are used for feature extraction -TF-IDF and Doc2vec.Figure 2
shows the steps involved in extracting features using both the techniques.
4 Avinash M and Sivasankar E*
3.2.1 TF-IDF
3.2.1.1 TF
Term Frequency measures number of times a particular term t occured in a document
d. Frequency increases when the term has occured multiple times.TF is calculated
by taking ratio of frequency of term t in document d to number of terms in that
particular document d.
3.2.1.2 IDF
TF measures only the frequency of a term t.Some terms like stop words occur multi-
ple times but may not be useful.Hence Inverse Document Frequency(IDF) is used to
measure term’s importance.IDF gives more importance to the rarely occuring terms
in the document d. IDF is calculated as:
Total number of documents
IDF(t) = loge (2)
Total number of documents with term t in it
The final weight for a term t in a document d is calculated as:
3.2.2 Doc2Vec
3.3 Classification
Classifiers are trained on training data sets with the features obtained from the fea-
ture extraction techniques TF-IDF and Doc2vec and with the corresponding output
labels from the dataset. The test data is tested with the trained classifiers to predict
the sentiments and the accuracy is measured for test data.Classifiers Logistic Re-
gression , KNearest Neighbors, Decision Tree, Bernoulli Nave Bayes and support
vector machines with linear and rbf kernels are used for sentiment analysis of the
datasets.
(i) th
In linear regression we try to predict the value y for the i training sample using
a linear function Y = hθ (x) = θ T x. This is clearly not a great solution to predict
binary-valued labels.(y(i) ∈ {0, 1}).Logistic regression use a different hypothesis
that predicts the probability that a given example belongs to class “1” and the prob-
ability that the example belongs to class “0” . Logistic Regression is formulated
as:
1
P(y = 1|x) = T (4)
1 + e−θ x
P(y = 0|x) = 1 − hθ (x) (5)
feature space with a class label. The training of the KNN algorithm involves stor-
ing training samples as feature vectors with their corresponding labels. k is decided
by the user during classification and the unlabeled data is assigned to class which
is more frequent among k of the training samples nearest to that unlabeled sam-
ple.Euclidean distance is used to calculate the distance.
Bayes theorem forms the basis for Naive Bayes classifier.It assumes that features are
strongly independent and calculates the probability.Multi variate Bernoulli event
model is one of the naive bayes classifiers used in sentimental analysis. If Xi is a
boolean expressing the occurrence or absence of the ith term from the vocabulary ,
then the likelihood of a document given a class Ck is given by:
n
P(x|Ck ) = ∏ pxkii (1 − pki )1−xi (6)
i=1
The training samples are represented as points in the feature space.SVM performs
classification by separating the points with a set of margin planes.The boundary
hyperplane is chosen which maximizes the distance to the training samples.Support
vectors are the points that determine the margin planes. Data which can be separated
linearly is classified using Linear kernel and the data which is not linearly separable
is classified using RBF kernel.
4 Experimental Results
is done on a small data set(700 positive and 800 negative reviews) of movie reviews
taken from Large polarity dataset 1.0 [8]. Secondly, the experiment is done on Sen-
timent Labelled dataset [9](2000 reviews) of UCI data repository. The third exper-
iment is done on a corpus (100 negative and 100 positive documents) made from
the polarity dataset v2.0 [10](1000 negative and 1000 positive documents) taken
from the CS.cornell.edu. The fourth experiment is done on sentence polarity dataset
v1.0(5331 positive and 5331 negative reviews) taken from CS.Cornell.edu [11] .
Fifth experiment is done on Large movie review dataset [8](25000 positive reviews
and 25000 negative reviews) taken from the ai.stanford.edu. Using regular expres-
sions and string processing methods sentences and labels are separated from the
datasets. The sentences obtained are fed into feature extraction techniques TF-IDF
and Doc2Vec to generate vector(real numbers) features for each sentence.The Split
of training and testing samples is done by either hold out method where 50% data is
used for training and 50% data is used for testing or by 10-fold cross validation(CV)
where 9 folds are used for training and 1 fold is used for testing.Table 1 shows the
methods used for splitting training and test datasets.
min count 1 1 1 1 1
window size 10 10 10 10 10
vector size 100 100 100 100 100
sample 1e-4 1e-5 1e-5 1e-5 1e-4
negative 5 0 5 5 5
workers 7 1 1 1 7
dm 1 0 0 0 1
Each training and testing data set obtained by any of the above methods are
fed into feature extraction techniques TF-IDF and Doc2Vec to generate vectors.The
A Study of Feature Extraction techniques for Sentiment Analysis 9
TF-IDF vectorizer used is tuned with hyper parameters min df (min number of
times term t has to occur in all sentences),n d = Number of sentences in a train-
ing or testing corpus, max df (maximum number of times a term t can occur in
all the sentences which is calculated as max df * n d), Encoding(encoding used),
sublinear tf (use of term frequency) , use idf (use of inverse document frequency),
stopwords(to eliminate stopwords). The Doc2Vec vectorizer used is tuned with hy-
per parameters min count(minimum number of times a word has to occur),window
size(maximum distance between predicted word and context words used for predic-
tion),vector size(size of vector for each sentence),sample( threshold for configuring
which higher frequency words are randomly downsampled),negative(negative sam-
pling is used for drawing noisy words),workers(number of workers used to extract
feature vectors),dm(type of training algorithm used - 0 for distributed bag of words
and 1 for distributed model), Epochs(number of iterations for training).Tables 2 and
3 shows hyper paramaters tuned for both the techniques.
5 Performance Analysis
Figure 3 shows the plot of accuracy for all classifiers using TF-IDF. Figure 4
shows the plot of accuracy for all classifiers using Doc2Vec. Based on the accuracy
values from Tables 5 and 6,
10 Avinash M and Sivasankar E*
1. For the first dataset, Doc2Vec performed better than TF-IDF for all classi-
fiers.The accuracy is highest for Logistic Regression in case of TF-IDF whereas
SVM with rbf kernel achieved highest accuracy in case of Doc2Vec.
2. For the second dataset, TF-IDF performed better than Doc2Vec.The accuracy is
highest for SVM with rbf kernel in case of TF-IDF whereas LR,SVM with rbf
and linear kernels achieved highest accuracy in case of Doc2Vec.
3. For the third dataset, TF-IDF and Doc2Vec achieved similar accuracies.SVM
with rbf kernel achieved highest accuracy in case of TF-IDF whereas LR,SVM
with rbf and linear kernels achieved highest accuracy in case of Doc2Vec.
4. For the fourth dataset, Doc2Vec performed better than TF-IDF for all classi-
fiers.The accuracy is highest for SVM with rbf kernel in case of TF-IDF whereas
A Study of Feature Extraction techniques for Sentiment Analysis 11
6 Conclusion:
The purpose of this robust analysis is to provide deeper insight about the perfor-
mance of feature extraction techniques TF-IDF and Doc2Vec . Based on the accu-
racy measures for all the datasets, Doc2vec and TF-IDF has achieved satisfactory
performance for most of the datasets. But the accuracy scores of Doc2vec are better
when compared to TF-IDF on most of the data sets. This work can be extended to
test these techniques on unstructured data to analyze the performance of features
extracted from both the methods.
References