0% found this document useful (0 votes)
258 views

Twitter Sentiment Analysis

With the evolving behavior of different types of social networking sites like Instagram, twitter, snapchat etc , the data posted by people i.e the users of a particular social site is increasing drastically . So much so that almost millions and billions of data may it be textual, video or audio is posted per day. This is because there are millions of users of a particular site. These users intend to share their thoughts, views related to any topic of their choosing. Some of these users even post in vain. These posts are short hence only meant to express a particular view of a particular user regarding a particular thing. In this paper we aim to derive the feelings behind these posts. For this we have chosen twitter as a social networking site. The posts in this social networking site are known as tweets. In this paper we scrutinize methods of preprocessing and extraction of twitter data using python and then train as well as test this data against a classifier in order to derive the sentiments behind tweets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
0% found this document useful (0 votes)
258 views

Twitter Sentiment Analysis

With the evolving behavior of different types of social networking sites like Instagram, twitter, snapchat etc , the data posted by people i.e the users of a particular social site is increasing drastically . So much so that almost millions and billions of data may it be textual, video or audio is posted per day. This is because there are millions of users of a particular site. These users intend to share their thoughts, views related to any topic of their choosing. Some of these users even post in vain. These posts are short hence only meant to express a particular view of a particular user regarding a particular thing. In this paper we aim to derive the feelings behind these posts. For this we have chosen twitter as a social networking site. The posts in this social networking site are known as tweets. In this paper we scrutinize methods of preprocessing and extraction of twitter data using python and then train as well as test this data against a classifier in order to derive the sentiments behind tweets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
You are on page 1/ 5

Volume 4, Issue 2, February – 2019 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Twitter Sentiment Analysis


Faizan
Sharda University

Abstract:- With the evolving behavior of different types used for different purposes such as politicians could use it
of social networking sites like Instagram, twitter, for analyzing what kind of sentiments people from different
snapchat etc , the data posted by people i.e the users of a areas are carrying towards him/her and hence could invest
particular social site is increasing drastically . So much more in those areas. An example of this is recent Trump
so that almost millions and billions of data may it be elections, where he hired a group of analysts for this
textual, video or audio is posted per day. This is because specific purpose. Sentiment analysis could also be applied
there are millions of users of a particular site. These in the field of business marketing. With the help of this
users intend to share their thoughts, views related to technology different business organizations capture the
any topic of their choosing. Some of these users even feelings of people regarding their products and of that of
post in vain. These posts are short hence only meant to their competitors. Organizations employ there strategies
express a particular view of a particular user regarding with accordance to this knowledge only. Leaving market
a particular thing. In this paper we aim to derive the research aside , analysis of sentiments could play a vital part
feelings behind these posts. For this we have chosen in Service industries As it could analyze a full fledged
twitter as a social networking site. The posts in this customer experience and could reveal customer feeling,
social networking site are known as tweets. In this paper which could prove to be very beneficial.
we scrutinize methods of preprocessing and extraction
of twitter data using python and then train as well as test II. LITERATURE SURVEY
this data against a classifier in order to derive the
sentiments behind tweets. These days analysis of feelings from twitter is on
constant appraisal within the research community as its
I. INTRODUCTION applications have a huge influence over the working of
different industries today. The main challenge faced by this
Microblogging sites, in today’s world have become a type of analysis is the variation of speech and complex
sea of data for analysts to prey on. This is because most of structure of data when extracted.
the individuals today are connected to some kind of
microblogging site where they pull out all the hype they Aliza Sarlan, Shuib and Chayanit [2] conducted
feel regarding anything. It won’t be wrong to say that in experiments on twitter data in which they simply extracted
some way these Microblogging sites have given a right to the tweets in Jason format and used python lexicon
speech to every individual who can access them. People dictionary to assign polarity to the tweets. On the other
from diverse parts of the world freely discuss , comment , hand Mandava Geeta, Bhargavav and Duvvada [3] turned it
post their opinions about any topic of their choosing in real up a notch and used learning methods for the same purpose
time .These blogs are mostly a complain expressing a and achieved a better accuracy of result. For this they
negative vibe Or an appreciation expressing a positive vibe collected data regarding cryptocurrency and applied
toward any topic of their choosing . The topics people post algorithms like naïve bayes and SVM (Support Vector
about could be a product from an organization such as a Machine) on it. These experiments further concluded that
laptop or a phone. Or it could be a famous entity Or any naïve bayes classifier has more accuracy then SVM.
other thing. Most of the leading organizations in today’s
era have employed analysts who have a job to derive Another research was conducted by Agarwal,
emotions of people behind these posts. This helps them to Xie,Vovshaa, I., Rambow, O., and also Passonneau[4] in
get a proper review About their product or company which which a unigram model was used as a baseline and was
helps them know public demand and the alterations they compared with other models such as one, model based on
Need to make in order to make better product in future. features and another model based on kernel tree . The
Therefore from the discussion above it could be concluded experiments revealed that feature based model out
that these micro-blogging sites could become an asset to performed the unigram model with a negligible margin
different organizations public or private if analysis of where as both unigram as well as feature based models
sentiment could be implemented on them. Sentiment were outperformed by kernel tree based model with a
analysis also known as analysis of feelings is an useful tool significant margin.
for analyzing different sites where people post their
opinions regarding a topic of interest .With the help of this Akshi Kumar and Teeja Mary Sebastian[5] proceeded
kind of analysis organizations can obtain the sentiments of with an approach which was a combination of corpus based
the people which they post as tweets or as comments or as well as lexicon based approach . This combination is
even as review regarding a particular entity or product of very rarely found in the work that has being done in this
interest to them .This goes in accordance with[10] who field as machine learning techniques are taking over. In
says , almost 87% people having a connection with internet their experiments they have used adjectives and verbs as
check reviews before purchase. This technique could be their features and have used corpus based techniques for

IJISRT19FB242 www.ijisrt.com 177


Volume 4, Issue 2, February – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
finding the semantic orientation of various adjectives  Topics discussed: tweets are posted by a wide variety of
present in the tweets and as for the verbs they have used people , thus the topics discussed is also variant and
lexicon dictionary. A linear equation is used to convey the could almost include any topic starting from politics to
total sentiment polarity of tweets. mobile products etc .
 Writing style or technique: tweets are written in total
K.Arun et al [1] gathered data on different aspects of bogus style i.e there are many spelling mistakes, use of
demonetization from twitter. They used R language as a slang is common, use of smileys is common. Thus
tool for analyzing these tweets. Not only were the tweets preprocessing of these plays a vital role for this.
analyzed but the result was visualized using different  Real time: tweets are very small and thus are quite often
projections such as word cloud and other different plots. updated.
These plots showed that the number of people accepting  Emoticons: these represent facial expression of user in a
demonetization is more then the number of people rejecting written form. Use of punctuations with other characters
it. is made to create these emoticons.
 User mentions: ‘@’ character is used to make a mention
Vaibhavi N. Patodkar, Imran R. Shaikh[6] aimed to of any user as to direct the message towards them.
predict the emotions behind the audience watching a  Hash tagging: ‘#’ character is used to make the mention
random tv show as positive or negative. For this purpose of the topic relating to which tweet is being written.
they extracted comments regarding some random tv shows  Other symbols: the use of ‘RT’ is done to symbolize the
and used these as data set for training and testing the
tweet as retweet, meaning posted again.
model. The model they choose was naïve bayes classifier
for which a result was displayed using a pie chart. This pie
IV. METHODOLOGY
chart concluded that the polarity of tweets with respect to
negative is more then that of positive. The proposed method for sentiment analysis in this
paper could be represented in 5 stages, each of which are
As stated in the section above sentiment analysis listed below:
could be used for politics .Tumasjan et al. [7] came across
A. Data Collection
the field and its benefits in election and used it for
B. Data Preprocessing
predicting the results in 2009 for German federal elections.
C. Feature Selection
They extracted approximately 100,000 tweets for this D. Model Selection
purpose regarding many political parties of that time and
E. Model Evaluation
area. Then analyzed the tweets in order to gain sentiments
for them .For this they used a software popularly known A. Data Collection
as(Linguistic Inquiry and Word Count) [8] LIWC2007. Data collection is the first phase for analysis as there
This software uses textual analysis as a base to derive
needs to be data for us to do analysis on. In our
sentiments. The results obtained by this analysis were very
experimentations we have used python programming
much similar to the actual results of the
language as a tool. Being that said, data collection in this
elections.[9]another interesting research was carried out by particular analysis could be carried out in two ways. First
Dr Rajiv along with some of his mates .They have applied
way is to collect preorganized data from different sites
the technique of sentiment analysis in a brand new way ,
such as kraggle . On these sites this preorganized data is
where they have used this technique to better situations in uploaded by the developers of sites themselves or is posted
crises situations. They collected the data of 2014 about a
by different researchers for free . All one needs to do to
deluge which occurred in Kashmir at that time.Data set acquire this data is to create a free account on these sites.
collected by them consisted of almost 8490 tweets on Second way is to manually extract data from twitter using
which naïve bayes classification technique was some API available for twitter. For this we have chosen
implemented. Their research showed that applying analysis tweepy as an API for extraction of tweets. Tweepy does not
of feeling in these situations of crises could help the compatible with the new versions of python(python 3.7) .
government in saving lives. So for using this particular API an older version of python
is needed(python 2.7). To access tweets on twitter using
III. DATA CHARACTERISTICS API first we need to authenticate the console from which
we are trying to access twitter. This could be done by
There are way too many social networking sites
following steps listed below:
available these days, but in this paper we are dealing with
 Creation of a twitter account.
just one such site and that is twitter. Twitter is in too much
 Logging in at the developer portal of twitter.
fame in present because of its specific format of writing
.Few of the characteristics of tweets is given below:  Select "New App" at developer portal.
 Tweet length: tweets are short messages consisting of a  A form for creation of new app appears, fill it out Fill.
maximum of 140 characters.  After this the app for which the form was filled out will
 Tweet availability: twitter is in way more fame than any go for review by twitter team.
other social networking site till present day. So much so  Once the review is complete and the registered app is
that approximately 1.2 billion tweets are posted on a authorized then and only then the user is provided with
daily basis. ‘API key’ and ‘API secret’

IJISRT19FB242 www.ijisrt.com 178


Volume 4, Issue 2, February – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 After this "Access token" and "Access token secret" are D. Model Selection
given. Once the data is being pre-processed, this data is to be
fed to a classification model for further processing. There
These keys and tokens are unique for each user and only are different classification algorithms on which these
with the help of these can one access the tweets directly models are built on. In this paper, we have chosen k-nearest
form twitter. For this paper we have extracted a large data neighbour model to perform the classification.
set consisting of almost 3000 tweets. These tweets are
taken using #USairlines and thus are about different US KNN or k-Nearest Neighbour algorithm represents a
Airlines. We have used textblob package of python for pre- machine learning technique used for classifying a set of
data annotation of polarity for these tweets. data into its given target values (in our case positive ,
neutral or negative).KNN could also be used for regression
Data set No. of tweets problems but is widely used for classification problems.
Training data 2343
Testing data 585 Now, any classification model needs a target set on
Table 1:- Data Distribution\ which we train the model for its further use. As for
mentions in the literature survey section most of them have
B. Data Preprocessing manually set these target values to positive, negative or
The pre-processing of data implies the processing of null. For this paper we have used a library in python known
raw data into a more convenient format which could be fed as Textblob to automatically set the target for each tweet.
to a classifier in order to better the accuracy of the
classifier. Here, in our case the raw data which is being The data set then is divided into two halfs training set
extracted from twitter using an API is initially totally and testing set. The data set used by us in our
unstructured and bogus as the availability of various experimentations consisted of 2928 tweets so we
useless characters seems very common in it. segregated it into training and testing data. Training portion
consisted of 2343 tweets whereas the test set consisted of
For this matter we remove all the unnecessary 585 tweets. Now this training as well as test set needs to be
characters and words from this data using a module in transformed into binary values so as to be fed to the model.
python known as Regular Expressions, are for short. This The models don’t understand any values other than the
module adopts symbolic techniques to represent different binary.
noise in the data and therefore makes it easy to drop them.
Specifically in twitter terminology there are various For this we have used another module of the python
common useless phrases and spelling mistakes present in known as sklearn which contains many classification model
the data, which need to be removed to boost the accuracy as well as different encoders in it. For this paper this library
of our resultant. These could be summoned up as follows: is being for model selection, label encoding as well as
 Hash tags: these are very common in tweets. Hash tags model evaluation which would be mentioned in next
represent a topic of interest about which the tweet is section.
being written. Hashtags look something like #topic.
 @Usernames: these represent the user mentions in a E. Model Evaluation
tweet. Some times a tweet is written and then is One of the most common and appropriate technique
associated with some twitter user, for this purpose these used for evaluation of a classifier is through confusion
are used. matrix. A generalized form of confusion matrix is given in
 Retweets(RT): as the name suggests retweets are used table 4.5 below:
when a tweet is posted twice by same or different user.
 Emoticons: these are very commonly found in the Predicted class1 Predicted class 2
tweets. Using punctuations facial expressions are Actual class 1 True positive(tp) False
formed in order to represent the a smile or other negative(fn)
expressions, these are known as emoticons.
 Stop words: stop words are those word which are Actual class 2 False positive(fp) True
useless when it comes to sentiment analysis. Words negative(tn)
such as it, is, the etc are known as stop words. Table 2:- General Confusion Matrix

C. Feature Selection By applying this technique we can derive the


As mentioned earlier in this paper different generalized evaluation parameters. These parameters
researchers have used different features for the include:
classification of the tweets, in our experimentations similar
feature are being used. These features include Unigram,  Accuracy : accuracy of a classifier indicates how
Bigram, N-gram, POS tagging, Subjective, objective accurately the classifier has predicted the result.It can be
features and so on. NLTK short for Natural Language Tool calculated using the formula:
Kit is another module available in python which also open 𝑡𝑝 + 𝑡𝑛
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦(a) =
source and could be used for extraction of these features. 𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛

IJISRT19FB242 www.ijisrt.com 179


Volume 4, Issue 2, February – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Precision: precision shows how often the result that is
being predicted by the classifier, when it indicates true
is correct. The formula for precision is:
𝑡𝑝
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝑝) =
𝑡𝑝 + 𝑓𝑝

 Recall: it indicates the true positive rate of the classifier.


Formula for recall is:
𝑡𝑝
𝑟𝑒𝑐𝑎𝑙𝑙(𝑟) =
𝑡𝑝 + 𝑓𝑝

 F1 score : it indicates the weighed average of recall and


precision.Formula for recall is:
2𝑝. 𝑟 Fig 2:- Positive and Null Wordcloud
𝐹1 𝑠𝑐𝑜𝑟𝑒 =
𝑝−𝑟 VI. CONCLUSION AND FUTURE WORK
V. RESULTS OF PROPOSED SYSTEM In this paper, we have demonstrated a system for the
analysis of textual twitter data in particular for the
The experiments conducted by us showed a model emerging field of sentiment analysis . For this we built a
accuracy of 65.33% for KNN (k nearest neighbor) model for the analysis of feeling using KNN algorithm with
classifier. The confusion matrix guised after completion of unigram, bigram and ngram features. We then performed a
testing of classifier is given in table 3 below: training and testing of this model on #USairline data set ,
for which we attained an accuracy of 65.33 % . In near
Negative Neutral Positive future we aim perform similar analysis in order to increase
Negative 1612 161 80 the accuracy of the model and doing so without extracting
any features, by using different deep learning techniques
Neutral 418 161 34
such as neural networks.
Positive 276 46 140
REFERENCES
Table 3:- Output Confusion Matrix
[1]. “Arun k, Sinagesh a & Ramesh m, “twitter sentiment
Other important model evaluation parameters as
analysis on demonetization tweets in India using r
mentioned in section before, for this experimentation are
language”, international journal of computer
given in the table 4 presented below:
engineering in research trends,vol.4, no.6, (2017),
pp.252-258”
Precision Recall F1-Score
[2]. “Aliza sarlan, chayanit nadam, shuib basri ,” twitter
Positive 0.70 0.87 0.78 sentiment analysis 2014 international conference on
Neutral 0.44 0.26 0.33 information technology and multimedia (ICIMU),
november 18 – 20, 2014, putrajaya, malaysia 978-1-
Negative 0.55 0.30 0.39 4799-5423-0/14/$31.00 ©2014 ieee 212”
Table 4:- Evaluation Parameters [3]. “Mandava geetha bhargava , duvvada rajeswara rao ,
” sentiment analysis on social media data using R” ,
The positive, null and negative could also be international journal of engineering & technology, 7
represented using wordclouds. wordclouds for our data set (2.31) (2018) 80-84”
are given below: [4]. “Agarwal, A., Xie, B., Vovsha, I., Rambow, O., and
Passonneau, R. “Sentimentanalysis of twitter data.” In
Proceedings of the ACL 2011,Workshop on
Languages in Social Media, pp. 30–38, 2011.
[5]. ”IJCSI International Journal of Computer Science
Issues, Vol. 9, Issue 4, No 3, July 2012 ISSN
(Online): 1694-0814”
[6]. ”International Journal of Innovations & Advancement
in Computer Science IJIACS ISSN 2347 – 8616
Volume 6, Issue 7 July 2017 ”
[7]. “Tumasjan, A., Sprenger, T., Sandner, P., and Welpe,
I.“Predictingelectionswith twitter: What 140
characters reveal about political sentiment..” In
Proceedings of theFourth International AAAI
Fig 1:- Negative Wordcloud Conference onWeblogs and Social Media (2010), pp.
178–185, 2010”.

IJISRT19FB242 www.ijisrt.com 180


Volume 4, Issue 2, February – 2019 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[8]. ” Pennebaker, J., Booth, R., and Francis, M.
“Liwc2007: Linguistic inquiry and word count,
Computer software.” Austin, TX: LIWC. Net, 2007.”
[9]. “Harvinder Jeet Kaur, Rajiv Kumar. "Sentiment
analysis from social media in crisis situations" ,
International Conference on Computing,
Communication & Automation, 2015”
[10]. “A. K. Jose, N. Bhatia, and S. Krishna, “Twitter
Sentiment Analysis”. National Institute of
Technology Calicut, 2010”.

IJISRT19FB242 www.ijisrt.com 181

You might also like