Mehta 2021
Mehta 2021
Abstract—The stock market is a very dynamic market twitter is probably the most dependable and the fastest way
where nothing is as stable as a rock but as the technology is of consuming media. Talking about quantitative analysis, a
upgrading there are many ways and methods one can try to lot of data is available for training models. The model uses
learn this dynamic change and be prepared accordingly. This the yahoo finance API to train the model and make
paper focuses on such different methods of dynamically predictions.
learning the market and its trends. We have used three
different models for this paper and have also performed II. LITERATURE SURVEY
sentiment analysis on the tweets regarding the company or the
stock, the model with the least error is the ideal and the most Major stock price prediction methods have been built
preferred method for prediction. The results of this around the basic premise of fundamental and technical
classification have given a clear and insightful idea about the analysis, although recent studies have shown that stock
random ups and downs of the market and also a new approach prices have a strong correlation with the news articles of the
for investors so that they know where they can bet their company. [7].
money. The ARIMA model is giving the best accuracy for
Financial analysts used hourly stock prices of 30 different
every stock.
stocks and their corresponding news articles and tweets
Keywords—Sentiment Analysis, ARIMA, LSTM, Linear regarding the company from the NASDAQ website. They
Regression, Naïve Bayes, Stock Market Prediction , Tweets also collected the tweets related to all those 30 stocks for a
span of six months. Song et al. [2] collected six years of data
I. INTRODUCTION from the Hong Kong stock market[17]. They gathered all the
financials of those particular companies and stocks and their
The history of the world's stock markets goes back to the
corresponding news articles and tweets for the same time
17th century when the Dutch East India Company was listed
period to basically draw a correlation between news articles
on an official stock exchange in 1611. Ever since, investors
and stock trends. Also for a particular trading day, they
have been trying to look for various techniques to gather
recorded the open, close, high, and low prices of a particular
knowledge about the listed companies in the field of Stock
stock for each company.
Markets and find ways to improve their investment returns.
Initially, investors simply relied upon their past experience to Dogra et al. [8] performed a detailed study on many
identify patterns and predict stock prices. But, with the field classifiers such as KNN, Random Forest, SVM(Support
growing at such a high pace with over 30 million trading Vector Machine), and Naive Bayes[11][12][16] on their
accounts across India, the traditional techniques become efficacy in predicting stock trends. We can hence state that
irrelevant. although the Naive Bayes model has substantially greater
accuracy, the SVM [10] and f-measures of certain different
With the ever-growing technology, investors are moving
algorithms such as Random Forest which states that they are
towards using Intelligent Trading Systems rather than
overall better in performance. Deep learning models are
fundamental analysis, which also helps them to make better
extremely high-priced. Also they do not provide a good
investment decisions. It seems like an unachievable task to
perception on stock prices compared to other simplistic
match up the experience and professionalism of an
methods, so, they are not a suitable choice for an
experienced trader who has been in the field for decades but
ensemble[15]. Moreover simple traditional techniques such
with the amount of data available and technological
as opening, closing, high, low prices, and moving average
advancements, it’s quite possible to come up with algorithms
are considered more helpful in estimating the future of a
to predict stock prices. The field of Machine Learning has
particular stock.
seen significant development in financial time series
prediction. Eval Becker et al. in the collection of tweets The current trends in the stock market show a correlation
regarding a particular company have shown positive results with the previous sequence of stock over a given time period.
[1]. In this paper, we have collected tweets using the Twitter The Naive Bayes approach[16] only aims to predict the
API called Tweepy. Following this comes the challenge to closing price of a particular stock based on the particular
classify the dataset with the company name and sentiments day’s other market values such as the opening price. In such
shown in the tweets. a case calculating the changes in the price over a duration of
time can critically improve the accuracy of the model. Bo et
As a part of Qualitative analysis, news feed has a
al. [9] put forward an attribute of measuring technique using
significant impact on how the stock behaves over a period of
genetic algorithms and SVM for forecasting. The stock
time. This goes on to prove that media and stock market
trends used are based on the elements of the valuation index
trends are highly interconnected[4]. It can be safely said that
such as Working Capital, P/E Ratio, Price to Book Ratio,
Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 11:29:03 UTC from IEEE Xplore. Restrictions apply.
Price to Sales Ratio, Total Market Value and Cash Flow [3]. Hence, they have been removed. In addition, the tweets
Ratio. This method of characteristic sorting with the help of have been converted to lowercase and multiple whitespace
the Genetic algorithm[14] can be used as a preprocessing characters have been replaced with just a single occurrence.
step which can then be considered to point out chief After preprocessing the dataset extra tree features will be
technical measures which in turn is more helpful to the used to select the features which are the most effective
traders and can be simply calculated using the closing price. predicting the closing price of the stock .Scikit-learn
implements a estimator that uses randomized decision tree
Currently, Deep Learning has had a significant impact on to fit various sub samples of the dataset while using the
the models used for sentiment analysis [3]. The most mean technique. Hence the prediction accuracy of the model
advanced architectures today are Attention models and their is increased and the overfitting of the dataset is regulated.
modifications [16]. Aspect based sentiment analysis For our experiments the model extracts all tweets that specify
(ABSA), which was earlier not used often due to its either the company name, stock name or name of any of the
limitation of being able to identify only one aspect and board members currently in the company.
requiring a large amount of labelled data [3], has witnessed
a spurt in its development. Wang et al. proposed to learn an
embedding vector of each aspect and append the input aspect
embedding into each word input vector in order to better take
advantage of aspect information [13].
III. PROPOSED SYSTEM
A. System Overview
The scope of this paper is to classify the stocks according
to their sectors and capture the sentiment from the tweets
regarding a particular stock or a company as whole. The
flowchart in Fig. 1 demonstrates the overall system
architecture. The system can be divided into three main
components: a) The collection of the stock data such as the
opening, closing, high, low prices and preprocessing of
tweets for sentiment analysis, b) The classification of stock
ticker using different models and c) Performing analysis on
the labels obtained after classification.
These steps will be further explained in the next section.
B. The Proposed Method
1) Collection of Stock Ticker
The dataset is made using tickers of different stocks from
Yahoo Finance API and the tweets for the sentiment analysis
is taken from the Twitter API called Tweepy to extract the
tweets. The advantage of this method is the ability to specify
keywords which the model wants the tweets to have by
defining a string of rules [7]. For example we can define the
string with specific words either with company names or
stock names,also certain words can be highlighted such as
fall,depreciate,open,closes etc . In the,Indian stock market
the Nifty index is the best 50 performing stocks so to take all
the trends for all of the stocks in that list such as the Volume,
Opening, Closing prices and load in the model. For the
tweets the retweets have not considered as there the given
data or information might not be accurate and hence during
the prediction that data might alter the accuracy of the
system. We are getting the latest tweets as the stock market
opens daily. To test the model, the code for collecting the
tweets was run on 16th March, 2021 and 23rd March, 2021.
2) Data preprocessing
In this paper, the tweets made by only news portals have
been included. Since they are only a reliable source of news
though it is possible that these agencies are governed by
certain companies which could lead to inside information,
also users with very few tweets and tweets not related to the
market have been removed. Following the removal of such
tweets, text pre-processing was done by using Regular
Expressions [16]. The presence of hashtags and mentions in
the text increases the length of the sentence thereby
decreasing the ability of the model to classify the sentiments Fig. 1. System Flow Diagram
Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 11:29:03 UTC from IEEE Xplore. Restrictions apply.
3) Testing the accuracy of different models The results clearly show that the ARIMA model is
Stock market prices comprise of sequential data i.e all having the least accuracy.The ARIMA model gives best
points could be mapped to any particular occurrence at any results when forecasting short term results whereas the
given time and can conjointly work to predict the data for the LSTM model is better for long term predicating. For the
future.The current neural network has many self loops in the stock market prediction we require short term prediction to
hidden layers of the algorithm to enable RNN to use the be more accurate as the user has to run the model daily for
previous set of hidden neurons so that it can learn the current each stock, so for this particular scenario we do not require
state with the given input.Since RNN has the problem of long term prediction to be more accurate and hence the
vanishing gradient it is not a suitable fit for the model. As the ARIMA model is giving the best RMSE score.
stock prices of various stocks differ by a significant amount
the data has to first normalize and has to be kept in a range of
0 to 1 this data is then given to the model for training. The
model which we have used has been trained for 25 epochs by
changing the size of the layers for further improved tuning.
After training all these models were tested and the model
with the least RMSE score is considered to be the ideal
model to go ahead for the prediction. So the first model is
LSTM [6] which is a unique type of RNN introduced by
Hochreiter[5] in the year 1997, basically in this model the
hidden layers are changed by the LSTM cells[16], so these
cells comprise of various gates which allow the control of
the input flow of the data. The second model which has
been used is ARIMA model which is a linear model for
predicting and is very accurate in forecasting, the reason
being it can handle extreme values and as the outliers are Fig. 2. Actual vs Predicted graph of State Bank Of India using ARIMA
difficult to forecast for ARIMA the reason being they they
tend to lie outside the general trend as captured by the
model.So the main purpose for using these different models
is so that we can compare and have a very clear idea about
the performance of different models and which one of them
is giving a better accuracy.
4) Analysing the results
The experiment has been done for various different
models one of them being a deep learning model and the
others being normal linear models. From Table 1 it is evident
that this particular model i.e. The ARIMA model is giving
more precise results than the other two models. This is
because the ARIMA model does not use any past or previous
information for predicating; it only uses the ongoing window
for prediction. This allows the model to dynamically predict
and comprehend the changes and the patterns emerging from Fig. 3. Actual vs Predicted graph of State Bank Of India using LSTM
them for a particular stock, whereas the other two models use
previous lags for their prediction and hence cannot V. CONCLUSION
dynamically adjust to the changes in the stock market. This In this paper, we have analyzed and compared the
causes learning failures for the models and hence they show accuracy of three algorithms namely ARIMA, LSTM, and
poor accuracy Linear Regression to predict stock prices. We have used
Tweepy, a python library to access.Twitter API to perform
IV. RESULTS AND EVALUATION sentiment analysis of tweets. The App forecasts stock prices
The accuracies of the Long Term Short Memory(LSTM), of the next seven days for any stock listed under NASDAQ
ARIMA and Linear Regression models obtained after or NSE. The sentiment analysis of tweets combined with the
classifying the sample containing the trends of the stock and predicted prices recommends the user whether to buy or sell
their corresponding tweets are summarized in Table 1. a particular stock.
Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 11:29:03 UTC from IEEE Xplore. Restrictions apply.
[4] W. Zhao et al., "Weakly-Supervised Deep Embedding for Product Research (ICAETR - 2014), Unnao, India, 2014, pp. 1-4, doi:
Review Sentiment Analysis," in IEEE Transactions on Knowledge 10.1109/ICAETR.2014.7012901..
and Data Engineering, vol. 30, no. 1, pp. 185-197, 1 Jan. 2018, doi: [12] S. Selvin, R. Vinayakumar, E. A. Gopalakrishnan, V. K. Menon and
10.1109/TKDE.2017.2756658. K. P. Soman, "Stock price prediction using LSTM, RNN and CNN-
[5] Hochreiter, Sepp & Schmidhuber, Jürgen. (1997). Long Short-term sliding window model," 2017 International Conference on Advances
Memory. Neural computation. 9. 1735-80. in Computing, Communications and Informatics (ICACCI), Udupi,
10.1162/neco.1997.9.8.1735. India, 2017, pp. 1643-1647, doi: 10.1109/ICACCI.2017.8126078.
[6] D. Li and J. Qian, "Text sentiment analysis based on long short-term [13] Rezwanul, Mohammad & Ali, Ahmad & Rahman, Anika. (2017).
memory," 2016 First IEEE International Conference on Computer Sentiment Analysis on Twitter Data using KNN and SVM.
Communication and the Internet (ICCCI), Wuhan, China, 2016, pp. International Journal of Advanced Computer Science and
471-475, doi: 10.1109/CCI.2016.7778967. Applications. 8. 10.14569/IJACSA.2017.080603.
[7] Mohan, S., Mullapudi, S., Sammeta, S., Vijayvergia, P., & Anastasiu, [14] Wang, Yequan & Huang, Minlie & Zhu, Xiaoyan & Zhao, Li. (2016).
D. C. (2019). Stock Price Prediction Using News Sentiment Analysis. Attention-based LSTM for Aspect-level Sentiment Classification.
2019 IEEE Fifth International Conference on Big Data Computing 606-615. 10.18653/v1/D16-1058.
Service and Applications (BigDataService). [15] U. Pasupulety, A. Abdullah Anees, S. Anmol and B. R. Mohan,
[8] I. Kumar, K. Dogra, C. Utreja and P. Yadav, "A Comparative Study "Predicting Stock Prices using Ensemble Learning and Sentiment
of Supervised Machine Learning Algorithms for Stock Market Trend Analysis," 2019 IEEE Second International Conference on Artificial
Prediction," 2018 Second International Conference on Inventive Intelligence and Knowledge Engineering (AIKE), Sardinia, Italy,
Communication and Computational Technologies (ICICCT), 2019, pp. 215-222, doi: 10.1109/AIKE.2019.00045.
Coimbatore, India, 2018, pp. 1003-1007, doi: [16] A. Sarkar, A. K. Sahoo, S. Sah and C. Pradhan, "LSTMSA: A Novel
10.1109/ICICCT.2018.8473214. Approach for Stock Market Prediction Using LSTM and Sentiment
[9] Yang Wendong, Lou Zhengzheng and Ji Bo, "A multi-factor analysis Analysis," 2020 International Conference on Computer Science,
model of quantitative investment based on GA and SVM," 2017 2nd Engineering and Applications (ICCSEA), Gunupur, India, 2020, pp.
International Conference on Image, Vision and Computing (ICIVC), 1-6, doi: 10.1109/ICCSEA49143.2020.9132928.
Chengdu, China, 2017, pp. 1152-1155, doi: [17] Xiaodong Li, Pangjing Wu, Wenpeng Wang,Incorporating stock
10.1109/ICIVC.2017.7984734. prices and news sentiments for stock market prediction: A case of
[10] Zainuddin, Nurulhuda & Selamat, Ali. (2014). Sentiment analysis Hong Kong, Information Processing & Management, Volume 57,
using Support Vector Machine. I4CT 2014 - 1st International Issue 5,2020,102212, ISSN 0306-4573,
Conference on Computer, Communications, and Control Technology, [18] R. Gupta and M. Chen, "Sentiment Analysis for Stock Price
Proceedings. 333-337. 10.1109/I4CT.2014.6914200. Prediction," 2020 IEEE Conference on Multimedia Information
[11] Amit Kumar Sirohi, Pradeep Kumar Mahato and V. Attar, "Multiple Processing and Retrieval (MIPR), Shenzhen, China, 2020, pp. 213-
Kernel Learning for stock price direction prediction," 2014 218, doi: 10.1109/MIPR49039.2020.00051.
International Conference on Advances in Engineering & Technology
Authorized licensed use limited to: University of Glasgow. Downloaded on August 17,2021 at 11:29:03 UTC from IEEE Xplore. Restrictions apply.