Report
Report
Sentiment Analysis of
r/wallstreetbets During the
GameStop Saga
Alexander Murphy (MSc Computing & Information Systems) -190614555
Franklin Antony (MSc Computer Science) - 200626510
Group size: 2
1. Introduction ..........................................................................................................2
1.1 Overview ............................................................................................... 2
1.2 Hypothesis............................................................................................. 2
1.3 Project Objectives ................................................................................. 2
2. Literature Review.................................................................................................2
7. Data Cleaning.......................................................................................................6
7.1 Drop Redundant Columns..................................................................... 6
7.2 Dealing with Missing Values ................................................................ 6
7.3 Cleaning Text ........................................................................................ 6
8. Modelling .............................................................................................................6
8.1 Logistic Regression ............................................................................... 6
8.2 Linear SVC ........................................................................................... 7
8.3 Multinomial Naïve Bayes ..................................................................... 7
8.4 Comparing Models................................................................................ 8
‘r/wallstreetbets’ (WSB) was the epicentre of the movement – an anarchic forum that focuses on risky, speculative
investment strategies and related memes. Threads on $GME alone were exceeding 200,000 comments per day at
the height of interest and membership grew from 1.5 to 9 million over January and February. Contributors would
often frame posts in a highly emotive manner, depicting a kind of class warfare between hedge funds and the
individual investor – a reach seam for sentiment analysis!
1.2 Hypothesis
We believe that related sentiment data being generated on social media contributed to the volatility of $GME,
particularly as the value of the stock far exceeded (up to 12x) fundamental analysis estimates, before fluctuating
wildly as more interested parties became involved. When compelling stories of market speculation gain traction on
social media, a fertile breeding ground for strong narratives, it seems intuitive that the effect of that volatility will
be amplified.
1.3 Project Objectives
1
Source:https://www.cnbc.com/2019/10/01/charles-schwab-is-eliminating-online-commissions-for-trading-in-us-
stocks-and-etfs.html
2
Source: https://www.cnbc.com/2021/01/29/robinhood-investment-apps-dominate-app-store-rankings.html
into categories of efficiency, delineated by actual available levels of information available and the ‘weak’ EMH
remains popular to this day.
The common occurrence of ‘bubbles’, where assets become highly overpriced, seems to challenge the EMH’s core
principle of investors behaving rationally regarding available information. Teeter and Sandberg (2017) argue that
bubbles are social phenomena, with seemingly irrational prices reflecting social trends and narratives.
As a means of identifying links between social media activity and the $GME bubble, SA can provide quantitative
and qualitative assessment of the volume and nature of opinion. With 99% of papers on the subject being published
after 2004, use of SA’s modern web-based incarnation soared 100x between 2005-2016 (Mäntylä, 2020).
While first used mainly as a means of aggregating review data on Amazon et al., attention has turned to extracting
public sentiment on social media, particularly Twitter - 3 of the 20 top cited papers of 2020 in data analytics
centred on working with data from Twitter (Mäntylä, 2020).
Earlier papers in this field focused on classifying tweets into a simple sentiment/non sentiment classification. As
interest in the subject grew, more sophisticated machine learning models were developed to identify polarity –
classifying statements as positive, negative or neutral (Pang et al., 2002).
When dealing with sentiment surrounding the stock market, looking at just polarity is suboptimal. The type of
positive or negative statements is important to consider (Read, 2005). For example, excitement about purchasing
shares at a low price has a different real-world impact to happiness about selling shares at a high price, even
though both emotions are positive. (Acheampong et al., 2020) investigate Emotion Detection as a finer grained
classification approach, comparing the abundance of advancements made in more recent years.
3. Timeline Investigation
Due to WSB’s subscriber count being in the millions, there is an unwieldy amount of data that can be explored – as
mentioned in the introduction, threads on $GME alone were exceeding 200,000 comments per day in some cases.
To focus the scope of the project down to a manageable scale, we examined the timeline of developments to find
events that triggered peak levels of engagement. This is to ensure that we get a selection of sentiment available from
the widest possible pool of users, hopefully providing a picture as close to the reality of the situation as possible.
Using our own knowledge (having both bought shares and actively engaged with WSB), we felt the most significant
development in the context of wider engagement with the GME phenomenon was Elon Musk tweeting ‘Gamestonk!’
on January 26. As of 9:50 am the next morning the price had increased by 105%3 and we suspect that Musk’s
endorsement was the primary trigger for this dramatic rise in engagement and, consequently, the subsequent price
volatility over the next week. Because we want to capture the ups and downs of the saga, we selected a time frame
a couple of days after Musk’s tweet to counter the innate bias towards positive sentiment at that specific juncture.
4. Data Retrieval
4.1 Reddit Data
Reddit is an extremely popular social network and the rise of WSB’s visibility in mainstream media, given the real-
world implications of the GameStop saga, has only increased engagement. Combined with Reddit’s straightforward
API, relevant and substantial datasets were easy to find. All datasets were imported as CSV files and then extracted
into pandas data frames.
• For prediction, we are using reddit_wsb.csv, a substantial set containing nearly 45,000 posts drawn
from the week after Musk’s tweet from Kaggle. Attributes are:
title score id url comms_num created body timestamp
3
Source: https://markets.businessinsider.com/news/stocks/gamestop-stock-price-elon-musk-gamestonk-tweet-
extends-trading-rally-2021-1-1030009065
4.2 Labelled Reddit Data
• For training and testing our models, we will be using Reddit_wsb_labelled.csv, which provides
large set of posts with associated polarity. Attributes are:
clean_comment category
• Pandas: A popular package combining simple-to-use data extraction utilities with powerful analytical tools
and structures.
• Numpy: An important reason for Python’s ubiquity in the scientific community. Provides support for large
multidimensional arrays and the mathematical computing power of more fundamental languages like C
and Fortran.
• Matplotlib: A data visualisation module that allows easily digestible plotting of Pandas Data Frames.
• Plotly: Another data visualisation package that utilises JSON to plot interactive graphs online.
• Seaborn: An extension of Matplotlib for advanced visualisation tools, such as combining graphs to
produce an aggregate display of the data in the context of this report.
• Scikit-learn: This package provides the machine learning models we will be using to classify posts as
positive or negative. The library interacts closely with Numpy.
• Re: This library is used to load the Regular Expressions to clean the text and other pre-processing needed.
• Text2emotion: This package is used for classifying posts into 5 emotional states for our data exploration:
Happiness, Sadness, Surprise, Anger and Fear.
6. Data Exploration
Before modelling can begin, it was important to examine and visualise the data to provide context for our analysis
and support our claims. All code is documented in the appendices.
6.1 Engagement by day of the week
Figure 1: Number of posts, by weekday
Figure x shows an extraordinary leap in engagement on Fridays. This on the outset would appear odd – the stock has
experienced white knuckle volatility on all weekdays. One would expect the levels of engagement to reflect that with
a more even distribution across weekdays. One explanation could be that there is a greater level of positivity (and
intoxication!) on Friday that encourages more engagement. Another interesting theory is related to options (i.e., calls
and puts) expiring on the third Friday of every month. This could be a result of using bot farms to influence investors
and protect assets, although this assertion is little more than a conspiracy theory and should be verified in further
work.
6.2 Commonly used words
This frequency analysis chart shows just how dominant the subject of $GME is in the WSB community. This one
word had over twice the frequency of the ubiquitous investment term for any stock, ‘buy’.
7. Data Cleaning
Machine Learning requires that the data be cleaned of any extraneous details to perform properly. To render the data
appropriate for feeding into our choice models we must perform the following tasks on the Pandas data frame:
7.1 Drop Redundant Columns
Firstly, we examine our datasets and select the attributes that are of importance. For the purposes of this investigation,
we only need title and timestamp from the reddit data, and the price point was stored as a variable.
Reddit_wsb_labelled.csv is a curated dataset for this purpose and had already been cleaned (title, datetime
and polarity with no null values in the correct format) so these steps only apply to reddit_wsb.csv
del df[‘id’,’body’,]
7.2 Dealing with Missing Values
Once the frame has been stripped of unnecessary attributes, any tuples that do not contain a title are dropped.
7.3 Cleaning Text
Using Python’s lower() function, all capital letters in the posts are transformed to lowercase letters for continuity.
Using Python’s regular expression module - handlers, URL’s, single characters and extra spaces are all replaced with
an empty string using the re.sub() command. Special characters are dealt with by the re.findall()
command.
8. Modelling
For the purposes of this report, we are interested in the binary classification of either positive or negative sentiment,
as our hypothesis revolves around the core concept of volatility being a result of heightened emotional reaction,
hence we have removed neutral labels during data preprocessing. Our three supervised learning models have been
optimised to predict a post being of one of these two categories. All models were drawn from Scikit-learn's sklearn
library.
Although we are working with text, we need to transform the data into information that the algorithm can manipulate
and draw tangible results from. To do feature selection, we used CountVectorizer from the sklearn library to
transform words into numbers, since that is what algorithms work with.
8.1 Logistic Regression
Logistic Regression in Data Analysis is a means of streamlining data into binary values, informed by a given
boundary set between 0 and 1. In this case 0 and 1 represent negative and positive sentiment respectively in the
predictor space.
8.1.1 Justification
For stock market analysis applications, logistical regression is a computationally inexpensive means of gaining a
high-level overview of the sentiment data. Although the scope is limited to binary values, it can be used as a means
of ongoing analysis due to its quick processing time, triggering alerts for a finer grained (I.e., emotion detection)
approach if there is a substantive swing from positive to negative sentiment.
8.1.2 Model Implementation
The model was implemented using LogisticRegression classifier from sklearn library:
from sklearn.linear_model import LogisticRegression
The data was split into two, 80% for training and 20% for testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The data was fitted the model using:
lr.fit(X_train_res, y_train_res)
The fit function will apply needed pre-processing and regularization for better accuracy. We get a fit score of ~87%.
After this the model is executed.
8.2 Linear SVC
LinearSVC is a form of Support Vector Model (SVM) that maps data points that are not linearly separable to a space
where they are.
8.2.1 Justification
Within the scope of the report, this means that we should be able to differentiate between different polarities even if
the posts contain words that might suggest a different sentiment. With WSB post data, this is a particularly useful
tool as posts are often shrouded in irony or sarcasm.
8.2.2 Model Implementation
The LinearSVC classifier from sklearn library was used to implement the model:
From sklearn.svm import LinearSVC
The data was split into two, 80% for training and 20% for testing:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The data was fit into the model using:
lsvc.fit(X_train_res, y_train_res)
The fit function will apply needed pre-processing and regularization for better accuracy. We get a fit score of ~87%.
After this the model is executed.
Figure 5: (From left to right): Confusion matrices for Multinomial NB, LinearSVC and Logistic Regression
respectively
9. Results, Visualisation and Analysis
In order to produce an easily understandable time series that would highlight the polarity differences between models
clearly, we split the mean values by date from Jan 28th to Feb 5th.
For each model, we input the separated dates to predict the polarity of each discrete period. We can see that y_pred
contains the prediction of every row of the input dataset. To simplify plotting, we take the mean of that array using
numpy.
np.mean(y_pred)
This process is repeated for all the three models. The end-of-day (EOD) share price is taken from the merged
historical_prices dataset we obtained using TiingoClient API and plotted on a time series. On this graph
we also plot the mean values outputted from the model for each day.
Figure 6: Combined
From the graph we can see that Multinomial NB model was more positive than the LR during the last day (Feb 5th)
even though it was more negative in the beginning (Jan 28th). As we anticipated the LSVC and LR had an almost
identical curve and the average tended toward zero(neutral).
12. References
1. Cowles 3rd, A. (1933) “Can Stock Market Forecasters Forecast?” Econometrica: Journal of the
Econometric Society, 1, 309-324.
2. Cowles 3rd, A. (1944) “Stock Market Forecasting.” Econometrica, 12: 206-214
3. Fama, Eugene F. (1970) “Efficient Capital Markets: A Review of Theory and Empirical Work.” The
Journal of Finance, 25, 2: 383–417.
4. Teeter, Preston; Jörgen, Sandberg. (2017) “Cracking the Enigma of Asset Bubbles with Narratives.”
Strategic Organization, 15, 1: 91–99.
5. Mäntylä, Mika & Graziotin, Daniel & Kuutila, Miikka. (2016). “The Evolution of Sentiment Analysis - A
Review of Research Topics, Venues, and Top Cited Papers.” Computer Science Review. 27.
10.1016/j.cosrev.2017.10.002.
6. B. Pang, L. Lee, S. Vaithyanathan (2002).” Thumbs up? Sentiment Classification using Machine
Learning Techniques”
7. Read, J. (2005) “Using Emotions to Reduce Dependency in Machine Learning Techniques for
Sentiment Classification.”
8. Acheampong, F. A.; Wenyu, C.; Nunoo‐Mensah, H. (2020) “Text‐based emotion detection:
Advances, challenges, and opportunities”
13. Appendices
13.1 Notebooks
Data Exploration -https://colab.research.google.com/drive/1UcK-
N9yj0B9mnYOspEVVCnPM5X3aL3ou?usp=sharing
ML Model and Analysis -
https://colab.research.google.com/drive/1e7eecMTrn_IENQ_oxw3d6GQMPJubjkB9?usp=sharing
Cross Validation
The accuracy and F1 Scores
The data from reddit_wsb.csv shown as a table using pandas
The pre-processed and cleaned data given as input for the model
The combination of historical stock price and the post title made