0% found this document useful (0 votes)
122 views

Associating Fundamental Features With Technical Indicators For Analyzing Quarterly Stock Market Trends Using Machine Learning Algorithms

The stock market is the primary entity driving every major economy across the globe, with each investment designed to capitalize profit while decreasing its associated risks. As a result of the stock market’s importance, there have been enumerable studies conducted with the goal of predicting the stock market through data analysis techniques including machine learning, neural networks, and time series analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views

Associating Fundamental Features With Technical Indicators For Analyzing Quarterly Stock Market Trends Using Machine Learning Algorithms

The stock market is the primary entity driving every major economy across the globe, with each investment designed to capitalize profit while decreasing its associated risks. As a result of the stock market’s importance, there have been enumerable studies conducted with the goal of predicting the stock market through data analysis techniques including machine learning, neural networks, and time series analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ISSN (online) 2583-455X

BOHR International Journal of Computer Science


2022, Vol. 2, No. 1, pp. 78–93
https://doi.org/10.54646/bijcs.014
www.bohrpub.com

Associating Fundamental Features with Technical Indicators


for Analyzing Quarterly Stock Market Trends Using
Machine Learning Algorithms
Nicholas Moore and Sikha Bagui∗

Department of Computer Science, University of West Florida, Pensacola, FL, United States
∗ Corresponding author: [email protected]

Abstract. The stock market is the primary entity driving every major economy across the globe, with each
investment designed to capitalize on profit while decreasing its associated risks. As a result of the stock market’s
importance, there have been enumerable studies conducted with the goal of predicting the stock market through
data analysis techniques including machine learning, neural networks, and time series analysis. This paper uses
machine learning algorithms to perform stock market index classification using fundamental data while classifying
the indices using technical indicators. The data were derived from Yahoo Finance on the top 100 indices in the
NASDAQ stock market from January 2000 to December 2020.
Keywords: Stock Market, Machine Learning, Technical Indicators, Fundamental Analysis, NASDAQ.

INTRODUCTION AND RELATED WORKS total assets, total debt, operating cash flow, and capital
expenditure [18].
The health of every economy in the world, both major The sentimental analysis relates to the public’s general
and growing, hinges on their market’s stock prices, and feeling or attitude toward specific stocks as it relates to its
predicting these stock prices is a growing area of interest success or failure within a given market [2]. The goal of
for world governments, professional investors, and private each of these methods is to try and predict market trends,
citizens. Despite efforts to develop new techniques and giving investors the information necessary to productively
strategies toward this goal, market volatility along with the place their money in a place where it will increase their
nonlinear high heteroscedasticity of market data present a overall investment [7].
model that is problematic to forecast [10]. There are three In a survey of the types of analysis performed on over
main approaches to analyzing the stock market: technical, 300 samples, 66% of papers focused on technical analysis,
fundamental, and sentimental. Technical analysis attempts 23% on fundamental analysis, and 11% based on some
to determine future price change patterns using technical combination of the two or some form of sentimental analy-
indicators, and these indicators include the opening price sis [19]. Given the vast amount of research available, this
(open), daily highest price (high), daily lowest price (low), paper will serve as a foundation for applying machine
closing price (close), adjusted closing price (adjusted close), learning techniques to fundamental data analysis.
and the total volume (volume). Technical indicators are The uniqueness of this paper is in the focus of the
detailed in daily stock market reports and represent data combination of fundamental data classified based on the
efficiently for time series analysis [19]. high and low technical indicators into three distinct classes:
Fundamental analysis uses the economic standing of buy, sell, or hold. Along with the classification system, a
a firm’s yearly or quarterly reports to predict future broad array of algorithms is applied in their most basic
stock value (Int et al., 2019). Fundamental analysis is the form, with the intended purpose of providing a benchmark
focus of this paper. Fundamental company reports vary performance of each classifier. The purpose of this is to
depending on the nature of the business. Examples of gain useful insights into how these algorithms could be
fundamental features include total revenue, gross profit, modified and expanded on for future use. This paper

78
Quarterly Market Trends Using Machine Learning 79

presents the results of collecting fundamental data on the and then producing two sets of features to be utilized in
top one hundred stocks in the NASDAQ stock market and each of the machine learning algorithms. This project also
applying eight different machine learning algorithms to serves the purpose of forming a benchmark foundation to
predict whether a stock should be bought, sold, or held in continue studies in this field.
any given quarter over the past 20 years from 2000 to 2020. The remainder of this paper is organized as follows:
Most research focusing on technical analysis deals with, Section “Data and Preprocessing” provides insight on the
at its smallest, minute-to-minute prediction models (Lam- datasets utilized and the preprocessing performed to estab-
ouirie et al., 2020), and at its largest, day–to-day [7]. While lish the final dataframe; Section “Algorithms” gives a high-
useful, we intended to explore longer-term investment level summary of the algorithms studied along with the
options that would be more useful to private citizens and parameters used in the experimentation; Section “Results”
long-term investors who wish to avoid the risk associated summarizes each algorithm’s best parameters along with
with day trading. their results; and finally, Section “Conclusion and Future
Given a large amount of research done on technical Works” points out the conclusions and posits future work
analysis and the generally positive results gained from to consider.
that research, we began by drawing inspiration from Wang
et al. [20], who attempted to train deep learning networks DATA AND PREPROCESSING
to analyze the Singapore Stock Exchange, straying from
conventional trend studies to have their algorithms pro- The data for this project consists of all the data on the
duce trading decisions directly. Their algorithms provided companies in the NASDAQ-100 stock market from Jan-
a buy, sell, or hold decision on a stock based on indica- uary 1, 1999, to January 1, 2020, located in the Yahoo
tors gathered from a random forest algorithm. In 2017, Finances database. The fundamental data are a collection
Thakur et al. [12] repeated this method, expanding on of three separate reports pulled from the database. These
using random forest algorithms to determine the rules dataframes (more technical term for files) and their feature
used to classify each index as a buy/sell/hold index. counts were as follows: quarterly balance sheet (92), quar-
The purpose of this research is to allow non-investors terly cash flow (72), and quarterly financials (52). A fourth
a platform to study and enter the market, streamlining dataframe on the historical daily values of each stock (tech-
the results directly into a decision stating, that is, if a nical indicators) was also pulled: historical prices indices
stock index should be bought, sold, or held. Discretizing (7). The combined total original feature count was 223.
the large number of fundamental features into a smaller The data are manually collected from Yahoo Finance.
number is a secondary focus of this study. To prepare the data, a series of preprocessing steps were
Hence, this study focuses on using fundamental val- taken. First, extraneous features were removed from the
ues to produce decisions based on those same technical dataframes. Many features were not reported continuously
indicators. By associating the fundamental features with across all dataframes from each stock. Also, Yahoo Finance
a decision based on the technical indicators, we have organizes features into individual sections and subsec-
combined two methods of study, namely, technical and tions, allowing for the generalizations of several features.
fundamental. We will study the fundamentals to predict Many of the subsections contained no values. Once the
classes based on the technical. original dataframes were feature filtered, they were com-
Given those articles and their influence on the work per- bined into a single quarterly report with the data associated
formed, it is prudent to note how this work will differ from around the dates that correlate with the end of the quarters
these works. While many of the studies mentioned used of the fiscal calendars across all 20 years of data. This left
machine learning algorithms [15–17], none used them on us with a combined dataframe of the fundamental and
fundamental data to predict long-term results, which for technical values. The preprocessed dataframe consisted of
the purpose of this paper is defined as results in increments 62 features of data and 8,498 indices of reported stock
of greater than 30 days. This project attempts to forecast the figures.
decision in 90-day increments four times a year, over a 20- Next, the data were categorized. Each quarterly report
year period, allowing personal nonday trading investors to was categorized into one of three possibilities: buy, sell, or
use this information to invest responsibly and reliably in a hold if neither buy nor sell. The exact class necessities were
volatile market environment. as follows:
By collecting quarterly data from 100 different stocks (1) Sell – High and low decrease by 5% or more in the
over a 20-year period, it is the work’s commitment to relate next quarter.
fundamental data to predicted classification on the rise and
(2) Buy – High and low increase by 5% or more in the
fall of technical indicators and then produce a decision
next quarter.
for the user to buy, sell, or hold a stock. The report will
(3) Hold – neither buy nor sell happens.
also explore which fundamental features gathered from the
quarterly report correlate the most with the decision clas- Each index represents the quarterly reports and the
sification, exploring two different methods of correlation high and low values associated with those quarters. These
80 Nicholas Moore and Sikha Bagui

Figure 1. Buy, sell, or hold classification process.

indices were classified based on the stated rules. Once the Table 1. Top 10 features selected from correlation and decision
classifications were added to each company’s quarterly tree methodologies.
reports, the rest of the data could be transitioned as follows. Correlation Method Decision Tree Method
This methodology is demonstrated in Figure 1. Basic Average Shares Capital Expenditure
A categorization was decided upon. The values in the Diluted Average Shares Total Assets
reports were altered to measure the percent change from Tax Effect of Unusual Items End Cash Position
the previous QR to the current QR and new classifications Other Income Expenses Ordinary Shares Number
can be added. A feature was added to the dataframe for Total Liabilities Net Minority Total Liabilities Net Minority
each existing feature. This new feature would measure the Interest Interest
change in each feature from one quarter to the next. For Total Unusual Items Total Expenses
example, say that in the previous quarter, a company was Excluding Goodwill
valued at $1,000, and in the next quarter, it was valued at Total Unusual Items Reconciled Depreciation
$1,100. This represents an increase of 10%. The new feature Working Capital Gross Profit
replaced the price value of 1,100 with 10% for our current Total Revenue Cost of Revenue
quarterly report. This process was repeated for every quar- Operating Expenses Operating Expenses
terly report, excluding the first, as no previous data existed
with which to modify the data. This left every quarterly
report with a percent change for each fundamental value. features into a more discreet number. The original features
In summary: are presented in Appendix A.
To perform this feature selection, we used two method-
• All the data files were collected into a single
ologies: correlation values and tree classifiers. For the first
dataframe for the purposes of preprocessing and
method, a correlation matrix was created, and the top
exploring the data.
10 features were correlated with our decision classifica-
• To train and test the classifier, the price-related fea-
tions. Once the top 10 correlation features were located,
tures were separated into two different dataframes
the preprocessed dataframe was spliced to only include
with a similar index value. This was done to prevent
those features along with the decision classifications. The
data leakage during the training and testing of the
dataframe was exported for use in our models. Next, using
model.
SciKit, a decision tree was implemented to determine this
• The date indices were replaced with a simple number
set of top ten features to study. Once identified, they
of indices as dates were no longer needed.
were also spliced out of the preprocessed dataframe and
moved into a new dataframe along with the correlating
Feature Selectin Using Correlation Values and Tree classification labels. Table 1 shows the features selected by
Classifiers both methods of feature discretization and used for the
remainder of this experiment.
The fundamental data were collected, preprocessed, and Using two different methods to determine which fea-
reformatted. The final files were exported. The data explo- tures to use will allow us to compare how the feature
ration consisted of two separate but similar steps. Given selection affects the accuracy of the models.
the large number of features (59, after removing the name, Now that the data were cleaned, formatted, and dis-
symbol, and date columns), we wanted to condense the cretized features were selected, we finally began classifying
Quarterly Market Trends Using Machine Learning 81

Figure 2. Data processing and analysis framework.

the stocks using their quarterly reports. To do this, multiple also included to picture how each of our models predicted
different machine learning classifiers were used. The data vs. the actual breakdown of how each index was classified.
were tested using eight different classifiers along with a The complete results and diagrams for each set of features
dummy model to provide us with a benchmark to compare can be found in Tables 10 and 11 and Figures 3–18.
our results against.
ALGORITHMS
• Ada Boost
• Decision Tree
Tables 2–9 give a short overview of the algorithms used
• Extreme Gradient Boost
in this study, their advantages and disadvantages, as well
• K Nearest Neighbor
as any parameter sets for each model. A total of eight
• Logistic Regression
algorithms were chosen based on their use in previous
• Naïve Bayes
studies on stock market data.
• Random Forest
Each algorithm was tested several times, using all of the
• Support Vector Machine Model
default methodologies as well as altering specific parame-
• Dummy (Benchmark)
ters using a grid search technique. This was done to ensure
Each of these models were trained and fitted to the that we were locating the optimal settings for each model
dataset to determine the best performing model. Along and so that we could in turn find the optimal model.
with running the default algorithms, we will also perform This may increase the processing time taken because each
a grid search on several different parameters to try and model will need to be run for each combination of param-
identify the best results for each algorithm. Figure 2 lays eters but eventually produces better results.
out this entire process. The parameters tested are listed along with their algo-
Results are presented in terms of four different statistical rithm’s synopsis and a short description of what the
metrics: accuracy, precision, recall, and F1-score. Our pre- parameter effects are. Each of the two discrete top features
processing methodology produced a slightly imbalanced will both be tested in this manner, choosing the best set
dataset, hence this study places more importance on pre- of parameters, which will in turn find the best algorithm
cision since it deals with the amount of false positives. for each set of features. Of the parameters listed for each
When studying stocks, we have chosen to be conservative algorithm, when more than one parameter value is listed,
with our investing policy. Focusing on precision allows us each listed parameter was tested for that model and the
to avoid investing in the wrong stock over recall, which specific combination of parameters that produced the best
would focus our concern on missing the opportunity. Each results of all attempts for each model. The definitions for
of the values has its importance, but because we do not each parameter were taken from Pedregosa et al. (2011)
want to invest in a stock that is a sell, we will focus on pre- and the SciKit learn documentation. The dummy algorithm
cision. The confusion matrix for each of the algorithms is was run using Scikit Learn’s default classifier.
82 Nicholas Moore and Sikha Bagui

Algorithm Synopsis Tables

Table 2. Ada boost algorithm synopsis.


Ada Boost – an estimator that initially fits on the original dataframe and then fits again on the same dataframe but in areas
where the weights are incorrectly assigned those instances are re classified and more difficult instances become the focus.
Parameters
– n_estimators – The maximum number of estimators at which boosting is terminated. In case of a perfect fit, the learning
procedure is stopped early.
◦ 50, 100, 200, 500
– learning_rate – Weight applied to each classifier at each boosting iteration. A higher learning rate increases the
contribution of each classifier. There is a trade-off between the learning rate and estimator parameters.
◦ 1, 0.1, 0.01

Advantages Disadvantages
– Less prone to overfitting data – Requires quality dataset void of noisy and outlier data.
– Input parameters are not jointly optimized. – Statistically slower compared to other algorithms

Table 3. Decision tree algorithm synopsis.


Decision Tree – Uses a tree data structure to predict the results of a particular classification. Highly useful classification model.
Parameters
– criterion – defines the function used to measure the quality of a split.
◦ ‘gini’ and ‘entropy’
– max_depth – defines the max depth of the tree. If ‘none’ nodes are expanded pure
◦ None, 2, 3, 4, 5, 6
– min_samples_split – defines the min number of samples required to split a node
◦ 2, 5, 10
– min_samples_leaf – defies the min number of samples required at a leaf node to split it.
◦ 1,2,3,4,5,6

Advantages Disadvantages
– Easy to understand and implement – Multiclassification problems increase error rates
– Insensitive to missing values – Underperforms when multiple features are highly
– Uncorrelated features can be processed with positive correlated.
results.
Quarterly Market Trends Using Machine Learning 83

Table 4. Gradient boost algorithm synopsis.


Extreme Gradient Boost – In each stage, n class regression trees are fit to the negative gradient of a multinomial deviance loss
function which allows for the enhancement of arbitrary differentiable loss functions. Essentially each model is trained on the
failures of the previous model.
Parameters
– Booster – which booster to use.
◦ “gbtree”, “gblinear”, “dart”
– Eta – step size shrink value used to prevent overfitting.
◦ .1, .5, .9
– Gamma – Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger the gamma
is, the more conservative the algorithm will be.
◦ 0, 1, 3
– n_estimators – number of trees in the forest
◦ 50, 100, 200
– max_depth – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure
◦ 1, 3, 6

Advantages Disadvantages
– Efficient classification model – Sensitive to outliers due to the carry through of errors in
– Historically more accurate then random forest previous iterations
– Can handle mixed feature types. – Difficult to upscale because of its reliance on previous
iterations.

Table 5. K nearest neighbor algorithm synopsis.


K Nearest Neighbor – A supervised machine learning algorithm that finds the distances between all the examples in the data
by selecting K closest examples. Chosen due to the high relation between two close data points in our data set.
Parameters
– n_neighbors – number of neighbors to use.
◦ 50, 100, 200
– weights – the weight function used in prediction.
◦ ‘uniform’ – all points in each neighborhood are equally weighted.
◦ ‘distance’ – closer neighbors on a query point will have more influence.
– p – the parameter for the Minkowski metric
◦ 1 – equivalent to Manhattan distance
◦ 2 – uses the Euclidean metric

Advantages Disadvantages
– Versatile algorithm that can be used for classification, – Speed in directly related to the size of the data making
regression, and search this classifier hard to size up.
84 Nicholas Moore and Sikha Bagui

Table 6. Logistic regression algorithm synopsis.


Logistic Regression – Used to assign observations to a discrete set of classes using a predictive analysis algorithm based on
probability calculated using a sigmoid cost function.
Parameters
– penalty – specify the norm of the penalty
◦ l1 – add a l1 penalty term
◦ l2 – add an l2 penalty term
– fit_intercept – specifies if a constant should be added to the decision function
◦ True, False
– intercept_scaling – used when using liblinear parameter and True self.fit interceptor it lessens the effect of regular
synthetic weights.
◦ 1, 10, 50
– Solver – chose the algorithm used in the optimization problem
◦ ‘liblinear’ – one vs rest schema
◦ ‘saga’ – used for larger dataframes to handle multinomial loss

Advantages Disadvantages
– Performs well with continuous or categorical data. – Data intensive
– Easy to use a interpret the results – Sensitive to multi-collinearity
– Feature scaling not needed – Performs poorly with non-linear data
– Prone to overfitting the data

Table 7. Naive Bayes algorithm synopsis.


Naïve Bayes – A supervised learning algorithm used for classification by features assuming each feature is independent of
each other with no correlation.
Parameters
– var_smoothing – Portion of the largest variance of all features that is added to variances for calculation stability.
◦ 1.5**-i for i in range (−20, 20, 2)

Advantages Disadvantages
– Fast paced algorithm that can be used in real time – Assumes each feature make an equal contribution,
– Scalable to larger datasets weighs each feature equally.
– Good performance with high dimensional data – Requires each classification to be well represented.
Quarterly Market Trends Using Machine Learning 85

Table 8. Random forest table synopsis.


Random Forest – Using many individual decision trees, each of which returns a class prediction and the class with the most
returns becomes the model’s prediction.
Parameters
– n_estimators – number of trees in the forest
◦ 10,50,100,200
– criterion – the function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy”
for the information gain.
◦ ‘gini’ and ’entropy’
– max_depth – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure
◦ None, 2, 5, 10
– min_samples_split – The minimum number of samples required to split an internal node
◦ 5, 10
– min_samples_leaf – The minimum number of samples required to be at a leaf node to be considered to continue splitting.
◦ 1, 2, 5

Advantages Disadvantages
– Works well with unbalanced data. – Smaller data frames and low dimension data are prone to
– Excellent nonlinear classifier. in accurate classifications.
– Maintains high accuracy when used with data that has – Setting parameters is difficult and sometimes
missing values. randomized.

Table 9. Support vector machine model algorithm synopsis.


Support Vector Machine Model – An extension of the maximal margin classifier modified for general use cases especially
nonlinear features.
Parameters
– C – Regularization parameter. The strength of the regularization is inversely proportional to C.
◦ [.01,.1,1],
– kernel is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ‘rbf’,
‘sigmoid’, ‘linear’

Advantages Disadvantages
– Works well in high dimensional spaces where the – Better suited for binary classifications.
dimensions is greater than the data frames. – Performs slower on larger datasets.
– Avoids overfitting the data due to outliers. – Selecting the right kernel function is difficult and can be
random.
86 Nicholas Moore and Sikha Bagui

RESULTS

The results displayed in Tables 10 and 11 show what


each algorithm returned using the best tuned parameters
found through the grid search phase of the experiment.
The results are reported on four different values: preci-
sion, recall, the f1-score for each classification, and the
overall accuracy of each model. Each metric is reported
for each of the three classifications (buy, sell, and hold).
Due to the imbalanced nature of the classifications, we
prioritize precision over accuracy when determining the
efficacy of the classifiers. Along with the results in both Figure 4. Decision tree confusion matrix.
tables, the confusion matrix for each model discretization
method combination is presented in Figures 3–18. The
confusion matrix allows us to visualize in depth how each
model performed by displaying the number of indices
that were misclassified for each of the three classifications.
By breaking down each classification into true classifica-
tions and false classifications and also labeling how each
false classification was mislabeled, we can gain a deeper
understanding how the algorithms performed and form a
foundation for improvement.
When viewing the confusion matrices for each dis-
cretization method and focusing on the top performing
algorithms, we can see that for both methodologies, the
results were best at predicting true holds, then true buys, Figure 5. Extreme gradient boost confusion matrix.
and rarely correctly predicted true sells. This indicates
that the weight was placed on being more conservative,
leaning toward holding over buying and selling. While
these decisions are not straightforward, the most important
aspect was for the algorithms to correctly classify buy and
hold indices over incorrectly sold ones, as buying and
holding are more directly related to money lost (i.e., buying
a stock that is going to lose value or holding a stock that
will lose value both cost you money you already have,
whereas selling a stock that will gain you money costs you
potential income you have not yet gained).

Top 10 Features Based on a Decision Figure 6. K nearest neighbour confusion matrix.


Tree Classifier
The top 10 features are presented in Table 1.

Figure 3. Ada boost confusion matrix. Figure 7. Logistic regression confusion matrix.
Quarterly Market Trends Using Machine Learning 87

Table 10. Results of top features based on a tree classifier.


Decision Tree Algorithms Results
Algorithms Class Precision Recall F1-score Accuracy Parameters Used
AdaBoost Sell 0.28 0.10 0.15 0.43 learning_rate = 1
Buy 0.43 0.45 0.44 n_estimators = 500
Hold 0.45 0.56 0.50
Decision Sell 0.21 0.23 0.22 0.38 criterion = ‘entropy’
Tree Buy 0.41 0.40 0.40 max_depth = None
Hold 0.45 0.43 0.44 min_samples_leaf’ = 1
min_samples_split = 5
Extreme Sell 0.31 0.03 0.06 0.42 Booster = gbtree
Gradient Buy 0.40 0.47 0.43 eta =0.1
Figure 8. Naive Bayes confusion matrix.
Boost Hold 0.44 0.56 0.50 gamma = 0
grow_policy =
depthwise
max_depth = 6
n_estimators = 50
K Nearest Sell 0.29 0.01 0.01 0.44 n_neighbors = 50
Neighbor Buy 0.42 0.45 0.44 p=2
Hold 0.45 0.62 0.52 weights = ‘distance’
Logistic Sell 0.22 0.03 0.05 0.43 C = 1.0
Regression Buy 0.43 0.33 0.37 fit_intercept = False
Hold 0.43 0.70 0.54 intercept_scaling = 1
penalty = ‘l2’
solver = liblinear
Naïve Sell 0.24 0.06 0.10 0.42 var_smoothing
Bayes Buy 0.35 0.02 0.04 = 0.001522438
Hold 0.43 0.93 0.59 8403474447
Figure 9. Random forest confusion matrix.
Random Sell 0.34 0.08 0.14 0.47 criterion = ‘gini’
Forest Buy 0.42 0.48 0.45 max_depth = None
Hold 0.46 0.54 0.50 min_samples_leaf = 1
min_samples_split = 5
n_estimators = 10
Support Sell 0.21 0.20 0.20 0.39 C=1
Vector Buy 0.43 0.37 0.40 Kernel = sigmoid
Machine Hold 0.42 0.48 0.45
Benchmark Sell 0.15 0.15 0.15 0.34 N/A
Model Buy 0.26 0.26 0.27
Hold 0.27 0.24 0.29

Figure 10. Support vector machine confusion matrix.


88 Nicholas Moore and Sikha Bagui

Top 10 Features Based on Correlation Values

Table 11. Results of top features based on a correlation model.


Correlation Method Algorithms Results
Algorithms Class Precision Recall F1-score Accuracy Parameters Used
AdaBoost Sell 0.27 0.06 0.10 0.41 learning_rate = 1
Buy 0.40 0.46 0.43 n_estimators = 500
Hold 0.43 0.55 0.48
Decision Sell 0.16 0.17 0.16 0.37 Criterion = entropy
Tree Buy 0.41 0.42 0.42 max_depth = None
Hold 0.44 0.42 0.43 min_samples_leaf
=6
min_samples_split
=5
Extreme Sell 0.28 0.22 0.22 0.41 Booster = ‘gbtree
Gradient Buy 0.43 0.46 0.44 eta = 0.1
Boost Hold 0.44 0.51 0.46 gamma = 0 Figure 11. AdaBoost confusion matrix.
grow_policy =
depthwise
max_depth = 6,
n_estimators = 200
K Nearest Sell 0.32 0.04 0.07 0.44 n_neighbors = 50
Neighbor Buy 0.43 0.45 0.44 p=2
Hold 0.42 0.57 0.48 weights =
’distance’
Logistic Sell 0.22 0.03 0.05 0.43 C = 11.390625
Regression Buy 0.43 0.33 0.37 fit_intercept =
False
Hold 0.43 0.70 0.54 intercept_scaling =
1
penalty = ‘l1’
solver = ’liblinear’
Naïve Sell 0.08 0.00 0.01 0.40 var_smoothing
Bayes Buy 0.40 0.97 0.57 = 0.0077073466292
589396
Hold 0.52 0.02 0.05
Random Sell 0.27 0.10 0.15 0.42 Criterion =
‘entropy’ Figure 12. Decision tree confusion matrix.
Forest Buy 0.45 0.46 0.45 max_depth = None
Hold 0.42 0.52 0.46 min_samples_leaf
=2
min_samples_split
=5
n_estimators = 10
Support Sell 0.21 0.10 0.14 0.40 C=1
Vector Buy 0.43 0.37 0.40 Kernel = ‘sigmoid’
Machine Hold 0.41 0.57 0.48
Benchmark Sell 0.15 0.15 0.15 0.34 N/A
Model Buy 0.26 0.26 0.27
Hold 0.27 0.24 0.29

Figure 13. Extreme gradient boost confusion matrix.


Quarterly Market Trends Using Machine Learning 89

Figure 14. K nearest neighbor confusion matrix. Figure 17. Random forest confusion matrix.

Figure 18. Support vector machine confusion matrix.


Figure 15. Logistic regression confusion matrix.

Summary of Top Performing Algorithms From Each


Methodology
Table 12 displays the best results from both the discretiza-
tion methods applied. As can be seen below, the random
forest nodel has an increase in accuracy of 13% while the
K nearest neighbor has an increase in accuracy of 10%.
Along with a dramatic increase in accuracy, there is also
a dramatic increase in the precision across all of the classi-
fications, nearly doubling the sell classification and over a
15% increase in both the buy and hold classifications.

Figure 16. Naive Bayes confusion matrix.


90 Nicholas Moore and Sikha Bagui

Table 12. Result from most optimal runs of both discretization models.
Best Performing Algorithms from Both Methodologies
Algorithms Class Precision Recall F1-score Accuracy Parameters Used
K Nearest Neighbor (Correlation Method) Sell 0.32 0.04 0.07 0.44 n_neighbors = 50
Buy 0.43 0.45 0.44 p=2
Hold 0.42 0.57 0.48 weights = ‘distance’
Random Forest (Tree Method) Sell 0.34 0.08 0.14 0.47 criterion = ‘gini’
Buy 0.42 0.48 0.45 max_depth = None
Hold 0.46 0.54 0.50 min_samples_leaf = 1
min_samples_split = 5
n_estimators = 10
Benchmark Model Sell 0.15 0.15 0.15 0.34 N/A
Buy 0.26 0.26 0.27
Hold 0.27 0.24 0.29

CONCLUSION AND FUTURE WORKS (ii) combining the best performing algorithms to increase
the performance of our models; and finally, (iii) exploring
By collecting quarterly data from 100 different stocks over the effect of modifying the features by creating interactive
a 20-year period, it is the work’s commitment to relate fun- features using domain knowledge.
damental data to predicted classification on the rise and fall
of technical indicators and then produce a decision for the
user to buy, sell, or hold a stock. The report will also explore CONFLICT OF INTEREST
which fundamental features gathered from the quarterly
report correlate the most with the decision classification, The authors declare that the research was conducted in the
exploring two different methods of correlation and then absence of any commercial or financial relationships that
producing two sets of features to be utilized in each of the could be construed as a potential conflict of interest.
machine learning algorithms. This project also serves the
purpose of forming a benchmark foundation to continue AUTHOR CONTRIBUTIONS
studies in this field.
Based on this work, it can be concluded that when con- Nicholas Moore was responsible for the initial research
tinuing this line of study, any efforts should be focused on and study, including the collecting of related works, per-
the K nearest neighbor and the random forest algorithms as forming the study of machine learning algorithms, and
they showed the best improvement against the benchmark the initial draft of the paper. He also contributed to the
model. It should also be noted that, while the percentages final submission. Sikha Bagui was responsible for guiding
could be considered low, given the nature of our study, Nicholas Moore as he conducted his research and advising
the ability of our classifiers to predict the highest reported on the research topic and formation. She also coauthored
precision of 46% and accuracy of 47% should be considered and edited the final submission.
a significant improvement. Given the unforgiving nature
of the study due to the volatile and unpredictable nature of
the data, more work need to be done in this area, but this APPENDIX A: ORIGINAL FEATURES
study shows that fundamental analysis at this stage forms
a foundation for future studies. It can also be noted that, Features Description Range of Values
on average, the decision tree algorithm results were better Date Date that each value Dates from
was reported 12/31/1999 to
than the correlation-based algorithm results. Also, it can be 01/01/2020
noted that, on average, the precision of the sell was lower Name Name of each stock String values of
than the precision of the buy or hold. as reported in varying lengths
For future work, we are thinking along the lines of: NASDAQ 100
(i) First and foremost, expanding our dataset to all the Symbol Symbol used to Strings values from
associate each stock three to four
available stock indices in the Yahoo Finance database and to its name within characters.
forming a data pipeline to potentially allow our data to stock market
be used indefinitely as new data is produced and posted; (Continued)
Quarterly Market Trends Using Machine Learning 91

Continued Continued
Features Description Range of Values Features Description Range of Values
TotalRevenue∗ Sum of both Dollar values from DilutedAverage Shares outstanding after Dollar values from
operating and 9,999,999,999 – Shares∗ all conversion 9,999,999,999 –
non-operating 9,999,999,999 possibilities are 9,999,999,1013
revenues of implemented
company as TotalOperating Sum total of profit after Dollar values from
reported for any IncomeAsReported subtracting regular, 9,999,999,999 –
given quarter recurring costs and 9,999,999,1014
CostOfRevenue+ Cost of Dollar values from expenses
manufacturing and 9,999,999,999 – TotalExpenses+ Sum of cost of sales and Dollar values from
delivering product 9,999,999,1000 operating expenses 9,999,999,999 –
or service 9,999,999,1015
GrossProfit+ Profit after Dollar values from NetIncomeFrom After-tax earnings Dollar values from
deducting costs −9,999,999,999 – Continuing and generated 9,999,999,999 –
associated with 9,999,999,1001 DiscontinuedOper- 9,999,999,1016
making and selling ation
products and/or NormalizedIncome Clearing impact of Dollar values from
providing services non-recurring items 9,999,999,999 –
OperatingExpense∗+ Expense business Dollar values from 9,999,999,1017
incurs through its 9,999,999,999 – InterestIncome Taxable income Dollar values from
normal business 9,999,999,1002 9,999,999,999 –
operations 9,999,999,1018
OperatingIncome Profit realized from Dollar values from InterestExpense Cost of borrowing Dollar values from
operations after 9,999,999,999 – money from banks, bond 9,999,999,999 –
deducting 9,999,999,1003 investors, and other 9,999,999,1019
operating expenses sources
NetNonOperating Expense unrelated Dollar values from NetInterestIncome Difference between Dollar values from
InterestIncomeEx- to core operations; 9,999,999,999 – revenue from 9,999,999,999 –
pense Interest charged on 9,999,999,1004 interest-bearing assets 9,999,999,1020
loss of an asset; and expenses on
Does not include interest-bearing liabilities
day to day expenses EBIT Earnings before interest Dollar values from
OtherIncome Income that does Dollar values from and taxes 9,999,999,999 –
Expense∗ not relate directly to 9,999,999,999 – 9,999,999,1021
business operations 9,999,999,1005 EBITDA Earnings before interest, Dollar values from
PretaxIncome Net sales minus cost Dollar values from taxes, depreciation, and 9,999,999,999 –
of goods sold minus 9,999,999,999 – amortization 9,999,999,1022
operating expenses 9,999,999,1006 ReconciledCost Act of reconciling all Dollar values from
TaxProvision Estimated income Dollar values from OfRevenue sales 9,999,999,999 –
tax company is 9,999,999,999 – 9,999,999,1023
legally expected to 9,999,999,1007 Reconciled Fixed asset reconciliation Dollar values from
pay Depreciation + statement 9,999,999,999 –
NetIncome Com- Bottom line profit Dollar values from 9,999,999,1024
monStockholders belonging to 9,999,999,999 – NetIncomeFrom Net income obtained Dollar values from
common 9,999,999,1008 Continuing from net of minority 9,999,999,999 –
stockholders Operation share-holders 9,999,999,1025
DilutedNIAvailto Diluted Net Income; Dollar values from NetMinority
ComStockholders net income adjusted 9,999,999,999 – Interest
for not paying out 9,999,999,1009 TotalUnusual Items Non-recurring gain or Dollar values from
any interest expense Excluding loss not considered part 9,999,999,999 –
or preferred Goodwill∗ of normal business 9,999,999,1026
dividends. TotalUnusualItems∗ Non-recurring gains or Dollar values from
BasicEPS Net income minus Dollar values from losses not considered 9,999,999,999 –
preferred dividends 9,999,999,999 – part of normal business 9,999,999,1027
divided by weight 9,999,999,1010 NormalizedEBITDA Net income from Dollar values from
average of common continuing operations 9,999,999,999 –
shares outstanding before interest, income 9,999,999,1028
DilutedEPS Value used to gauge Dollar values from taxes, depreciation and
quality of earnings 9,999,999,999 – amortization, excluding
per share of stock 9,999,999,1011 any non-recurring items
BasicAverageShares∗ Average number of Dollar values from and/or non-cash equity
shares investors 9,999,999,999 – compensation expense
held at any point in 9,999,999,1012 (Continued)
period
(Continued)
92 Nicholas Moore and Sikha Bagui

Continued Continued
Features Description Range of Values Features Description Range of Values
TaxRateForCalcs Effective federal tax rate Dollar values from IssuanceOf Amount of money Dollar values from
9,999,999,999 – CapitalStock generated when 9,999,999,999 –
9,999,999,1029 company initially sold its 9,999,999,1048
TaxEffectOf Net value of taxable Dollar values from common stock on open
UnusualItems∗ unusual items 9,999,999,999 – market
9,999,999,1030 RepaymentOfDebt After all long-term debt Dollar values from
TotalAssets+ Combined value of the Dollar values from instrument obligations 9,999,999,999 –
total liabilities and 9,999,999,999 – are repaid, balance sheet 9,999,999,1049
shareholder’s equity 9,999,999,1031 will reflect a canceling of
TotalLiabilities Share of equity Dollar values from principal and liability
NetMinority ownership not owned or 9,999,999,999 – expenses for total
Interest∗+ controlled by parent 9,999,999,1032 amount of interest
corporation RepurchaseOf When a company buys Dollar values from
TotalEquityGross Minority Interests Dollar values from CapitalStock back its shares from 9,999,999,999 –
MinorityInterest divided by the total 9,999,999,999 – marketplace 9,999,999,1050
equity 9,999,999,1033 FreeCashFlow Cash generated after Dollar values from
TotalCapitalization Sum of the long-term Dollar values from accounting for cash 9,999,999,999 –
debt and all other 9,999,999,999 – outflows 9,999,999,1051
equities including 9,999,999,1034 Open Price at which financial Value from 0 to 100
common stock and security opens in market
preferred stock High Price at which financial Value from 0 to 101
CommonStockEquity Stock held by founders Dollar values from security is highest on
and employees not 9,999,999,999 – market
included in stock owned 9,999,999,1035 Low Lowest price of financial Value from 0 to 102
by parent company security
NetTangibleAssets Total assets of company Dollar values from Close Closing price of financial Value from 0 to 103
minus any intangible 9,999,999,999 – security
assets 9,999,999,1036 Adj Close Amends stock’s closing Value from 0 to 104
WorkingCapital∗ Capital used in day to Dollar values from price
day trading operations 9,999,999,999 – Volume Amount of asset or Value from 0 to
9,999,999,1037 security that changes 999,999,999
InvestedCapital Money raised by issuing Dollar values from hands
securities, stock equity 9,999,999,999 – * = Feature derived from Correlation Method.
shareholders and debt of 9,999,999,1038 + = Feature derived from Decision Tree Method.
bond holders *+ = Feature derived from both methodologies.
TangibleBookValue Book value Dollar values from Please note that definitions of each feature were pulled from investopia
9,999,999,999 – or yahoo finance.
9,999,999,1039
TotalDebt Sum of short- and Dollar values from
long-term debt 9,999,999,999 –
9,999,999,1040 REFERENCES
ShareIssued Authorized shares sold Dollar values from
to and held by 9,999,999,999 –
[1] Picasso, Merello, S., Ma, Y., Oneto, L., and Cambria, E. (2019). Techni-
shareholders of company 9,999,999,1041
cal analysis and sentiment embeddings for market trend prediction.
OrdinaryShares Stocks sold on a public Dollar values from
Expert Systems with Applications, 135, 60–70. https://doi.org/10.1
Number+ exchange. 9,999,999,999 –
9,999,999,1042 016/j.eswa.2019.06.014
OperatingCashFlow Cash generated by Dollar values from [2] Mizuno, Ohnishi, T., and Watanabe, T. (2017). Novel and topical
normal business 9,999,999,999 – business news and their impact on stock market activity. EPJ Data
operation 9,999,999,1043 Science, 6(1), 1–14. https://doi.org/10.1140/epjds/s13688-017-0123-
InvestingCashFlow Cash generated (or Dollar values from 7
spent) on non-current 9,999,999,999 – [3] Dai, Y., and Shang, Y. (2013). Machine learning in stock price trend
assets 9,999,999,1044 forecasting. Stndford University Standford.
FinancingCashFlow Generated cash flow to Dollar values from [4] Sakshi, and A, V. (2020). An ARIMA- LSTM Hybrid Model for Stock
pay back loan 9,999,999,999 – Market Prediction Using Live Data. Journal of Engineering Science and
9,999,999,1045 Technology Review, 13(4), 117–123. https://doi.org/10.25103/jestr.1
EndCashPosition+ Cash on books at specific Dollar values from 34.11
point in time 9,999,999,999 – [5] Lanbouri, and Achchab, S. (2020). Stock Market prediction on High
9,999,999,1046 frequency data using Long-Short Term Memory. Procedia Computer
CapitalExpenditure+ Used to undertake new Dollar values from Science, 175, 603–608. https://doi.org/10.1016/j.procs.2020.07.087
projects or investments 9,999,999,999 – [6] Jahufer. (2021). Choosing the Best Performing Garch Model for Sri
9,999,999,1047
Lanka Stock Market by Non-Parametric Specification Test. Journal of
(Continued)
Quarterly Market Trends Using Machine Learning 93

Data Science, 13(3), 457–472. https://doi.org/10.6339/JDS.201507_1 [16] Basak, Kar, S., Saha, S., Khaidem, L., and Dey, S. R. (2019). Predicting
3(3).0003 the direction of stock market prices using tree-based classifiers. The
[7] Chong, Han, C., and Park, F. C. (2017). Deep learning networks for North American Journal of Economics and Finance, 47, 552–567. https:
stock market analysis and prediction: Methodology, data representa- //doi.org/10.1016/j.najef.2018.06.013
tions, and case studies. Expert Systems with Applications, 83, 187–205. [17] Vijh, Chandola, D., Tikkiwal, V. A., and Kumar, A. (2020). Stock Clos-
https://doi.org/10.1016/j.eswa.2017.04.030 ing Price Prediction using Machine Learning Techniques. Procedia
[8] Silvija Vlah Jerić. (2020). RULE EXTRACTION FROM RANDOM Computer Science, 167, 599–606. https://doi.org/10.1016/j.procs.
FOREST FOR INTRA-DAY TRADING USING CROBEX DATA. Pro- 2020.03.326
ceedings of FEB Zagreb International Odyssey Conference on Economics [18] Nti, Adekoya, A. F., and Weyori, B. A. (2019). A systematic review of
and Business, 2(1), 411–419. fundamental and technical analysis of stock market predictions. The
[9] Liu, Shen, W.-K., and Zhu, J.-M. (2021). Research on Risk Iden- Artificial Intelligence Review, 53(4), 3007–3057. https://doi.org/10.100
tification System Based on Random Forest Algorithm-High-Order 7/s10462-019-09754-z
Moment Model. Complexity (New York, N.Y.), 2021. https://doi.or [19] Gil Cohen, Andrey Kudryavtsev, and Shlomit Hon-Snir. (2011). Stock
g/10.1155/2021/5588018 Market Analysis in Practice: Is It Technical or Fundamental? Journal
[10] Ayala, García-Torres, M., Noguera, J. L. V., Gómez-Vela, F., and of Applied Finance and Banking, 1(3), 125–138.
Divina, F. (2021). Technical analysis strategy optimization using a [20] Qing-Guo Wang, Jin Li, Qin Qin, and Shuzhi Sam Ge. (2011). Linear,
machine learning approach in stock market indices. Knowledge-Based adaptive and nonlinear trading models for Singapore stock market
Systems, 225, 107119–. https://doi.org/10.1016/j.knosys.2021.107119 with random forests. 2011 9th IEEE International Conference on Control
[11] Chen, Zhang, Z., Shen, J., Deng, Z., He, J., and Huang, S. (2020). A and Automation (ICCA), 726–731. https://doi.org/10.1109/ICCA.201
Quantitative Investment Model Based on Random Forest and Sen- 1.6137897
timent Analysis. Journal of Physics. Conference Series, 1575(1), 12083–. [21] Pedregosa, F., Duchesnay, E., Perrot, M., Brucher, M., Cournapeau,
https://doi.org/10.1088/1742-6596/1575/1/012083 D., Passos, A., Vanderplas, J., Dubourg, V., Weiss, R., Prettenhofer,
[12] Thakur, and Kumar, D. (2018). A hybrid financial trading support P., Blondel, M., Grisel, O., Thirion, B., Michel, V., Gramfort, and
system using multi-category classifiers and random forest. Applied Varoquaux, G. (2011). Scikit-learn: Machine Learning in Python. API
Soft Computing, 67, 337–349. https://doi.org/10.1016/j.asoc.2018.03 Design for Machine Learning Software: Experiences from the Scikit-
.006 learn Project. Retrieved February 21, 2022, 109– 122. https://scikit-l
[13] Ciner. (2019). Do industry returns predict the stock market? A reprise earn.org/stable/about.html
using the random forest. The Quarterly Review of Economics and [22] Yahoo! (n.d.). Yahoo Finance – Stock Market Live, quotes, Business &
Finance, 72, 152–158. https://doi.org/10.1016/j.qref.2018.11.001 Finance News. Yahoo! Finance. Retrieved April 15, 2022, from https:
[14] Patel, Shah, S., Thakkar, P., and Kotecha, K. (2015). Predicting stock //finance.yahoo.com/
and stock price index movement using Trend Deterministic Data
Preparation and machine learning techniques. Expert Systems with
Applications, 42(1), 259–268. https://doi.org/10.1016/j.eswa.2014.
07.040
[15] Patel, Shah, S., Thakkar, P., and Kotecha, K. (2015). Predicting stock
market index using fusion of machine learning techniques. Expert
Systems with Applications, 42(4), 2162–2172. https://doi.org/10.101
6/j.eswa.2014.10.031

You might also like