Stock Price Prediction LSTM Final Report
Stock Price Prediction LSTM Final Report
PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
BONAFIDE CERTIFICATE
Certified further, that to the best of my knowledge the work reported here
does not form part of any other project report or dissertation on the basis of
Certified that the candidate was examined in the Viva-Voce Examination held on
__/04/2021.
We give all the glory and thanks to our almighty GOD for showering upon, the
necessary wisdom and grace for accomplishing this project. We express our
gratitude and thanks to Our Parents first for giving health and sound mind for
completing this project.
First of all we would like to express our deep gratitude to our beloved and
respectable Chairman Thiru M.V. Muthuramalingam and our Chief
Executive Officer Thiru M.V.M. Velmurugan for their kind
encouragement.
Our thanks to all other Faculty and Non-Teaching staff members of our
department for their support and peers for having stood by me and helped me
to complete this project.
ABSTRACT
Ⅳ
TABLE OF CONTENTS
3. METHODOLOGY 13
3.1 PROPOSED SYSTEM 12
3.1.1 ADVANTAGES 13
3.2 GENETIC ALGORITHM 15
3.3 DEEP LEARNING 18
3.3.1 RNN 18
V
3.3.2 LSTM 19
3.3.3 GA-LSTM TWO-STAGE STOCK PRICE 21
PREDICTION MODEL
3.4 MODULES 24
3.4.1 IMPORTING LIBRARIES 24
3.4.2 LOADING DATA 25
3.4.3 READING DATA 25
3.4.4 FEATURE EXTRACTION 26
3.4.5 MODEL BUILDING 26
3.4.6 RESULTS VISUALIZATION 28
3.5 EXPERIMENTS 28
3.5.1 DATASET DESCRIPTION 28
3.5.2 EVALUATION PARAMETERS 29
4. RESULTS AND DISCUSSION 30
4.1 ACCURACY 30
4.2 ACCURACY WITH EACH EPOCHS 31
4.3 ANALYSIS OF TRAINING OF MODEL 32
4.4 FINAL RESULTS 41
5. CONCLUSION AND FUTURE WORK 43
5.1 CONCLUSION 43
5.2 FUTURE WORK 43
6. REFERENCES 44
7. APPENDIX 46
Ⅵ
LIST OF FIGURES
2 GENETIC ALGORITHM 16
LIFECYCLE
12 VISUALIZATION BLOCK 28
13 DATASET 28
14 CONFUSION MATRIX 29
15 ACCURACY RESULT 30
16 ITERATIONS OF EPOCH 31
17 HISTORICAL VIEW OF 32
CLOSING PRICE
19 ADDITIONAL MOVING 34
AVERAGE OF STOCK
VII
20 GRAPH PLOT BETWEEN 35
MOVING AVERAGE STOCK IN
10, 20, 50 DAYS
26 HISTOGRAM OF PAIRPLOT OF 39
DAILY RETURN
29 TRAINING PHASE 41
30 OUTPUT (CLOSE VS 41
PREDICTED)
VIII
LIST OF ABBREVIATIONS
1. GA Genetic Algorithm
IX
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
With the rapid development of the social economy, the number of
listed companies is increasing, so the stock has become one of the hot
topics in the financial field. The changing trend of stock often affects the
direction of many economic behaviors to a certain extent , so the prediction
of stock price has been paid more and more attention by scholars. The
stock market data has the characteristics of non-linear, high noise,
complexity and timing, etc., so scholars have done a lot of research on the
stock prediction method . The traditional stock prediction method is to
build a linear prediction model based on the historical stock data, Bowden
et al. proposed to use the ARIMA method to build an autoregressive
model to predict stock prices.
1
based on rough sets. This method combines the advantages of rough sets
and decision trees, but this method is prone to overfitting when dealing
with data sets with a large amount of noise, which will affect the trend of
stock prediction.
2
classical models to verify the effectiveness of the convolution model in
stock prediction. However, due to the timing of stock data, the
convolutional neural network is not the most suitable neural network
model for stock prediction. Selvin et al. proposed three stock prediction
models based on CNN, recurrent neural network (RNN) and LSTM deep
learning networks respectively, and compared the performance of the three
models by predicting the stock prices of listed companies. Finally, it was
concluded that the LSTM neural network is most suitable for forecasting
the stock market with time series due to its long-term memorability.
3
changes of influence information in different time stages, Zheng et al.
designed a specific attention network and successfully learned the dynamic
influence of the changes of multiple non-predictive time series on the
target series over time.
4
a data-mining method to examine the linkage between a firm’s
communication data and its share price. As Enron Corporation’s e-mail
messages constitute the only corpus available to the public, we make use of
Enron’s email corpus as the training and testing data for our proposed
algorithm.
1.2.1 LIMITATIONS
5
algorithm to determine the data patterns on its own. Reinforcement
learning is when you present the algorithm with examples that lack labels,
as in unsupervised learning. However, you can accompany an example
with positive or negative feedback. Machine learning is used in internet
search engines, email filters to sort out spam, websites to make
personalised recommendations, banking software to detect unusual
transactions, and lots of apps on our phones such as voice
recognition.Machine learning involves computers discovering how they
can perform tasks without being explicitly programmed to do so. It
involves computers learning from data provided so that they carry out
certain tasks. For simple tasks assigned to computers, it is possible to
program algorithms telling the machine how to execute all steps required
to solve the problem at hand on the computer's part, no learning is needed.
For more advanced tasks, it can be challenging for a human to manually
create the needed algorithms. In practice, it can turn out to be more
effective to help the machine develop its own algorithm, rather than having
human programmers specify every needed step. The discipline of machine
learning employs various approaches to teach computers to accomplish
tasks where no fully satisfactory algorithm is available.
● Supervised Learning
6
● Unsupervised Learning
● Reinforcement Learning
Other approaches have been developed which don't fit neatly into this
three-fold categorisation, and sometimes more than one is used by the
same machine learning system. For example topic
modelling,dimensionality reduction or meta learning.
7
1.4.1 Deep Learning vs. Machine Learning
One of the most common AI techniques used for processing big data
is machine learning, a self-adaptive algorithm that gets increasingly better
analysis and patterns with experience or with newly added data.
8
CHAPTER 2
LITERATURE SURVEY
9
entire accuracy of 75.4464% in the training set and of 61.7925% in the test
set. Further, the PCA-SVM stock selection model contributes the annual
earnings of stock portfolio to outperforming those of A-share index of
Shanghai Stock Exchange, significantly.
10
2.4 STOCK PRICE FORECASTING ON INDONESIAN STOCKS
11
2.6 THE ANALYSIS AND FORECASTING OF STOCK WITH
DEEP LEARNING
12
CHAPTER 3
METHODOLOGY
combination of optimal factors and the LSTM model for stock prediction.
13
FIG.1 PROPOSED SYSTEM- WORKFLOW
3.1.1 ADVANTAGES
14
3.2 GENETIC ALGORITHM
15
generation of the population can be used as the approximate optimal
solution to the problem.
16
{α1,α2,...,αn} represents the original feature set. First, it designs a binary
encoding for each chromosome β that represents a potential solution to the
problem, i.e., the binary encoding of each chromosome represents each
feature combination. In the ini-tialization phase, the population size is set
for the population and a random original population {β1,β2,...,βn} is
gen-erated. Then the fitness of each chromosome is calculated according to
the preset fitness function. The fitness function is an evaluation index used
to evaluate the chromosome per-formance. In GA, the definition of fitness
function is a key factor affecting performance. The process of calculat-ing
the fitness function will be used to retain the excellent solution for further
reproduction. High-performing chromo-somes are more likely to be
selected multiple times, while low-performing ones are more likely to be
eliminated. After Several rounds of selection, crossover and mutation
operation,we obtain the optimal chromosomeˆβ. In this paper, we adopt r2
determination coefficient as the fitness function of GA.The determination
coefficient reflects how much percentage of the fluctuation of Y can be
described by the fluctuation of X, i.e., the interpretation degree of the
characteristic variable X to the target value of Y. The determination
coefficient can be defined as follows:
17
significance to GA. It is beneficial to increase the genetic diversity of the
population to exchange the corresponding part of chromosome chain and
change the gene combination to produce new offspring.
3.3.1 RNN
18
“memory” which captures information about what has been calculated so
far.
3.3.2 LSTM
19
FIG 3. LSTM MODEL DIAGRAM
ht = ot × tanh(Ct ) - (6)
forgetting gate, input gate, update gate and output gate respectively, b f , b i
, bc and b o are respectively the bias of forgetting gate, input gate, update
gate and output gate. Finally, the output at the current moment and the
updated cell state at the current moment are calculated.
20
3.3.3 GA-LSTM TWO-STAGE STOCK PRICE PREDICTION
MODEL
21
a random probability. If the random probability is less than the crossover
probability, the exchange will be carried out, otherwise, there is no
exchange.(iv) The basic bit mutation method is used to carry out mutation
operation. In contemporary individuals, a gene is altered with a small
probability. The probability of variation is set to 0. 003. The algorithm
produces one probability at a time. If the random probability is less than
the crossover prob-ability, variation will be carried out; otherwise, no
variation will be carried out.Loop steps (ii) through (iv) until the iteration
is 100 times.At the end of the algorithm, we generate an optimal
popu-lation close to the optimal solution. In this paper, the total number of
occurrences of each factor in the population is statistically ranked as the
factor importance ranking. The More times the factor appears, the more
important it is.Table 1 shows the specific model parameters.The second
stage of this study is the stage of feature selec-tion optimization of LSTM
stock prediction model. Based On the factor importance ranking obtained
in the previous stage, the top 40, 30, 20, 10 and 5 factors were taken as
input features of the LSTM model. By comparing the prediction results,
the optimal factor combination is determined, and
Table 1
22
Table 2
The optimal model is compared with the baseline models to verify the
superiority of the proposed optimization model in improving the model
accuracy.Table 2 shows the specific model parameters. There are three
network layers in the model, namely, the input layer,the hidden layer and
the output layer. The number of neurons in the hidden layer and the output
layer is 128 and 1 respec-tively, and the dropout parameter is set as 0.2 to
randomly remove a part of neurons to avoid overfitting. The
LSTMnetwork time step is set to 5, i.eThis paper takes the historicaldata of
the first five days as input to predict the stock price of the next day. Model
gradient descent optimizer is Adam, andthe number of model iterations is
100. In this paper, the data are divided into a training set and a test set on a
23
scale of 8:2. The First 80% of the data is used for training, while the
remaining20% of the data is used to evaluate the model.The model adopts
mean square error (MSE) as the model evaluation index, and the formula is
as follows,
where, m is the number of samples, yi is the stock price, and ˆyi is the
model forecast stock price. Fig. 3 shows the experimental process of this
study. We obtain 5 multi-factor combinations t40, t30, t20, t10and t5by
ranking the importance of the original feature set{α1,α2,...,αn}. These
multi-factor combinations are used as the input features of LSTM to
predict stock price ˆyt at time t.
3.4 MODULES
24
FIG. 5 IMPORT LIBRARY
If you are not using Google Colab, you can put the data file in the
same code folder.
25
3.4.4 FEATURE EXTRACTION
Now we are going to create our training and testing data by calling
our function for each one:
26
FIG.10 MODEL - LSTM BLOCK
After that, we want to reshape our feature for the LSTM layer, because it is
sequential_3 which is expecting 3 dimensions, not 2:
27
3.4.6 RESULTS VISUALIZATION
3.5 EXPERIMENTS
If you take a look at the dataset, you need to know that the “open”
column represents the opening price for the stock at that “date” column,
FIG.13 DATASET
28
and the “close” column is the closing price on that day. The “High” column
represents the highest price reached that day, and the “Low” column
represents the lowest price.
● Accuracy
● Classification Report
● Confusion Matrix
FIG.14-CONFUSION MATRIX
29
CHAPTER 4
4.1 ACCURACY
30
4.2 ACCURACY WITH EACH EPOCHS
31
44.3 ANALYSIS OF TRAINING OF MODEL
32
FIG.18 TOTAL VOLUME OF STOCK TRADE PER DAY
33
FIG.19 ADDITIONAL MOVING AVERAGE OF STOCK
34
FIG.20 GRAPH PLOT BETWEEN MOVING AVERAGE STOCK IN 10, 20, 50
DAYS
35
FIG.22 HISTOGRAM FOR DAILY RETURN USING SEABORN
36
FIG.23 COMPARING GOOGLE WITH ITSELF FOR LINEAR RELATION
37
FIG.25 LINEAR REGRESSION WITH EACH OTHER
38
FIG.26 HISTOGRAM OF PAIRPLOT OF DAILY RETURN
39
FIG.27 CORRELATION WITH EACH OTHER USING SEABORN FOR DAILY
RETURN
40
FIG.29 TRAINING PHASE
41
FIG.31 PREDICTED WITH CLOSING PRICE
42
CHAPTER 5
5.1 CONCLUSION
With the help of Genetic Algorithm and LSTM, we were able to
predict the stock price of a company with an accuracy over 90 %. Thus
making it one of the best methods to predict the stock price of any
company’s stocks.
43
CHAPTER 6
REFERENCES
[3] Y. Kim, J.-H. Roh, and H. Kim, ‘‘Early forecasting of rice blast disease
using long short-term memory recurrent neural networks,’’ Sustainability,
vol. 10, no. 2, p. 34, Dec. 2017.
44
[7] J. Hu and W. Zheng, ‘‘Multistage attention network for multivariate
time series prediction,’’ Neurocomputing, vol. 18, no. 383, pp. 122–137,
Mar. 2020.
[8] U. F. Siddiqi, S. M. Sait, and O. Kaynak, ‘‘Genetic algorithm for the
mutual information-based feature selection in univariate time series data,’’
IEEE Access, vol. 8, pp. 9597–9609, 2020.
45
CHAPTER 7
APPENDIX
import pandas as pd
import numpy as np
46
#For loop for grabbing yahoo finance data and setting
as a dataframe
for stock in tech_list:
# Set DataFrame as the Stock Ticker
globals()[stock] = DataReader(stock, 'yahoo', start,
end)
df = pd.concat(company_list, axis=0)
df
# Summary Stats
AAPL.describe()
47
# General info
AAPL.info()
plt.figure(figsize=(12, 8))
plt.subplots_adjust(top=1.25, bottom=1.2)
48
ma_day = [10, 20, 50]
for ma in ma_day:
for company in company_list:
column_name = f"MA for {ma} days"
company[column_name] = company['Adj
Close'].rolling(ma).mean()
print(GOOG.columns)
df.groupby("company_name").hist(figsize=(12, 12));
49
axes[0,1].set_title('GOOGLE')
fig.tight_layout()
50
axes[0,1].set_title('GOOGLE')
fig.tight_layout()
# Grab all the closing prices for the tech stock list
51
into one DataFrame
closing_df = DataReader(tech_list, 'yahoo', start,
end)['Adj Close']
sns.pairplot(tech_rets, kind='reg')
52
return_fig = sns.PairGrid(tech_rets.dropna())
53
histogram plots of the daily return
returns_fig.map_diag(plt.hist,bins=30)
sns.heatmap(closing_df.corr(), annot=True,
cmap='summer')
area = np.pi*20
plt.figure(figsize=(12, 10))
plt.scatter(rets.mean(), rets.std(), s=area)
plt.xlabel('Expected return')
plt.ylabel('Risk')
54
# Get the stock quote
df = DataReader('AAPL', data_source='yahoo',
start='2012-01-01', end=datetime.now())
# Show teh data
df
plt.figure(figsize=(16,8))
plt.title('Close Price History')
plt.plot(df['Close'])
plt.xlabel('Date', fontsize=18)
plt.ylabel('Close Price USD ($)', fontsize=18)
plt.show()
training_data_len
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)
55
scaled_data
56
# Build the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=
(x_train.shape[1], 1)))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(25))
model.add(Dense(1))
57
# Reshape the data
x_test = np.reshape(x_test, (x_test.shape[0],
x_test.shape[1], 1 ))
58
valid
59