17BIT008
17BIT008
AND PREDICTION
A PROJECT REPORT
Submitted by
KANAGARAJ P (17BIT008)
BOOPATHY K (17BIT028)
PRASANTH G (17BIT201)
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
May 2021
1
KUMARAGURU COLLEGE OF TECHNOLOGY
COIMBATORE-641 049
(An Autonomous Institution Affiliated to Anna University, Chennai)
BONAFIDE CERTIFICATE
Certified that this project report “Stock market analysis and prediction” is
the bonafide work of “Kanagaraj P (17BIT008), Boopathy K (17BIT028),
prasanth G (17BIT201)” who carried out the project work under my
supervision.
SIGNATURE SIGNATURE
Dr.M.Alamelu Dr.P.C.Thirumal
Head Of the Department Supervisor
Associate Professor Associate Professor
Information Technology Information Technology
2
DECLARATION
We affirm that the project work titled “Stock Market Analysis and prediction”
being submitted in partial fulfillment for the award of B.Tech Information
Technology is the original work carried out by us. It has not formed the part of
any other project work submitted for the award of any degree or diploma, either
in this or any other University.
Kanagaraj P (17BIT008)
Boopathy K (17BIT028)
Prasanth G (17BIT201)
3
ACKNOWLEDGEMENT
We extend our gratitude to our Principal, Dr. D. Saravanan, for providing us the
necessary facilities to pursue the project.
Our sincere and hearty thanks to staff members of the Department of Information
Technology of Kumaraguru College of Technology for their well wishes, timely
help and support rendered to us during our project. We are greatly indebted to our
family, relatives and friends, without whom life would have not been shaped to
this level.
-Kanagaraj
-Boopathy
-Prasanth
4
TABLE OF CONTENTS
3. ALGORITHMS 8
3.1 SVM (Support Vector Machine for Regression) 8
3.2 LSTM (Long Short-Term Memory ) 8
3.3 ARIMA (Auto Regressive Integrated Moving Average) 8
3.4 RANDOM FOREST 8
3.5 LINEAR REGRESSION 8
6. RESULTS 11
6.1 GRAPH 11
6.2 EVALUATION 12
6.3 RMSE COMPARISON 13
7. DISCUSSIONS 14
7.1 DECISION MADE 14
7.2 DIFFICULTIES FACED 14
7.3 THINGS THAT WORK AND DID’NT 14
WORK WELL
8. CONCLUSION 15
9. FUTURE SCOPE 16
REFERENCES 16
10. APPENDIX 17
5
CHAPTER-1
INTRODUCTION
1.1 DESCRIPTION
6
CHAPTER-2
TECHNOLOGIES & TOOLS
Language and libraries: Python, SciPy, NumPy, Pandas, Sci-Kit Learn, Keras.
Keras is required to implement LSTM model. the other libraries are required to
process data and implement machine learning algorithms. Pandas made data pre-
processing relatively easy.
Tool: GOOGLE colab is convenient to use and is very fast.
7
CHAPTER-3
ALGORITHMS
3.1 SVM (Support Vector Machine for Regression):
SVM is considered as one of the most important breakthroughs in machine
learning field and can be applied in classification and regression. In this project,
SVR is considered to solve a regression problem as it avoids difficulties of using
linear functions.
3.2 LSTM (Long Short-Term Memory):
It is a recurrent neural network (RNN) architecture that learns about values
using intervals. LSTM keeps track of the past values and use those changes to
predict future values. In our project we have stock values for each day which can
be treated as sequence of values. for its ability to act as memory unit, LSTM can
be treated as one of the best algorithms for time-series analysis problems.
For example: Y is present value and X is past value by one day. LSTM will link
between X and Y to predict future value.
x Y
22 35
35 48
48 52
Fig.2: workflow
9
CHAPTER-5
EXPERIMENTS / PROOF OF CONCEPT EVALUATION
5.1 DATASET
In the project, we chose the National Stock exchange collected from. This
dataset includes India stocks and our index covers a diverse set of sectors
featuring many Indian companies. Our aim was to focus on making general and
unbiased model, which works on every type of scenario irrespective of company
or financial sector. It helps to validate our predictive algorithm and provide more
accurate stock prediction.
Our dataset includes eight features such as company Index, Date, Time, Open,
Close, High, Low values and Volume of trading (prices are in INR). The dataset
covers 440 companies every minute since 2015. We took this dataset as it’s size
is quite large (~2gb) and it can be used to evaluate several companies using our
algorithm. With the primary dataset prepared, we applied pre-processing methods
to carry out individual experiments.
5.2 Data pre-processing
Pre-processing refers to the transformations applied to your data before
feeding it to the algorithm. Selecting and pre-processing the data are crucial steps
in any modelling effort, particularly for generalizing a new predictive model. Our
dataset has some limitations such as it contains invalid values, null values and
missing records etc. We applied following techniques to pre-process our data to
make accurate prediction.
1. Data cleaning:
In real world, data tend to be incomplete and inconsistent. Therefore, the
purpose of data cleaning is to fill in missing values and correct
inconsistencies in the data. Index, Date, time closing prices of NSE dataset
are used as input. There were some missing values due to public holidays.
We removed null values and invalid indexes. There were few irrelevant
columns in the dataset which were not used as input. So we eliminated
those columns to reduce the complexity of our prediction model.
2. Data Transformation:
As our dataset contains minute-wise stock prices and we needed daily basis prices
to fit in our model, so we grouped the data on daily basis prices and took mean of
10
all the rows. Also we applied min-max scaling for a few algorithms to get more
accurate prediction.
CHAPTER-6
RESULTS
In this project, we have made a time-series analysis and it doesn’t need n-fold
cross validation methodology since it’s sequential data. We split our dataset in
train and test data. Top 80 percent of data will be Train data and the remaining
will be test.
6.1 Graphs
Below are the few companies with graphs plotted for predicted values(Red)
versus actual values (Blue) for different algorithms. We can see that LSTM and
Arima performs better compared to random forest and linear regression.
11
Fig 3: Actual closing price index and its predicted value from LR, RF, LSTM models
Fig 4: Graph Comparison for five companies (Left to right) Infotech, 8kmiles, Aban, Bosch Ltd,
NTPC
6.2 Evaluation
The accuracy of prediction is referred to as “goodness of fit”. In this project, most
popular and statistical accuracy measure RMSE is used for comparison of
different algorithms on same dataset, which is defined as:
12
6.3 RMSE Comparison
Model 3IINFOTECH 8KMILES ABAN
13
CHAPTER-7
DISCUSSION
14
CHAPTER-8
CONCLUSION
In the project, We proposed the use of different algorithms to predict the future
stock prices of almost twenty companies. Although comparison is shown for
only five companies (randomly selected) in the report due to space constraint,
the behaviour can be known for any company by using the same code. Long
short term memory algorithm worked best in case of forecasting and also we
ranged from first to last algorithms for forecasting stock market,
LSTM
ARIMA
RF
LR & SVR
15
CHAPTER-9
FUTURE SCOPE
In future, we will extend the project for other effective methods that might
result a better performance. Our algorithms can be used to maximize profit of
investors but it has to be improved for real time conditions.
REFERENCES
http://markdunne.github.io/public/mark-dunne-stock-market-prediction.pdf
http://citeseerx.ist.p
su.edu/viewdoc/download?doi=10.1.1.278.6139&rep=rep1&type=pdf
https://www.kaggle.com/ramamet4/nse-company-stocks
https://machinelearningmastery.com/arima-for-time-series-forecasting-with-
python/
https://ec.europa.eu/eurostat/sa-elearning/arima-model
16
CHAPTER-10
APPENDIX
import numpy as np
import scipy as sp
import pandas as pd
from subprocess import check_output
import time
import math
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.recurrent import LSTM
import numpy as np
import pandas as pd
import sklearn.preprocessing as prep
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
df= pd.read_csv('groupeddf.csv')
df4=df.set_index("Code")
uniqueVals = df["Code"].unique()
grouped_df=pd.DataFrame()
for i in uniqueVals:
df5 = (df4.loc[i,:]).groupby(['Code','Date']).mean()
# store DataFrame in list
grouped_df=grouped_df.append(df5)
grouped_df.reset_index()
del df5
grouped_df
uniqueVals[:10]
def create_dataset(dataset,past=5): # relating 5th day and 1st day
dataX, dataY = [], []
for i in range(len(dataset)-past-1):
j = dataset[i:(i+past), 0]
dataX.append(j)
dataY.append(dataset[i + past, 0])
return np.array(dataX), np.array(dataY)
from sklearn.preprocessing import MinMaxScaler
def testandtrain(prices):
prices = prices.reshape(len(prices), 1)
prices.shape
scaler = MinMaxScaler(feature_range=(0, 1))
prices = scaler.fit_transform(prices)
17
trainsize = int(len(prices) * 0.80)
testsize = len(prices) - trainsize
train, test = prices[0:trainsize,:], prices[trainsize:len(prices),:
]
print(len(train), len(test))
x_train,y_train = create_dataset(train,1)
x_test,y_test = create_dataset(test,1)
x_train = np.reshape(x_train, (x_train.shape[0], 1, x_train.shape[1
]))
x_test = np.reshape(x_test, (x_test.shape[0], 1, x_test.shape[1]))
testPredict = model.predict(testX)
plt.plot(testPredict,color="blue")
plt.plot(testY,color='red')
plt.show()
print (" --------end for the company------")
return testPredict
18
trainX, trainY, testX, testY=testandtrain(prices)
model = trainingmodel(model, trainX, trainY)
19