0% found this document useful (0 votes)
4 views

Stock Price Prediction Using Reinforcement Learning

The document discusses using reinforcement learning for stock price prediction. It presents reinforcement learning and describes how it can be applied to model the stock price change process and predict prices. The paper uses the TD(0) algorithm to learn state values represented by stock price trends and predict future rewards.

Uploaded by

ankurdey2002.ad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Stock Price Prediction Using Reinforcement Learning

The document discusses using reinforcement learning for stock price prediction. It presents reinforcement learning and describes how it can be applied to model the stock price change process and predict prices. The paper uses the TD(0) algorithm to learn state values represented by stock price trends and predict future rewards.

Uploaded by

ankurdey2002.ad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING

Jae Won idee

School of Computer Science and Engineering


Sungshin Women's University
Seoul, 136-742, South Korea
ABSTRACT time point than the absolute prices after a fixed time
horizon which are generally adopted in conventional time
Recently, numerous investigations for stock price series approaches based on supervised learning methods.
prediction and portfolio management using machine In this sense, reinforcement leaming can be a promising
learning have been trying to develop efficient mechanical altemative approach for stock price prediction,
trading systems. But these systems have a limitation in that representing and leaming the delayed rewards, as well as
they are mainly based on the supervised leaming which is the immediate rewards, from interactive processes more
not so adequate for leaming problems with long-term effectively [5].
goals and delayed rewards. This paper proposes a method This paper adopts reinforcement learning to the
of applying reinforcement leaming, suitable for modeling problem of stock price prediction regarding the process of
and leaming various kinds of interactions in real situations, stock price changes as a Markov process. For this, the
to the problem of stock price prediction. The stock price process of stock price changes is modeled by the elements
prediction problem is considered as Markov process which of reinforcement learning such as state, action, reward,
can be optimized by reinforcement learning based policy, etc. And TD(0) algorithm [63, a kind of
algorithm. TD(O), a reinforcement learning algorithm reinforcement learning methods is used. TD (temporal-
which leams only from experiences, is adopted and difference) algorithms can leam directly from raw
function approximation by artificial neural network is experience without a complete model of the environment's
performed to leam the values of states each of which dynamics. Like DP (dynamic programming), TD
corresponds to a stock price trend at a given time. An algorithms update estimates based in part on other leamed
experimental result based on the Korean stock market is estimates, without waiting for a final outcome (they
presented to evaluate the performance of the proposed bootstrap) [7].
method. Though the ultimate purpose of reinforcement
leaming is to get an optimal policy, that is to resolve
1. INTRODUCTION control problem, the use of' 'TD algorithm in this paper is
confined to the prediction problem because the policy in
There has been so much work done on ways to predict the stock market is assumed to be determined by each investor
stock prices for the sake of providing investors with the and to be beyond the scope of leaming here. In this paper,
basis for an optimal decision-making in trading. Various each stock price trend, the trace of successive price
technical indicators such as moving averages [ 13 have changes, at a given time is mapped to a state of
been developed by researches in economic area and reinforcement learning that is represented by the
nowadays numerous investigations in computer science combination of some numerical features. The total number
have intended to develop different deiision support of possible states defined in this way is not finite and thus
systems for stock price prediction. Especially, some recent a generalization method is needed to approximate the
systems using machine leaming methods such as artificial value of expected future cumulative reward for a state.
neural network achieved better performance thaq those Here a multi-layer neural network is used for this purpose.
using only conventional indicators [2][3][4]. But these
systems have a limitation in that they are mainly based on 2. REINFORCEMENT LEARNING
the supervised learning. This is an important kind of
learning, but alone it is not adequate for leaming from Reinforcement leaming is a computational approach to
interactions w t h long-term goals. understanding and automating goal-directed learning and
In the leaming task of stock prediction, it is more decision-making [8]. It is distinguished from other
natural and effective to represent target values by the computational approaches by its emphasis on leaming by
successive relative changes in price since the previous

0-7803 -709O-2/01/$10.000200 1 IEEE. 690 ISIE 2001, Pusan, KOREA


the individual from direct interaction with its environment
as shown in Figure I . ' evaluation

7L V
I I

I
State
St
Action
at
improvement

Figure 2: Generalized policy iteration


This paper applies the learning only to the prediction
Figure 1: The agent-environment interaction in problem by utilizing the result of evaluation step in Figure
reinforcement learning. 2.

3. TD ALGORITHM
At each discrete time step t, the agent senses the current
state s,, chooses a current action a,, and performs it. The Various reinforcement learning algorithms (eg. dynamic
environment responds by giving the agent a reward r,+/= programming, temporal-difference, Monte Carlo, etc.) can
r(s,, a,) and by producing the succeeding state s , + ~= qs,, be classified in view of GPI and the differences in the
a,). Here the functions S and r are the part of the algorithms lies primarily in their approaches to the
environment and are not necessarily known to the agent. In prediction problem. Like Monte Carlo algorithms, TD
MDP (Markov decision process), the functions qs,, a,) algorithms can leam directly from raw experience without
and r(s,, a,) depend only on the current state and action not a model of the environment's dynamics. Like DP, TD
on earlier states or actions. The task of the agent is to Ieam algorithms are bootstrapping algorithms, that is, they base
a policy, n : S+ A , where S is the set of states and A is the their update in part on an existing estimate. After all, TD
set of actions, for selecting its next action based on the algorithms combine the sampling of Monte Carlo
current observed state s; that is, n(s,)=a,. An optimal algorithms with the bootstrapping of DP algorithms.
policy is a policy that can maximize the possible reward TD algorithms have an advantage over DP algorithms
from a state, called value, v" ('), for all states. (1) is a in that they do not require a model of the environment, of
typical definition of this: its reward and next-state probability distributions [9]. TD
algorithms are different from Monte Carlo algorithms in
that they are suitable for continuous (not episodic) task
\'"(Sf) = r,,, + K,,+ yzq+3+ ... such as stock trading or task with very long episode' [lo].
a,
At time step t+l they immediately form a target and make
a useful update using the observed reward and the
k=O
estimate V(s,+/).
The simplest TD algorithm, known as TD(O), is
Here y (0 < y < 1) is a constant that determines the relative
value of delayed versus immediate rewards. @)means the
possible cumulative reward achieved by following an
arbitrary policy n from an arbitrary initial state. Then an
In effect, the target for the TD update is r,+/+ yV(s,+,).
optimal policy, n',is defined as follows:
TD(0) in complete procedural form is as follows:

Initialize V(s)arbitrarily, n t o the policy to be


evaluated
An iterative process in Figure 2, called Generalized Policy Repeat (for each episode):
Iteration (GPI), is necessary to get n'. Initialize s

'This is a unit for separating interactions between agent


and environment into subsequences.

69 1 ISIE 2001, Pusan, KOREA


Repeat (for each step of episode) components each of which corresponds to one of the raw
a t action given by n for s daily data or the derived data enumerated above:.
Take action a
observe reward, r, and next state, s’
V(s) t V(s)+ a [ r+ yV(s7 - V(s)]
scs’ In the rest of this paper, this vector is called state vector
Until s is terminal and is used as the input part of the prediction task.
4. STOCK PRICE CHANGES IN TD VIEW
4.2. Reward
It is a highly difficult problem to get a model that can The output part z of most inductive learning algorithms for
clearly illustrate the stock price changes in terms of g,eneral prediction task based on the time series approach
actions or policies of investors because they take many have the following form:
different kinds of policies and actions. For example, an
investor may make a decision to sell or buy a stock based
z(l+ h) =y(t + h). (5)
only on the current price trend of a stock, while another
one deeply concerns in fundamental analysis. But it is
That is, they aim to predict the time series value after the h,
sufficient for achieving the purpose o f this paper to define
a fixed prediction time horizon, time steps. But in the task
only the state and reward in the process of stock price
of stock prediction, representing the output (or target) as
change for TD learning, assuming that each agent’s policy (5) has a drawback. The pricey normally vary greatly for a
and actions can be implicitly reflected into the stock data
longer period of time and y for different stocks may differ
such as a chart showing series of changes of stock prices.
over several decades, making it difficult to create a valid
model. Therefore the returns defined as the relative
4.1. State
change in pnce since the previous time step are often used
instead [I 11:
The next to be considered for modeling stock price
changes in TD View is the state of a stock at a given time.
By the definition of the policy in section 2, a state is the
unique input parameter of a policy function. That is, an
action is selected depending only on a given state if we
assume overall policies of investors rarely change. Thus From this, using one day for the time horizon, h , the
any kind of data that is referred every day by investors can immediate reward at time step t, r,, is defined as follows:
be considered in defining the state of a stock.
The general daily available data for each stock is
(7)
represented by the following time series:

yo : The price of the first trade of the day where y,(t) is the close price of a stock at t. With t
substituted by t+h, equation ( 6 ) corresponds to the h-step
yH : Highest traded price during the day. truncated return with y = 1 in the reinforcement leaming
framework which is used for computing the values of
y L : Lowest traded price during the day.
states defined as (1) in section 2.
By the definition of the reward and the target term in
yc : Last price that the security traded during the
equation (3), the current estimated value of a state, that is,
day.
the current price trend of a stock, can be computed at
every time step. The estimates may result in different
v : The number of shares (or contracts) that were
values according to the discounting factor y. If y = 0, only
traded during the day.
the immediate reward, the price rate of change of the next
Most market experts refer the conventional indicators in day, is reflected to the values. So y close to 0 is expected
to be suitable for short-term prediction. On the other hand,
technical analysis area. Therefore various kinds of
technical indicators as well as the simple kinds of derived if y = 1, then all the rewards after the state are treated
data such as retum, volume increase, etc. should also be equally regardless of the temporal distance from that state
considered. Finally the state of a stock in is paper is and thus y close to 1 seems to be suitable for long-term
defined as a vector with a fixed number of real valued prediction.

692 ISIE 2001, Pusan, KOREA


1 1100 900 10.0 -10.0
2 1150 950 4.55 5.56 If the value function, V,, is a linear function of the
3 1300 1100 13.04 15.79 parameter vector as:
4 1200 1200 -7.69 9.09

1400 I I
E1300 - +A
:. ., ~ the gradient term in equation (8) is reduced to a simple
____ form, just the state vector itself at time t as:
$1100
:. 1000 __~.--__- --t

v- V/(S/) = k , . (10)


0 ___-_-_____
73 900 0,

800 I I I i
0 1 2 3 4 But the linear method is not suitable for stock price
t prediction problem, because any strong relationships
between inputs and outputs in stock market are likely to be
Figure 3: An example of stock price changes. highly nonlinear. A widely used nonlinear method for
gradient-based function approximation in reinforcement
In stock price prediction, the intermediate prices learning is the multilayer artificial neural network using
between the price at current time step and the one after the the error backpropagation algorithm [ 12][ 131. This maps
given time horizon of interest are also meaningful. In this immediately onto the equation (8), where the
sense, the discounting factor make it possible to represent backpropagation process is the way of computing the
the target terms more effectively. For example, consider gradients.
the price changes of stock A and B in Figure 3. According
to the value function (l), the target value of A at time step 6. EXPERIMENT
0 is greater than that of B when assumed y < 1 and rAf = rE,
= 0 for t > 4, which reflects the price differences at In order to experiment the method presented, data
intermediate time steps. But the returns of both stocks collected from the Korean stock market shown in Table 1
calculated by (6), setting the time horizon h to 4, have the is used.
same value, 20.0.
Table 1 : The experiment data.
5. FUNCTION APPROXIMATION BY NEURAL TraininkData I Test Data
NETWORK Stocks 100 [ Stocks 50
Period 2years [ Period 1 year
According to the definition of the previous section, the
state space of a stock is continuous. This means that most
states encountered in training examples will never have Table 2 is the list of the indicators consisting the state
been experienced exactly before. So it is needed to vector in this experiment. Some of them are themselves
generalize from previously experienced states to ones that conventional indicators in technical analysis and the others
have never been seen. In gradient-descent methods, the are primitive indicators referred commonly in computing
approximate value function at time t , V,, is represented as a several complex indicators. All these input indicators are
parameterized functional form with the parameter vector normalized between 0 and 1 based on their values between
which is a column vector with a fixed number of real a predetermined minimum and maximum.
valued components, $ = (0,(1),0,(2),..., 0, (PI))‘ (the T here Training proceeds by on-line updating, in which
denotes transpose). In TD case, this vector is updated updates are done during an episode as soon as a new target
using each example by: value is computed. To compute the target value for TD
update, rf+,+ yV(s,+,), of an input example, the weigts of
the net learned from the previous examples are used to
approximate the term V(s/+,).

693 ISIE 2001, Pusan, KOREA


6

0
-2

-4

Figure 4: The experiment result.

with y = 0.8. According to this result, as expected, higher


grades result in higher retums in overall and vice versa.
But as shown in Figure 5, the majority of the test samples
cannot be predicted to have extremely positive or negative
values.

1400
1200
1000
20-day moving average I 20-day disparity 800
60-day moving average I 60-day disparity 600
400
Though the stock price change is a continuous task, in this 200
0
experiment, the total period, 2 years, of subsequent price
1 2 3 4 5 6 7 8 9 10111213141516
changes of each stock is regarded as an episode. Therefore
the training is performed with 100 episodes and each
episode corresponds to a stock included in Table 1 . Figure 5: The grade distribution of the test examples.
Figure 4 shows the result of the test experiment after
the training of 3000 epoch with 60 hidden nodes using y =
0.8. The predicted values for the test input patterns are Table 3 shows the average root mean-squared (Rh4S)
classified into 16 distinct grades, each of which errors between the predicted grades and the grades of
corresponds to a predefined range of the value. For this, a actual retums. Here the grade of each retum is also
heuristic function which determines the upper and lower computed by the same fuction mentioned above. The
limits of each range is used. This function considers the performance in terms of R(l) and R(20) was not as
maximum and minimum of actual retums ‘,over a period, satisfactory as that of the others, showing that making the
found in the training examples. The average actual retums period of prediction extremely shorter or longer results in
of all the examples that are predicted to belong to each decay of performance.
grade are plotted. R(l), R(5), R(10), and R(20) are the
average retums truncated after 1, 5, 10 and 20 days Table 3: RMS errors between value grades and return
respectively and are computed from the n-step truncted grades.
retum r(n) of each example: Grade

694 ISIE 2001, Pusan, KOREA


7. CONCLUSION [ 1 1] T. Hellstroem, A Random Walk through the Stock Market,
Ph.D. theis, University of Umea, Sweden, 1998.
This paper presents an approach for applying
reinforcement leaming to the problem of stock market [I21 C.W. Anderson, Learning and Problem Solving with
Multilayer Connectionist Systems, Ph.D. theis, University of
prediction. An alternative way of representing target Massachusetts, Amherst, 1986.
values for the pattems of stock price changes is also
presented. The experimental result shows that the [ 13] S.E. Hampson, Connectionit Problem Solving:
proposed method might be utilized as a more useful Computational Aspects of Biological Learning, Birkhauser,
indicator for stock trading. The state-values calculated by Boston, 1989.
the method are different from the conventional indicators
in that they intend to represent directly the expected
cumulative reward, the price rate of change, not the
auxiliary information for making buying or selling signals.
The proposed approach provides a basis for applying
reinforcement leaming to stock market prediction and it
might be possible to extend this to applying another
reinforcement algorithms such as TD(h), instead of TD(O),
where the definition of the reward is slightly different fiom
that of this paper.

REFERENCES

[ I ] S.M. Kendall and K. Ord, Time Series, Oxford University


Press, New York, 1997.

[2] R.J. Kuo, “A decision support system for the stock market
through integration of fuzzy neural networks and fuzzy Delphi,”
Applied Arlijkial Intelligence, 6501-520, 1998.

[3] N. BaBa and M. Kozaki, “An intelligent forecasting system


of stock price using neural networks,” Proc. IJCNN, Baltimore,
Maryland, 652-657, 1992.
[4] S . Cheng, A Neural Network Approach for Forecasting and
Analyzing the Price-Volume Relationship in the Taiwan Stock
Market, Master’s thesis, National Jow-Tung University, 1994.

[SI U S . Sutton and A.G. Barto, Reinforcement Learning, The


MIT Press, 1998.

[6] R.S. Sutton, ‘‘Learning to predict by the method of temporal


differences,” Machine Learning, 3:9-44, 1988.

[7] P.J. Werbos, “Building and understanding adaptive systems:


A statistical/numericaI approach to factory automation and brain
research,” IEEE Transactions on Systems, Man, and Cybernetics,
17:7-20, 1987.

[8] T.M. Mitchell, Machine Learning, The McGraw-Hill


Companies, Inc, 1997.

[9] C.J. Watkins, Learning from Delayed Rewards, Ph.D. theis,


Cambridge University, 1989.

[IO] M.H. Kalos and P.A. Whitlock, Monte Carlo Methods,


Wiley, New York, 1998.

695 ISIE 2001, Pusan, KOREA

You might also like