Stock Price Prediction Using Reinforcement Learning
Stock Price Prediction Using Reinforcement Learning
7L V
I I
I
State
St
Action
at
improvement
3. TD ALGORITHM
At each discrete time step t, the agent senses the current
state s,, chooses a current action a,, and performs it. The Various reinforcement learning algorithms (eg. dynamic
environment responds by giving the agent a reward r,+/= programming, temporal-difference, Monte Carlo, etc.) can
r(s,, a,) and by producing the succeeding state s , + ~= qs,, be classified in view of GPI and the differences in the
a,). Here the functions S and r are the part of the algorithms lies primarily in their approaches to the
environment and are not necessarily known to the agent. In prediction problem. Like Monte Carlo algorithms, TD
MDP (Markov decision process), the functions qs,, a,) algorithms can leam directly from raw experience without
and r(s,, a,) depend only on the current state and action not a model of the environment's dynamics. Like DP, TD
on earlier states or actions. The task of the agent is to Ieam algorithms are bootstrapping algorithms, that is, they base
a policy, n : S+ A , where S is the set of states and A is the their update in part on an existing estimate. After all, TD
set of actions, for selecting its next action based on the algorithms combine the sampling of Monte Carlo
current observed state s; that is, n(s,)=a,. An optimal algorithms with the bootstrapping of DP algorithms.
policy is a policy that can maximize the possible reward TD algorithms have an advantage over DP algorithms
from a state, called value, v" ('), for all states. (1) is a in that they do not require a model of the environment, of
typical definition of this: its reward and next-state probability distributions [9]. TD
algorithms are different from Monte Carlo algorithms in
that they are suitable for continuous (not episodic) task
\'"(Sf) = r,,, + K,,+ yzq+3+ ... such as stock trading or task with very long episode' [lo].
a,
At time step t+l they immediately form a target and make
a useful update using the observed reward and the
k=O
estimate V(s,+/).
The simplest TD algorithm, known as TD(O), is
Here y (0 < y < 1) is a constant that determines the relative
value of delayed versus immediate rewards. @)means the
possible cumulative reward achieved by following an
arbitrary policy n from an arbitrary initial state. Then an
In effect, the target for the TD update is r,+/+ yV(s,+,).
optimal policy, n',is defined as follows:
TD(0) in complete procedural form is as follows:
yo : The price of the first trade of the day where y,(t) is the close price of a stock at t. With t
substituted by t+h, equation ( 6 ) corresponds to the h-step
yH : Highest traded price during the day. truncated return with y = 1 in the reinforcement leaming
framework which is used for computing the values of
y L : Lowest traded price during the day.
states defined as (1) in section 2.
By the definition of the reward and the target term in
yc : Last price that the security traded during the
equation (3), the current estimated value of a state, that is,
day.
the current price trend of a stock, can be computed at
every time step. The estimates may result in different
v : The number of shares (or contracts) that were
values according to the discounting factor y. If y = 0, only
traded during the day.
the immediate reward, the price rate of change of the next
Most market experts refer the conventional indicators in day, is reflected to the values. So y close to 0 is expected
to be suitable for short-term prediction. On the other hand,
technical analysis area. Therefore various kinds of
technical indicators as well as the simple kinds of derived if y = 1, then all the rewards after the state are treated
data such as retum, volume increase, etc. should also be equally regardless of the temporal distance from that state
considered. Finally the state of a stock in is paper is and thus y close to 1 seems to be suitable for long-term
defined as a vector with a fixed number of real valued prediction.
1400 I I
E1300 - +A
:. ., ~ the gradient term in equation (8) is reduced to a simple
____ form, just the state vector itself at time t as:
$1100
:. 1000 __~.--__- --t
v- V/(S/) = k , . (10)
’
0 ___-_-_____
73 900 0,
800 I I I i
0 1 2 3 4 But the linear method is not suitable for stock price
t prediction problem, because any strong relationships
between inputs and outputs in stock market are likely to be
Figure 3: An example of stock price changes. highly nonlinear. A widely used nonlinear method for
gradient-based function approximation in reinforcement
In stock price prediction, the intermediate prices learning is the multilayer artificial neural network using
between the price at current time step and the one after the the error backpropagation algorithm [ 12][ 131. This maps
given time horizon of interest are also meaningful. In this immediately onto the equation (8), where the
sense, the discounting factor make it possible to represent backpropagation process is the way of computing the
the target terms more effectively. For example, consider gradients.
the price changes of stock A and B in Figure 3. According
to the value function (l), the target value of A at time step 6. EXPERIMENT
0 is greater than that of B when assumed y < 1 and rAf = rE,
= 0 for t > 4, which reflects the price differences at In order to experiment the method presented, data
intermediate time steps. But the returns of both stocks collected from the Korean stock market shown in Table 1
calculated by (6), setting the time horizon h to 4, have the is used.
same value, 20.0.
Table 1 : The experiment data.
5. FUNCTION APPROXIMATION BY NEURAL TraininkData I Test Data
NETWORK Stocks 100 [ Stocks 50
Period 2years [ Period 1 year
According to the definition of the previous section, the
state space of a stock is continuous. This means that most
states encountered in training examples will never have Table 2 is the list of the indicators consisting the state
been experienced exactly before. So it is needed to vector in this experiment. Some of them are themselves
generalize from previously experienced states to ones that conventional indicators in technical analysis and the others
have never been seen. In gradient-descent methods, the are primitive indicators referred commonly in computing
approximate value function at time t , V,, is represented as a several complex indicators. All these input indicators are
parameterized functional form with the parameter vector normalized between 0 and 1 based on their values between
which is a column vector with a fixed number of real a predetermined minimum and maximum.
valued components, $ = (0,(1),0,(2),..., 0, (PI))‘ (the T here Training proceeds by on-line updating, in which
denotes transpose). In TD case, this vector is updated updates are done during an episode as soon as a new target
using each example by: value is computed. To compute the target value for TD
update, rf+,+ yV(s,+,), of an input example, the weigts of
the net learned from the previous examples are used to
approximate the term V(s/+,).
0
-2
-4
1400
1200
1000
20-day moving average I 20-day disparity 800
60-day moving average I 60-day disparity 600
400
Though the stock price change is a continuous task, in this 200
0
experiment, the total period, 2 years, of subsequent price
1 2 3 4 5 6 7 8 9 10111213141516
changes of each stock is regarded as an episode. Therefore
the training is performed with 100 episodes and each
episode corresponds to a stock included in Table 1 . Figure 5: The grade distribution of the test examples.
Figure 4 shows the result of the test experiment after
the training of 3000 epoch with 60 hidden nodes using y =
0.8. The predicted values for the test input patterns are Table 3 shows the average root mean-squared (Rh4S)
classified into 16 distinct grades, each of which errors between the predicted grades and the grades of
corresponds to a predefined range of the value. For this, a actual retums. Here the grade of each retum is also
heuristic function which determines the upper and lower computed by the same fuction mentioned above. The
limits of each range is used. This function considers the performance in terms of R(l) and R(20) was not as
maximum and minimum of actual retums ‘,over a period, satisfactory as that of the others, showing that making the
found in the training examples. The average actual retums period of prediction extremely shorter or longer results in
of all the examples that are predicted to belong to each decay of performance.
grade are plotted. R(l), R(5), R(10), and R(20) are the
average retums truncated after 1, 5, 10 and 20 days Table 3: RMS errors between value grades and return
respectively and are computed from the n-step truncted grades.
retum r(n) of each example: Grade
REFERENCES
[2] R.J. Kuo, “A decision support system for the stock market
through integration of fuzzy neural networks and fuzzy Delphi,”
Applied Arlijkial Intelligence, 6501-520, 1998.