Deep Reinforcement Learning For Algorithmic Trading
Deep Reinforcement Learning For Algorithmic Trading
AGENT
The agent is a MLP (Multi Layer Perceptron) multi-class classifier neural network taking in two
inputs from the environment: Price of A and B resulting in actions : (0) Long A, Short B (1) Short
A, Long B (2) Do nothing, subject to maximizing the overall reward in every step. After every
action, it receives the next observation (state) and the reward associated with its previous
action. Since the environment is stochastic in nature, the agent operates through a MDP (Markov
Decision Process) i.e. the next action is entirely based on the current state and not on the history
of prices/states/actions and it discounts the future reward(s) with a certain measure (gamma).
The score is calculated with every step and saved in the Agent's memory along with the action,
current state and the next state. The cumulative reward per episode is the sum of all the
individual scores in the lifetime of an episode and will eventually judge the performance of the
agent over its training. The complete workflow diagram is shown below:
Why should this approach even work ? Since the spread of the two co-
integrated processes exhibits a stationary property i.e. it has a constant mean and variance over
time and can be thought of as having a normal distribution. The agent can identify this statistical
behavior by buying and selling A and B simultaneously based on their price spread (= Price_A -
Price_B) . For example, if the spread is negative it implies that A is cheap and B is expensive, the
agent will figure the action would be to go long A and short B to attain the higher reward. The
agent will try to approximate this through the Q(s, a) function where 's' is the state and 'a' is the
optimal action associated with that state to maximize its returns over the lifetime of the
episode. The policy for next action will be determined using Bellman Ford Algorithm as
described by the equation below:
Through this mechanism, it will also appreciate the long term prospects
than just immediate rewards by assigning different Q values to each action. This is the crux of
Reinforcement Learning. Since the input space can be massively large, we will use a Deep Neural
Network to approximate the Q(s, a) function through backward propagation. Over multiple
iterations, the Q(s, a) function will converge to find the optimal action in every possible state it
has explored.
Speaking of the internal details, it has two major components:
Memory: Its a list of events. The Agent will store the information through iterations of exploration
and exploitation. It contains a list of the format: (state, action, reward, next_state, message)
Brain: This is the Fully Connected, Feed-Forward Neural Net which will train from the memory i.e.
Sign in Join now
past experiences. Given the current state as input, it will predict the next optimal action.
To train the agent, we need to build our Neural Network which will learn to classify actions based
on the inputs it receives. (A simplified Image below. Of course the real neural net will be more
complicated than this.).
In the above graph, you can see 3 different plots representing entire training scenarios of 500
episodes, each having 500 steps. With every step, the agent performs an action and gets its
reward. As you can see, in the beginning since the agent has no preconception of the
consequences of its actions, it takes randomized actions to observe the rewards associated
with it. Hence the cumulative reward per episode fluctuates a lot in the beginning from 0-300th
episode, however beyond 300 episodes, the agent starts learning from its training and and by
400th episode, it almost converges in each of the training scenarios as it discovers the long-
short pattern and starts to fully exploit it.
There are still many challenges to it and it is still a part of an ongoing research of engineering
both the agent as well as the environment. My aim here was not to show a 'backtested profitable
trading strategy' but to describe how to apply advanced Machine Learning concepts such as
Deep Q-Learning/Neural Networks to the field of Algorithmic Trading. It is an extremely
complicated process and pretty hard to explain in a single blog post however I have tried my
best to simplify things. I am going to put a much detailed analysis and code on github, so please
watch this space if you are interested.
Furthermore, this approach can be extended into a large portfolio of stocks and bonds and the
agent can be trained under diverse range of stochastic environments. Additionally, the agent's
behavior can be constrained to various risk parameters such as sizing, hedging etc. One can
also have multiple agents training under different suitability criteria given the desired risk/return
profiles. These types of approximation can be made more accurately using large data sets and
distributed computing power.
Eventually, the question is, can AI do everything ? Probably no. Can we effectively train it to do
Sign in Join now
anything ? Possibly yes , i.e. with real intelligence, the artificial intelligence can surely thrive.
Thanks for reading. Please feel free to share your ideas.
Hope you enjoyed the post !
DISCLAIMER:
1. Opinions expressed are solely my own and do not express the views or opinions of any of my
employers.
2. The information from the Site is based on financial models, and trading signals are generated
mathematically. All of the calculations, signals, timing systems, and forecasts are the result of
back testing, and are therefore merely hypothetical. Trading signals or forecasts used to produce
our results were derived from equations which were developed through hypothetical reasoning
based on a variety of factors. Theoretical buy and sell methods were tested against the past to
prove the profitability of those methods in the past. Performance generated through back testing
has many and possibly serious limitations. We do not claim that the historical performance,
signals or forecasts will be indicative of future results. There will be substantial and possibly
extreme differences between historical performance and future performance. Past performance
is no guarantee of future performance. There is no guarantee that out-of-sample performance
will match that of prior in-sample performance. The website does not claim or warrant that its
timing systems, signals, forecasts, opinions or analyses are consistent, logical or free from
hindsight or other bias or that the data used to generate signals in the back tests was available
to investors on the dates for which theoretical signals were generated.
Gaurav S.
Gaurav Software Engineer at MediaMath Follow
S.
30 comments
A nice article. I would add a few comments: 1. Q-learning (or RL in general) is NOT a
form of semi-supervised learning. 2. I would be interested to see any empirical support
to keeping the word 'Deep' in Deep Reinforcement Learning that you use. Did you
benchmark it against a simple linear (or bi-linear, in your case) set of basis functions,
and a simple linear architecture? 3. I am not sure I agree with this statement:
"Furthermore, this approach can be extended into a large portfolio of stocks and bonds
and the agent can be trained under diverse range of stochastic environments." I think
this particular approach of Q-learning cannot be extended to a multi-asset portfolio, but
Sign in Join now
its generalization called G-learning can do that. You can find some details on this
approach in my paper on RL in a multi-asset setting
here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3174498.
Like Reply 8 Likes 4 Replies
Gaurav S. 3mo
Gaurav Software Engineer at MediaMath
S.
Igor, there have been known examples of linear architecture (linear regression
models) to find a constant 'fitted' spread. But as I have mentioned that "It has
always been a huge challenge to pick the right data sample for that universal
spread measure through regression since the mean is derived from a
randomly chosen sample of historical data". The real reason to choose NN or
Reinforcement learning for that matter is to identify the spread as not a just a
measure but rather a function which we are trying to approximate and its
anything but linear.
Like Reply
Gerardo Lemus - I believe it is always better to start with a simplest model first
(a linear architecture), and THEN see what changes when you move to neural
networks. Every time I violated this rule, I ended up regretting it.
Like Reply 1 Like
Gaurav S. 5mo
Gaurav Software Engineer at MediaMath
S.
Thats great. Thanks for sharing. I will review it.
Like Reply
Sign in Join now
Gerardo Lemus
Gerardo Quantitative Finance Practitioner: Applying all available data to get an edge over the market.6mo
Lemus
One possibility is to estimate the mean reverting parameters from history (see
https://medium.com/@gjlr2000/mean-reversion-in-finance-definitions-
45242f19526f for a python example) and then use them as parameters for the
Monte Carlo Simulation for the reinforced learning training. The solution
should be similar to the dynamic programming optimal solution
http://folk.uio.no/kennethk/articles/art53_journal.pdf (but the advantage of
reinforced learning is that you can add complexity to the simulation, like
adding seasonality as mentioned above)
Like Reply 1 Like
Gaurav S. 6mo
Gaurav Software Engineer at MediaMath
S.
By unbiased I mean that the data should have no relationship with the actual
historical data. The important part to note here is that you are training the
agent based on the market behavior not on historical data. We would only
need the long running mean and volatility of the index (say SP500 or FTSE100)
and then you can generate/simulate 5000 possible paths to train/test your
agent/strategy fairly.
Like Reply
Gaurav S. 6mo
Gaurav Software Engineer at MediaMath
S.
Gerardo , great questions. The policy depends on the rewards and the
discount factor can be adjusted to any environment based on whether one
wants a short term view (low discount factor) or a very long term view (very
high discount factor). The reward function can include all the parameters you
mentioned plus more for example the current state of the portfolio, market
volatility etc. You are completely right in the sense as to if one changes the
simulation (i.e. the behavior of the environment), the agent will adjust its
behavior based on the (new) reward policy.
Like Reply
Gaurav S. 9mo
Gaurav Software Engineer at MediaMath
S.
Thanks Andrew. I am glad you liked it.
Like Reply
Gaurav S. 11mo
Gaurav Software Engineer at MediaMath
S.
Thanks for the time to review and great questions. Here are my answers: 1. If
the actions are based on individual stocks, then we can do the experiment on
just a single stock and not a pair. The input the model is in fact just the spread
(even though I show the price of A and B). It is necessary that if we go long on
A , then we simultaneously go short on B and vice versa to keep a market
neutral situation. 2. I haven't really specified the size of a time step. It can
range from a microsecond to a month based on the instrument type. Some
assets are very liquid like FX derivatives, while others don't trade for months
like Municipal bonds. 3. No. This is a fairly new concept and an evolving idea. I
think there is more to this before I can put it to use to real trading. 4. I believe
so. It can be added as a third input (in addition to the prices) to account for the
classification of actions.
Like Reply
Deep Learning Neural Networks for Bond Fixed Income: Price-Yield Fallacy
Pricing
Fixed Income: Price-Yield Fallacy
Deep Learning Neural Networks for July 31, 2016
Bond…
October 30, 2017