L35-ReinforcementLearning 2
L35-ReinforcementLearning 2
Shyamanta M Hazarika
Mechanical Engineering
Indian Institute of Technology Guwahati
[email protected]
http://www.iitg.ac.in/s.m.hazarika/
Reinforcement Learning
In Reinforcement learning, the information available for training is intermediate
between supervised and unsupervised learning. Instead of training examples
that indicate the correct output for a given input, the training data are assumed
to provide only an indication as to whether an action is correct or not.
Reinforcement learning: a type of machine learning where the Image Source: Data Demystified — Machine Learning;; http://towardsdatascience.com
data are in the form of sequences of actions, observations, and
rewards, and the learner learns how to take actions to interact in
a specific environment so as to maximise the specified rewards.
1
10/7/19
Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
State st
Environment
Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
State st
Action at
Environment
This interaction takes the form of the agent sensing the environment, and based
on this sensory input choosing an action to perform in the environment.
2
10/7/19
Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
State st Reward rt
Action at
This interaction takes the form of the agent sensing the environment, and based
on this sensory input choosing an action to perform in the environment.
Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
State st Reward3rt33
Action a t
Next3state s
t+1
This interaction takes the form of the agent sensing the environment, and based
on this sensory input choosing an action to perform in the environment.
3
10/7/19
Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
State st Reward3rt33
Action a t
Next3state s
t+1
This interaction takes the form of the agent sensing the environment, and based
on this sensory input choosing an action to perform in the environment.
Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
Reinforcement Learning: Concerned with the problem of finding suitable
actions to take in a given situation in order to maximize a reward.
In the standard reinforcement learning model
an agent interacts with its environment. Agent
State st Reward3rt33
Action a t
Next3state s
t+1
This interaction takes the form of the agent sensing the environment, and based
on this sensory input choosing an action to perform in the environment.
4
10/7/19
Components of a RL Agent
Agent
State st Reward3rt33
Action a t
Next3state s
t+1
Environment
Components of a RL Agent
Agent
State st Reward3rt33
Action a t
Next3state s
t+1
Environment
5
10/7/19
Components of a RL Agent
Agent
State st Reward3rt33
Action a t
Next3state s
t+1
Environment
Components of a RL Agent
Agent
State st Reward3rt33
Action a t
Next3state s
t+1
Environment
6
10/7/19
Elements of RL Problem
1. The Environment
o Every RL system learns a mapping from situations to actions
by trial-and-error interactions with a dynamic environment.
o This environment must at least be partially observable; the
observations may come in the form of sensor readings,
symbolic descriptions, or possibly “mental” situations.
o If the RL system can observe perfectly all the information in
the environment that might influence the choice of action to
perform, then the RL system chooses actions based on true
“states” of the environment.
o This ideal case is the best possible basis for reinforcement
learning and, in fact, is a necessary condition for much of
the associated theory.
13
© Shyamanta M Hazarika, ME, IIT Guwahati
Elements of RL Problem
2. The Reinforcement Function
RL systems learn a mapping from situations to actions by trial-and-error
interactions with a dynamic environment.
7
10/7/19
Elements of RL Problem
3. The Value Function Having the environment and the reinforcement function defined;;
the question now is how the agent learns to choose “good” actions.
16
© Shyamanta M Hazarika, ME, IIT Guwahati
8
10/7/19
Example of a pure delayed reward reinforcement function: Standard cart-pole or inverted pendulum problem. A cart supporting a
hinged, inverted pendulum is placed on a finite track. The goal of the RL agent is to learn to balance the pendulum in an upright
position without hitting the end of the track. The situation (state) is the dynamic state of the cart pole system. Two actions are
available to the agent in each state: move the cart left, or move the cart right. The reinforcement function is zero everywhere except
for the states in which the pole falls or the cart hits the end of the track, in which case the agent receives a -1 reinforcement.
17
© Shyamanta M Hazarika, ME, IIT Guwahati
18
© Shyamanta M Hazarika, ME, IIT Guwahati
9
10/7/19
An example is one commonly known as the “Car on the hill” problem. The problem is defined as that of a stationary car being
positioned between two steep inclines. The goal of the RL agent is to successfully drive up the incline on the right to reach a goal
state at the top of the hill. The state of the environment is the car’s position and velocity. Three actions are available: forward thrust,
backward thrust, or no thrust at all. Agent learn to use momentum to gain velocity to successfully climb the hill. The reinforcement
function is -1 for ALL state transitions except the transition to the goal state, in which case a zero reinforcement is returned.
19
© Shyamanta M Hazarika, ME, IIT Guwahati
3. Games
o An alternative reinforcement function would be used in the context of
a game environment - wo or more players with opposing goals.
o RL system can learn to generate optimal behavior for the players;
finding the minimax, or saddlepoint of the reinforcement function.
o Agent would evaluate the state for each player and would choose an
action independent of the other players action.
o Actions are chosen independently and executed simultaneously, the RL
agent learns to choose actions for each player that would generate the
best outcome for the given player in a “worst case” scenario.
20
© Shyamanta M Hazarika, ME, IIT Guwahati
10
10/7/19
Reinforcement Learning
o In our discussion of Markov Decision Problems, we
assumed that we knew the agent’s reward function, R,
and a model of how the world works, expressed as the
transition probability distribution.
o In reinforcement learning, we would like an agent to
learn to behave well in an MDP world, but without
knowing anything about reward function or the
transition probability distribution when it starts out.
Parameter Estimation - you can estimate the next-state distribution P(s’|s, a) by counting the number
of times the agent has taken action a in state s and looking at the proportion of the time that s’ has
been the next state. Similarly, you can estimate R(s) just by averaging all the rewards you’ve
received when you were in state s.
21
© Shyamanta M Hazarika, ME, IIT Guwahati
11
10/7/19
Q-Learning
o Deterministic Markov decision process - the state
transitions are deterministic
n An action performed in state xt always transitions to the
same successor state xt+1 .
n In a nondeterministic Markov decision process, a probability
distribution function defines a set of potential successor
states for a given action in a given state.
o If the MDP is non-deterministic, then value iteration
requires that we find the action that returns the
maximum expected value
n The sum of the reinforcement and the integral over all
possible successor states for the given action.
23
© Shyamanta M Hazarika, ME, IIT Guwahati
Q-Learning
o Theoretically, value iteration is possible in the context of
non-deterministic MDPs.
n Computationally impossible to calculate the necessary integrals
without added knowledge or some degree of modification.
o Q-learning solves the problem of having to take the max
over a set of integrals.
o Rather than finding a mapping from states to state
values (as in value iteration), Q-learning finds a mapping
from state/action pairs to values (called Q-values).
n Q-value is the sum of the reinforcements received when
performing the associated action and then following the given
policy thereafter.
24
© Shyamanta M Hazarika, ME, IIT Guwahati
12
10/7/19
Q-Function
o Q*(s,a) is the expected discounted future reward for
starting in state s, taking action a, and continuing
optimally thereafter. Assuming we have some way of choosing actions, now we’re going
to focus on finding a way to estimate the value function directly.
Q-Function
o Q*(s,a) is the expected discounted future reward for
starting in state s, taking action a, and continuing
optimally thereafter. Assuming we have some way of choosing actions, now we’re going
to focus on finding a way to estimate the value function directly.
o If you know Q*, then it’s really easy to compute the optimal
action in a state.
n Take the action that gives the largest Q value in that state.
o When using V*, it required knowing the transition probabilities
to compute the optimal action, so this is considerably simpler.
o And it will be effective when the model is not explicitly known.
26
© Shyamanta M Hazarika, ME, IIT Guwahati
13
10/7/19
Q-Learning
o Q learning, which estimates the Q* function directly,
without estimating the transition probabilities.
n Once we have Q*, finding the best way to behave is easy.
Q-Learning
o A piece of experience in the world is (s,a,r,s′)
28
© Shyamanta M Hazarika, ME, IIT Guwahati
14
10/7/19
Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As in value iteration, by initializing the Q function arbitrarily.
Zero is usually a reasonable starting point.
n Initialize Q(s,a) arbitrarily
29
© Shyamanta M Hazarika, ME, IIT Guwahati
Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As in value iteration, by initializing the Q function arbitrarily.
Zero is usually a reasonable starting point.
n Initialize Q(s,a) arbitrarily
n After each experience, update Q: The basic form of the update looks like this.
The parameter alpha is a learning rate;;
usually it’s something like 0.1. So, we’re
updating our estimate of Q(s,a) to be
mostly like our old value of Q(s,a), but
adding in a new term that depends on r and
the new state.
30
© Shyamanta M Hazarika, ME, IIT Guwahati
15
10/7/19
Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As in value iteration, by initializing the Q function arbitrarily.
Zero is usually a reasonable starting point.
n Initialize Q(s,a) arbitrarily
n After each experience, update Q: The basic form of the update looks like this.
The parameter alpha is a learning rate;;
usually it’s something like 0.1. So, we’re
updating our estimate of Q(s,a) to be
mostly like our old value of Q(s,a), but
adding in a new term that depends on r and
the new state.
q(r,s) is an example of the value of taking action a in state s. The actual reward, r, is a sample of the
expected reward R(s). And the actual next state, s’, is a sample from the next state distribution. And the
value of that state s’ is the value of the best action we can take in it, which is the max over a’ of Q(s’,a’).
31
© Shyamanta M Hazarika, ME, IIT Guwahati
Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As in value iteration, by initializing the Q function arbitrarily.
Zero is usually a reasonable starting point.
n Initialize Q(s,a) arbitrarily
n After each experience, update Q: The basic form of the update looks like this.
The parameter alpha is a learning rate;;
usually it’s something like 0.1. So, we’re
updating our estimate of Q(s,a) to be
mostly like our old value of Q(s,a), but
adding in a new term that depends on r and
the new state.
q(r,s) is an example of the value of taking action a in state s. The actual reward, r, is a sample of the
expected reward R(s). And the actual next state, s’, is a sample from the next state distribution. And the
value of that state s’ is the value of the best action we can take in it, which is the max over a’ of Q(s’,a’).
16
10/7/19
Q-Learning
o Requires that the states and actions be drawn from a
small enough set that we can store the Q function in a
table.
n Large or even continuous state spaces, make the direct
representation approach impossible.
n We can try to use a function approximator, such as a neural
network, to store the Q function, rather than a table.
33
© Shyamanta M Hazarika, ME, IIT Guwahati
Reinforcement Learning
o Reinforcement learning is appealing because of its
generality.
n Any problem domain that can be cast as a Markov decision
process can potentially benefit from this technique
o Reinforcement learning is a promising technology; but
possible refinements that will have to be made before it
has truly widespread application.
n Reinforcement learning is an extension of classical dynamic
programming in that it greatly enlarges the set of problems
that can practically be solved.
n Combining dynamic programming with neural networks, many
are optimistic of solving a large class of problems.
34
© Shyamanta M Hazarika, ME, IIT Guwahati
17