0% found this document useful (0 votes)
9 views17 pages

L35-ReinforcementLearning 2

Uploaded by

voramahek21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

L35-ReinforcementLearning 2

Uploaded by

voramahek21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

10/7/19

Fundamentals of Artificial Intelligence


Reinforcement Learning

Shyamanta M Hazarika
Mechanical Engineering
Indian Institute of Technology Guwahati
[email protected]

http://www.iitg.ac.in/s.m.hazarika/

Reinforcement Learning
In Reinforcement learning, the information available for training is intermediate
between supervised and unsupervised learning. Instead of training examples
that indicate the correct output for a given input, the training data are assumed
to provide only an indication as to whether an action is correct or not.

Reinforcement learning: a type of machine learning where the Image  Source:  Data  Demystified  — Machine  Learning;;  http://towardsdatascience.com
data are in the form of sequences of actions, observations, and
rewards, and the learner learns how to take actions to interact in
a specific environment so as to maximise the specified rewards.

Learn action to maximize payoff.


2
© Shyamanta M Hazarika, ME, IIT Guwahati

1
10/7/19

Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.

In  the  standard  reinforcement  learning  model  


an  agent  interacts  with  its  environment. Agent

State st

Environment

3 Lecture  14  -­ 3 © Shyamanta M Hazarika, ME, IIT Guwahati

Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.

In  the  standard  reinforcement  learning  model  


an  agent  interacts  with  its  environment. Agent

State st
Action at

Environment

This  interaction  takes  the  form  of  the  agent  sensing  the  environment,  and  based  
on  this  sensory  input  choosing  an  action  to  perform  in  the  environment.  

4 Lecture  14  -­ 4 © Shyamanta M Hazarika, ME, IIT Guwahati

2
10/7/19

Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.

In  the  standard  reinforcement  learning  model  


an  agent  interacts  with  its  environment. Agent

State st Reward rt
Action at

The  action  changes  the  environment  in  some  manner  and  


this  change  is  communicated  to  the  agent  through  a  scalar   Environment
reinforcement  signal.

This  interaction  takes  the  form  of  the  agent  sensing  the  environment,  and  based  
on  this  sensory  input  choosing  an  action  to  perform  in  the  environment.  

5 Lecture  14  -­ 5 © Shyamanta M Hazarika, ME, IIT Guwahati

Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.

In  the  standard  reinforcement  learning  model  


an  agent  interacts  with  its  environment. Agent

State st Reward3rt33
Action a t
Next3state s
t+1

The  action  changes  the  environment  in  some  manner  and  


this  change  is  communicated  to  the  agent  through  a  scalar   Environment
reinforcement  signal.

This  interaction  takes  the  form  of  the  agent  sensing  the  environment,  and  based  
on  this  sensory  input  choosing  an  action  to  perform  in  the  environment.  

6 Lecture  14  -­ 6 © Shyamanta M Hazarika, ME, IIT Guwahati

3
10/7/19

Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.

In  the  standard  reinforcement  learning  model  


an  agent  interacts  with  its  environment. Agent

State st Reward3rt33
Action a t
Next3state s
t+1

The  action  changes  the  environment  in  some  manner  and  


this  change  is  communicated  to  the  agent  through  a  scalar   Environment
reinforcement  signal.

This  interaction  takes  the  form  of  the  agent  sensing  the  environment,  and  based  
on  this  sensory  input  choosing  an  action  to  perform  in  the  environment.  

Learning is based on the reward hypothesis – All goals can be


described by maximization of expected cumulative rewards.
7 Lecture  14  -­ 7 © Shyamanta M Hazarika, ME, IIT Guwahati

Reinforcement Learning
More general than supervised or unsupervised learning. Learn from
interaction with the environment to achieve a goal.
Reinforcement Learning: Concerned with the problem of finding suitable
actions to take in a given situation in order to maximize a reward.
In  the  standard  reinforcement  learning  model  
an  agent  interacts  with  its  environment. Agent

State st Reward3rt33
Action a t
Next3state s
t+1

The  action  changes  the  environment  in  some  manner  and  


this  change  is  communicated  to  the  agent  through  a  scalar   Environment
reinforcement  signal.

This  interaction  takes  the  form  of  the  agent  sensing  the  environment,  and  based  
on  this  sensory  input  choosing  an  action  to  perform  in  the  environment.  

Learning is based on the reward hypothesis – All goals can be


described by maximization of expected cumulative rewards.
8 Lecture  14  -­ 8 © Shyamanta M Hazarika, ME, IIT Guwahati

4
10/7/19

Components of a RL Agent

Agent

State st Reward3rt33
Action a t
Next3state s
t+1

Environment

9 Lecture  14  -­ 9 © Shyamanta M Hazarika, ME, IIT Guwahati

Components of a RL Agent

Agent

State st Reward3rt33
Action a t
Next3state s
t+1

Environment

10 Lecture  14  -­ 10 © Shyamanta M Hazarika, ME, IIT Guwahati

5
10/7/19

Components of a RL Agent

Agent

State st Reward3rt33
Action a t
Next3state s
t+1

Environment

11 Lecture  14  -­ 11 © Shyamanta M Hazarika, ME, IIT Guwahati

Components of a RL Agent

Agent

State st Reward3rt33
Action a t
Next3state s
t+1

Environment

12 Lecture  14  -­ 12 © Shyamanta M Hazarika, ME, IIT Guwahati

6
10/7/19

Elements of RL Problem
1. The Environment
o Every RL system learns a mapping from situations to actions
by trial-and-error interactions with a dynamic environment.
o This environment must at least be partially observable; the
observations may come in the form of sensor readings,
symbolic descriptions, or possibly “mental” situations.
o If the RL system can observe perfectly all the information in
the environment that might influence the choice of action to
perform, then the RL system chooses actions based on true
“states” of the environment.
o This ideal case is the best possible basis for reinforcement
learning and, in fact, is a necessary condition for much of
the associated theory.
13
© Shyamanta M Hazarika, ME, IIT Guwahati

Elements of RL Problem
2. The Reinforcement Function
RL  systems  learn  a  mapping  from  situations  to  actions  by  trial-­and-­error  
interactions  with  a  dynamic  environment.  

o The “goal” of the RL system is defined using the concept of


a reinforcement function, which is the exact function of
future reinforcements the agent seeks to maximize.
o There exists a mapping from state/action pairs to
reinforcements; after performing an action in a given state
the RL agent will receive some reinforcement (reward) in
the form of a scalar value.
o The RL agent learns to perform actions that will maximize
the sum of the reinforcements received when starting from
some initial state and proceeding to a terminal state.
14
© Shyamanta M Hazarika, ME, IIT Guwahati

7
10/7/19

Elements of RL Problem
3. The Value Function Having the environment and the reinforcement function defined;;
the question now is how the agent learns to choose “good” actions.

o A policy determines which action should be performed in


each state; a policy is a mapping from states to actions.
o The value of a state is defined as the sum of the
reinforcements received when starting in that state and
following some fixed policy to a terminal state.
o The optimal policy would be the mapping from states to
actions that maximizes the sum of the reinforcements. The
value of a state is dependent upon the policy.
o The value function is a mapping from states to state values
and can be approximated using any type of function
approximator (e.g., multilayered perceptron, memory based
system, radial basis functions, look-up table, etc.).
15
© Shyamanta M Hazarika, ME, IIT Guwahati

The Reinforcement Function


o There are at least three noteworthy classes often used
to construct reinforcement functions that properly
define the desired goals.

1. Pure Delayed Reward and Avoidance Problems


o In Pure Delayed Reward class of functions the reinforcements are all
zero except at the terminal state.
o The sign of the scalar reinforcement at the terminal state indicates
whether the terminal state is a goal state (a reward) or a state that
should be avoided (a penalty).

16
© Shyamanta M Hazarika, ME, IIT Guwahati

8
10/7/19

The Reinforcement Function


o There are at least three noteworthy classes often used
to construct reinforcement functions that properly
define the desired goals.

1. Pure Delayed Reward and Avoidance Problems

Example of a pure delayed reward reinforcement function: Standard cart-­pole or inverted pendulum problem. A cart supporting a
hinged, inverted pendulum is placed on a finite track. The goal of the RL agent is to learn to balance the pendulum in an upright
position without hitting the end of the track. The situation (state) is the dynamic state of the cart pole system. Two actions are
available to the agent in each state: move the cart left, or move the cart right. The reinforcement function is zero everywhere except
for the states in which the pole falls or the cart hits the end of the track, in which case the agent receives a -­1 reinforcement.
17
© Shyamanta M Hazarika, ME, IIT Guwahati

The Reinforcement Function


o There are at least three noteworthy classes often used
to construct reinforcement functions that properly
define the desired goals.

2. Minimum Time to Goal


o Reinforcement functions in this class cause an agent to perform
actions that generate the shortest path or trajectory to a goal state.

18
© Shyamanta M Hazarika, ME, IIT Guwahati

9
10/7/19

The Reinforcement Function


o There are at least three noteworthy classes often used
to construct reinforcement functions that properly
define the desired goals.

2. Minimum Time to Goal

An example is one commonly known as the “Car on the hill” problem. The problem is defined as that of a stationary car being
positioned between two steep inclines. The goal of the RL agent is to successfully drive up the incline on the right to reach a goal
state at the top of the hill. The state of the environment is the car’s position and velocity. Three actions are available: forward thrust,
backward thrust, or no thrust at all. Agent learn to use momentum to gain velocity to successfully climb the hill. The reinforcement
function is -­1 for ALL state transitions except the transition to the goal state, in which case a zero reinforcement is returned.
19
© Shyamanta M Hazarika, ME, IIT Guwahati

The Reinforcement Function


o There are at least three noteworthy classes often used
to construct reinforcement functions that properly
define the desired goals.

3. Games
o An alternative reinforcement function would be used in the context of
a game environment - wo or more players with opposing goals.
o RL system can learn to generate optimal behavior for the players;
finding the minimax, or saddlepoint of the reinforcement function.
o Agent would evaluate the state for each player and would choose an
action independent of the other players action.
o Actions are chosen independently and executed simultaneously, the RL
agent learns to choose actions for each player that would generate the
best outcome for the given player in a “worst case” scenario.
20
© Shyamanta M Hazarika, ME, IIT Guwahati

10
10/7/19

Reinforcement Learning
o In our discussion of Markov Decision Problems, we
assumed that we knew the agent’s reward function, R,
and a model of how the world works, expressed as the
transition probability distribution.
o In reinforcement learning, we would like an agent to
learn to behave well in an MDP world, but without
knowing anything about reward function or the
transition probability distribution when it starts out.
Parameter Estimation -­ you can estimate the next-­state distribution P(s’|s, a) by counting the number
of times the agent has taken action a in state s and looking at the proportion of the time that s’ has
been the next state. Similarly, you can estimate R(s) just by averaging all the rewards you’ve
received when you were in state s.
21
© Shyamanta M Hazarika, ME, IIT Guwahati

Approximating the Value Function


o Reinforcement learning is a difficult problem because
the learning system may perform an action and not be
told whether that action was good or bad.
o Initially, the approximation of the optimal value
function is poor. In other words, the mapping from
states to state values is not valid.
o The primary objective of learning is to find the correct
mapping. Once this is completed, the optimal policy
can easily be extracted.
n Q-learning - finds a mapping from state/action pairs to
values; one of the most successful approaches to RL.
22
© Shyamanta M Hazarika, ME, IIT Guwahati

11
10/7/19

Q-Learning
o Deterministic Markov decision process - the state
transitions are deterministic
n An action performed in state xt always transitions to the
same successor state xt+1 .
n In a nondeterministic Markov decision process, a probability
distribution function defines a set of potential successor
states for a given action in a given state.
o If the MDP is non-deterministic, then value iteration
requires that we find the action that returns the
maximum expected value
n The sum of the reinforcement and the integral over all
possible successor states for the given action.
23
© Shyamanta M Hazarika, ME, IIT Guwahati

Q-Learning
o Theoretically, value iteration is possible in the context of
non-deterministic MDPs.
n Computationally impossible to calculate the necessary integrals
without added knowledge or some degree of modification.
o Q-learning solves the problem of having to take the max
over a set of integrals.
o Rather than finding a mapping from states to state
values (as in value iteration), Q-learning finds a mapping
from state/action pairs to values (called Q-values).
n Q-value is the sum of the reinforcements received when
performing the associated action and then following the given
policy thereafter.
24
© Shyamanta M Hazarika, ME, IIT Guwahati

12
10/7/19

Q-Function
o Q*(s,a) is the expected discounted future reward for
starting in state s, taking action a, and continuing
optimally thereafter. Assuming  we  have  some  way  of  choosing  actions,  now  we’re  going  
to  focus  on  finding  a  way  to  estimate  the  value  function  directly.

The Q value of being in state s and taking action a is the


immediate reward, R(s), plus the discounted expected value of the
future. We get the expected value of the future by taking an
expectation over all possible next states, s’. In each state s’, we
need to know the value of behaving optimally. We can get that by
choosing, in each s’, the action a’ that maximizes Q*(s’,a’).
25
© Shyamanta M Hazarika, ME, IIT Guwahati

Q-Function
o Q*(s,a) is the expected discounted future reward for
starting in state s, taking action a, and continuing
optimally thereafter. Assuming  we  have  some  way  of  choosing  actions,  now  we’re  going  
to  focus  on  finding  a  way  to  estimate  the  value  function  directly.

o If you know Q*, then it’s really easy to compute the optimal
action in a state.
n Take the action that gives the largest Q value in that state.
o When using V*, it required knowing the transition probabilities
to compute the optimal action, so this is considerably simpler.
o And it will be effective when the model is not explicitly known.
26
© Shyamanta M Hazarika, ME, IIT Guwahati

13
10/7/19

Q-Learning
o Q learning, which estimates the Q* function directly,
without estimating the transition probabilities.
n Once we have Q*, finding the best way to behave is easy.

o The learning algorithm deals with individual “pieces” of


experience with the world.
n One piece of experience is a set of the current state
s, the chosen action a, the reward r, and the next
state s’.
n Each piece of experience will be folded into the Q
values, and then thrown away!
27
© Shyamanta M Hazarika, ME, IIT Guwahati

Q-Learning
o A piece of experience in the world is (s,a,r,s′)

28
© Shyamanta M Hazarika, ME, IIT Guwahati

14
10/7/19

Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As  in  value  iteration,  by  initializing  the  Q  function  arbitrarily.  
Zero  is  usually  a  reasonable  starting  point.
n Initialize Q(s,a) arbitrarily

29
© Shyamanta M Hazarika, ME, IIT Guwahati

Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As  in  value  iteration,  by  initializing  the  Q  function  arbitrarily.  
Zero  is  usually  a  reasonable  starting  point.
n Initialize Q(s,a) arbitrarily
n After each experience, update Q: The basic form of the update looks like this.
The parameter alpha is a learning rate;;
usually it’s something like 0.1. So, we’re
updating our estimate of Q(s,a) to be
mostly like our old value of Q(s,a), but
adding in a new term that depends on r and
the new state.

30
© Shyamanta M Hazarika, ME, IIT Guwahati

15
10/7/19

Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As  in  value  iteration,  by  initializing  the  Q  function  arbitrarily.  
Zero  is  usually  a  reasonable  starting  point.
n Initialize Q(s,a) arbitrarily
n After each experience, update Q: The basic form of the update looks like this.
The parameter alpha is a learning rate;;
usually it’s something like 0.1. So, we’re
updating our estimate of Q(s,a) to be
mostly like our old value of Q(s,a), but
adding in a new term that depends on r and
the new state.

q(r,s) is an example of the value of taking action a in state s. The actual reward, r, is a sample of the
expected reward R(s). And the actual next state, s’, is a sample from the next state distribution. And the
value of that state s’ is the value of the best action we can take in it, which is the max over a’ of Q(s’,a’).

31
© Shyamanta M Hazarika, ME, IIT Guwahati

Q-Learning
o A piece of experience in the world is (s,a,r,s′)
As  in  value  iteration,  by  initializing  the  Q  function  arbitrarily.  
Zero  is  usually  a  reasonable  starting  point.
n Initialize Q(s,a) arbitrarily
n After each experience, update Q: The basic form of the update looks like this.
The parameter alpha is a learning rate;;
usually it’s something like 0.1. So, we’re
updating our estimate of Q(s,a) to be
mostly like our old value of Q(s,a), but
adding in a new term that depends on r and
the new state.

q(r,s) is an example of the value of taking action a in state s. The actual reward, r, is a sample of the
expected reward R(s). And the actual next state, s’, is a sample from the next state distribution. And the
value of that state s’ is the value of the best action we can take in it, which is the max over a’ of Q(s’,a’).

o Guaranteed to converge to optimal Q if the world is


really an MDP
32
© Shyamanta M Hazarika, ME, IIT Guwahati

16
10/7/19

Q-Learning
o Requires that the states and actions be drawn from a
small enough set that we can store the Q function in a
table.
n Large or even continuous state spaces, make the direct
representation approach impossible.
n We can try to use a function approximator, such as a neural
network, to store the Q function, rather than a table.

o Q-Learning can sometimes be very slow to converge.


More advanced techniques in reinforcement learning
are aimed at addressing this problem.

33
© Shyamanta M Hazarika, ME, IIT Guwahati

Reinforcement Learning
o Reinforcement learning is appealing because of its
generality.
n Any problem domain that can be cast as a Markov decision
process can potentially benefit from this technique
o Reinforcement learning is a promising technology; but
possible refinements that will have to be made before it
has truly widespread application.
n Reinforcement learning is an extension of classical dynamic
programming in that it greatly enlarges the set of problems
that can practically be solved.
n Combining dynamic programming with neural networks, many
are optimistic of solving a large class of problems.
34
© Shyamanta M Hazarika, ME, IIT Guwahati

17

You might also like