Reinforcement Learning: Russell and Norvig: CH 21
Reinforcement Learning: Russell and Norvig: CH 21
Nifty applets:
for blackjack
for robot motion
for a pendulum controller
Formalization
Given:
a state space S
a set of actions a1, , ak
reward value at the end of each trial (may
be positive or negative)
Output:
example:
a mapping Alvinnto
from states (driving
actionsagent)
state: configuration of the car
learn a steering action for each state
Reactive Agent Algorithm
Accessible or
Repeat: observable state
s sensed state
If s is terminal then exit
a choose action (given s)
Perform a
Policy (Reactive/Closed-Loop Strategy)
3 +1
2 -1
1 2 3 4
Repeat:
s sensed state
If s is terminal then exit
a P(s)
Perform a
Approaches
Learn policy directly function mapping
from states to actions
Learn utility values for states (i.e., the
value function)
Value Function
The agent knows what state it is in
The agent has a number of actions it can perform in
each state.
Initially, it doesn't know the value of any of the states
If the outcome of performing an action at a state is
deterministic, then the agent can update the utility
value U() of states:
U(oldstate) = reward + U(newstate)
The agent learns the utility values of states as it
works its way through the state space
Exploration
The agent may occasionally choose to explore
suboptimal moves in the hopes of finding better
outcomes
Only by visiting all the states frequently enough can we
guarantee learning the true values of all the states
A discount factor is often introduced to prevent utility
values from diverging and to promote the use of
shorter (more efficient) sequences of actions to
attain rewards
The update equation using a discount factor is:
U(oldstate) = reward + * U(newstate)
Normally, is set between 0 and 1
Q-Learning
Q-learning augments value iteration by
maintaining an estimated utility value
Q(s,a) for every action at every state
The utility of a state U(s), or Q(s), is
simply the maximum Q value over all
the possible actions at that state
Learns utilities of actions (not states)
model-free learning
Q-Learning
foreach state s
foreach action a
Q(s,a)=0
s=currentstate
do forever
a = select an action
do action a
r = reward from doing a
t = resulting state from doing a
Q(s,a) = (1 ) Q(s,a) + (r + Q(t))
s=t
The learning coefficient, , determines how quickly our
estimates are updated
Normally, is set to a small positive constant less than
1
Selecting an Action
Simply choose action with highest (current)
expected utility?
Problem: each action has two effects
yields a reward (or penalty) on current sequence
information stuck
is received and used in learning for
in a rut
future sequences
Trade-off: immediate good for long-term well-
being