0% found this document useful (0 votes)
101 views

Fundamentals of Reinforcement Learning

The document discusses fundamentals of reinforcement learning including: 1) Reinforcement learning involves an agent learning from interactions with an environment by trial-and-error using rewards and punishment. 2) Examples of reinforcement learning include learning to play games like backgammon as well as solving control problems like pole balancing. 3) Markov decision processes provide a framework for modeling reinforcement learning problems and involve finding the optimal policy that maps states to actions to maximize rewards over time.

Uploaded by

Bogdan Vlad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Fundamentals of Reinforcement Learning

The document discusses fundamentals of reinforcement learning including: 1) Reinforcement learning involves an agent learning from interactions with an environment by trial-and-error using rewards and punishment. 2) Examples of reinforcement learning include learning to play games like backgammon as well as solving control problems like pole balancing. 3) Markov decision processes provide a framework for modeling reinforcement learning problems and involve finding the optimal policy that maps states to actions to maximize rewards over time.

Uploaded by

Bogdan Vlad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Fundamentals of Reinforcement

Learning
December 9, 2013 - Techniques of AI
Yann-Michaël De Hauwere - [email protected]

December 9, 2013 - Techniques of AI


Course material
Slides online

T. Mitchell
Machine Learning, chapter 13
McGraw Hill, 1997

Richard S. Sutton and Andrew G. Barto


Reinforcement Learning: An
Introduction
MIT Press, 1998

Available on-line for free!

Reinforcement Learning - 2/33


Why reinforcement learning?

Based on ideas from psychology


I Edward Thorndike’s law of effect
I Satisfaction strengthens behavior,
discomfort weakens it
I B.F. Skinner’s principle of
reinforcement
I Skinner Box: train animals by
providing (positive) feedback
Learning by interacting with the
environment

Reinforcement Learning - 3/33


Why reinforcement learning?

Control learning
I Robot learning to dock on battery charger
I Learning to choose actions to optimize factory output
I Learning to play Backgammon/other games

Reinforcement Learning - 4/33


The RL setting

I Learning from interactions


I Learning what to do - how to map situations to actions -
so as to maximize a numerical reward signal

Reinforcement Learning - 5/33


Key features of RL

I Learner is not told which action to take


I Trial-and-error approach
I Possibility of delayed reward
I Sacrifice short-term gains for greater long-term gains
I Need to balance exploration and exploitation
I Possible that states are only partially observable
I Possible needs to learn multiple tasks with same sensors
I In between supervised and unsupervised learning

Reinforcement Learning - 6/33


AGENT-ENVIRONMENT INTERFACE
The agent-environment interface
Agent interacts at discrete time steps t = 0, 1, 2, . . .

I Observes state st ∈ S Agent


Agent
I Selects action at ∈ A(st )
st s
state rt reward
rt atat
action
I Obtains immediate reward t

rt+1rt+1
rt+1 ∈ R st+1 Environment
Environment
st+1
I Observes resulting state st+1

... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3

14

Reinforcement Learning - 7/33


Elements of RL

I Time steps need not refer to fixed intervals of real time


I Actions can be
I low level (voltage to motors)
I high level (go left, go right)
I ”mental” (shift focus of attention)
I States can be
I low level ”sensations” (temperature, (x, y) coordinates)
I high level abstractions, symbolic
I subjective, internal (”surprised”, ”lost”)
I The environment is not necessarily known to the agent

Reinforcement Learning - 8/33


Elements of RL

I State transitions are


I changes to the internal state of the agent
I changes in the environment as a result of the agent’s action
I can be nondeterministic
I Rewards are
I goals, subgoals
I duration
I ...

Reinforcement Learning - 9/33


Learning how to behave

I The agent’s policy π at time t is


I a mapping from states to action probabilities
I πt (s, a) = P (at = a|st = s)

I Reinforcement learning methods specify how the agent


changes its policy as a result of experience
I Roughly, the agent’s goal is to get as much reward as it can
over the long run

Reinforcement Learning - 10/33


The objective

I Use discounted return instead of total reward



X
Rt = rt+1 + γrt+2 + γ 2 rt+3 + . . . = γ k rt+k+1
k=0

where γ ∈ [0, 1] is the discount factor such that

shortsighted 0←γ→1 farsighted

Reinforcement Learning - 11/33


Example: backgammon

I Learn to play backgammon


I Immediate reward:
I +100 if win
I -100 if lose
I 0 for all other states

Trained by playing 1.5 million games against itself


Now approximately equal to best human player.

Reinforcement Learning - 12/33


Example: pole balancing

I A continuing task with discounted


return:
I reward = -1 upon failure
I return = −γ k , for k steps before
failure

Return is maximized by avoiding failure for as long as possible



X
Rt = γ k rt+k+1
k=0

Reinforcement Learning - 13/33


Examples: pole balancing (movie)

Reinforcement Learning - 14/33


Markov decision processes

I It is often useful to a assume that all relevant information is


present in the current state: Markov property

P (st+1 , rt+1 |st , at ) = P (st+1 , rt+1 |st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 )

I If a reinforcement learning task has the Markov property, it is


basically a Markov Decision Process (MDP)
I Assuming finite state and action spaces, it is a finite MDP

Reinforcement Learning - 15/33


AGENT-ENVIRONMENT INTERFACE
Markov decision processes
An MDP is defined by
I State and action sets
I a Transition function
Agent
a 0
Pss 0 = P (st+1 = s |ss t =reward
r s, at = a)
state
t
action
at
t

rt+1
I a Reward function st+1 Environment

Rass0 = E(rt+1 |st = s, at = a, st+1 = s0 )

... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3

14

Reinforcement Learning - 16/33


Value functions
I Goal: learn π : S → A, given hhs, ai, ri
I When following a fixed policy π we can define the value of a
state s under that policy as

X∞
π
V (s) = Eπ (Rt |st = s) = Eπ ( γ k rt+k+1 |st = s)
k=0

I Similarly we can define the value of taking action a in state s


as
Qπ (s, a) = Eπ (Rt |st = s, at = a)
I Optimal π ∗ = argmaxπ V π (s)

Reinforcement Learning - 17/33


Reinforcement Learning - 18/33
Value functions

I The value function has a particular recursive relationship,


expressed by the Bellman equation
X X
π 0
V π (s) = π(s, a) a
Pss a
0 [Rss0 + γV (s )]

a∈A(s) s0 ∈S

I The equation expresses the recursive relation between the


value of a state and its successor states, and averages over all
possibilities, weighting each by its probability of occurring

Reinforcement Learning - 19/33


Learning an optimal policy online

I Often transition and reward functions are unknown


I Using temporal difference (TD) methods is one way of
overcoming this problem
I Learn directly from raw experience
I No model of the environment required (model-free)
I E.g.: Q-learning
I Update predicted state values based on new observations of
immediate rewards and successor states

Reinforcement Learning - 20/33


Q-function

Q(s, a) = r(s, a) + γV ∗ (δ(s, a))with st+1 = δ(st , at )

I if we know Q, we do not have to know δ.

π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]

π ∗ (s) = argmaxa Q(s, a)

Reinforcement Learning - 21/33


Training rule to learn Q
I Q and V ∗ are closely related:

V ∗ (s) = maxa0 Q(s, a0 )

I which allows us to write Q as:

Q(st , at ) = r(st , at ) + γV ∗ (δ(st , at ))

Q(st , at ) = r(st , at ) + γmaxa0 Q(st+1 , a0 )


I So if Q̂ represents the learner’s current approximation of Q:

Q̂(s, a) ← r + γmaxa0 Q̂(s0 , a0 )

Reinforcement Learning - 22/33


Q-learning

I Q-learning updates state-action values based on the


immediate reward and the optimal expected return
h i
Q(st , at ) ← Q(st , at )+α rt+1 + γ max Q(st+1 , a) − Q(st , at )
a

I Directly learns the optimal value function independent of the


policy being followed
I Proven to converge to the optimal policy given ”sufficient”
updates for each state-action pair, and decreasing learning
rate α [Watkins92,Tsitsiklis94]

Reinforcement Learning - 23/33


Q-learning

Reinforcement Learning - 24/33


Action selection

I How to select an action based on the values of the states or


state-action pairs?
I Success of RL depends on a trade-off
I Exploration
I Exploitation
I Exploration is needed to prevent getting stuck in local optima
I To ensure convergence you need to exploit

Reinforcement Learning - 25/33


Action selection

Two common choices


I -greedy
I Choose the best action with probability 1 − 
I Choose a random action with probability 
I Boltzmann exploration (softmax) uses a temperature
parameter τ to balance exploration and exploitation

eQt (s,a)/τ
πt (s, a) = P Qt (s,a0 )/τ
a0 ∈A e

pure exploitation 0 ← τ → ∞ pure exploration

Reinforcement Learning - 26/33


Updating Q: in practice

Reinforcement Learning - 27/33


Convergence of deterministic Q-learning

Q̂ converges to Q when each hs, ai is visited infinitely often


Proof:
I Let a full interval be an interval during which each hs, ai is
visited
I Let Q̂n be the Q-table after n-updates
I ∆n is the maximum error in Q̂n :

∆n = maxs,a |Q̂n (s, a) − Q(s, a)|

Reinforcement Learning - 28/33


Convergence of deterministic Q-learning
For any table entry Q̂n (s, a) updated on iteration n + 1, the error
in the revised estimate is Q̂n+1 (s, a)

Q̂n+1 (s, a) − Q(s, a)| = |(r + γmaxa0 Q̂n (s0 , a0 ))


−(r + γmaxa0 Q(s0 , a0 ))|
= |γmaxa0 Q̂n (s0 , a0 )) − γmaxa0 Q(s0 , a0 ))|
≤ γmaxa0 |Q̂n (s0 , a0 ) − Q(s0 , a0 ))|
≤ γmaxs00 ,a0 |Q̂n (s00 , a0 ) − Q(s00 , a0 ))|
Q̂n+1 (s, a) − Q(s, a)| ≤ γ∆n < ∆n

Reinforcement Learning - 29/33


Extensions
I Multi-step TD
I Instead of observing one immediate reward, use n consecutive
rewards for the value update
I Intuition: your current choice of action may have implications
for the future
I Eligibility traces
I State-action pairs are eligible for future rewards, with more
recent states getting more credit

Reinforcement Learning - 30/33


Extensions

I Reward shaping
I Incorporate domain knowledge to provide additional rewards
during an episode
I Guide the agent to learn faster
I (Optimal) policies preserved given a potential-based shaping
function [Ng99]
I Function approximation
I So far we have used a tabular notation for value functions
I For large state and actions spaces this approach becomes
intractable
I Function approximators can be used to generalize over large or
even continuous state and action spaces

Reinforcement Learning - 31/33


Demo

http://wilma.vub.ac.be:3000

Reinforcement Learning - 32/33


Questions?

Reinforcement Learning - 33/33

You might also like