0% found this document useful (0 votes)

101 views

Fundamentals of Reinforcement Learning

The document discusses fundamentals of reinforcement learning including: 1) Reinforcement learning involves an agent learning from interactions with an environment by trial-and-error using rewards and punishment. 2) Examples of reinforcement learning include learning to play games like backgammon as well as solving control problems like pole balancing. 3) Markov decision processes provide a framework for modeling reinforcement learning problems and involve finding the optimal policy that maps states to actions to maximize rewards over time.

Uploaded by

Bogdan Vlad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

Fundamentals of Reinforcement Learning

Uploaded by

Bogdan Vlad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Fundamentals of Reinforcement

Learning
December 9, 2013 - Techniques of AI
Yann-Michaël De Hauwere - [email protected]

December 9, 2013 - Techniques of AI

Course material
Slides online

T. Mitchell
Machine Learning, chapter 13
McGraw Hill, 1997

Richard S. Sutton and Andrew G. Barto

Reinforcement Learning: An
Introduction
MIT Press, 1998

Available on-line for free!

Reinforcement Learning - 2/33

Why reinforcement learning?

Based on ideas from psychology

I Edward Thorndike’s law of effect
I Satisfaction strengthens behavior,
discomfort weakens it
I B.F. Skinner’s principle of
reinforcement
I Skinner Box: train animals by
providing (positive) feedback
Learning by interacting with the
environment

Reinforcement Learning - 3/33

Why reinforcement learning?

Control learning
I Robot learning to dock on battery charger
I Learning to choose actions to optimize factory output
I Learning to play Backgammon/other games

Reinforcement Learning - 4/33

The RL setting

I Learning from interactions

I Learning what to do - how to map situations to actions -
so as to maximize a numerical reward signal

Reinforcement Learning - 5/33

Key features of RL

I Learner is not told which action to take

I Trial-and-error approach
I Possibility of delayed reward
I Sacrifice short-term gains for greater long-term gains
I Need to balance exploration and exploitation
I Possible that states are only partially observable
I Possible needs to learn multiple tasks with same sensors
I In between supervised and unsupervised learning

Reinforcement Learning - 6/33

AGENT-ENVIRONMENT INTERFACE
The agent-environment interface
Agent interacts at discrete time steps t = 0, 1, 2, . . .

I Observes state st ∈ S Agent

Agent
I Selects action at ∈ A(st )
st s
state rt reward
rt atat
action
I Obtains immediate reward t

rt+1rt+1
rt+1 ∈ R st+1 Environment
Environment
st+1
I Observes resulting state st+1

... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3

Reinforcement Learning - 7/33

Elements of RL

I Time steps need not refer to fixed intervals of real time

I Actions can be
I low level (voltage to motors)
I high level (go left, go right)
I ”mental” (shift focus of attention)
I States can be
I low level ”sensations” (temperature, (x, y) coordinates)
I high level abstractions, symbolic
I subjective, internal (”surprised”, ”lost”)
I The environment is not necessarily known to the agent

Reinforcement Learning - 8/33

Elements of RL

I State transitions are

I changes to the internal state of the agent
I changes in the environment as a result of the agent’s action
I can be nondeterministic
I Rewards are
I goals, subgoals
I duration
I ...

Reinforcement Learning - 9/33

Learning how to behave

I The agent’s policy π at time t is

I a mapping from states to action probabilities
I πt (s, a) = P (at = a|st = s)

I Reinforcement learning methods specify how the agent

changes its policy as a result of experience
I Roughly, the agent’s goal is to get as much reward as it can
over the long run

Reinforcement Learning - 10/33

The objective

I Use discounted return instead of total reward

∞
X
Rt = rt+1 + γrt+2 + γ 2 rt+3 + . . . = γ k rt+k+1
k=0

where γ ∈ [0, 1] is the discount factor such that

shortsighted 0←γ→1 farsighted

Reinforcement Learning - 11/33

Example: backgammon

I Learn to play backgammon

I Immediate reward:
I +100 if win
I -100 if lose
I 0 for all other states

Trained by playing 1.5 million games against itself

Now approximately equal to best human player.

Reinforcement Learning - 12/33

Example: pole balancing

I A continuing task with discounted

return:
I reward = -1 upon failure
I return = −γ k , for k steps before
failure

Return is maximized by avoiding failure for as long as possible

∞
X
Rt = γ k rt+k+1
k=0

Reinforcement Learning - 13/33

Examples: pole balancing (movie)

Reinforcement Learning - 14/33

Markov decision processes

I It is often useful to a assume that all relevant information is

present in the current state: Markov property

P (st+1 , rt+1 |st , at ) = P (st+1 , rt+1 |st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 )

I If a reinforcement learning task has the Markov property, it is

basically a Markov Decision Process (MDP)
I Assuming finite state and action spaces, it is a finite MDP

Reinforcement Learning - 15/33

AGENT-ENVIRONMENT INTERFACE
Markov decision processes
An MDP is defined by
I State and action sets
I a Transition function
Agent
a 0
Pss 0 = P (st+1 = s |ss t =reward
r s, at = a)
state
t
action
at
t

rt+1
I a Reward function st+1 Environment

Rass0 = E(rt+1 |st = s, at = a, st+1 = s0 )

... r t +1 r t +2 r t +3 ...
st s t +1 s t +2 s t +3
at at +1 at +2 at +3

Reinforcement Learning - 16/33

Value functions
I Goal: learn π : S → A, given hhs, ai, ri
I When following a fixed policy π we can define the value of a
state s under that policy as

X∞
π
V (s) = Eπ (Rt |st = s) = Eπ ( γ k rt+k+1 |st = s)
k=0

I Similarly we can define the value of taking action a in state s

as
Qπ (s, a) = Eπ (Rt |st = s, at = a)
I Optimal π ∗ = argmaxπ V π (s)

Reinforcement Learning - 17/33

Reinforcement Learning - 18/33
Value functions

I The value function has a particular recursive relationship,

expressed by the Bellman equation
X X
π 0
V π (s) = π(s, a) a
Pss a
0 [Rss0 + γV (s )]

a∈A(s) s0 ∈S

I The equation expresses the recursive relation between the

value of a state and its successor states, and averages over all
possibilities, weighting each by its probability of occurring

Reinforcement Learning - 19/33

Learning an optimal policy online

I Often transition and reward functions are unknown

I Using temporal difference (TD) methods is one way of
overcoming this problem
I Learn directly from raw experience
I No model of the environment required (model-free)
I E.g.: Q-learning
I Update predicted state values based on new observations of
immediate rewards and successor states

Reinforcement Learning - 20/33

Q-function

Q(s, a) = r(s, a) + γV ∗ (δ(s, a))with st+1 = δ(st , at )

I if we know Q, we do not have to know δ.

π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]

π ∗ (s) = argmaxa Q(s, a)

Reinforcement Learning - 21/33

Training rule to learn Q
I Q and V ∗ are closely related:

V ∗ (s) = maxa0 Q(s, a0 )

I which allows us to write Q as:

Q(st , at ) = r(st , at ) + γV ∗ (δ(st , at ))

Q(st , at ) = r(st , at ) + γmaxa0 Q(st+1 , a0 )

I So if Q̂ represents the learner’s current approximation of Q:

Q̂(s, a) ← r + γmaxa0 Q̂(s0 , a0 )

Reinforcement Learning - 22/33

Q-learning

I Q-learning updates state-action values based on the

immediate reward and the optimal expected return
h i
Q(st , at ) ← Q(st , at )+α rt+1 + γ max Q(st+1 , a) − Q(st , at )
a

I Directly learns the optimal value function independent of the

policy being followed
I Proven to converge to the optimal policy given ”sufficient”
updates for each state-action pair, and decreasing learning
rate α [Watkins92,Tsitsiklis94]

Reinforcement Learning - 23/33

Q-learning

Reinforcement Learning - 24/33

Action selection

I How to select an action based on the values of the states or

state-action pairs?
I Success of RL depends on a trade-off
I Exploration
I Exploitation
I Exploration is needed to prevent getting stuck in local optima
I To ensure convergence you need to exploit

Reinforcement Learning - 25/33

Action selection

Two common choices

I -greedy
I Choose the best action with probability 1 −
I Choose a random action with probability
I Boltzmann exploration (softmax) uses a temperature
parameter τ to balance exploration and exploitation

eQt (s,a)/τ
πt (s, a) = P Qt (s,a0 )/τ
a0 ∈A e

pure exploitation 0 ← τ → ∞ pure exploration

Reinforcement Learning - 26/33

Updating Q: in practice

Reinforcement Learning - 27/33

Convergence of deterministic Q-learning

Q̂ converges to Q when each hs, ai is visited infinitely often

Proof:
I Let a full interval be an interval during which each hs, ai is
visited
I Let Q̂n be the Q-table after n-updates
I ∆n is the maximum error in Q̂n :

∆n = maxs,a |Q̂n (s, a) − Q(s, a)|

Reinforcement Learning - 28/33

Convergence of deterministic Q-learning
For any table entry Q̂n (s, a) updated on iteration n + 1, the error
in the revised estimate is Q̂n+1 (s, a)

Q̂n+1 (s, a) − Q(s, a)| = |(r + γmaxa0 Q̂n (s0 , a0 ))

Reinforcement Learning - 29/33

Extensions
I Multi-step TD
I Instead of observing one immediate reward, use n consecutive
rewards for the value update
I Intuition: your current choice of action may have implications
for the future
I Eligibility traces
I State-action pairs are eligible for future rewards, with more
recent states getting more credit

Reinforcement Learning - 30/33

Extensions

I Reward shaping
I Incorporate domain knowledge to provide additional rewards
during an episode
I Guide the agent to learn faster
I (Optimal) policies preserved given a potential-based shaping
function [Ng99]
I Function approximation
I So far we have used a tabular notation for value functions
I For large state and actions spaces this approach becomes
intractable
I Function approximators can be used to generalize over large or
even continuous state and action spaces

Reinforcement Learning - 31/33

Demo

http://wilma.vub.ac.be:3000

Reinforcement Learning - 32/33

Questions?

Reinforcement Learning - 33/33

Presentation About Bluetooth Low Energy
No ratings yet
Presentation About Bluetooth Low Energy
34 pages
UNM2000
No ratings yet
UNM2000
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Unit-5 Mlt
No ratings yet
Unit-5 Mlt
13 pages
UNIT-4
No ratings yet
UNIT-4
56 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
RL Unit 1
100% (1)
RL Unit 1
26 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
IntroductiontoRL-BR
No ratings yet
IntroductiontoRL-BR
22 pages
Sara Reinforcement Learning
No ratings yet
Sara Reinforcement Learning
69 pages
37 RL
No ratings yet
37 RL
18 pages
DLMAIRIL01_Q4-2024_Session1
No ratings yet
DLMAIRIL01_Q4-2024_Session1
84 pages
unit 5 ml
No ratings yet
unit 5 ml
15 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
No ratings yet
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
36 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Unit 3
No ratings yet
Unit 3
12 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Unit 5
No ratings yet
Unit 5
45 pages
Module 1
No ratings yet
Module 1
72 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Assignment_15_Modern_AI
No ratings yet
Assignment_15_Modern_AI
3 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Unit 5
No ratings yet
Unit 5
10 pages
Hota-ML-ReinforcementLearning
No ratings yet
Hota-ML-ReinforcementLearning
12 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Games: The Hitchhiker's Guide To Adaptive Dynamics
No ratings yet
Games: The Hitchhiker's Guide To Adaptive Dynamics
25 pages
Hubbles Legacy PDF
No ratings yet
Hubbles Legacy PDF
240 pages
A Case Study For Intelligent Event Recommendation: Article
No ratings yet
A Case Study For Intelligent Event Recommendation: Article
21 pages
Agenda Api - How-To: Registering An Application
No ratings yet
Agenda Api - How-To: Registering An Application
6 pages
SIGGRAPH2001 CoursePack 08
No ratings yet
SIGGRAPH2001 CoursePack 08
81 pages
Shenzhen Noridqweeqwc - BLE Sensor App
No ratings yet
Shenzhen Noridqweeqwc - BLE Sensor App
54 pages
IELTS Academic and General Training
No ratings yet
IELTS Academic and General Training
5 pages
Sam 4 Lqwec
No ratings yet
Sam 4 Lqwec
1,207 pages
Branch Prediction: Jeroen Lichtenauer
No ratings yet
Branch Prediction: Jeroen Lichtenauer
23 pages
Credit Score Prediction Using Support Vector Machine and Gray Wolf Optimization
No ratings yet
Credit Score Prediction Using Support Vector Machine and Gray Wolf Optimization
5 pages
3T100 Ferraz Shawmut
No ratings yet
3T100 Ferraz Shawmut
4 pages
Civil Engineering Materials: Timber and Wood Products
No ratings yet
Civil Engineering Materials: Timber and Wood Products
73 pages
Snort For Kali Linux
No ratings yet
Snort For Kali Linux
29 pages
Aditya Simple C Programs
No ratings yet
Aditya Simple C Programs
4 pages
Filite Diagram
No ratings yet
Filite Diagram
2 pages
Ejercicios Sistemas de Ecuaciones Lineales
No ratings yet
Ejercicios Sistemas de Ecuaciones Lineales
8 pages
Iv Semester Diploma in Electrical Engineering Skill Based Diploma in Engineering Course Management
No ratings yet
Iv Semester Diploma in Electrical Engineering Skill Based Diploma in Engineering Course Management
22 pages
09 r00 57 C Deha Halfen
No ratings yet
09 r00 57 C Deha Halfen
26 pages
PolarChoice Plus Brochure
No ratings yet
PolarChoice Plus Brochure
2 pages
Organisational Hierarchy
No ratings yet
Organisational Hierarchy
120 pages
Dynamical Resource Allocation in Edge For Trustable Internet-of-Things Systems: A Reinforcement Learning Method
No ratings yet
Dynamical Resource Allocation in Edge For Trustable Internet-of-Things Systems: A Reinforcement Learning Method
11 pages
FT-60R Service Manual
No ratings yet
FT-60R Service Manual
56 pages
OSS T12-D+ - User Manual
No ratings yet
OSS T12-D+ - User Manual
8 pages
Canoe Plan Part 3
No ratings yet
Canoe Plan Part 3
24 pages
1022 - Aryan Bhingare - DSAL A1-11
No ratings yet
1022 - Aryan Bhingare - DSAL A1-11
67 pages
76 Steps in A BW-to-HANA Migration
No ratings yet
76 Steps in A BW-to-HANA Migration
4 pages
Ratio,Propratio and Variation
No ratings yet
Ratio,Propratio and Variation
2 pages
EXIT EXAM-KINETICS AND ELECTROCHEMISTRY for gabbo
No ratings yet
EXIT EXAM-KINETICS AND ELECTROCHEMISTRY for gabbo
7 pages
4.3 Hernández-ospina et al. 2024
No ratings yet
4.3 Hernández-ospina et al. 2024
13 pages
Vision Turbine Meters
No ratings yet
Vision Turbine Meters
8 pages
Means and Methods of Sports Training
No ratings yet
Means and Methods of Sports Training
4 pages
Works and Equipment For Bus Reactor Installation
No ratings yet
Works and Equipment For Bus Reactor Installation
2 pages
The Naughty Students Are Punished
No ratings yet
The Naughty Students Are Punished
13 pages
Int F (Int N) (Static Int I 1 If (N 5) Return N N N+i I++ Return F (N) )
No ratings yet
Int F (Int N) (Static Int I 1 If (N 5) Return N N N+i I++ Return F (N) )
9 pages
Lecture 17-Multisubstrate Enzyme RXN Kinetics
100% (1)
Lecture 17-Multisubstrate Enzyme RXN Kinetics
13 pages
JMEST Template - Final (2020)
No ratings yet
JMEST Template - Final (2020)
8 pages
Physics Lesson PowerPoint Template by SlideWin
No ratings yet
Physics Lesson PowerPoint Template by SlideWin
29 pages

Fundamentals of Reinforcement Learning

Uploaded by

Fundamentals of Reinforcement Learning

Uploaded by

Fundamentals of Reinforcement

December 9, 2013 - Techniques of AI

Richard S. Sutton and Andrew G. Barto

Available on-line for free!

Reinforcement Learning - 2/33

Based on ideas from psychology

Reinforcement Learning - 3/33

Reinforcement Learning - 4/33

I Learning from interactions

Reinforcement Learning - 5/33

I Learner is not told which action to take

Reinforcement Learning - 6/33

I Observes state st ∈ S Agent

Reinforcement Learning - 7/33

I Time steps need not refer to fixed intervals of real time

Reinforcement Learning - 8/33

I State transitions are

Reinforcement Learning - 9/33

I The agent’s policy π at time t is

I Reinforcement learning methods specify how the agent

Reinforcement Learning - 10/33

I Use discounted return instead of total reward

where γ ∈ [0, 1] is the discount factor such that

shortsighted 0←γ→1 farsighted

Reinforcement Learning - 11/33

I Learn to play backgammon

Trained by playing 1.5 million games against itself

Reinforcement Learning - 12/33

I A continuing task with discounted

Return is maximized by avoiding failure for as long as possible

Reinforcement Learning - 13/33

Reinforcement Learning - 14/33

I It is often useful to a assume that all relevant information is

P (st+1 , rt+1 |st , at ) = P (st+1 , rt+1 |st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 )

I If a reinforcement learning task has the Markov property, it is

Reinforcement Learning - 15/33

Rass0 = E(rt+1 |st = s, at = a, st+1 = s0 )

Reinforcement Learning - 16/33

I Similarly we can define the value of taking action a in state s

Reinforcement Learning - 17/33

I The value function has a particular recursive relationship,

I The equation expresses the recursive relation between the

Reinforcement Learning - 19/33

I Often transition and reward functions are unknown

Reinforcement Learning - 20/33

Q(s, a) = r(s, a) + γV ∗ (δ(s, a))with st+1 = δ(st , at )

I if we know Q, we do not have to know δ.

π ∗ (s) = argmaxa [r(s, a) + γV ∗ (δ(s, a))]

π ∗ (s) = argmaxa Q(s, a)

Reinforcement Learning - 21/33

V ∗ (s) = maxa0 Q(s, a0 )

I which allows us to write Q as:

Q(st , at ) = r(st , at ) + γV ∗ (δ(st , at ))

Q(st , at ) = r(st , at ) + γmaxa0 Q(st+1 , a0 )

Q̂(s, a) ← r + γmaxa0 Q̂(s0 , a0 )

Reinforcement Learning - 22/33

I Q-learning updates state-action values based on the

I Directly learns the optimal value function independent of the

Reinforcement Learning - 23/33

Reinforcement Learning - 24/33

I How to select an action based on the values of the states or

Reinforcement Learning - 25/33

Two common choices

pure exploitation 0 ← τ → ∞ pure exploration

Reinforcement Learning - 26/33

Reinforcement Learning - 27/33

Q̂ converges to Q when each hs, ai is visited infinitely often

∆n = maxs,a |Q̂n (s, a) − Q(s, a)|

Reinforcement Learning - 28/33

Q̂n+1 (s, a) − Q(s, a)| = |(r + γmaxa0 Q̂n (s0 , a0 ))

Reinforcement Learning - 29/33

Reinforcement Learning - 30/33

Reinforcement Learning - 31/33

Reinforcement Learning - 32/33

Reinforcement Learning - 33/33

You might also like