0% found this document useful (0 votes)
15 views

Reinforcement Learning

This document discusses reinforcement learning and related topics. It provides an overview of Markov decision processes and techniques like value iteration, policy iteration, and modified policy iteration. The document then discusses challenges with reinforcement learning when models are unknown and introduces ideas like passive reinforcement learning, adaptive dynamic programming, and temporal-difference learning.

Uploaded by

mtechbitit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Reinforcement Learning

This document discusses reinforcement learning and related topics. It provides an overview of Markov decision processes and techniques like value iteration, policy iteration, and modified policy iteration. The document then discusses challenges with reinforcement learning when models are unknown and introduces ideas like passive reinforcement learning, adaptive dynamic programming, and temporal-difference learning.

Uploaded by

mtechbitit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Reinforcement Learning

CSE 573

Ever Feel Like Pavlov’s Poor Dog?


© Daniel S. Weld 1
Logistics
• Reading for Wed
AIMA Ch 20 thru 20.3
• Teams for Project 2
• Midterm
Typical problems – see AIMA exercises
In class / takehome
Open book / closed
Length

© Daniel S. Weld 2
573 Topics

Reinforcement
Supervised Learning
Learning Planning

Logic-Based Probabilistic
Knowledge Representation & Inference
Search
Problem Spaces
Agency

© Daniel S. Weld 3
Pole Demo

© Daniel S. Weld 4
Review: MDPs
S = set of states set (|S| = n)

A = set of actions (|A| = m)

Pr = transition function Pr(s,a,s’)


represented by set of m n x n stochastic matrices (factored
into DBNs)
each defines a distribution over SxS

R(s) = bounded, real-valued reward fun


represented by an n-vector

© Daniel S. Weld 5
Goal for an MDP
• Find a policy which:
maximizes expected discounted reward
over an infinite horizon
for a fully observable
Markov decision process.

© Daniel S. Weld 6
Bellman Backup
Vn
Qn+1(s,a)

Max Vn
a1
Vn

Vn+1(s) s a2
Vn

a3
Vn

Vn

Improve estimate of value function


Vn

Vt+1(s) = R(s) +
Expected future reward
MaxaεA {c(a)+γΣs’εS Pr(s’|a,s)
Aver’gd over destVt(s’)}
states
© Daniel S. Weld 7
Value Iteration
• Assign arbitrary values to each state
(or use an admissible heuristic).

• Iterate over all states


Improving value funct via Bellman Backups

• Stop the iteration when converges


(Vt approaches V* as t  )

• Dynamic Programming
© Daniel S. Weld 8
Note on Value Iteration
• Order in which one applies Bellman Backups
Irrelevant!

• Some orders more efficient than others

t e
S ta
i t 10
I n

Action cost = -1. Discount factor,  = 1


© Daniel S. Weld 9
Expand figure
• Initialize all value functions to 0
• First backup sets goal to 10

• Use animation

© Daniel S. Weld 10
Policy evaluation
• Given a policy Π:SA, find value of each
state using this policy.
• VΠ(s) = R(s) + c(Π(s)) +
γ[Σs’εS Pr(s’| Π(s),s)VΠ(s’)]
• This is a system of linear equations
involving |S| variables.

© Daniel S. Weld 11
Policy iteration
• Start with any policy (Π0).
• Iterate
Policy evaluation : For each state find VΠi(s).
Policy improvement : For each state s, find action
a* that maximizes QΠi(a,s).
If QΠi(a*,s) > VΠi(s) let Πi+1(s) = a*
else let Πi+1(s) = Πi(s)
• Stop when Πi+1 = Πi
• Converges faster than value iteration but
policy evaluation step is more expensive.

© Daniel S. Weld 12
Modified Policy iteration
• Instead of evaluating the actual value of
policy by
Solving system of linear equations, …

• Approximate it:
Value iteration with fixed policy.

© Daniel S. Weld 13
Excuse Me…
• MDPs are great, IF…
We know the state transition function P(s,a,s’)
We know the reward function R(s)

• But what if we don’t?


Like when we were babies…
And like our dog…

© Daniel S. Weld 14
How is learning to act possible when…
• Actions have non-deterministic effects
Which are initially unknown

• Rewards / punishments are infrequent


Often at the end of long sequences of actions

• Learner must decide what actions to take

• World is large and complex

© Daniel S. Weld 15
Naïve Approach
1. Act Randomly for a while
(Or systematically explore all possible actions)
2. Learn
Transition function
Reward function
3. Use value iteration, policy iteration, …

Problems?

© Daniel S. Weld 16
RL Techniques
1. Passive RL

2. Adaptive Dynamic Programming

3. Temporal-difference learning
Learns a utility function on states
• treats the difference between expected / actual
reward as an error signal, that is propagated
backward in time

© Daniel S. Weld 17
Concepts
• Exploration functions
Balance exploration / exploitation

• Function approximation
Compress a large state space into a small one
Linear function approximation, neural nets, …
Generalization

© Daniel S. Weld 18
Example:
• Suppose given policy
• Want to determine how good it is

© Daniel S. Weld 19
Objective: Value Function

© Daniel S. Weld 20
Just Like Policy Evaluation
• Except…?

© Daniel S. Weld 21
Passive RL
• Given policy ,
estimate U(s)
• Not given
transition matrix, nor
reward function!
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(1,2)(1,1)(1,2)(2,2)(3,2) –1
(1,1)(1,2)(1,3)(2,3)(2,2)(2,3)(3,3) +1
(1,1)(1,2)(1,1)(1,2)(1,1)(2,1)(2,2)(2,3)(3,3) +1
(1,1)(1,2)(2,2)(1,2)(1,3)(2,3)(1,3)(2,3)(3,3) +1
(1,1)(2,1)(2,2)(2,1)(1,1)(1,2)(1,3)(2,3)(2,2)(3,2) -1
(1,1)(2,1)(1,1)(1,2)(2,2)(3,2) -1

© Daniel S. Weld 22
Approach 1
• Direct estimation
Estimate U(s) as average total reward of epochs
containing s (calculating from s to end of epoch)
• Pros / Cons?

Requires huge amount of data


doesn’t exploit Bellman constraints!

Expected utility of a state =


its own reward +
expected utility of successors

© Daniel S. Weld 23
Approach 2
Adaptive Dynamic Programming
Requires fully observable environment
Estimate transition function M from training data
Solve Bellman eqn w/ modified policy iteration

Pros / Cons: U

 R( s )    M s , sU ( s)
 

s

Requires complete observations


Don’t usually need value of all states

© Daniel S. Weld 24
Approach 3
• Temporal Difference Learning
Do backups on a per-action basis
Don’t try to estimate entire transition function!
For each transition from s to s’, update:

=Learning rate
   
U ( s )  U ( s )   ( R ( s )   U ( s )  U ( s ))

 =Discount rate

© Daniel S. Weld 25
Notes
• Once U is learned, updates become 0:
0   ( R ( s )   U  ( s)  U  ( s )) when U  ( s )  R ( s )   U  ( s)

• Similar to ADP
Adjusts state to ‘agree’ with observed successor
• Not all possible successors

Doesn’t require M, model of transition function

Intermediate approach: use M to generate


“Pseudo experience”

© Daniel S. Weld 26
Notes II

• “TD(0)”
One step lookahead

U  ( s )  U  ( s )   ( R( s )   U  ( s)  U  ( s ))
Can do 2 step, 3 step…

© Daniel S. Weld 27
TD()
• Or, … take it to the limit!
• Compute weighted average of all future states
   
U ( st )  U ( st )   ( R ( st )   U ( st 1 )  U ( st ))
becomes

U  ( st )  U  ( st )   ( R ( st )   (1   )  iU  ( st i 1 )  U  ( st ))
i 0

weighted average
• Implementation
Propagate current weighted TD onto past states
Must memorize states visited from start of epoch

© Daniel S. Weld 28
Notes III
• Online: update immediately after actions
Works even if epochs are infinitely long

• Offline: wait until the end of an epoch


Can perform updates in any order
E.g. backwards in time
Converges faster if rewards come at epoch end
Why?!?

• ADP Prioritized sweeping heuristic


Bound # of value iteration steps (small  ave)
Only update states whose successors have 
Sample complexity ~ADP
Speed ~ TD

© Daniel S. Weld 29
Q-Learning
• Version of TD-learning where
instead of learning value funct on states
we learn funct on [state,action] pairs
   
U ( s )  U ( s )   ( R ( s )   U ( s)  U ( s ))
becomes
Q(a, s )  Q(a, s)   ( R ( s )   max Q(a, s)  Q(a, s ))
a

• [Helpful for model-free policy learning]


© Daniel S. Weld 30
Baseball

CMU Robotics

Puma arm learning to throw


training involves 100 throws
(video is lame; learning is good)
Part II
• So far, we’ve assumed agent had policy

• Now, suppose agent must learn it


While acting in uncertain world

© Daniel S. Weld 32
Active Reinforcement Learning
Suppose agent must make policy while learning
First approach:
Start with arbitrary policy
Apply Q-Learning
New policy:
In state s,
Choose action a that maximizes Q(a,s)
Problem?

© Daniel S. Weld 33
Utility of Exploration
• Too easily stuck in non-optimal space
“Exploration versus exploitation tradeoff”

• Solution 1
With fixed probability perform a random action
• Solution 2
Increase est expected value of infrequent states

Properties
U+(s)  of f(u,
R(s) + n) ?? a f(
 max  s’ P(s’ | a,s) U +
(s’), N(a,s))

If n > Ne U i.e. normal utility


Else, R+ i.e. max possible reward
© Daniel S. Weld 34
Part III
• Problem of large state spaces remain
Never enough training data!
Learning takes too long

• What to do??

© Daniel S. Weld 35
Function Approximation
• Never enough training data!
Must generalize what learning to new situations

• Idea:
Replace large state table by a smaller,
parameterized function
Updating the value of state will change the value
assigned to many other similar states

© Daniel S. Weld 36
Linear Function Approximation
• Represent U(s) as a weighted sum of
features (basis functions) of s
Uˆ ( s )  1 f1 ( s )   2 f 2 ( s )  ...   n f n ( s )

• Update each parameter separately, e.g:

Uˆ (s)
 i   i   ( R( s )   Uˆ ( s)  Uˆ ( s)) 
 i

© Daniel S. Weld 37
Example
• U(s) = 0 + 1 x + 2 y
• Learns good approximation
10

© Daniel S. Weld 38
But What If…
• U(s) = 0 + 1 x + 2 y + 3 z

• Computed Features
z=  (xg-x)2 + (yg-y)2

10

© Daniel S. Weld 39
Neural Nets
• Can create powerful function approximators
Nonlinear
Possibly unstable
• For TD-learning, apply difference signal to
neural net output and perform back-
propagation

© Daniel S. Weld 40
Policy Search
• Represent policy in terms of Q functions
• Gradient search
Requires differentiability
Stochastic policies; softmax
• Hillclimbing
Tweak policy and evaluate by running
• Replaying experience

© Daniel S. Weld 41
Walking Demo

UT AustinVilla

Aibo walking – before & after 1000


training instances (across field)
… yields fastest known gait!
~Worlds Best Player

• Neural network with 80 hidden units


Used computed features
• 300,000 games against self
© Daniel S. Weld 43
Applications to the Web
Focused Crawling

• Limited resources
Fetch most important pages first
• Topic specific search engines
Only want pages which are relevant to topic
• Minimize stale pages
Efficient re-fetch to keep index timely
How track the rate of change for pages?

© Daniel S. Weld 44
Standard Web Search Engine Architecture
store documents,
check for duplicates,
extract links
crawl the
web DocIds

create an
user inverted
index
query

Search
show results inverted
To user
engine
index
servers

Slide adapted from Marty Hearst / UC Berkeley]


© Daniel S. Weld 45
Performance
Rennie & McCallum (ICML-99)

© Daniel S. Weld 46
Methods
• Agent Types
Utility-based
Action-value based (Q function)
Reflex
• Passive Learning
Direct utility estimation
Adaptive dynamic programming
Temporal difference learning
• Active Learning
Choose random action 1/n th of the time
Exploration by adding to utility function
Q-learning (learn action/value f directly – model free)
• Generalization
Function approximation (linear function or neural networks)
• Policy Search
Stochastic policy repr / Softmax
Reusing past experience

© Daniel S. Weld 47
Summary
• Use reinforcement learning when
Model of world is unknown and/or rewards are delayed
• Temporal difference learning
Simple and efficient training rule
• Q-learning eliminates need for explicit T model
• Large state spaces can (sometimes!) be handled
Function approximation, using linear functions
Or neural nets

© Daniel S. Weld 48

You might also like