S18 Reinforcement Learning 2
S18 Reinforcement Learning 2
Center_for_Human-Compatible_Artificial_Intelligence.png
The slides for INFOF311 are slightly modified versions of the slides of
the spring and summer CS188 sessions in 2021 and 2022
Recap: Reinforcement Learning
Agent
State: s
Actions: a
Reward: r
Environment
▪ Basic idea:
▪ Receive feedback in the form of rewards
▪ Agent’s utility is defined by the reward function
▪ Must (learn to) act so as to maximize expected rewards
▪ All learning is based on observed samples of outcomes!
Recap: Reinforcement Learning
▪ Still assume a Markov decision process (MDP):
▪ A set of states s S
▪ A set of actions (per state) A(s)
▪ A transition model T(s,a,s’)
▪ A reward function R(s,a,s’)
▪ Still looking for a policy (s)
Offline/Planning Online/Learning
Recap: Passive vs Active RL
s, a0 s, a1 s, a2
▪ Caveats:
▪ You have to explore enough
▪ You have to eventually make the learning rate
small enough
▪ … but not decrease it too quickly
▪ Basically, in the limit, it doesn’t matter how you select actions (!)
Exploration vs. Exploitation
Exploration vs. Exploitation
▪ Exploration: try new things
▪ Exploitation: do what’s best given what you’ve learned so far
▪ Key point: pure exploitation often gets stuck in a rut and never
finds an optimal policy!
27
Exploration method 1: -greedy
▪ -greedy exploration
▪ Every time step, flip a biased coin
▪ With (small) probability , act randomly
▪ With (large) probability 1-, act on current policy
▪ Regular Q-update:
▪ Q(s,a) (1-) Q(s,a) + [R(s,a,s’) + γ maxaQ (s’,a) ]
▪ Modified Q-update:
▪ Q(s,a) (1-) Q(s,a) + [R(s,a,s’) + γ maxa f(Q (s’,a’),n(s’,a’)) ]
▪ Note: this propagates the “bonus” back to states that lead to
unknown states as well!
Demo Q-learning – Exploration Function – Crawler
Approximate Q-Learning
Generalizing Across States
▪ Basic Q-Learning keeps a table of all Q-values
[demo – RL pacman]
Example: Pacman
Let’s say we discover In naïve q-learning, Or even this one!
through experience we know nothing
that this state is bad: about this state:
Demo Q-Learning Pacman – Tiny – Watch All
Demo Q-Learning Pacman – Tiny – Silent Train
Demo Q-Learning Pacman – Tricky – Watch All
Feature-Based Representations
▪ Solution: describe a state using a vector of
features
▪ Features are functions from states to real
numbers (often 0/1) that capture important
properties of the state
▪ Example features:
▪ Distance to closest ghost fGST
▪ Distance to closest dot
▪ Number of ghosts
▪ 1 / (distance to closest dot) fDOT
▪ Is Pacman in a tunnel? (0/1)
▪ …… etc.
▪ Can also describe a q-state (s, a) with features
(e.g., action moves closer to food)
Linear Value Functions
▪ Disadvantage: states may share features but have very different expected utility!
Updating a linear value function
▪ Original Q-learning rule tries to reduce prediction error at s,a:
▪ Q(s,a) Q(s,a) + [R(s,a,s’) + γ maxa’ Q (s’,a’) - Q(s,a) ]
▪ Instead, we update the weights to try to reduce the error at s,a:
▪ wi wi + [R(s,a,s’) + γ maxa’ Q (s’,a’) - Q(s,a) ] Qw(s,a)/wi
= wi + [R(s,a,s’) + γ maxa’ Q (s’,a’) - Q(s,a) ] fi(s,a)
▪ Intuitive interpretation:
▪ Adjust weights of active features
▪ If something bad happens, blame the features we saw; decrease value of
states with those features. If something good happens, increase value!
Example: Q-Pacman
Q(s,a) = 4.0 fDOT(s,a) – 1.0 fGST(s,a)
fDOT(s,NORTH) = 0.5
a = NORTH
s r = –500 s’
fGST(s,NORTH) = 1.0
Q(s,NORTH) = +1 Q(s’,) = 0
r + γ maxa’ Q (s’,a’) = – 500 + 0
wDOT 4.0 + [–501]0.5
difference = –501
wGST –1.0 + [–501]1.0
▪ Solution: learn policies that maximize rewards, not the values that predict them
▪ Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing
(or gradient ascent!) on feature weights
Policy Search
▪ Simplest policy search:
▪ Start with an initial linear value function or Q-function
▪ Nudge each feature weight up and down and see if your policy is better than before
▪ Pros:
▪ Works well for partial observability / stochastic policies
▪ Cons:
▪ How do we tell the policy got better?
▪ Need to run many sample episodes!
▪ If there are a lot of features, this can be impractical
Policy Search