10. Learning Task
10. Learning Task
CO-3
AIM
To familiarize students with the concepts of unsupervised machine learning, hierarchical clustering,
distance functions, and data standardization
INSTRUCTIONAL OBJECTIVES
LEARNING OUTCOMES
V * (s ) max Q (s , a )
a
Example II
Check that
* (s ) argmax Q (s , a)
a
V * (s ) max Q (s , a )
a
Q-Learning
• However, imagine the robot is exploring its environment, trying new actions as it
goes.
• At every step it receives some reward “r”, and it observes the environment change
into a new state s’ for action a.
s’=st+1
Q-Learning
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
Example Q-Learning
0 0.9 max{66,81,100}
90
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current
policy when learning Q. (off-policy learning).The reason is that you may
get stuck in a suboptimal solution. I.e., there may be other solutions
out there that you have never seen.
• One can actively search for state-action pairs for which Q(s,a)
is expected to change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back
than just one step ( TD( ) learning).
Extensions
• Often the state space is too large to deal with all states. In this case we
need to learn a function: Q (s , a ) f (s , a )
• For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very
large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda)
for learning.
More on Function Approximation
• The features Phi are fixed measurements of the state (e.g., # stones
on the board).
• We only learn the parameters theta.
• Update rule: (start in state s, take action a, observe reward r and end
up in state s’)
Conclusion
TEAM ML