0% found this document useful (0 votes)
3 views

10. Learning Task

The document provides an overview of Q-learning, a reinforcement learning approach that estimates optimal state-action pairs without requiring prior knowledge of the reward function or environment dynamics. It discusses key concepts such as hierarchical clustering, distance functions, data standardization, and the importance of exploration in learning. Additionally, it highlights real-world applications of reinforcement learning in various fields, including robotics, business strategy, and traffic control.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

10. Learning Task

The document provides an overview of Q-learning, a reinforcement learning approach that estimates optimal state-action pairs without requiring prior knowledge of the reward function or environment dynamics. It discusses key concepts such as hierarchical clustering, distance functions, data standardization, and the importance of exploration in learning. Additionally, it highlights real-world applications of reinforcement learning in various fields, including robotics, business strategy, and traffic control.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Q-LEARNING

CO-3
AIM

To familiarize students with the concepts of unsupervised machine learning, hierarchical clustering,
distance functions, and data standardization

INSTRUCTIONAL OBJECTIVES

This session is designed to:


1. Two formulations for learning: Inductive and Analytical
2. Perfect domain theories

LEARNING OUTCOMES

At the end of this session, you should be able to:


1. Hierarchical clustering and its types
2. Agglomerative clustering
3. Measuring the distance of two clusters
4. Data standardization techniques
Q-Function

• One approach to RL is then to try to estimate V*(s).


• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is
exploring mars and decides to take a right turn?
• Fortunately, we can circumvent this problem by exploring and
experiencing how the world reacts to our actions. We need to learn r &
delta.
• We want a function that directly learns good state-action pairs, i.e., what
action should I take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without
knowing r(s,a) and delta(s,a). We have:
 * (s )  argmax Q (s , a)
a

V * (s ) max Q (s , a )
a
Example II

Check that
 * (s )  argmax Q (s , a)
a

V * (s ) max Q (s , a )
a
Q-Learning

• This still depends on r(s , a) and delta(s , a).

• However, imagine the robot is exploring its environment, trying new actions as it
goes.

• At every step it receives some reward “r”, and it observes the environment change
into a new state s’ for action a.

• How can we use these observations, (s, a, s’,r) to learn a model?

s’=st+1
Q-Learning

• This equation continually estimates Q at state s consistent with an estimate


of Q at state s’, one step in the future: temporal difference (TD) learning.

• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.

• Updating estimates based on other estimates is called bootstrapping.

• We do an update after each state-action pair. I.e., we are learning online!

• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.

• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
Example Q-Learning

Qˆ(s1, aright )  r   max Qˆ(s2 , a ' )


a'

 0  0.9 max{66,81,100}
 90
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation

• It is very important that the agent does not simply follow the current
policy when learning Q. (off-policy learning).The reason is that you may
get stuck in a suboptimal solution. I.e., there may be other solutions
out there that you have never seen.

• Hence it is good to try new things so now and then, e.g.


If T large lots of exploring, if T small follow current policy. One can
decrease
T over time ˆ
P (a | s )  eQ (s ,a ) / T
Improvements

• One can trade-off memory and computation by cashing (s,s’,r)


for observed transitions. After a while, as Q(s’,a’) has changed,
you can “replay” the update:

• One can actively search for state-action pairs for which Q(s,a)
is expected to change a lot (prioritized sweeping).

• One can do updates along the sampled path much further back
than just one step ( TD( ) learning).
Extensions

• To deal with stochastic environments, we need to maximize


expected future discounted reward:

• Often the state space is too large to deal with all states. In this case we
need to learn a function: Q (s , a ) f (s , a )

• Neural network with back-propagation have been quite successful.

• For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very
large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda)
for learning.
More on Function Approximation

• For instance: linear function:

• The features Phi are fixed measurements of the state (e.g., # stones
on the board).
• We only learn the parameters theta.

• Update rule: (start in state s, take action a, observe reward r and end
up in state s’)
Conclusion

• Reinforcement learning addresses a very broad and relevant question:


• How can we learn to survive in our environment?
• We have looked at Q-learning, which simply learn s from experience.
• No model of the world is needed.
• We made simplifying assumptions: e.g., state of the world only
depends on last state and action. This is the Markov assumption. The
model is called a Markov Decision Process (MDP).
• We assumed deterministic dynamics, reward function, but the world
really is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real-world applications.
Applications of
Reinforcement Learning

• Robotics for industrial automation.


• Business strategy planning
• Machine learning and data processing
• It helps you to create training systems that provide
custom instruction and materials according to the
requirement of students.
• Aircraft control and robot motion control
• Traffic Light Control
• A robot cleaning room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
THANK YOU

TEAM ML

You might also like