0% found this document useful (0 votes)
83 views

7.reinforcement Learning-Introduction-The Learning Task Q-Learning

Uploaded by

Shiva L
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

7.reinforcement Learning-Introduction-The Learning Task Q-Learning

Uploaded by

Shiva L
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Reinforcement Learning

What is learning?
Learning takes place as a result of interaction
between an agent and the world, the idea behind
learning is that
◦ Percepts received by an agent should be used not only for
acting, but also for improving the agent’s ability to
behave optimally in the future to achieve the goal.
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).

• Unsupervised Learning: No feedback (no labels provided).

• Reinforcement Learning: Delayed scalar feedback (a number called reward).

• RL deals with agents that must sense & act upon their environment.
This is combines classical AI and machine learning techniques.
It the most comprehensive problem setting.
• Examples:
• A robot cleaning my room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
• and so on
Learning types
Learning types
◦ Supervised learning:
a situation in which sample (input, output) pairs of the function to
be learned can be perceived or are given
◦ You can think it as if there is a kind teacher
◦ Reinforcement learning:
in the case of the agent acts on its environment, it receives some
evaluation of its action (reinforcement), but is not told of which
action is the correct one to achieve its goal
Reinforcement learning
It is about taking suitable action to maximize reward in a particular
situation.
It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
Reinforcement learning differs from the supervised learning in a way
that in supervised learning the training data has the answer key with it
so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent
decides what to do to perform the given task. In the absence of a
training dataset, it is bound to learn from its experience.
The above image shows the robot, diamond, and fire.
The goal of the robot is to get the reward that is the diamond and avoid
the hurdles that are fire.
The robot learns by trying all the possible paths and then choosing the
path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will
subtract the reward of the robot.
The total reward will be calculated when it reaches the final reward that
is the diamond
Various Practical applications of Reinforcement Learning –
RL can be used in robotics for industrial automation.
RL can be used in machine learning and data processing
RL can be used to create training systems that provide custom
instruction and materials according to the requirement of students.
reinforcement learning is studied in many disciplines, such as game
theory, control theory, operations research, information
theory, simulation-based optimization, multi-agent systems, swarm
intelligence, and statistics. In the operations research and control
literature, reinforcement learning is called approximate dynamic
programming, or neuro-dynamic programming.
The environment is typically stated in the form of a Markov decision
process (MDP), because many reinforcement learning algorithms for
this context use dynamic programming techniques.
The main difference between the classical dynamic programming
methods and reinforcement learning algorithms is that the latter do not
assume knowledge of an exact mathematical model of the MDP and
they target large MDPs where exact methods become infeasible.
Types of Algorithm
Reinforcement learning
Task
Learn how to behave successfully to achieve a goal while
interacting with an external environment
◦ Learn via experiences!
Examples
◦ Game playing: player knows whether it win or lose, but
not know how to move at each step
◦ Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
RL is learning from interaction
RL model
◦ Each percept(e) is enough to determine the State(the state
is accessible)
◦ The agent can decompose the Reward component from a
percept.
◦ The agent task: to find a optimal policy, mapping states to
actions, that maximize long-run measure of the
reinforcement
◦ Think of reinforcement as reward
◦ Can be modeled as MDP model!
Review of MDP model
MDP model <S,T,A,R>
• S– set of states
Agent • A– set of actions
•T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment •R(s,a)– the expected reward
for taking action a in state s
R(s, a)   P(s'| s, a)r(s, a, s')
a0 a1 a2 s'
s0 s1 s2 s3
R(s, a)   T(s, a, s')r(s, a, s')
r0 r1 r2 s'
Model based v.s.Model free
approaches
But, we don’t know anything about the environment model—
the transition function T(s,a,s’)
Here comes two approaches
◦ Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach

◦ Model free approach RL:


derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach

Which one is better?


Passive learning v.s. Active
learning
Passive learning
◦ The agent imply watches the world going by and tries to
learn the utilities of being in various states
Active learning
◦ The agent not simply watches, but also acts
Example environment
Model based v.s.Model free
approaches
But, we don’t know anything about the environment model—
the transition function T(s,a,s’)
Here comes two approaches
◦ Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach

◦ Model free approach RL:


derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach

Which one is better?


Passive learning v.s. Active
learning
Passive learning
◦ The agent imply watches the world going by and tries to
learn the utilities of being in various states
Active learning
◦ The agent not simply watches, but also acts
Example environment
Passive learning scenario
The agent see the the sequences of state transitions
and associate rewards
◦ The environment generates state transitions and the agent
perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]

(1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1)


(4,2)[-1]

Key idea: updating the utility value using the given


training sequences.
The Task
• To learn an optimal policy that maps states of the world to actions of the agent.
I.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.

 :S  A

• What is it that the agent tries to optimize?


Answer: the total future discounted reward:

V  (st )  rt   rt 1   2rt 2  ...



   i rt i 0 1
i 0

Note: immediate reward is worth more than future reward.


What would happen to mouse in a maze with gamma = 0 ?
Value Function

• Let’s say we have access to the optimal value function that computes
the total future discounted reward
V * (s )
• What would be the optimal policy ?  * (s )

• Answer: we choose the action that maximizes:


 * (s)  argmax r(s,a)  V * ((s,a))
a

• We assume that we know what the reward will be if we perform action “a” in
state “s”: r (s,a )

• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”: st 1   (st ,a )
Example II
Find your way to the goal.
Passive leaning scenario
Q-Function Bellman Equation:

• One approach to RL is then to try to estimate V*(s). V * (s) maxr(s,a)  V * ((s,a))


a

• However, this approach requires you to know r(s,a) and delta(s,a).

• This is unrealistic in many real problems. What is the reward if a


robot is exploring mars and decides to take a right turn?

• Fortunately we can circumvent this problem by exploring and experiencing


how the world reacts to our actions. We need to learn r & delta.

• We want a function that directly learns good state-action pairs, i.e.


what action should I take in this state. We call this Q(s,a).

• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
 * (s )  argmax Q (s ,a )
a

V * (s )  max Q (s ,a )
a
Q-Learning
Q(s,a)  r(s,a)  V * ((s,a))
 r(s,a)   max Q((s,a),a')
a'

• This still depends on r(s,a) and delta(s,a).

• However, imagine the robot is exploring its environment, trying new actions
as it goes.

• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a.
How can we use these observations, (s,a,s’,r) to learn a model?

Q̂(s,a)  r   maxQ̂(s ',a') s’=st+1


a'
Another model free method–
TD-Q learning
Define Q-value function
U (s)  max Q(s, a)
a

Q-valueUfu(sn)ctimoanx(uRp(ds,aat)ingruTle(s,a, s')U (s'))


a

Q(s, a)  R(s, a)    T(s, a, s')U (s')


s'

s'

Q(s, a)  R(s, a)    T (s, a, s') maxQ(s', a')


a'
s'

<*>

Key idea of TD-Q learning


◦ Combined with temporal difference approach
◦ The updating rule
Q(s, a)  Q(s, a)   (r   max Q(s', a')  Q(s, a))
a'
a  arg max Q(s, a)
a
TD-Q learning agent algorithm
For each pair (s, a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
a  arg max Q(s, a)
a
Receive immediate reward r and observe the new state s’
Update Q(s,a)
Q(s, a)  Q(s, a)   (r   max Q(s', a')  Q(s, a))
a'
s=s’

You might also like