0% found this document useful (0 votes)

83 views

7.reinforcement Learning-Introduction-The Learning Task Q-Learning

Uploaded by

Shiva L

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

7.reinforcement Learning-Introduction-The Learning Task Q-Learning

Uploaded by

Shiva L

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Reinforcement Learning

What is learning?
Learning takes place as a result of interaction
between an agent and the world, the idea behind
learning is that
◦ Percepts received by an agent should be used not only for
acting, but also for improving the agent’s ability to
behave optimally in the future to achieve the goal.
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).

• Unsupervised Learning: No feedback (no labels provided).

• Reinforcement Learning: Delayed scalar feedback (a number called reward).

• RL deals with agents that must sense & act upon their environment.
This is combines classical AI and machine learning techniques.
It the most comprehensive problem setting.
• Examples:
• A robot cleaning my room and recharging its battery
• Robot-soccer
• How to invest in shares
• Modeling the economy through rational agents
• Learning how to fly a helicopter
• Scheduling planes to their destinations
• and so on
Learning types
Learning types
◦ Supervised learning:
a situation in which sample (input, output) pairs of the function to
be learned can be perceived or are given
◦ You can think it as if there is a kind teacher
◦ Reinforcement learning:
in the case of the agent acts on its environment, it receives some
evaluation of its action (reinforcement), but is not told of which
action is the correct one to achieve its goal
Reinforcement learning
It is about taking suitable action to maximize reward in a particular
situation.
It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
Reinforcement learning differs from the supervised learning in a way
that in supervised learning the training data has the answer key with it
so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent
decides what to do to perform the given task. In the absence of a
training dataset, it is bound to learn from its experience.
The above image shows the robot, diamond, and fire.
The goal of the robot is to get the reward that is the diamond and avoid
the hurdles that are fire.
The robot learns by trying all the possible paths and then choosing the
path which gives him the reward with the least hurdles.
Each right step will give the robot a reward and each wrong step will
subtract the reward of the robot.
The total reward will be calculated when it reaches the final reward that
is the diamond
Various Practical applications of Reinforcement Learning –
RL can be used in robotics for industrial automation.
RL can be used in machine learning and data processing
RL can be used to create training systems that provide custom
instruction and materials according to the requirement of students.
reinforcement learning is studied in many disciplines, such as game
theory, control theory, operations research, information
theory, simulation-based optimization, multi-agent systems, swarm
intelligence, and statistics. In the operations research and control
literature, reinforcement learning is called approximate dynamic
programming, or neuro-dynamic programming.
The environment is typically stated in the form of a Markov decision
process (MDP), because many reinforcement learning algorithms for
this context use dynamic programming techniques.
The main difference between the classical dynamic programming
methods and reinforcement learning algorithms is that the latter do not
assume knowledge of an exact mathematical model of the MDP and
they target large MDPs where exact methods become infeasible.
Types of Algorithm
Reinforcement learning
Task
Learn how to behave successfully to achieve a goal while
interacting with an external environment
◦ Learn via experiences!
Examples
◦ Game playing: player knows whether it win or lose, but
not know how to move at each step
◦ Control: a traffic system can measure the delay of cars,
but not know how to decrease it.
RL is learning from interaction
RL model
◦ Each percept(e) is enough to determine the State(the state
is accessible)
◦ The agent can decompose the Reward component from a
percept.
◦ The agent task: to find a optimal policy, mapping states to
actions, that maximize long-run measure of the
reinforcement
◦ Think of reinforcement as reward
◦ Can be modeled as MDP model!
Review of MDP model
MDP model <S,T,A,R>
• S– set of states
Agent • A– set of actions
•T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment •R(s,a)– the expected reward
for taking action a in state s
R(s, a)   P(s'| s, a)r(s, a, s')
a0 a1 a2 s'
s0 s1 s2 s3
R(s, a)   T(s, a, s')r(s, a, s')
r0 r1 r2 s'
Model based v.s.Model free
approaches
But, we don’t know anything about the environment model—
the transition function T(s,a,s’)
Here comes two approaches
◦ Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach

◦ Model free approach RL:

derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach

Which one is better?

Passive learning v.s. Active
learning
Passive learning
◦ The agent imply watches the world going by and tries to
learn the utilities of being in various states
Active learning
◦ The agent not simply watches, but also acts
Example environment
Model based v.s.Model free
approaches
But, we don’t know anything about the environment model—
the transition function T(s,a,s’)
Here comes two approaches
◦ Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach

◦ Model free approach RL:

derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach

Which one is better?

Passive learning v.s. Active
learning
Passive learning
◦ The agent imply watches the world going by and tries to
learn the utilities of being in various states
Active learning
◦ The agent not simply watches, but also acts
Example environment
Passive learning scenario
The agent see the the sequences of state transitions
and associate rewards
◦ The environment generates state transitions and the agent
perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]

(1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1)

(4,2)[-1]

Key idea: updating the utility value using the given

training sequences.
The Task
• To learn an optimal policy that maps states of the world to actions of the agent.
I.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.

 :S  A

• What is it that the agent tries to optimize?

Answer: the total future discounted reward:

V  (st )  rt   rt 1   2rt 2  ...


   i rt i 0 1
i 0

Note: immediate reward is worth more than future reward.

What would happen to mouse in a maze with gamma = 0 ?
Value Function
•

• Let’s say we have access to the optimal value function that computes
the total future discounted reward
V * (s )
• What would be the optimal policy ?  * (s )

• Answer: we choose the action that maximizes:

 * (s)  argmax r(s,a)  V * ((s,a))
a

• We assume that we know what the reward will be if we perform action “a” in
state “s”: r (s,a )

• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”: st 1   (st ,a )
Example II
Find your way to the goal.
Passive leaning scenario
Q-Function Bellman Equation:

• One approach to RL is then to try to estimate V(s). V (s) maxr(s,a)  V * ((s,a))

• However, this approach requires you to know r(s,a) and delta(s,a).

• This is unrealistic in many real problems. What is the reward if a

robot is exploring mars and decides to take a right turn?

• Fortunately we can circumvent this problem by exploring and experiencing

how the world reacts to our actions. We need to learn r & delta.

• We want a function that directly learns good state-action pairs, i.e.

what action should I take in this state. We call this Q(s,a).

• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
 * (s )  argmax Q (s ,a )
a

V * (s )  max Q (s ,a )
a
Q-Learning
Q(s,a)  r(s,a)  V * ((s,a))
 r(s,a)   max Q((s,a),a')
a'

• This still depends on r(s,a) and delta(s,a).

• However, imagine the robot is exploring its environment, trying new actions
as it goes.

• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a.
How can we use these observations, (s,a,s’,r) to learn a model?

Q̂(s,a)  r   maxQ̂(s ',a') s’=st+1

a'
Another model free method–
TD-Q learning
Define Q-value function
U (s)  max Q(s, a)
a

Q-valueUfu(sn)ctimoanx(uRp(ds,aat)ingruTle(s,a, s')U (s'))

Q(s, a)  R(s, a)    T(s, a, s')U (s')

Q(s, a)  R(s, a)    T (s, a, s') maxQ(s', a')

a'
s'

<*>

Key idea of TD-Q learning

◦ Combined with temporal difference approach
◦ The updating rule
Q(s, a)  Q(s, a)   (r   max Q(s', a')  Q(s, a))
a'
a  arg max Q(s, a)
a
TD-Q learning agent algorithm
For each pair (s, a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
a  arg max Q(s, a)
a
Receive immediate reward r and observe the new state s’
Update Q(s,a)
Q(s, a)  Q(s, a)   (r   max Q(s', a')  Q(s, a))
a'
s=s’

37 RL
No ratings yet
37 RL
18 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Module 1
No ratings yet
Module 1
72 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Unit 5
No ratings yet
Unit 5
45 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Unit-5 Reinforcemnt and Q learning
No ratings yet
Unit-5 Reinforcemnt and Q learning
45 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
25 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
114021
No ratings yet
114021
55 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Unit 1
No ratings yet
Unit 1
18 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
UNIT V reinforcement learning
No ratings yet
UNIT V reinforcement learning
8 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
unit4(AI)2024.docx-1
No ratings yet
unit4(AI)2024.docx-1
22 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Unit V Reinforcement Learning and Genetic Algorithm
No ratings yet
Unit V Reinforcement Learning and Genetic Algorithm
40 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
Genetic Algorithms
From Everand
Genetic Algorithms
Isuru Abeysinghe
4.5/5 (3)
1 s2.0 S1110016823010815 Main
No ratings yet
1 s2.0 S1110016823010815 Main
11 pages
Question Paper LH - MLT
No ratings yet
Question Paper LH - MLT
93 pages
A Guide To The Types of Machine Learning Algorithms - SAS UK
No ratings yet
A Guide To The Types of Machine Learning Algorithms - SAS UK
5 pages
KCG Ynk Exp Sleeper Class (SL) : Electronic Reserva On Slip (ERS)
No ratings yet
KCG Ynk Exp Sleeper Class (SL) : Electronic Reserva On Slip (ERS)
2 pages
Introduction To The Bread Board
No ratings yet
Introduction To The Bread Board
54 pages
Shape Recognition Shape Analysis and Classification
No ratings yet
Shape Recognition Shape Analysis and Classification
80 pages
Well Posed Problem
No ratings yet
Well Posed Problem
9 pages
UT Dallas Syllabus For cs6364.501 06f Taught by Dan Moldovan (Moldovan)
No ratings yet
UT Dallas Syllabus For cs6364.501 06f Taught by Dan Moldovan (Moldovan)
5 pages
Artificial Intelligence in Knowledge Management Overview and Trend
100% (1)
Artificial Intelligence in Knowledge Management Overview and Trend
8 pages
ComplexSystems
No ratings yet
ComplexSystems
22 pages
The Nature of Code Simulating Natural Systems with Processing 1st Edition Daniel Shiffman All Chapters Instant Download
100% (3)
The Nature of Code Simulating Natural Systems with Processing 1st Edition Daniel Shiffman All Chapters Instant Download
61 pages
21CS502 AI Unit 1 - Removed
No ratings yet
21CS502 AI Unit 1 - Removed
11 pages
Unit 1 - Ai - KCS071
No ratings yet
Unit 1 - Ai - KCS071
32 pages
CST401-Artificial Intelligence QP May 2023 Solution
No ratings yet
CST401-Artificial Intelligence QP May 2023 Solution
10 pages
AI-Robotics-Executive-Summary
No ratings yet
AI-Robotics-Executive-Summary
15 pages
AI Agent Types Summary
No ratings yet
AI Agent Types Summary
19 pages
Ai
No ratings yet
Ai
82 pages
Machine Learning With Python Unit 1-17-84 Final13092024
No ratings yet
Machine Learning With Python Unit 1-17-84 Final13092024
68 pages
Multi Agent System
No ratings yet
Multi Agent System
11 pages
AI
No ratings yet
AI
2 pages
Assignmrnt 1
No ratings yet
Assignmrnt 1
9 pages
Art in An Age of Artificial Intelligence, de Anjan Chatterjee
No ratings yet
Art in An Age of Artificial Intelligence, de Anjan Chatterjee
9 pages
"Machine Learning in Artificial Intelligence: Towards a Common Understanding"
No ratings yet
"Machine Learning in Artificial Intelligence: Towards a Common Understanding"
10 pages
Chapter3 2020
No ratings yet
Chapter3 2020
27 pages
Ai Jisce Solutions Aman
No ratings yet
Ai Jisce Solutions Aman
45 pages
Mcq
No ratings yet
Mcq
9 pages
Chap-2 Agents & Environments
No ratings yet
Chap-2 Agents & Environments
37 pages
Laudon Chapter 11: Knowledge Systems
100% (1)
Laudon Chapter 11: Knowledge Systems
35 pages
ssrn-3763090
No ratings yet
ssrn-3763090
4 pages
Artificial Intelligence
100% (1)
Artificial Intelligence
422 pages
AI Lab Lab Manual 23 24
No ratings yet
AI Lab Lab Manual 23 24
33 pages
AI 2nd Chapter Summary
No ratings yet
AI 2nd Chapter Summary
5 pages
Artificial Intelligence AI Learning Tools in K-12
No ratings yet
Artificial Intelligence AI Learning Tools in K-12
39 pages
Artificial Intelligence Model Exam
No ratings yet
Artificial Intelligence Model Exam
13 pages
Structure of Agents
No ratings yet
Structure of Agents
30 pages
Introduction To AI-Chapter1
No ratings yet
Introduction To AI-Chapter1
16 pages
AIMLUNit 1 QB
No ratings yet
AIMLUNit 1 QB
37 pages

7.reinforcement Learning-Introduction-The Learning Task Q-Learning

Uploaded by

7.reinforcement Learning-Introduction-The Learning Task Q-Learning

Uploaded by

Reinforcement Learning

• Unsupervised Learning: No feedback (no labels provided).

• Reinforcement Learning: Delayed scalar feedback (a number called reward).

◦ Model free approach RL:

Which one is better?

◦ Model free approach RL:

Which one is better?

(1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1)

Key idea: updating the utility value using the given

• What is it that the agent tries to optimize?

V  (st )  rt   rt 1   2rt 2  ...

Note: immediate reward is worth more than future reward.

• Answer: we choose the action that maximizes:

• One approach to RL is then to try to estimate V*(s). V * (s) maxr(s,a)  V * ((s,a))

• However, this approach requires you to know r(s,a) and delta(s,a).

• This is unrealistic in many real problems. What is the reward if a

• Fortunately we can circumvent this problem by exploring and experiencing

• We want a function that directly learns good state-action pairs, i.e.

• This still depends on r(s,a) and delta(s,a).

Q̂(s,a)  r   maxQ̂(s ',a') s’=st+1

Q-valueUfu(sn)ctimoanx(uRp(ds,aat)ingruTle(s,a, s')U (s'))

Q(s, a)  R(s, a)    T(s, a, s')U (s')

Q(s, a)  R(s, a)    T (s, a, s') maxQ(s', a')

Key idea of TD-Q learning

You might also like

• One approach to RL is then to try to estimate V(s). V (s) maxr(s,a)  V * ((s,a))