0% found this document useful (0 votes)

125 views43 pages

Ai (It) Unit-5

The document provides an overview of Reinforcement Learning (RL), a subset of Machine Learning focused on maximizing rewards through trial and error. It discusses key concepts such as learning from rewards, active vs passive reinforcement learning, and various techniques including Q-learning and Temporal Difference Learning. Additionally, it highlights applications of RL in fields like robotics, business strategy, and machine learning.

Uploaded by

abdulrasheedshaik996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views43 pages

Ai (It) Unit-5

Uploaded by

abdulrasheedshaik996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Department of Information Technology

Department of Computer Science

Artificial Intelligence (PE 511 IT)
V SEM

Faculty Name: MOHAMMED IRSHAD

UNIT-5

Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
 Reinforcement learning is an area of Machine Learning. It is
about taking suitable action to maximize reward in a particular
situation.
 It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
 Reinforcement learning differs from the supervised learning in a
way that in supervised learning the training data has the answer
key with it so the model is trained with the correct answer itself
whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given
task.
 In the absence of a training dataset, it is bound to learn from its
experience.
Learning from rewards
 In Reinforcement Learning (RL), agents are trained on
a reward and punishment mechanism. The agent is rewarded
for correct moves and punished for the wrong ones. In doing so,
the agent tries to minimize wrong moves and maximize the right
ones.
Learning from rewards
Example: The problem is as follows: We have an agent and a reward, with an
hurdle (2,2) in between. The agent is supposed to find the best possible path to
reach the reward.

The agent learns by trying all the possible paths and then choosing the path which
gives him the reward with the least hurdles. Each right step will give the agent a
reward and each wrong step will subtract the reward of the agent. The total reward
will be calculated when it reaches the final reward that is at the state where an agent
gets +1 reward
Steps in Reinforcement Learning
 Input: The input should be an initial state from which the
model will start
 Output: There are many possible output as there are variety
of solution to a particular problem
 Training: The training is based upon the input, The model
will return a state and the user will decide to reward or
punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.
 Policy: It is a mapping of an action to every
possible state in the system (sequence of
states).
 Optimal Policy: A policy which maximizes
the long term reward.
Active and Passive Reinforcement
Learning
 Both active and passive reinforcement learning are types of
Reinforcement Learning.
 In case of passive reinforcement learning, the agent’s policy
is fixed which means that it is told what to do.
 In contrast to this, in active reinforcement learning, an
agent needs to decide what to do as there’s no fixed policy
that it can act on.
 Therefore, the goal of a passive reinforcement learning
agent is to execute a fixed policy (sequence of actions) and
evaluate it while that of an active reinforcement learning
agent is to act and learn an optimal policy.
Passive Reinforcement Learning
 Assume fully observable environment.
 Passive learning:
 Policy is fixed (behavior does not change).
 The agent learns how good each state is.
 Similar to policy evaluation, but:
 Transition function and reward function are unknown.
 Why is it useful?
 For future policy revisions.
Passive Reinforcement Learning
Techniques
 In this kind of RL, agent assume that the agent’s policy
π(s) is fixed.
 Agent is therefore bound to do what the policy dictates,
although the outcomes of agent actions are probabilistic.
 The agent may watch what is happening, so the agent
knows what states the agent is reaching and what rewards
the agent gets there.
Techniques:
1. Direct utility estimation
2. Adaptive dynamic programming
3. Temporal difference learning
Passive RL
 Suppose we are given a policy
 Want to determine how good it is
Passive RL
 Follow the policy for many epochs (training sequences)
Approach 1:
Direct Utility Estimation
We're still operating under a stochastic environment, so a
particular action executed in a particular state does not always
lead to the same next state. If we want to learn the utilities of
these states under a fixed policy, then, we can imagine a fairly
straightforward way to do it:
 Execute the policy a bunch of times.
 At the end of every run, calculate the utility for each state in
the sequence (remember, utility of a state is the sum of
rewards for that state and all subsequent states).
 Update the average utility for each of the states we observed
with our new data points.
Approach 1: (model free)

Direct Utility Estimation

Approach 1:
Direct Utility Estimation
 Follow the policy for many epochs (training sequences)

 For each state the agent ever visits:

 For each time the agent visits the state:
 Keep track of the accumulated rewards from the visit
onwards.
Direct Utility Estimation

Example observed state sequence:

Direct Utility Estimation
 As the number of trials goes to infinity, the sample average
converges to the true utility
 Weakness:
 Converges very slowly!
 Why?
 Ignores correlations between utilities of neighboring
states.
Direct Utility Estimation
 Since the less-likely outcomes will happen less often, they will
affect our estimates less, which means that we don't need to
know a transition model for this to work.
 In fact, this will (provably) eventually converge to the true
utilities, but it's slow, because it doesn't take advantage of the
Markov property of the problem.
 Remember, since we calculate the utility of a state as the
reward-to-go, the utility of each state can be written strictly in
terms of the utilities of its immediate neighbors.
Utilities of states are not independent!

P=0.9
-1
NEW OLD
U=? U = -0.8

P=0.1 +1
Approach 2: (model based)

Adaptive Dynamic Programming



U i +1 ( s) = R( s) +  max  T ( s, a, s' )U i , i ( s' )

a s'
Adaptive Dynamic Programming

ADP Algorithm
Approach 3: (model free)

Temporal Difference Learning

 Instead of calculating the exact utility for a state can we
approximate it and possibly make it less computationally
expensive?

 The idea behind Temporal Difference (TD) learning:

U i +1 ( s) = R( s) +  max  T ( s, a, s' )U i , i ( s' )

a s'

Instead of doing this sum over all successors, only adjust the
utility of the state based on the successor observed in the
trial.
(No need to estimate the transition model!)
TD Learning Example

Temporal Difference Learning

ADP and TD comparison
 Advantages of ADP:
 Converges to the true utilities faster
 Utility estimates don’t vary as much from the true
utilities
 Advantages of TD:
 Simpler, less computation per observation
 Crude but efficient first approximation to ADP
 Don’t need to build a transition model in order to
perform its updates (this is important because we can
interleave computation with exploration rather than
having to wait for the whole model to be built first)
Active Reinforcement Learning
Technique- Q Learning
 It is an off policy reinforcement learning algorithm that
seeks to find the best action to take given the current state.
S →
(given state) best action
 It is a model free reinforcement learning where the agent
only knows what are the set of possible states and actions
and can observe the environment.
 Current state, agent does not know about rewards and
transitions between states.
 So the agent has to actively learn through the experience of
interactions with the environment.
 The agent will discover what are good and bad actions by
trial and error.
Q-learning
 Q-learning is considered off-policy because Q-learning
function learns from action that are outside the current
policy like taking random actions and therefore a policy is
not needed.
 Q-learning seeks to learn a policy that maximizes the total
reward.
 Q-learning is a value based RL algorithm which is used to
find the optimal action selection policy using a Q function.
 Our main goal is to maximize the value function Q.
 It uses Q-table, Q is the quality of the action
 Quality represents how useful a given action is in gaining
future rewards.
Q-learning
 So Q-table helps us to find the best action for each state.
 It helps us to maximize the expected reward by selecting the best
of all possible actions
 When Q-learning is performed, a Q-table or matrix is created that
contains:
No. of Rows = No. of states
No. of Columns = No. of possible actions
 Q-table is initialized with zero values.
 Q(state, action) returns the expected future reward of that action at
that state.
 This function can be estimated using Q-Learning, which iteratively
updates Q(s,a) using the Bellman equation.
 Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
Q-Learning — a simplistic overview
 Let’s say that a robot has to cross a maze and reach the end
point. There are mines, and the robot can only move one tile at
a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
 The scoring/reward system is as below:
 The robot loses 1 point at each step. This is done so that the
robot takes the shortest path and reaches the goal as fast as
possible.
 If the robot steps on a mine, the point loss is 100 and the game
ends.
 If the robot gets power, it gains 1 point.
 If the robot reaches the end goal, the robot gets 100 points.
Q-Learning — a simplistic overview
 Now, the obvious question is: How do we train a robot to
reach the end goal with the shortest path without
stepping on a mine?


So, how do we solve this?

Introducing the Q-Table
Q-Table is just a fancy name for a simple lookup table where
we calculate the maximum expected future rewards for action
at each state. Basically, this table will guide us to the best
action at each state.

There will be four numbers of actions at each non-edge tile.

When a robot is at a state it can either move up or down or
right or left.
So, let’s model this environment in our Q-Table.
Q-Table
 In the Q-Table, the columns are the actions and the rows are the
states.
 Each Q-table score will be the maximum expected future reward
that the robot will get if it takes that action at that state. This is an
iterative process, as we need to improve the Q-Table at each
iteration.
 But the questions are:
 How do we calculate the values of the
Q-table?
 Are the values available or predefined?
 To learn each value of the Q-table,
we use the Q-Learning algorithm.
Q-Learning Algorithm
 Q-function
 The Q-function uses the Bellman equation and takes two
inputs: state (s) and action (a).
 Using the above function, we get the values of Q for the
cells in the table.
Q-Learning Algorithm
 When we start, all the values in the Q-table are zeros.
 There is an iterative process of updating the values. As we
start to explore the environment, the Q-function gives us
better and better approximations by continuously updating
the Q-values in the table.
 Now, let’s understand how the updating takes place.
Q-Learning algorithm process

Each of the colored boxes is one step. Let’s understand each of

these steps in detail.
Q-Learning algorithm process
Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n=
number of actions. There are m rows, where m= number of states.
We will initialize the values at 0.

In our robot example, we have four actions (a=4) and five states
(s=5). So we will build a table with four columns and five rows.
Q-Learning algorithm process
Steps 2 and 3: choose and perform an action
 This combination of steps is done for an undefined amount of
time. This means that this step runs until the time we stop the
training, or the training loop stops as defined in the code.
 We will choose an action (a) in the state (s) based on the Q-
Table. But, as mentioned earlier, when the episode initially starts,
every Q-value is 0.
 We’ll use something called the epsilon greedy strategy.
 In the beginning, the epsilon rates will be higher. The robot will
explore the environment and randomly choose actions. The logic
behind this is that the robot does not know anything about the
environment.
 As the robot explores the environment, the epsilon rate decreases
and the robot starts to exploit the environment.
Q-Learning algorithm process
Steps 2 and 3: choose and perform an action
 During the process of exploration, the robot progressively becomes
more confident in estimating the Q-values.
 For the robot example, there are four actions to choose from:
up, down, left, and right. We are starting the training now — our
robot knows nothing about the environment. So the robot chooses a
random action, say right.


We can now update the Q-values for being at the start and moving
right using the Bellman equation.
Q-Learning algorithm process
Steps 4 and 5: evaluate
 Now we have taken an action and observed an outcome and
reward.We need to update the function Q(s,a).

In the case of the robot game, to reiterate the scoring/reward structure is:
power = +1
mine = -100
end = +100
Q-Learning algorithm process

We will repeat this again and again until the learning is stopped.
In this way the Q-Table will be updated.
Applications of Reinforcement
Learning
 Robotics for industrial automation.
 Business strategy planning
 Machine learning and data processing
 It helps you to create training systems that provide custom
instruction and materials according to the requirement of
students.
 Aircraft control and robot motion control

EPPP PRACTICE Exam
100% (3)
EPPP PRACTICE Exam
22 pages
Time and Work All Previous Year Questions PYQs Asked in SSC 2018
No ratings yet
Time and Work All Previous Year Questions PYQs Asked in SSC 2018
37 pages
Adkar Model of Change Management
No ratings yet
Adkar Model of Change Management
8 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Syllogism Notes + Solved PYQs For RRB NTPC 2024
No ratings yet
Syllogism Notes + Solved PYQs For RRB NTPC 2024
17 pages
Behavior-Based Safety
No ratings yet
Behavior-Based Safety
67 pages
Ai (It) Unit-1
No ratings yet
Ai (It) Unit-1
16 pages
(FREE PDF Sample) (Ebook PDF) Learning and Behavior 7th Edition by Paul Chance Ebooks
100% (2)
(FREE PDF Sample) (Ebook PDF) Learning and Behavior 7th Edition by Paul Chance Ebooks
55 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
CHAPTER 5 OPERANT LEARNING Final
100% (1)
CHAPTER 5 OPERANT LEARNING Final
7 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Ai (It) Unit-3
No ratings yet
Ai (It) Unit-3
85 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
MKT201
No ratings yet
MKT201
55 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Unit 4
No ratings yet
Unit 4
56 pages
Lec 10
No ratings yet
Lec 10
50 pages
M Organizational Behavior 2nd Edition McShane Solutions Manual Download
100% (17)
M Organizational Behavior 2nd Edition McShane Solutions Manual Download
27 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Vladimir Lenin Essay
100% (2)
Vladimir Lenin Essay
3 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Unit-5 Reinforcemnt and Q Learning
No ratings yet
Unit-5 Reinforcemnt and Q Learning
45 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
5th Unit Notes Full File
No ratings yet
5th Unit Notes Full File
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Unit 3
No ratings yet
Unit 3
32 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Principles of Teaching and Learning
No ratings yet
Principles of Teaching and Learning
9 pages
Exam 3
67% (3)
Exam 3
10 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Situational Interest A Curriculum Component Enhancing Motivation To Learn
No ratings yet
Situational Interest A Curriculum Component Enhancing Motivation To Learn
29 pages
Unit 3
No ratings yet
Unit 3
29 pages
Belo Da Fonseca, T., Lewon, M. & Laurenti, C. (2025) - Precurrent Behavior in B. F. Skinn
No ratings yet
Belo Da Fonseca, T., Lewon, M. & Laurenti, C. (2025) - Precurrent Behavior in B. F. Skinn
14 pages
Behaviour Support Guidelines Children
100% (1)
Behaviour Support Guidelines Children
44 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
mgn-581 Unit 4
No ratings yet
mgn-581 Unit 4
54 pages
Unit4 (AI) 2024 Docx-1
No ratings yet
Unit4 (AI) 2024 Docx-1
22 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
Unit 5
No ratings yet
Unit 5
45 pages
Positive Discipline
No ratings yet
Positive Discipline
20 pages
Mcob Unit 4 Notes
No ratings yet
Mcob Unit 4 Notes
17 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Minggu 13
No ratings yet
Minggu 13
21 pages
Student Assessment Agreement
No ratings yet
Student Assessment Agreement
36 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
37 RL
No ratings yet
37 RL
18 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Chapter 4
No ratings yet
Chapter 4
25 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Unit 5 ML 3year
No ratings yet
Unit 5 ML 3year
17 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
The Impact of Positive Reinforcement On Employees Performance in Organizations
No ratings yet
The Impact of Positive Reinforcement On Employees Performance in Organizations
5 pages
AP Psychology Unit 6 Notes
No ratings yet
AP Psychology Unit 6 Notes
4 pages
Primate Restraint
No ratings yet
Primate Restraint
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Fai Mid2 4ans
No ratings yet
Fai Mid2 4ans
4 pages
CBT Midterm Study Guide
No ratings yet
CBT Midterm Study Guide
5 pages
Presentationdogpsychology
No ratings yet
Presentationdogpsychology
6 pages
Free Will: Hail and Farewell
No ratings yet
Free Will: Hail and Farewell
25 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
BCaBA Task List 5th Ed 240216 A
No ratings yet
BCaBA Task List 5th Ed 240216 A
5 pages
Iulia Mukbel - Teaching Wait
No ratings yet
Iulia Mukbel - Teaching Wait
10 pages
Behavioral/ Cognitive Family Therapy
No ratings yet
Behavioral/ Cognitive Family Therapy
6 pages
Lesson 3.2 - Children's Developmental Stages
No ratings yet
Lesson 3.2 - Children's Developmental Stages
6 pages
Learning Theories
No ratings yet
Learning Theories
16 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet

Ai (It) Unit-5

Uploaded by

Ai (It) Unit-5

Uploaded by

Department of Information Technology

Department of Computer Science

Faculty Name: MOHAMMED IRSHAD

Direct Utility Estimation

 For each state the agent ever visits:

Example observed state sequence:

Adaptive Dynamic Programming

U i +1 ( s) = R( s) +  max  T ( s, a, s' )U i , i ( s' )

Temporal Difference Learning

 The idea behind Temporal Difference (TD) learning:

U i +1 ( s) = R( s) +  max  T ( s, a, s' )U i , i ( s' )

So, how do we solve this?

There will be four numbers of actions at each non-edge tile.

Each of the colored boxes is one step. Let’s understand each of

You might also like