0% found this document useful (0 votes)
97 views43 pages

AI (IT) UNIT-5

The document provides an overview of Reinforcement Learning (RL), a subset of Machine Learning focused on maximizing rewards through trial and error. It discusses key concepts such as learning from rewards, active vs passive reinforcement learning, and various techniques including Q-learning and Temporal Difference Learning. Additionally, it highlights applications of RL in fields like robotics, business strategy, and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views43 pages

AI (IT) UNIT-5

The document provides an overview of Reinforcement Learning (RL), a subset of Machine Learning focused on maximizing rewards through trial and error. It discusses key concepts such as learning from rewards, active vs passive reinforcement learning, and various techniques including Q-learning and Temporal Difference Learning. Additionally, it highlights applications of RL in fields like robotics, business strategy, and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Department of Information Technology

Department of Computer Science


Artificial Intelligence (PE 511 IT)
V SEM

Faculty Name: MOHAMMED IRSHAD


UNIT-5

Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
 Reinforcement learning is an area of Machine Learning. It is
about taking suitable action to maximize reward in a particular
situation.
 It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
 Reinforcement learning differs from the supervised learning in a
way that in supervised learning the training data has the answer
key with it so the model is trained with the correct answer itself
whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given
task.
 In the absence of a training dataset, it is bound to learn from its
experience.
Learning from rewards
 In Reinforcement Learning (RL), agents are trained on
a reward and punishment mechanism. The agent is rewarded
for correct moves and punished for the wrong ones. In doing so,
the agent tries to minimize wrong moves and maximize the right
ones.
Learning from rewards
Example: The problem is as follows: We have an agent and a reward, with an
hurdle (2,2) in between. The agent is supposed to find the best possible path to
reach the reward.

The agent learns by trying all the possible paths and then choosing the path which
gives him the reward with the least hurdles. Each right step will give the agent a
reward and each wrong step will subtract the reward of the agent. The total reward
will be calculated when it reaches the final reward that is at the state where an agent
gets +1 reward
Steps in Reinforcement Learning
 Input: The input should be an initial state from which the
model will start
 Output: There are many possible output as there are variety
of solution to a particular problem
 Training: The training is based upon the input, The model
will return a state and the user will decide to reward or
punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.
 Policy: It is a mapping of an action to every
possible state in the system (sequence of
states).
 Optimal Policy: A policy which maximizes
the long term reward.
Active and Passive Reinforcement
Learning
 Both active and passive reinforcement learning are types of
Reinforcement Learning.
 In case of passive reinforcement learning, the agent’s policy
is fixed which means that it is told what to do.
 In contrast to this, in active reinforcement learning, an
agent needs to decide what to do as there’s no fixed policy
that it can act on.
 Therefore, the goal of a passive reinforcement learning
agent is to execute a fixed policy (sequence of actions) and
evaluate it while that of an active reinforcement learning
agent is to act and learn an optimal policy.
Passive Reinforcement Learning
 Assume fully observable environment.
 Passive learning:
 Policy is fixed (behavior does not change).
 The agent learns how good each state is.
 Similar to policy evaluation, but:
 Transition function and reward function are unknown.
 Why is it useful?
 For future policy revisions.
Passive Reinforcement Learning
Techniques
 In this kind of RL, agent assume that the agent’s policy
π(s) is fixed.
 Agent is therefore bound to do what the policy dictates,
although the outcomes of agent actions are probabilistic.
 The agent may watch what is happening, so the agent
knows what states the agent is reaching and what rewards
the agent gets there.
Techniques:
1. Direct utility estimation
2. Adaptive dynamic programming
3. Temporal difference learning
Passive RL
 Suppose we are given a policy
 Want to determine how good it is
Passive RL
 Follow the policy for many epochs (training sequences)
Approach 1:
Direct Utility Estimation
We're still operating under a stochastic environment, so a
particular action executed in a particular state does not always
lead to the same next state. If we want to learn the utilities of
these states under a fixed policy, then, we can imagine a fairly
straightforward way to do it:
 Execute the policy a bunch of times.
 At the end of every run, calculate the utility for each state in
the sequence (remember, utility of a state is the sum of
rewards for that state and all subsequent states).
 Update the average utility for each of the states we observed
with our new data points.
Approach 1: (model free)

Direct Utility Estimation


Approach 1:
Direct Utility Estimation
 Follow the policy for many epochs (training sequences)

 For each state the agent ever visits:


 For each time the agent visits the state:
 Keep track of the accumulated rewards from the visit
onwards.
Direct Utility Estimation

Example observed state sequence:


Direct Utility Estimation
 As the number of trials goes to infinity, the sample average
converges to the true utility
 Weakness:
 Converges very slowly!
 Why?
 Ignores correlations between utilities of neighboring
states.
Direct Utility Estimation
 Since the less-likely outcomes will happen less often, they will
affect our estimates less, which means that we don't need to
know a transition model for this to work.
 In fact, this will (provably) eventually converge to the true
utilities, but it's slow, because it doesn't take advantage of the
Markov property of the problem.
 Remember, since we calculate the utility of a state as the
reward-to-go, the utility of each state can be written strictly in
terms of the utilities of its immediate neighbors.
Utilities of states are not independent!

P=0.9
-1
NEW OLD
U=? U = -0.8

P=0.1 +1
Approach 2: (model based)

Adaptive Dynamic Programming


U i +1 ( s) = R( s) +  max  T ( s, a, s' )U i , i ( s' )


a s'
Adaptive Dynamic Programming

ADP Algorithm
Approach 3: (model free)

Temporal Difference Learning


 Instead of calculating the exact utility for a state can we
approximate it and possibly make it less computationally
expensive?

 The idea behind Temporal Difference (TD) learning:

U i +1 ( s) = R( s) +  max  T ( s, a, s' )U i , i ( s' )


a s'

Instead of doing this sum over all successors, only adjust the
utility of the state based on the successor observed in the
trial.
(No need to estimate the transition model!)
TD Learning Example

Temporal Difference Learning

ADP and TD comparison
 Advantages of ADP:
 Converges to the true utilities faster
 Utility estimates don’t vary as much from the true
utilities
 Advantages of TD:
 Simpler, less computation per observation
 Crude but efficient first approximation to ADP
 Don’t need to build a transition model in order to
perform its updates (this is important because we can
interleave computation with exploration rather than
having to wait for the whole model to be built first)
Active Reinforcement Learning
Technique- Q Learning
 It is an off policy reinforcement learning algorithm that
seeks to find the best action to take given the current state.
S →
(given state) best action
 It is a model free reinforcement learning where the agent
only knows what are the set of possible states and actions
and can observe the environment.
 Current state, agent does not know about rewards and
transitions between states.
 So the agent has to actively learn through the experience of
interactions with the environment.
 The agent will discover what are good and bad actions by
trial and error.
Q-learning
 Q-learning is considered off-policy because Q-learning
function learns from action that are outside the current
policy like taking random actions and therefore a policy is
not needed.
 Q-learning seeks to learn a policy that maximizes the total
reward.
 Q-learning is a value based RL algorithm which is used to
find the optimal action selection policy using a Q function.
 Our main goal is to maximize the value function Q.
 It uses Q-table, Q is the quality of the action
 Quality represents how useful a given action is in gaining
future rewards.
Q-learning
 So Q-table helps us to find the best action for each state.
 It helps us to maximize the expected reward by selecting the best
of all possible actions
 When Q-learning is performed, a Q-table or matrix is created that
contains:
No. of Rows = No. of states
No. of Columns = No. of possible actions
 Q-table is initialized with zero values.
 Q(state, action) returns the expected future reward of that action at
that state.
 This function can be estimated using Q-Learning, which iteratively
updates Q(s,a) using the Bellman equation.
 Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
Q-Learning — a simplistic overview
 Let’s say that a robot has to cross a maze and reach the end
point. There are mines, and the robot can only move one tile at
a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
 The scoring/reward system is as below:
 The robot loses 1 point at each step. This is done so that the
robot takes the shortest path and reaches the goal as fast as
possible.
 If the robot steps on a mine, the point loss is 100 and the game
ends.
 If the robot gets power, it gains 1 point.
 If the robot reaches the end goal, the robot gets 100 points.
Q-Learning — a simplistic overview
 Now, the obvious question is: How do we train a robot to
reach the end goal with the shortest path without
stepping on a mine?

So, how do we solve this?


Introducing the Q-Table
Q-Table is just a fancy name for a simple lookup table where
we calculate the maximum expected future rewards for action
at each state. Basically, this table will guide us to the best
action at each state.

There will be four numbers of actions at each non-edge tile.


When a robot is at a state it can either move up or down or
right or left.
So, let’s model this environment in our Q-Table.
Q-Table
 In the Q-Table, the columns are the actions and the rows are the
states.
 Each Q-table score will be the maximum expected future reward
that the robot will get if it takes that action at that state. This is an
iterative process, as we need to improve the Q-Table at each
iteration.
 But the questions are:
 How do we calculate the values of the
Q-table?
 Are the values available or predefined?
 To learn each value of the Q-table,
we use the Q-Learning algorithm.
Q-Learning Algorithm
 Q-function
 The Q-function uses the Bellman equation and takes two
inputs: state (s) and action (a).
 Using the above function, we get the values of Q for the
cells in the table.
Q-Learning Algorithm
 When we start, all the values in the Q-table are zeros.
 There is an iterative process of updating the values. As we
start to explore the environment, the Q-function gives us
better and better approximations by continuously updating
the Q-values in the table.
 Now, let’s understand how the updating takes place.
Q-Learning algorithm process

Each of the colored boxes is one step. Let’s understand each of


these steps in detail.
Q-Learning algorithm process
Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n=
number of actions. There are m rows, where m= number of states.
We will initialize the values at 0.

In our robot example, we have four actions (a=4) and five states
(s=5). So we will build a table with four columns and five rows.
Q-Learning algorithm process
Steps 2 and 3: choose and perform an action
 This combination of steps is done for an undefined amount of
time. This means that this step runs until the time we stop the
training, or the training loop stops as defined in the code.
 We will choose an action (a) in the state (s) based on the Q-
Table. But, as mentioned earlier, when the episode initially starts,
every Q-value is 0.
 We’ll use something called the epsilon greedy strategy.
 In the beginning, the epsilon rates will be higher. The robot will
explore the environment and randomly choose actions. The logic
behind this is that the robot does not know anything about the
environment.
 As the robot explores the environment, the epsilon rate decreases
and the robot starts to exploit the environment.
Q-Learning algorithm process
Steps 2 and 3: choose and perform an action
 During the process of exploration, the robot progressively becomes
more confident in estimating the Q-values.
 For the robot example, there are four actions to choose from:
up, down, left, and right. We are starting the training now — our
robot knows nothing about the environment. So the robot chooses a
random action, say right.

We can now update the Q-values for being at the start and moving
right using the Bellman equation.
Q-Learning algorithm process
Steps 4 and 5: evaluate
 Now we have taken an action and observed an outcome and
reward.We need to update the function Q(s,a).

In the case of the robot game, to reiterate the scoring/reward structure is:
power = +1
mine = -100
end = +100
Q-Learning algorithm process

We will repeat this again and again until the learning is stopped.
In this way the Q-Table will be updated.
Applications of Reinforcement
Learning
 Robotics for industrial automation.
 Business strategy planning
 Machine learning and data processing
 It helps you to create training systems that provide custom
instruction and materials according to the requirement of
students.
 Aircraft control and robot motion control

You might also like