AI (IT) UNIT-5
AI (IT) UNIT-5
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
Reinforcement learning is an area of Machine Learning. It is
about taking suitable action to maximize reward in a particular
situation.
It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
Reinforcement learning differs from the supervised learning in a
way that in supervised learning the training data has the answer
key with it so the model is trained with the correct answer itself
whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given
task.
In the absence of a training dataset, it is bound to learn from its
experience.
Learning from rewards
In Reinforcement Learning (RL), agents are trained on
a reward and punishment mechanism. The agent is rewarded
for correct moves and punished for the wrong ones. In doing so,
the agent tries to minimize wrong moves and maximize the right
ones.
Learning from rewards
Example: The problem is as follows: We have an agent and a reward, with an
hurdle (2,2) in between. The agent is supposed to find the best possible path to
reach the reward.
The agent learns by trying all the possible paths and then choosing the path which
gives him the reward with the least hurdles. Each right step will give the agent a
reward and each wrong step will subtract the reward of the agent. The total reward
will be calculated when it reaches the final reward that is at the state where an agent
gets +1 reward
Steps in Reinforcement Learning
Input: The input should be an initial state from which the
model will start
Output: There are many possible output as there are variety
of solution to a particular problem
Training: The training is based upon the input, The model
will return a state and the user will decide to reward or
punish the model based on its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
Policy: It is a mapping of an action to every
possible state in the system (sequence of
states).
Optimal Policy: A policy which maximizes
the long term reward.
Active and Passive Reinforcement
Learning
Both active and passive reinforcement learning are types of
Reinforcement Learning.
In case of passive reinforcement learning, the agent’s policy
is fixed which means that it is told what to do.
In contrast to this, in active reinforcement learning, an
agent needs to decide what to do as there’s no fixed policy
that it can act on.
Therefore, the goal of a passive reinforcement learning
agent is to execute a fixed policy (sequence of actions) and
evaluate it while that of an active reinforcement learning
agent is to act and learn an optimal policy.
Passive Reinforcement Learning
Assume fully observable environment.
Passive learning:
Policy is fixed (behavior does not change).
The agent learns how good each state is.
Similar to policy evaluation, but:
Transition function and reward function are unknown.
Why is it useful?
For future policy revisions.
Passive Reinforcement Learning
Techniques
In this kind of RL, agent assume that the agent’s policy
π(s) is fixed.
Agent is therefore bound to do what the policy dictates,
although the outcomes of agent actions are probabilistic.
The agent may watch what is happening, so the agent
knows what states the agent is reaching and what rewards
the agent gets there.
Techniques:
1. Direct utility estimation
2. Adaptive dynamic programming
3. Temporal difference learning
Passive RL
Suppose we are given a policy
Want to determine how good it is
Passive RL
Follow the policy for many epochs (training sequences)
Approach 1:
Direct Utility Estimation
We're still operating under a stochastic environment, so a
particular action executed in a particular state does not always
lead to the same next state. If we want to learn the utilities of
these states under a fixed policy, then, we can imagine a fairly
straightforward way to do it:
Execute the policy a bunch of times.
At the end of every run, calculate the utility for each state in
the sequence (remember, utility of a state is the sum of
rewards for that state and all subsequent states).
Update the average utility for each of the states we observed
with our new data points.
Approach 1: (model free)
P=0.9
-1
NEW OLD
U=? U = -0.8
P=0.1 +1
Approach 2: (model based)
Instead of doing this sum over all successors, only adjust the
utility of the state based on the successor observed in the
trial.
(No need to estimate the transition model!)
TD Learning Example
Temporal Difference Learning
ADP and TD comparison
Advantages of ADP:
Converges to the true utilities faster
Utility estimates don’t vary as much from the true
utilities
Advantages of TD:
Simpler, less computation per observation
Crude but efficient first approximation to ADP
Don’t need to build a transition model in order to
perform its updates (this is important because we can
interleave computation with exploration rather than
having to wait for the whole model to be built first)
Active Reinforcement Learning
Technique- Q Learning
It is an off policy reinforcement learning algorithm that
seeks to find the best action to take given the current state.
S →
(given state) best action
It is a model free reinforcement learning where the agent
only knows what are the set of possible states and actions
and can observe the environment.
Current state, agent does not know about rewards and
transitions between states.
So the agent has to actively learn through the experience of
interactions with the environment.
The agent will discover what are good and bad actions by
trial and error.
Q-learning
Q-learning is considered off-policy because Q-learning
function learns from action that are outside the current
policy like taking random actions and therefore a policy is
not needed.
Q-learning seeks to learn a policy that maximizes the total
reward.
Q-learning is a value based RL algorithm which is used to
find the optimal action selection policy using a Q function.
Our main goal is to maximize the value function Q.
It uses Q-table, Q is the quality of the action
Quality represents how useful a given action is in gaining
future rewards.
Q-learning
So Q-table helps us to find the best action for each state.
It helps us to maximize the expected reward by selecting the best
of all possible actions
When Q-learning is performed, a Q-table or matrix is created that
contains:
No. of Rows = No. of states
No. of Columns = No. of possible actions
Q-table is initialized with zero values.
Q(state, action) returns the expected future reward of that action at
that state.
This function can be estimated using Q-Learning, which iteratively
updates Q(s,a) using the Bellman equation.
Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
Q-Learning — a simplistic overview
Let’s say that a robot has to cross a maze and reach the end
point. There are mines, and the robot can only move one tile at
a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
The scoring/reward system is as below:
The robot loses 1 point at each step. This is done so that the
robot takes the shortest path and reaches the goal as fast as
possible.
If the robot steps on a mine, the point loss is 100 and the game
ends.
If the robot gets power, it gains 1 point.
If the robot reaches the end goal, the robot gets 100 points.
Q-Learning — a simplistic overview
Now, the obvious question is: How do we train a robot to
reach the end goal with the shortest path without
stepping on a mine?
In our robot example, we have four actions (a=4) and five states
(s=5). So we will build a table with four columns and five rows.
Q-Learning algorithm process
Steps 2 and 3: choose and perform an action
This combination of steps is done for an undefined amount of
time. This means that this step runs until the time we stop the
training, or the training loop stops as defined in the code.
We will choose an action (a) in the state (s) based on the Q-
Table. But, as mentioned earlier, when the episode initially starts,
every Q-value is 0.
We’ll use something called the epsilon greedy strategy.
In the beginning, the epsilon rates will be higher. The robot will
explore the environment and randomly choose actions. The logic
behind this is that the robot does not know anything about the
environment.
As the robot explores the environment, the epsilon rate decreases
and the robot starts to exploit the environment.
Q-Learning algorithm process
Steps 2 and 3: choose and perform an action
During the process of exploration, the robot progressively becomes
more confident in estimating the Q-values.
For the robot example, there are four actions to choose from:
up, down, left, and right. We are starting the training now — our
robot knows nothing about the environment. So the robot chooses a
random action, say right.
We can now update the Q-values for being at the start and moving
right using the Bellman equation.
Q-Learning algorithm process
Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and
reward.We need to update the function Q(s,a).
In the case of the robot game, to reiterate the scoring/reward structure is:
power = +1
mine = -100
end = +100
Q-Learning algorithm process
We will repeat this again and again until the learning is stopped.
In this way the Q-Table will be updated.
Applications of Reinforcement
Learning
Robotics for industrial automation.
Business strategy planning
Machine learning and data processing
It helps you to create training systems that provide custom
instruction and materials according to the requirement of
students.
Aircraft control and robot motion control