0% found this document useful (0 votes)
15 views

ML unit 4

Notes ml

Uploaded by

lalits7420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

ML unit 4

Notes ml

Uploaded by

lalits7420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

What is reinforcement learning?

Reinforcement learning (RL) is a machine learning (ML) technique that


trains software to make decisions to achieve the most optimal results. It
mimics the trial-and-error learning process that humans use to achieve
their goals. Software actions that work towards your goal are reinforced,
while actions that detract from the goal are ignored.

RL algorithms use a reward-and-punishment paradigm as they process


data. They learn from the feedback of each action and self-discover the
best processing paths to achieve final outcomes. The algorithms are also
capable of delayed gratification. The best overall strategy may require short
-term sacrifices, so the best approach they discover may include some
punishments or backtracking along the way. RL is a powerful method to
help artificial intelligence (AI) systems achieve optimal outcomes in unseen
environments.

There are many different algorithms that tackle this issue. As a


matter of fact, Reinforcement Learning is defined by a specific type of
problem, and all its solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to decide the best action
to select based on his current state. When this step is repeated, the
problem is known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:

 A set of possible world states S.


 A set of Models.
 A set of possible actions A.
 A real-valued reward function R(s,a).
A policy the solution of Markov Decision Process.

What is a State?

A State is a set of tokens that represent every state that the agent can
be in.

What is a Model?

A Model (sometimes called Transition Model) gives an action’s effect in a


state. In particular, T(S, a, S’) defines a transition T where being in state S
and taking an action ‘a’ takes us to state S’ (S and S’ may be the same).
For stochastic actions (noisy, non-deterministic) we also define a
probability P(S’|S,a) which represents the probability of reaching a state S’
if action ‘a’ is taken in state S. Note Markov property states that the
effects of an action taken in a state depend only on that state and not on
the prior history.
What are Actions?

An Action A is a set of all possible actions. A(s) defines the set of actions
that can be taken being in state S.

What is a Reward?

A Reward is a real-valued reward function. R(s) indicates the reward for


simply being in the state S. R(S,a) indicates the reward for being in a state
S and taking an action ‘a’. R(S,a,S’) indicates the reward for being in a
state S, taking an action ‘a’ and ending up in a state S’.

What is a Policy?

A Policy is a solution to the Markov Decision Process. A policy is a


mapping from S to a. It indicates the action ‘a’ to be taken while in state
S.
Let us take the example of a grid world:
An agent lives in the grid. The above example is a 3*4 grid. The grid has a
START state (grid no 1,1). The purpose of the agent is to wander around
the grid to finally reach the Blue Diamond (grid no 4,3). Under all
circumstances, the agent should avoid the Fire grid (orange color, grid no
4,2). Also the grid no 2,2 is a blocked grid, it acts as a wall hence the agent
cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent
would have taken , the agent stays in the same place. So for example, if
the agent says LEFT in the START grid he would stay put in the START
grid.
First Aim: To find the shortest sequence getting from START to the
Diamond. Two such sequences can be found:

 RIGHT RIGHT UP UPRIGHT


 UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the
subsequent discussion.
The move is now noisy. 80% of the time the intended action works
correctly. 20% of the time the action agent takes causes it to move at right
angles. For example, if the agent says UP the probability of going UP is 0.8
whereas the probability of going LEFT is 0.1, and the probability of going
RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
The agent receives rewards each time step:-

 Small reward each step (can be negative when can also be term as
punishment, in the above example entering the Fire can have a reward
of -1).
 Big rewards come at the end (good or bad).
 The goal is to Maximize the sum of rewards.

Bellman Equation:-

According to the Bellman Equation, long-term- reward in a given action is


equal to the reward from the current action combined with the expected
reward from the future actions taken at the following time. Let’s try to
understand first.

Let’s take an example:


Here we have a maze which is our environment and the sole goal of our
agent is to reach the trophy state (R = 1) or to get Good reward and
to avoid the fire state because it will be a failure (R = -1) or will get Bad
reward.

What happens without Bellman Equation?

Initially, we will give our agent some time to explore the environment and
let it figure out a path to the goal. As soon as it reaches its goal, it
will back trace its steps back to its starting position and mark values of all
the states which eventually leads towards the goal as V = 1.
The agent will face no problem until we change its starting position, as it
will not be able to find a path towards the trophy state since the value of
all the states is equal to 1. So, to solve this problem we should
use Bellman Equation:

V(s)=maxa(R(s,a)+ γV(s’))

State(s): current state where the agent is in the environment


Next State(s’): After taking action(a) at state(s) the agent reaches s’
Value(V): Numeric representation of a state which helps the agent to find
its path. V(s) here means the value of the state s.
Reward(R): treat which the agent gets after performing an action(a).
 R(s): reward for being in the state s
 R(s,a): reward for being in the state and performing an action a
 R(s,a,s’): reward for being in a state s, taking an action a and ending up
in s’
e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.
Action(a): set of possible actions that can be taken by the agent in the
state(s). e.g. (LEFT, RIGHT, UP, DOWN)
Discount factor(γ): determines how much the agent cares about rewards
in the distant future relative to those in the immediate future. It has a
value between 0 and 1. Lower value encourages short–term rewards
while higher value promises long-term reward.

The max denotes the most optimum action among all the actions that the
agent can take in a particular state which can lead to the reward
after repeating this process every consecutive step.
For example:
 The state left to the fire state (V = 0.9) can
go UP, DOWN, RIGHT but NOT LEFT because it’s a wall(not accessible).
Among all these actions available the maximum value for that state is
the UP action.
 The current starting state of our agent can choose
any random action UP or RIGHT since both lead towards the reward
with the same number of steps.
By using the Bellman equation our agent will calculate the value of every
step except for the trophy and the fire state (V = 0), they cannot have
values since they are the end of the maze.
So, after making such a plan our agent can easily accomplish its goal by
just following the increasing values.
Here's a complete roadmap for you to become a developer: Learn DSA ->
Master Frontend/Backend/Full Stack -> Build Projects -> Keep Applying to
Jobs

What is Q-learning?:-

Q-learning is a machine learning approach that enables a model to


iteratively learn and improve over time by taking the correct action. Q-
learning is a type of reinforcement learning.

With reinforcement learning, a machine learning model is trained to mimic


the way animals or children learn. Good actions are rewarded or reinforced,
while bad actions are discouraged and penalized.

With the state-action-reward-state-action form of reinforcement learning,


the training regimen follows a model to take the right actions. Q-learning
provides a model-free approach to reinforcement learning. There is no
model of the environment to guide the reinforcement learning process. The
agent -- which is the AI component that acts in the environment -- iteratively
learns and makes predictions about the environment on its own.

Q-learning also takes an off-policy approach to reinforcement learning. A Q-


learning approach aims to determine the optimal action based on its
current state. The Q-learning approach can accomplish this by
either developing its own set of rules or deviating from the prescribed
policy. Because Q-learning may deviate from the given policy, a defined
policy is not needed.

How does Q-learning work?


Q-learning models operate in an iterative process that involves multiple
components working together to help train a model. The iterative process
involves the agent learning by exploring the environment and updating the
model as the exploration continues. The multiple components of Q-learning
include the following:
 Agents. The agent is the entity that acts and operates within an
environment.
 States. The state is a variable that identifies the current position in an
environment of an agent.
 Actions. The action is the agent's operation when it is in a specific state.
 Rewards. A foundational concept within reinforcement learning is the
concept of providing either a positive or a negative response for the
agent's actions.
 Episodes. An episode is when an agent can no longer take a new action
and ends up terminating.
 Q-values. The Q-value is the metric used to measure an action at a
particular state.

Advantages of Q-learning:-
he Q-learning approach to reinforcement learning can potentially be
advantageous for several reasons, including the following:

 Model-free. The model-free approach is the foundation of Q-learning


and one of the biggest potential advantages for some uses. Rather than
requiring prior knowledge about an environment, the Q-learning agent
can learn about the environment as it trains. The model-free approach is
particularly beneficial for scenarios where the underlying dynamics of an
environment are difficult to model or completely unknown.
 Off-policy optimization. The model can optimize to get the best
possible result without being strictly tethered to a policy that might not
enable the same degree of optimization.
 Flexibility. The model-free, off-policy approach enables Q-learning
flexibility to work across a variety of problems and environments.
 Offline training. A Q-learning model can be deployed on pre-collected,
offline data sets.
Disadvantages of Q-learning:-
The Q-learning approach to reinforcement model machine learning also has
some disadvantages, such as the following:

 Exploration vs. exploitation tradeoff. It can be hard for a Q-learning


model to find the right balance between trying new actions and sticking
with what's already known. It's a dilemma that is commonly referred to
as the exploration vs. exploitation tradeoff for reinforcement learning.

 Curse of dimensionality. Q-learning can potentially face a machine


learning risk known as the curse of dimensionality. The curse of
dimensionality is a problem with high-dimensional data where the
amount of data required to represent the distribution increases
exponentially. This can lead to computational challenges and decreased
accuracy.
 Overestimation. A Q-learning model can sometimes be too optimistic
and overestimate how good a particular action or strategy is.
 Performance. A Q-learning model can take a long time to figure out the
best method if there are several ways to approach a problem.

Value iteration vs policy iteration:-

In reinforcement learning, Markov decision processes (MDPs) help in


decision-making problems. Such problems include finding an optimal
policy, where states are mapped to actions to maximize the overall reward
over time. To tackle the reward optimization problem, there are two
approaches:

 Value iteration
 Policy iteration

In this Answer, we will discuss the algorithms mentioned above and delve
into their differences as well.

Value iteration:-
Value iteration is a dynamic programming algorithm in which an agent
interacts with its surroundings through actions to maximize long-term
reward. It considers the neighboring states and refines the estimates of
the states in the future. Value iteration starts with initial random
estimates and improves until it converges to the optimal values.

V(s) =amax T(s,a,s′)(R(s,a,s′)+γV(s′)))


 Here V(s)is the value at state s.
 max select the best action to find the optimal solution.
 T(s,a,s′) is the probability for the movement of an agent from state
s to s′ by taking an action a.
 R(s,a,s′) is the reward of an agent when it moves from state s to s′.
 γ represents the discount factor that determines the significance of
long-term rewards as compared to short-term rewards.
 V(s′) is the value of the next state ′s′.
Policy iteration:-

Policy iteration is an iterative method that alternates between evaluating


and improving a policy until an optimal policy is found.

Mathematical intuition

There are two parts of policy iteration, which are:

 Policy evaluation
 Policy improvement

In policy evaluation, we evaluate V(s) for the current policy π(s) until it
converges to the optimal solution.

V(s)=s′∑(T(s,π(s),s′)(R(s,π(s),s′)+γV(s′)))
 T(s,π(s),s′) is the probability of transition from state s to state s′ when
π(s) is given.
 R(s,π(s),s′) is the short-term or immediate reward from the state s to
s′, given that action is described by π(s).

Difference:-
The difference table between policy iteration and value iteration is given
below:

Value Iteration Policy Iteration


Approach Directly computes optimal Alternates between
values. policy evaluation and
policy iteration.
Steps It uses the Bellman optimal First, it evaluates the
equation. policy and then it
improves the policy.
Intermediate It does not explicitly generate In each iteration, it
Policies intermediate policies. generates
intermediate policies.
Convergence Criteria Values converge to their Policy no longer
optimal values. changes between
iterations.
Computational Fewer iterations but more May require more
Efficiency value function evaluations. iterations but fewer
value function
evaluations.
Application Suitable for cases with Generally more
expensive value function computationally
updates. efficient.

Definition of SARSA:-
SARSA is a reinforcement learning algorithm that teaches computers how
to make good decisions by interacting with an environment. SARSA stands
for State-Action-Reward-State-Action, which represents the algorithm's
sequence of steps. It helps computers learn from their experiences to
determine the best actions.

Explanation of SARSA:-

Assume you're teaching a robot to navigate a maze. The robot begins at a


specific location (the "State" - where it is), and you want it to discover the
best path to the maze's finish. The robot can proceed in numerous
directions at each step (these are the "Actions" - what it does). As it travels,
the robot receives input through incentives - positive or negative numbers
indicating its performance.

The amazing thing about SARSA is that it doesn't need a map of the maze
or explicit instructions on what to do. It learns by trial and error, discovering
which actions work best in different situations. This way, SARSA helps
computers learn to make decisions in various scenarios, from games to
driving cars to managing resources efficiently.
Applications of SARSA:-

Game Playing:

o SARSA can train agents to play games effectively by learning optimal


strategies. In board games like chess, it can explore different move
sequences and adapt its decisions based on rewards (winning,
drawing, losing).
o SARSA can control game characters in video games, making them
learn to navigate complex levels, avoid obstacles, and interact with
other in-game entities.

Robotics:
SARSA is invaluable for robotic systems. Robots can learn how to
move, interact with objects, and perform tasks through interactions
with their environment.

o SARSA can guide a robot in exploring and mapping unknown


environments, enabling efficient exploration and mapping strategies.

Autonomous Vehicles:

o Self-driving cars can use SARSA to learn safe and efficient driving
behaviors. The algorithm helps them navigate various traffic
scenarios, such as lane changes, merging, and negotiating
intersections.
o SARSA can optimize real-time decision-making based on sensor
inputs, traffic patterns, and road conditions.

Resource Management:

o In energy management, SARSA can control the charging and


discharging of batteries in a renewable energy system to maximize
energy utilization while considering varying demand and supply
conditions.
o It can optimize the allocation of resources in manufacturing
processes, ensuring efficient utilization of machines, materials, and
labor.

Finance and Trading:

o SARSA can be applied in algorithmic trading to learn optimal buying


and selling strategies in response to market data.
o The algorithm can adapt trading decisions based on historical market
trends, news sentiment, and other financial indicators.

Healthcare:

o In personalized medicine, SARSA could optimize treatment plans for


individual patients by learning from historical patient data and
adjusting treatment parameters.
o SARSA can aid in resource allocation, such as hospital bed
scheduling, to minimize patient wait times and optimize resource
utilization.

Network Routing:

o Telecommunication networks can benefit from SARSA for dynamic


routing decisions, minimizing latency and congestion.
o SARSA can adapt routing strategies to optimize data transmission
paths based on changing network conditions.

Benefits of SARSA:-

SARSA (State-Action-Reward-State-Action) reinforcement learning


algorithm has several distinct advantages, making it a valuable tool for
solving sequential decision-making problems in various domains. Here are
some of its key advantages:

On-Policy Learning:
SARSA is an on-policy learning algorithm, which means it updates its Q-
values based on the policy it is currently following. This has several
advantages:

o Stability: SARSA's on-policy nature often leads to more stable


learning. Since it learns from experiences generated by its policy, the
updates align with the agent's actions, resulting in smoother and
more consistent learning curves.
o Real-Time Adaptation: On-policy algorithms like SARSA are well-
suited for online learning scenarios where agents interact with the
environment in real-time. This adaptability is crucial in applications
such as robotics or autonomous vehicles, where decisions must be
made on the fly while the agent is in motion.

Balanced Exploration and Exploitation:

SARSA employs exploration strategies, such as epsilon-greedy or softmax


policies, to balance the exploration of new actions and exploitation of
known actions:

o Exploration: SARSA explores different actions to discover their


consequences and learn the best strategies. This is essential for
learning about uncertain or unexplored aspects of the environment.
o Exploitation: The algorithm uses its current policy to exploit actions
leading to higher rewards. This ensures that the agent leverages its
existing knowledge to make optimal decisions.

Convergence to Stable Policies:

The combination of on-policy learning and balanced exploration contributes


to SARSA's convergence to stable policies:

Disadvantages of SARSA:-
While SARSA (State-Action-Reward-State-Action) has many advantages, it
also has limitations and disadvantages. Let's explore some of these
drawbacks:
1. On-Policy Learning Limitation:
o While advantageous in some scenarios, SARSA's on-policy
learning approach can also be a limitation. It means that the
algorithm updates its Q-values based on its current policy. This
can slow down learning, especially in situations where
exploration is challenging or when there's a need to explore
more diverse actions.
2. Exploration Challenges:
o Like many reinforcement learning algorithms, SARSA can
struggle with exploration in environments where rewards are
sparse or delayed. It might get stuck in suboptimal policies if it
needs to explore sufficiently to discover better strategies.
3. Convergence Speed:
o SARSA's convergence speed might be slower compared to off-
policy algorithms like Q-learning. Since SARSA learns from its
current policy, exploring and finding optimal policies might take
longer, especially in complex environments.
4. Bias in Value Estimation:
o SARSA can be sensitive to initial conditions and early
experiences, leading to potential bias in the estimation of Q-
values. Biased initial Q-values can influence the learning
process and impact the quality of the learned policy.
5. Efficiency in Large State Spaces:
o SARSA's learning process might become computationally
expensive and time-consuming in environments with large state
spaces. The agent must explore a substantial portion of the
state space to learn effective policies.
6. Optimality of Policy:
o SARSA sometimes converges to the optimal policy, mainly
when exploration is limited or when the optimal policy is
complex and difficult to approximate.
7. Difficulty in High-Dimensional Inputs:
o SARSA's tabular representation of Q-values might be less
effective when dealing with high-dimensional or continuous
state and action spaces. Function approximation techniques
would be needed to handle such scenarios.
8. Trade-off Between Exploration and Exploitation:
o SARSA's exploration strategy, like epsilon-greedy, requires
tuning of hyperparameters, such as the exploration rate. Finding
the right balance between exploration and exploitation can be
challenging and impact the algorithm's performance.
9. Sensitivity to Hyperparameters:
o SARSA's performance can be sensitive to the choice of
hyperparameters, including the learning rate, discount factor,
and exploration parameters. Fine-tuning these parameters can
be time-consuming.
10. Limited for Off-Policy Tasks:
o SARSA is inherently an on-policy algorithm and might not be the
best choice for tasks where off-policy learning is more suitable,
such as scenarios where learning from historical data is
essential.

Despite these limitations, SARSA remains a valuable reinforcement


learning algorithm in various contexts. Its disadvantages are often
addressed by combining it with other techniques or by selecting
appropriate algorithms based on the specific characteristics of the
problem at hand.

You might also like