ML unit 4
ML unit 4
What is a State?
A State is a set of tokens that represent every state that the agent can
be in.
What is a Model?
An Action A is a set of all possible actions. A(s) defines the set of actions
that can be taken being in state S.
What is a Reward?
What is a Policy?
Small reward each step (can be negative when can also be term as
punishment, in the above example entering the Fire can have a reward
of -1).
Big rewards come at the end (good or bad).
The goal is to Maximize the sum of rewards.
Bellman Equation:-
Initially, we will give our agent some time to explore the environment and
let it figure out a path to the goal. As soon as it reaches its goal, it
will back trace its steps back to its starting position and mark values of all
the states which eventually leads towards the goal as V = 1.
The agent will face no problem until we change its starting position, as it
will not be able to find a path towards the trophy state since the value of
all the states is equal to 1. So, to solve this problem we should
use Bellman Equation:
V(s)=maxa(R(s,a)+ γV(s’))
The max denotes the most optimum action among all the actions that the
agent can take in a particular state which can lead to the reward
after repeating this process every consecutive step.
For example:
The state left to the fire state (V = 0.9) can
go UP, DOWN, RIGHT but NOT LEFT because it’s a wall(not accessible).
Among all these actions available the maximum value for that state is
the UP action.
The current starting state of our agent can choose
any random action UP or RIGHT since both lead towards the reward
with the same number of steps.
By using the Bellman equation our agent will calculate the value of every
step except for the trophy and the fire state (V = 0), they cannot have
values since they are the end of the maze.
So, after making such a plan our agent can easily accomplish its goal by
just following the increasing values.
Here's a complete roadmap for you to become a developer: Learn DSA ->
Master Frontend/Backend/Full Stack -> Build Projects -> Keep Applying to
Jobs
What is Q-learning?:-
Advantages of Q-learning:-
he Q-learning approach to reinforcement learning can potentially be
advantageous for several reasons, including the following:
Value iteration
Policy iteration
In this Answer, we will discuss the algorithms mentioned above and delve
into their differences as well.
Value iteration:-
Value iteration is a dynamic programming algorithm in which an agent
interacts with its surroundings through actions to maximize long-term
reward. It considers the neighboring states and refines the estimates of
the states in the future. Value iteration starts with initial random
estimates and improves until it converges to the optimal values.
Mathematical intuition
Policy evaluation
Policy improvement
In policy evaluation, we evaluate V(s) for the current policy π(s) until it
converges to the optimal solution.
V(s)=s′∑(T(s,π(s),s′)(R(s,π(s),s′)+γV(s′)))
T(s,π(s),s′) is the probability of transition from state s to state s′ when
π(s) is given.
R(s,π(s),s′) is the short-term or immediate reward from the state s to
s′, given that action is described by π(s).
Difference:-
The difference table between policy iteration and value iteration is given
below:
Definition of SARSA:-
SARSA is a reinforcement learning algorithm that teaches computers how
to make good decisions by interacting with an environment. SARSA stands
for State-Action-Reward-State-Action, which represents the algorithm's
sequence of steps. It helps computers learn from their experiences to
determine the best actions.
Explanation of SARSA:-
The amazing thing about SARSA is that it doesn't need a map of the maze
or explicit instructions on what to do. It learns by trial and error, discovering
which actions work best in different situations. This way, SARSA helps
computers learn to make decisions in various scenarios, from games to
driving cars to managing resources efficiently.
Applications of SARSA:-
Game Playing:
Robotics:
SARSA is invaluable for robotic systems. Robots can learn how to
move, interact with objects, and perform tasks through interactions
with their environment.
Autonomous Vehicles:
o Self-driving cars can use SARSA to learn safe and efficient driving
behaviors. The algorithm helps them navigate various traffic
scenarios, such as lane changes, merging, and negotiating
intersections.
o SARSA can optimize real-time decision-making based on sensor
inputs, traffic patterns, and road conditions.
Resource Management:
Healthcare:
Network Routing:
Benefits of SARSA:-
On-Policy Learning:
SARSA is an on-policy learning algorithm, which means it updates its Q-
values based on the policy it is currently following. This has several
advantages:
Disadvantages of SARSA:-
While SARSA (State-Action-Reward-State-Action) has many advantages, it
also has limitations and disadvantages. Let's explore some of these
drawbacks:
1. On-Policy Learning Limitation:
o While advantageous in some scenarios, SARSA's on-policy
learning approach can also be a limitation. It means that the
algorithm updates its Q-values based on its current policy. This
can slow down learning, especially in situations where
exploration is challenging or when there's a need to explore
more diverse actions.
2. Exploration Challenges:
o Like many reinforcement learning algorithms, SARSA can
struggle with exploration in environments where rewards are
sparse or delayed. It might get stuck in suboptimal policies if it
needs to explore sufficiently to discover better strategies.
3. Convergence Speed:
o SARSA's convergence speed might be slower compared to off-
policy algorithms like Q-learning. Since SARSA learns from its
current policy, exploring and finding optimal policies might take
longer, especially in complex environments.
4. Bias in Value Estimation:
o SARSA can be sensitive to initial conditions and early
experiences, leading to potential bias in the estimation of Q-
values. Biased initial Q-values can influence the learning
process and impact the quality of the learned policy.
5. Efficiency in Large State Spaces:
o SARSA's learning process might become computationally
expensive and time-consuming in environments with large state
spaces. The agent must explore a substantial portion of the
state space to learn effective policies.
6. Optimality of Policy:
o SARSA sometimes converges to the optimal policy, mainly
when exploration is limited or when the optimal policy is
complex and difficult to approximate.
7. Difficulty in High-Dimensional Inputs:
o SARSA's tabular representation of Q-values might be less
effective when dealing with high-dimensional or continuous
state and action spaces. Function approximation techniques
would be needed to handle such scenarios.
8. Trade-off Between Exploration and Exploitation:
o SARSA's exploration strategy, like epsilon-greedy, requires
tuning of hyperparameters, such as the exploration rate. Finding
the right balance between exploration and exploitation can be
challenging and impact the algorithm's performance.
9. Sensitivity to Hyperparameters:
o SARSA's performance can be sensitive to the choice of
hyperparameters, including the learning rate, discount factor,
and exploration parameters. Fine-tuning these parameters can
be time-consuming.
10. Limited for Off-Policy Tasks:
o SARSA is inherently an on-policy algorithm and might not be the
best choice for tasks where off-policy learning is more suitable,
such as scenarios where learning from historical data is
essential.