0% found this document useful (0 votes)

15 views

ML unit 4

Notes ml

Uploaded by

lalits7420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

ML unit 4

Notes ml

Uploaded by

lalits7420

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

What is reinforcement learning?

Reinforcement learning (RL) is a machine learning (ML) technique that

trains software to make decisions to achieve the most optimal results. It
mimics the trial-and-error learning process that humans use to achieve
their goals. Software actions that work towards your goal are reinforced,
while actions that detract from the goal are ignored.

RL algorithms use a reward-and-punishment paradigm as they process

data. They learn from the feedback of each action and self-discover the
best processing paths to achieve final outcomes. The algorithms are also
capable of delayed gratification. The best overall strategy may require short
-term sacrifices, so the best approach they discover may include some
punishments or backtracking along the way. RL is a powerful method to
help artificial intelligence (AI) systems achieve optimal outcomes in unseen
environments.

There are many different algorithms that tackle this issue. As a

matter of fact, Reinforcement Learning is defined by a specific type of
problem, and all its solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to decide the best action
to select based on his current state. When this step is repeated, the
problem is known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:

 A set of possible world states S.

 A set of Models.
 A set of possible actions A.
 A real-valued reward function R(s,a).
A policy the solution of Markov Decision Process.

What is a State?

A State is a set of tokens that represent every state that the agent can
be in.

What is a Model?

A Model (sometimes called Transition Model) gives an action’s effect in a

state. In particular, T(S, a, S’) defines a transition T where being in state S
and taking an action ‘a’ takes us to state S’ (S and S’ may be the same).
For stochastic actions (noisy, non-deterministic) we also define a
probability P(S’|S,a) which represents the probability of reaching a state S’
if action ‘a’ is taken in state S. Note Markov property states that the
effects of an action taken in a state depend only on that state and not on
the prior history.
What are Actions?

An Action A is a set of all possible actions. A(s) defines the set of actions
that can be taken being in state S.

What is a Reward?

A Reward is a real-valued reward function. R(s) indicates the reward for

simply being in the state S. R(S,a) indicates the reward for being in a state
S and taking an action ‘a’. R(S,a,S’) indicates the reward for being in a
state S, taking an action ‘a’ and ending up in a state S’.

What is a Policy?

A Policy is a solution to the Markov Decision Process. A policy is a

mapping from S to a. It indicates the action ‘a’ to be taken while in state
S.
Let us take the example of a grid world:
An agent lives in the grid. The above example is a 3*4 grid. The grid has a
START state (grid no 1,1). The purpose of the agent is to wander around
the grid to finally reach the Blue Diamond (grid no 4,3). Under all
circumstances, the agent should avoid the Fire grid (orange color, grid no
4,2). Also the grid no 2,2 is a blocked grid, it acts as a wall hence the agent
cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent
would have taken , the agent stays in the same place. So for example, if
the agent says LEFT in the START grid he would stay put in the START
grid.
First Aim: To find the shortest sequence getting from START to the
Diamond. Two such sequences can be found:

 RIGHT RIGHT UP UPRIGHT

 UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the
subsequent discussion.
The move is now noisy. 80% of the time the intended action works
correctly. 20% of the time the action agent takes causes it to move at right
angles. For example, if the agent says UP the probability of going UP is 0.8
whereas the probability of going LEFT is 0.1, and the probability of going
RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
The agent receives rewards each time step:-

 Small reward each step (can be negative when can also be term as
punishment, in the above example entering the Fire can have a reward
of -1).
 Big rewards come at the end (good or bad).
 The goal is to Maximize the sum of rewards.

Bellman Equation:-

According to the Bellman Equation, long-term- reward in a given action is

equal to the reward from the current action combined with the expected
reward from the future actions taken at the following time. Let’s try to
understand first.

Let’s take an example:

Here we have a maze which is our environment and the sole goal of our
agent is to reach the trophy state (R = 1) or to get Good reward and
to avoid the fire state because it will be a failure (R = -1) or will get Bad
reward.

What happens without Bellman Equation?

Initially, we will give our agent some time to explore the environment and
let it figure out a path to the goal. As soon as it reaches its goal, it
will back trace its steps back to its starting position and mark values of all
the states which eventually leads towards the goal as V = 1.
The agent will face no problem until we change its starting position, as it
will not be able to find a path towards the trophy state since the value of
all the states is equal to 1. So, to solve this problem we should
use Bellman Equation:

V(s)=maxa(R(s,a)+ γV(s’))

State(s): current state where the agent is in the environment

Next State(s’): After taking action(a) at state(s) the agent reaches s’
Value(V): Numeric representation of a state which helps the agent to find
its path. V(s) here means the value of the state s.
Reward(R): treat which the agent gets after performing an action(a).
 R(s): reward for being in the state s
 R(s,a): reward for being in the state and performing an action a
 R(s,a,s’): reward for being in a state s, taking an action a and ending up
in s’
e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.
Action(a): set of possible actions that can be taken by the agent in the
state(s). e.g. (LEFT, RIGHT, UP, DOWN)
Discount factor(γ): determines how much the agent cares about rewards
in the distant future relative to those in the immediate future. It has a
value between 0 and 1. Lower value encourages short–term rewards
while higher value promises long-term reward.

The max denotes the most optimum action among all the actions that the
agent can take in a particular state which can lead to the reward
after repeating this process every consecutive step.
For example:
 The state left to the fire state (V = 0.9) can
go UP, DOWN, RIGHT but NOT LEFT because it’s a wall(not accessible).
Among all these actions available the maximum value for that state is
the UP action.
 The current starting state of our agent can choose
any random action UP or RIGHT since both lead towards the reward
with the same number of steps.
By using the Bellman equation our agent will calculate the value of every
step except for the trophy and the fire state (V = 0), they cannot have
values since they are the end of the maze.
So, after making such a plan our agent can easily accomplish its goal by
just following the increasing values.
Here's a complete roadmap for you to become a developer: Learn DSA ->
Master Frontend/Backend/Full Stack -> Build Projects -> Keep Applying to
Jobs

What is Q-learning?:-

Q-learning is a machine learning approach that enables a model to

iteratively learn and improve over time by taking the correct action. Q-
learning is a type of reinforcement learning.

With reinforcement learning, a machine learning model is trained to mimic

the way animals or children learn. Good actions are rewarded or reinforced,
while bad actions are discouraged and penalized.

With the state-action-reward-state-action form of reinforcement learning,

the training regimen follows a model to take the right actions. Q-learning
provides a model-free approach to reinforcement learning. There is no
model of the environment to guide the reinforcement learning process. The
agent -- which is the AI component that acts in the environment -- iteratively
learns and makes predictions about the environment on its own.

Q-learning also takes an off-policy approach to reinforcement learning. A Q-

learning approach aims to determine the optimal action based on its
current state. The Q-learning approach can accomplish this by
either developing its own set of rules or deviating from the prescribed
policy. Because Q-learning may deviate from the given policy, a defined
policy is not needed.

How does Q-learning work?

Q-learning models operate in an iterative process that involves multiple
components working together to help train a model. The iterative process
involves the agent learning by exploring the environment and updating the
model as the exploration continues. The multiple components of Q-learning
include the following:
 Agents. The agent is the entity that acts and operates within an
environment.
 States. The state is a variable that identifies the current position in an
environment of an agent.
 Actions. The action is the agent's operation when it is in a specific state.
 Rewards. A foundational concept within reinforcement learning is the
concept of providing either a positive or a negative response for the
agent's actions.
 Episodes. An episode is when an agent can no longer take a new action
and ends up terminating.
 Q-values. The Q-value is the metric used to measure an action at a
particular state.

Advantages of Q-learning:-
he Q-learning approach to reinforcement learning can potentially be
advantageous for several reasons, including the following:

 Model-free. The model-free approach is the foundation of Q-learning

and one of the biggest potential advantages for some uses. Rather than
requiring prior knowledge about an environment, the Q-learning agent
can learn about the environment as it trains. The model-free approach is
particularly beneficial for scenarios where the underlying dynamics of an
environment are difficult to model or completely unknown.
 Off-policy optimization. The model can optimize to get the best
possible result without being strictly tethered to a policy that might not
enable the same degree of optimization.
 Flexibility. The model-free, off-policy approach enables Q-learning
flexibility to work across a variety of problems and environments.
 Offline training. A Q-learning model can be deployed on pre-collected,
offline data sets.
Disadvantages of Q-learning:-
The Q-learning approach to reinforcement model machine learning also has
some disadvantages, such as the following:

 Exploration vs. exploitation tradeoff. It can be hard for a Q-learning

model to find the right balance between trying new actions and sticking
with what's already known. It's a dilemma that is commonly referred to
as the exploration vs. exploitation tradeoff for reinforcement learning.

 Curse of dimensionality. Q-learning can potentially face a machine

learning risk known as the curse of dimensionality. The curse of
dimensionality is a problem with high-dimensional data where the
amount of data required to represent the distribution increases
exponentially. This can lead to computational challenges and decreased
accuracy.
 Overestimation. A Q-learning model can sometimes be too optimistic
and overestimate how good a particular action or strategy is.
 Performance. A Q-learning model can take a long time to figure out the
best method if there are several ways to approach a problem.

Value iteration vs policy iteration:-

In reinforcement learning, Markov decision processes (MDPs) help in

decision-making problems. Such problems include finding an optimal
policy, where states are mapped to actions to maximize the overall reward
over time. To tackle the reward optimization problem, there are two
approaches:

 Value iteration
 Policy iteration

In this Answer, we will discuss the algorithms mentioned above and delve
into their differences as well.

Value iteration:-
Value iteration is a dynamic programming algorithm in which an agent
interacts with its surroundings through actions to maximize long-term
reward. It considers the neighboring states and refines the estimates of
the states in the future. Value iteration starts with initial random
estimates and improves until it converges to the optimal values.

V(s) =amax T(s,a,s′)(R(s,a,s′)+γV(s′)))

 Here V(s)is the value at state s.
 max select the best action to find the optimal solution.
 T(s,a,s′) is the probability for the movement of an agent from state
s to s′ by taking an action a.
 R(s,a,s′) is the reward of an agent when it moves from state s to s′.
 γ represents the discount factor that determines the significance of
long-term rewards as compared to short-term rewards.
 V(s′) is the value of the next state ′s′.
Policy iteration:-

Policy iteration is an iterative method that alternates between evaluating

and improving a policy until an optimal policy is found.

Mathematical intuition

There are two parts of policy iteration, which are:

 Policy evaluation
 Policy improvement

In policy evaluation, we evaluate V(s) for the current policy π(s) until it
converges to the optimal solution.

V(s)=s′∑(T(s,π(s),s′)(R(s,π(s),s′)+γV(s′)))
 T(s,π(s),s′) is the probability of transition from state s to state s′ when
π(s) is given.
 R(s,π(s),s′) is the short-term or immediate reward from the state s to
s′, given that action is described by π(s).

Difference:-
The difference table between policy iteration and value iteration is given
below:

Value Iteration Policy Iteration

Approach Directly computes optimal Alternates between
values. policy evaluation and
policy iteration.
Steps It uses the Bellman optimal First, it evaluates the
equation. policy and then it
improves the policy.
Intermediate It does not explicitly generate In each iteration, it
Policies intermediate policies. generates
intermediate policies.
Convergence Criteria Values converge to their Policy no longer
optimal values. changes between
iterations.
Computational Fewer iterations but more May require more
Efficiency value function evaluations. iterations but fewer
value function
evaluations.
Application Suitable for cases with Generally more
expensive value function computationally
updates. efficient.

Definition of SARSA:-
SARSA is a reinforcement learning algorithm that teaches computers how
to make good decisions by interacting with an environment. SARSA stands
for State-Action-Reward-State-Action, which represents the algorithm's
sequence of steps. It helps computers learn from their experiences to
determine the best actions.

Explanation of SARSA:-

Assume you're teaching a robot to navigate a maze. The robot begins at a

specific location (the "State" - where it is), and you want it to discover the
best path to the maze's finish. The robot can proceed in numerous
directions at each step (these are the "Actions" - what it does). As it travels,
the robot receives input through incentives - positive or negative numbers
indicating its performance.

The amazing thing about SARSA is that it doesn't need a map of the maze
or explicit instructions on what to do. It learns by trial and error, discovering
which actions work best in different situations. This way, SARSA helps
computers learn to make decisions in various scenarios, from games to
driving cars to managing resources efficiently.
Applications of SARSA:-

Game Playing:

o SARSA can train agents to play games effectively by learning optimal

strategies. In board games like chess, it can explore different move
sequences and adapt its decisions based on rewards (winning,
drawing, losing).
o SARSA can control game characters in video games, making them
learn to navigate complex levels, avoid obstacles, and interact with
other in-game entities.

Robotics:
SARSA is invaluable for robotic systems. Robots can learn how to
move, interact with objects, and perform tasks through interactions
with their environment.

o SARSA can guide a robot in exploring and mapping unknown

environments, enabling efficient exploration and mapping strategies.

Autonomous Vehicles:

o Self-driving cars can use SARSA to learn safe and efficient driving
behaviors. The algorithm helps them navigate various traffic
scenarios, such as lane changes, merging, and negotiating
intersections.
o SARSA can optimize real-time decision-making based on sensor
inputs, traffic patterns, and road conditions.

Resource Management:

o In energy management, SARSA can control the charging and

discharging of batteries in a renewable energy system to maximize
energy utilization while considering varying demand and supply
conditions.
o It can optimize the allocation of resources in manufacturing
processes, ensuring efficient utilization of machines, materials, and
labor.

Finance and Trading:

o SARSA can be applied in algorithmic trading to learn optimal buying

and selling strategies in response to market data.
o The algorithm can adapt trading decisions based on historical market
trends, news sentiment, and other financial indicators.

Healthcare:

o In personalized medicine, SARSA could optimize treatment plans for

individual patients by learning from historical patient data and
adjusting treatment parameters.
o SARSA can aid in resource allocation, such as hospital bed
scheduling, to minimize patient wait times and optimize resource
utilization.

Network Routing:

o Telecommunication networks can benefit from SARSA for dynamic

routing decisions, minimizing latency and congestion.
o SARSA can adapt routing strategies to optimize data transmission
paths based on changing network conditions.

Benefits of SARSA:-

SARSA (State-Action-Reward-State-Action) reinforcement learning

algorithm has several distinct advantages, making it a valuable tool for
solving sequential decision-making problems in various domains. Here are
some of its key advantages:

On-Policy Learning:
SARSA is an on-policy learning algorithm, which means it updates its Q-
values based on the policy it is currently following. This has several
advantages:

o Stability: SARSA's on-policy nature often leads to more stable

learning. Since it learns from experiences generated by its policy, the
updates align with the agent's actions, resulting in smoother and
more consistent learning curves.
o Real-Time Adaptation: On-policy algorithms like SARSA are well-
suited for online learning scenarios where agents interact with the
environment in real-time. This adaptability is crucial in applications
such as robotics or autonomous vehicles, where decisions must be
made on the fly while the agent is in motion.

Balanced Exploration and Exploitation:

SARSA employs exploration strategies, such as epsilon-greedy or softmax

policies, to balance the exploration of new actions and exploitation of
known actions:

o Exploration: SARSA explores different actions to discover their

consequences and learn the best strategies. This is essential for
learning about uncertain or unexplored aspects of the environment.
o Exploitation: The algorithm uses its current policy to exploit actions
leading to higher rewards. This ensures that the agent leverages its
existing knowledge to make optimal decisions.

Convergence to Stable Policies:

The combination of on-policy learning and balanced exploration contributes

to SARSA's convergence to stable policies:

Disadvantages of SARSA:-
While SARSA (State-Action-Reward-State-Action) has many advantages, it
also has limitations and disadvantages. Let's explore some of these
drawbacks:
1. On-Policy Learning Limitation:
o While advantageous in some scenarios, SARSA's on-policy
learning approach can also be a limitation. It means that the
algorithm updates its Q-values based on its current policy. This
can slow down learning, especially in situations where
exploration is challenging or when there's a need to explore
more diverse actions.
2. Exploration Challenges:
o Like many reinforcement learning algorithms, SARSA can
struggle with exploration in environments where rewards are
sparse or delayed. It might get stuck in suboptimal policies if it
needs to explore sufficiently to discover better strategies.
3. Convergence Speed:
o SARSA's convergence speed might be slower compared to off-
policy algorithms like Q-learning. Since SARSA learns from its
current policy, exploring and finding optimal policies might take
longer, especially in complex environments.
4. Bias in Value Estimation:
o SARSA can be sensitive to initial conditions and early
experiences, leading to potential bias in the estimation of Q-
values. Biased initial Q-values can influence the learning
process and impact the quality of the learned policy.
5. Efficiency in Large State Spaces:
o SARSA's learning process might become computationally
expensive and time-consuming in environments with large state
spaces. The agent must explore a substantial portion of the
state space to learn effective policies.
6. Optimality of Policy:
o SARSA sometimes converges to the optimal policy, mainly
when exploration is limited or when the optimal policy is
complex and difficult to approximate.
7. Difficulty in High-Dimensional Inputs:
o SARSA's tabular representation of Q-values might be less
effective when dealing with high-dimensional or continuous
state and action spaces. Function approximation techniques
would be needed to handle such scenarios.
8. Trade-off Between Exploration and Exploitation:
o SARSA's exploration strategy, like epsilon-greedy, requires
tuning of hyperparameters, such as the exploration rate. Finding
the right balance between exploration and exploitation can be
challenging and impact the algorithm's performance.
9. Sensitivity to Hyperparameters:
o SARSA's performance can be sensitive to the choice of
hyperparameters, including the learning rate, discount factor,
and exploration parameters. Fine-tuning these parameters can
be time-consuming.
10. Limited for Off-Policy Tasks:
o SARSA is inherently an on-policy algorithm and might not be the
best choice for tasks where off-policy learning is more suitable,
such as scenarios where learning from historical data is
essential.

Despite these limitations, SARSA remains a valuable reinforcement

learning algorithm in various contexts. Its disadvantages are often
addressed by combining it with other techniques or by selecting
appropriate algorithms based on the specific characteristics of the
problem at hand.

08.1 Southern Gateway Chichester - Implementation App 1 PID v6 PDF
No ratings yet
08.1 Southern Gateway Chichester - Implementation App 1 PID v6 PDF
12 pages
Fine Cooking 054
50% (2)
Fine Cooking 054
100 pages
Syllabus - Stats 412 - Miller-Fink - W21
No ratings yet
Syllabus - Stats 412 - Miller-Fink - W21
13 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
RL_MJJ
No ratings yet
RL_MJJ
32 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Sections
No ratings yet
Sections
76 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Unit 5
No ratings yet
Unit 5
10 pages
4.3 Reinforcement Learning
No ratings yet
4.3 Reinforcement Learning
27 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
114021
No ratings yet
114021
55 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Sara Reinforcement Learning
No ratings yet
Sara Reinforcement Learning
69 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Unit 1
No ratings yet
Unit 1
18 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
RL Frra
No ratings yet
RL Frra
9 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
RL Frra
No ratings yet
RL Frra
10 pages
Deep Reinforcement Learning - Guide To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning - Guide To Deep Q-Learning
1 page
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
Unit 4
No ratings yet
Unit 4
12 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
RL
No ratings yet
RL
62 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Conflicts of Law Digest Zalamea Vs CA
No ratings yet
Conflicts of Law Digest Zalamea Vs CA
2 pages
Children Lit
No ratings yet
Children Lit
13 pages
Chapter III Legislative Drafting General Overview
No ratings yet
Chapter III Legislative Drafting General Overview
11 pages
7 Last Words at GNWOM
No ratings yet
7 Last Words at GNWOM
50 pages
Webinar Flyer
No ratings yet
Webinar Flyer
1 page
M A M SC M Com Other Courses 2011-12
No ratings yet
M A M SC M Com Other Courses 2011-12
95 pages
Davao Preboard Gen Ed
No ratings yet
Davao Preboard Gen Ed
14 pages
AutoCAD History
No ratings yet
AutoCAD History
10 pages
Kick Control - 20220429-hs
No ratings yet
Kick Control - 20220429-hs
15 pages
Contemporary Logistics: Procurement
No ratings yet
Contemporary Logistics: Procurement
34 pages
1) Read The Passages and Answer The Questions Tornadoes
No ratings yet
1) Read The Passages and Answer The Questions Tornadoes
7 pages
E.N. Kulkov - Oleg Aleksandrovich Rzheshevskii - Harold Shukman - Stalin and The Soviet-Finnish War, 1939-1940-Routledge (2014)
No ratings yet
E.N. Kulkov - Oleg Aleksandrovich Rzheshevskii - Harold Shukman - Stalin and The Soviet-Finnish War, 1939-1940-Routledge (2014)
336 pages
Accounting Manual Final 21.04.2025
No ratings yet
Accounting Manual Final 21.04.2025
149 pages
Full Download (Ebook) The Business of Mining: Mineral Deposits, Exploration and Ore-Reserve Estimation (Volume 3) by Ifan Odwyn Jones (Author); Mehrooz Aspandiar (Author); Allison Dugdale (Author); Neal Leggo (Author); Ian Glacken (Author); Bryan Smith (Author) ISBN 9780367148942, 9780429057540, 9780429614354, 9780429615566, 0367148943, 0429057547, 0429614357, 0429615566 PDF DOCX
100% (10)
Full Download (Ebook) The Business of Mining: Mineral Deposits, Exploration and Ore-Reserve Estimation (Volume 3) by Ifan Odwyn Jones (Author); Mehrooz Aspandiar (Author); Allison Dugdale (Author); Neal Leggo (Author); Ian Glacken (Author); Bryan Smith (Author) ISBN 9780367148942, 9780429057540, 9780429614354, 9780429615566, 0367148943, 0429057547, 0429614357, 0429615566 PDF DOCX
65 pages
Lab 03 - Os
No ratings yet
Lab 03 - Os
11 pages
GKMC 11 2023 0416 - Proof - Hi
No ratings yet
GKMC 11 2023 0416 - Proof - Hi
33 pages
Goitom Proposal 2020
No ratings yet
Goitom Proposal 2020
44 pages
Moloney 202 01 Intro To Psych Social Spring 2024 - Tagged
No ratings yet
Moloney 202 01 Intro To Psych Social Spring 2024 - Tagged
9 pages
Ghost Font Generator: Simply Copy & Paste Ghost Fonts
No ratings yet
Ghost Font Generator: Simply Copy & Paste Ghost Fonts
1 page
Guru Hargobind Sahib Ji - 6th Sikh Guru - Blog Post
No ratings yet
Guru Hargobind Sahib Ji - 6th Sikh Guru - Blog Post
1 page
A Balanced Islamic View On Music and Singing
No ratings yet
A Balanced Islamic View On Music and Singing
86 pages
Epp 3
No ratings yet
Epp 3
2 pages
Two Year Classroom Program For JEE (Main & Advanced), 2024
No ratings yet
Two Year Classroom Program For JEE (Main & Advanced), 2024
5 pages
GoChi-Notes-Pedia-Allergy-Part-7 KTRC
No ratings yet
GoChi-Notes-Pedia-Allergy-Part-7 KTRC
1 page
Biology
No ratings yet
Biology
2 pages
Aiders NTP Ipcc Complete Caraga
No ratings yet
Aiders NTP Ipcc Complete Caraga
374 pages
WebADI New Features
No ratings yet
WebADI New Features
38 pages

ML unit 4

Uploaded by

ML unit 4

Uploaded by

What is reinforcement learning?

Reinforcement learning (RL) is a machine learning (ML) technique that

RL algorithms use a reward-and-punishment paradigm as they process

There are many different algorithms that tackle this issue. As a

 A set of possible world states S.

A Model (sometimes called Transition Model) gives an action’s effect in a

A Reward is a real-valued reward function. R(s) indicates the reward for

A Policy is a solution to the Markov Decision Process. A policy is a

 RIGHT RIGHT UP UPRIGHT

According to the Bellman Equation, long-term- reward in a given action is

Let’s take an example:

What happens without Bellman Equation?

State(s): current state where the agent is in the environment

Q-learning is a machine learning approach that enables a model to

With reinforcement learning, a machine learning model is trained to mimic

With the state-action-reward-state-action form of reinforcement learning,

Q-learning also takes an off-policy approach to reinforcement learning. A Q-

How does Q-learning work?

 Model-free. The model-free approach is the foundation of Q-learning

 Exploration vs. exploitation tradeoff. It can be hard for a Q-learning

 Curse of dimensionality. Q-learning can potentially face a machine

Value iteration vs policy iteration:-

In reinforcement learning, Markov decision processes (MDPs) help in

V(s) =amax T(s,a,s′)(R(s,a,s′)+γV(s′)))

Policy iteration is an iterative method that alternates between evaluating

There are two parts of policy iteration, which are:

Value Iteration Policy Iteration

Assume you're teaching a robot to navigate a maze. The robot begins at a

o SARSA can train agents to play games effectively by learning optimal

o SARSA can guide a robot in exploring and mapping unknown

o In energy management, SARSA can control the charging and

Finance and Trading:

o SARSA can be applied in algorithmic trading to learn optimal buying

o In personalized medicine, SARSA could optimize treatment plans for

o Telecommunication networks can benefit from SARSA for dynamic

SARSA (State-Action-Reward-State-Action) reinforcement learning

o Stability: SARSA's on-policy nature often leads to more stable

Balanced Exploration and Exploitation:

SARSA employs exploration strategies, such as epsilon-greedy or softmax

o Exploration: SARSA explores different actions to discover their

Convergence to Stable Policies:

The combination of on-policy learning and balanced exploration contributes

Despite these limitations, SARSA remains a valuable reinforcement

You might also like