Reinforcement Learning

Reinforcement Learning (RL) is a machine learning technique where an agent learns through trial and error, using rewards and punishments to inform its actions. It involves concepts such as environment, state, reward, policy, and value, and can be implemented using various methods like Q-learning and SARSA. The Markov Decision Process (MDP) framework is commonly used to model RL problems, focusing on maximizing cumulative rewards while considering the current state and actions taken.

Uploaded by

yashkamra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Reinforcement Learning

Uploaded by

yashkamra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Reinforcement Learning(RL) is a type of machine learning technique that enables

an agent to learn in an interactive environment by trial and error using feedback from
its own actions and experiences.

Unlike supervised learning where the feedback provided to the agent is correct set of
actions for performing a task, reinforcement learning uses rewards and
punishments as signals for positive and negative behavior. As compared to
unsupervised learning, reinforcement learning is different in terms of goals. While the
goal in unsupervised learning is to find similarities and differences between data
points, in the case of reinforcement learning the goal is to find a suitable action model
that would maximize the total cumulative reward of the agent.

1. Environment — Physical world in which the agent operates

2. State — Current situation of the agent
3. Reward — Feedback from the environment
4. Policy — Method to map agent’s state to actions
5. Value — Future reward that an agent would receive by taking an action in
a particular state

An RL problem can be best explained through games. Let’s take the game
of PacMan where the goal of the agent(PacMan) is to eat the food in the grid while
avoiding the ghosts on its way. In this case, the grid world is the interactive
environment for the agent where it acts. Agent receives a reward for eating food and
punishment if it gets killed by the ghost (loses the game). The states are the location of
the agent in the grid world and the total cumulative reward is the agent winning the
game.

Value-Based – The main goal of this method is to maximize a value function. Here, an
agent through a policy expects a long-term return of the current states.
Policy-Based – In policy-based, you enable to come up with a strategy that helps to
gain maximum rewards in the future through possible actions performed in each state.
Two types of policy-based methods are deterministic and stochastic.
Model-Based – In this method, we need to create a virtual model for the agent to help
in learning to perform in each specific environment

Markov Decision Processes(MDPs) are mathematical frameworks to describe an

environment in RL and almost all RL problems can be formulated using MDPs. An
MDP consists of a set of finite environment states S, a set of possible actions A(s) in
each state, a real valued reward function R(s) and a transition model P(s’, s | a).
However, real world environments are more likely to lack any prior knowledge of
environment dynamics. Model-free RL methods come handy in such cases. The set of
parameters that include Set of finite states – S, Set of possible Actions in each state –
A, Reward – R, Model – T, Policy – π. The outcome of deploying an action to a state
doesn’t depend on previous actions or states but on current action and state.

Q-learning is a commonly used model-free approach which can be used for building
a self-playing PacMan agent. It revolves around the notion of updating Q values which
denotes value of performing action a in state s. The following value update rule is the
core of the Q-learning algorithm.
Q-learning and SARSA (State-Action-Reward-State-Action) are two commonly
used model-free RL algorithms. They differ in terms of their exploration strategies
while their exploitation strategies are similar. While Q-learning is an off-policy method
in which the agent learns the value based on action a* derived from the another policy,
SARSA is an on-policy method where it learns the value based on its current
action a derived from its current policy. These two methods are simple to implement
but lack generality as they do not have the ability to estimates values for unseen states.

Types of Reinforcement: There are two types of Reinforcement:

Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it
has a positive effect on behavior.
Advantages of reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can diminish the
results
Negative –
Negative Reinforcement is defined as strengthening of behavior because a negative
condition is stopped or avoided.
Advantages of reinforcement learning:
Increases Behavior
Provide defiance to a minimum standard of performance
It Only provides enough to meet up the minimum behavior
Q-Learning
We build an agent who will interact with the environment through a trial-error
process. At each time step t, the agent is at a certain state s_t and chooses an action
a_t to perform. The environment runs the selected action and returns a reward to the
agent. The higher is the reward, the better is the action. The environment also tells the
agent whether he is done or not. So an episode can be represented as a sequence of
state-action-reward.

In the Q-Learning algorithm, the goal is to learn iteratively the optimal Q-value
function using the Bellman Optimality Equation. To do so, we store all the Q-values in
a table that we will update at each time step using the Q-Learning iteration:
Markov Decision Process (MDP)
A Markov decision process (MDP) refers to a stochastic decision-making process that
uses a mathematical framework to model the decision-making of a dynamic system.
It is used in scenarios where the results are either random or controlled by a decision
maker, which makes sequential decisions over time. MDPs evaluate which actions the
decision maker should take considering the current state and environment of the
system.

The MDP model uses the Markov Property, which states that the future can be
determined only from the present state that encapsulates all the necessary
information from the past. The Markov Property can be evaluated by using this
equation:
P[St+1|St] = P[St+1 |S1,S2,S3……St]
According to this equation, the probability of the next state (P[St+1]) given the
present state (St) is given by the next state’s probability (P[St+1]) considering all the
previous states (S1,S2,S3……St). This implies that MDP uses only the present/current
state to evaluate the next actions without any dependencies on previous states or
actions.
A Markov process is defined by (S, P) where S are the states, and P is the state-
transition probability. It consists of a sequence of random states S₁, S₂, … where all
the states obey the Markov property. The state transition probability or P_ss’ is the
probability of jumping to a state s’ from the current state s.

The Markov reward process (MRP) is defined by (S, P, R, γ), where S are the
states, P is the state-transition probability, R_s is the reward, and γ is the discount
factor.
The variable γ ∈ [0, 1] in the figure is the discount factor. The intuition behind using a
discount is that there is no certainty about the future rewards. While it is important to
consider future rewards to increase the Return, it’s also equally important to limit the
contribution of the future rewards to the Return (since you can’t be 100 percent certain
of the future.)

The policy and value function

The policy (Π) is known to determine the agent’s optimal action given the current
state so that it gains the maximum reward. In simple words, it associates actions with
states.
Π: S –> A
To determine the best policy, it is essential to define the returns that reveal the
agent’s rewards at every state. As a result, the horizon method is not preferred to
focus on short-term or long term-rewards. Instead, a variable termed ‘discounted
factor (γ)’ is introduced. The rule says if γ has values that are closer to zero, then the
immediate rewards are prioritized. Subsequently, if γ reveals values closer to one,
then the focus shifts to long-term rewards. Hence, the discounted infinite-horizon
method is key to revealing the best policy.

The state value function v(s) is the expected Return starting from state s.
The value function can be divided into two components: the reward of the current
state and the discounted reward value of the next state. This breakdown
derives Bellman’s equation, as shown below:

Here, it is worth noting that the agent’s actions and rewards vary based on the policy.
This implies that the value function is specific to a policy.

We have a problem where we need to decide whether the tribes should go deer
hunting or not in a nearby forest to ensure long-term returns. Each deer generates a
fixed return. However, if the tribes hunt beyond a limit, it can result in a lower yield
next year. Hence, we need to determine the optimum portion of deer that can be
caught while maximizing the return over a longer period.
The problem statement can be simplified in this case: whether to hunt a certain
portion of deer or not. In the context of MDP, the problem can be expressed as follows:
States: The number of deer available in the forest in the year under consideration.
The four states include empty, low, medium, and high, which are defined as follows:
• Empty: No deer available to hunt
• Low: Available deer count is below a threshold t_1
• Medium: Available deer count is between t_1 and t_2
• High: Available deer count is above a threshold t_2
Actions: Actions include go_hunt and no_hunting, where go_hunt implies catching
certain proportions of deer. It is important to note that for the empty state, the only
possible action is no_hunting.
Rewards: Hunting at each state generates rewards of some kind. The rewards for
hunting at different states, such as state low, medium, and high, maybe $5K, $50K,
and $100k, respectively. Moreover, if the action results in an empty state, the reward
is -$200K. This is due to the required e-breeding of new deer, which involves time and
money.
State transitions: Hunting in a state causes the transition to a state with fewer deer.
Subsequently, the action of no_hunting causes the transition to a state with more
deer, except for the ‘high’ state.

Goldstein Chapter 8 PDF
50% (2)
Goldstein Chapter 8 PDF
5 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Unit 1
No ratings yet
Unit 1
18 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
RL Ese
No ratings yet
RL Ese
7 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
MDP
No ratings yet
MDP
10 pages
notes
No ratings yet
notes
6 pages
Types of Reinforcement Learning MDP
No ratings yet
Types of Reinforcement Learning MDP
3 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Sections
No ratings yet
Sections
76 pages
Unit 5
No ratings yet
Unit 5
45 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Unit V
100% (1)
Unit V
24 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
15 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
25 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Ai Unit 3
No ratings yet
Ai Unit 3
23 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 1_ Introduction
No ratings yet
Lecture 1_ Introduction
3 pages
CLS Aipmt 18 19 XI Che Study Package 1 SET 1 Chapter 1
0% (2)
CLS Aipmt 18 19 XI Che Study Package 1 SET 1 Chapter 1
22 pages
Chapter 2 Mole Concept
No ratings yet
Chapter 2 Mole Concept
10 pages
Bal Bharati Public School, Rohini Class Xi English Core Writing Skills - Posters
No ratings yet
Bal Bharati Public School, Rohini Class Xi English Core Writing Skills - Posters
2 pages
Yash Kamra The Indian Particpant in Robocup 2019
No ratings yet
Yash Kamra The Indian Particpant in Robocup 2019
1 page
Yash Kamra 11
No ratings yet
Yash Kamra 11
1 page
Linear Report
No ratings yet
Linear Report
3 pages
Information Security: Subject Lect. Tut. Pract. Hrs
100% (1)
Information Security: Subject Lect. Tut. Pract. Hrs
69 pages
Sample or Record
No ratings yet
Sample or Record
35 pages
Application of Wavelet Transforms To Compression of Mechanical Vibration Data
No ratings yet
Application of Wavelet Transforms To Compression of Mechanical Vibration Data
10 pages
2085-Article Text-5597-1-10-20220804
No ratings yet
2085-Article Text-5597-1-10-20220804
12 pages
Kinetic Theory & Thermodynamics
No ratings yet
Kinetic Theory & Thermodynamics
3 pages
Computer aided Civil Eng - 2024 - Berangi - Gradient boosting decision trees to study laboratory and field performance in
No ratings yet
Computer aided Civil Eng - 2024 - Berangi - Gradient boosting decision trees to study laboratory and field performance in
30 pages
Module 1 Lesson 4
No ratings yet
Module 1 Lesson 4
16 pages
SV Assertion Part4
No ratings yet
SV Assertion Part4
4 pages
Instant Download (Ebook) Deep Learning for Medical Image Analysis by S. Kevin Zhou, Dinggang Shen, Hayit Greenspan ISBN 9780128104088, 0128104082 PDF All Chapters
100% (2)
Instant Download (Ebook) Deep Learning for Medical Image Analysis by S. Kevin Zhou, Dinggang Shen, Hayit Greenspan ISBN 9780128104088, 0128104082 PDF All Chapters
81 pages
European Football Player Valuation: Integrating Financial Models and Network Theory
No ratings yet
European Football Player Valuation: Integrating Financial Models and Network Theory
15 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
AI 2121 W6
No ratings yet
AI 2121 W6
5 pages
Mulitimedia Computing: Online Lecture-6 Instructor-in-Charge Dr. Mukesh Kumar Rohil
No ratings yet
Mulitimedia Computing: Online Lecture-6 Instructor-in-Charge Dr. Mukesh Kumar Rohil
31 pages
Adobe Scan Apr 06, 2025
No ratings yet
Adobe Scan Apr 06, 2025
3 pages
Algs4.Cs - Princeton.edu-33 NBSP Balanced Search Trees
No ratings yet
Algs4.Cs - Princeton.edu-33 NBSP Balanced Search Trees
8 pages
Connectedcomponents
No ratings yet
Connectedcomponents
96 pages
Implementation of Stronger Aes by Using Dynamic S-Box Dependent of Master Key
No ratings yet
Implementation of Stronger Aes by Using Dynamic S-Box Dependent of Master Key
9 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Bahagi 3
No ratings yet
Bahagi 3
20 pages
The Primitive Recursive Functions
No ratings yet
The Primitive Recursive Functions
13 pages
Runge-Kutta 4th Order Method For
No ratings yet
Runge-Kutta 4th Order Method For
7 pages
Lecture 01.PDF
No ratings yet
Lecture 01.PDF
63 pages
Cheat Sheet 1 Microsoft Azure Ai Fundamentals Ai 900 Ai Concepts
100% (1)
Cheat Sheet 1 Microsoft Azure Ai Fundamentals Ai 900 Ai Concepts
1 page
Mathematics of Machine Learning-Notation [Rajen D. Shah][University of Cambridge]
No ratings yet
Mathematics of Machine Learning-Notation [Rajen D. Shah][University of Cambridge]
2 pages
Unsupervised Domain Adaptation by Backpropagation
No ratings yet
Unsupervised Domain Adaptation by Backpropagation
11 pages
Poisson Distribution: DR A R M Harunur Rashid
No ratings yet
Poisson Distribution: DR A R M Harunur Rashid
13 pages
Group 5 - Assignment No.3
No ratings yet
Group 5 - Assignment No.3
4 pages
Lecture 05 - Queuing Theory
No ratings yet
Lecture 05 - Queuing Theory
9 pages

Reinforcement Learning

Uploaded by

Reinforcement Learning

Uploaded by

Reinforcement Learning(RL) is a type of machine learning technique that enables

1. Environment — Physical world in which the agent operates

Markov Decision Processes(MDPs) are mathematical frameworks to describe an

Types of Reinforcement: There are two types of Reinforcement:

The policy and value function

You might also like