0% found this document useful (0 votes)
4 views28 pages

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning field focused on how agents learn to make decisions through trial and error, drawing inspiration from behavioral psychology. The document outlines the history of RL, including foundational concepts like dynamic programming, Markov Decision Processes, and key algorithms such as Q-learning, Value Iteration, and Policy Iteration. It also discusses the evolution of RL into modern applications, including deep reinforcement learning and its use in various real-world scenarios.

Uploaded by

Dhanish Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views28 pages

Reinforcement Learning

Reinforcement Learning (RL) is a machine learning field focused on how agents learn to make decisions through trial and error, drawing inspiration from behavioral psychology. The document outlines the history of RL, including foundational concepts like dynamic programming, Markov Decision Processes, and key algorithms such as Q-learning, Value Iteration, and Policy Iteration. It also discusses the evolution of RL into modern applications, including deep reinforcement learning and its use in various real-world scenarios.

Uploaded by

Dhanish Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Origin and History of Reinforcement Learning Research

Reinforcement Learning (RL) is a part of machine learning that is inspired by


behavioral psychology. It studies how a computer program or "agent" can learn to
make the best decisions by trying different actions and seeing what works. Over time,
this field has grown by combining ideas from psychology, control systems, operations
research, and artificial intelligence (AI).

1. Early Foundations

Behaviorist Psychology (1900s–1950s)

● Thorndike's Law of Effect (1905)


Edward Thorndike did experiments with animals, like cats in puzzle boxes, and
found that actions that lead to good outcomes are repeated more often. This idea
—called the Law of Effect—was one of the first ideas behind learning through
trial and error.

● B.F. Skinner (1930s–1950s)


Building on Thorndike’s work, Skinner introduced operant conditioning, which
means behavior can be shaped by rewards (reinforcements) and punishments.
He showed that animals (and humans) can learn complex tasks by receiving
small rewards at the right time.

Bellman and Dynamic Programming (1950s)

● Richard Bellman created a method called dynamic programming, which helps


solve problems where a decision leads to a new situation, and the next decision
depends on it.

● His Bellman Equation became a key idea in RL, showing how to calculate the
value of different actions in a step-by-step process.

Markov Decision Processes (MDPs, 1950s)

● Markov Decision Processes (MDPs) were created by Andrey Kolmogorov and


Ronald Howard to describe problems where outcomes depend only on the
current situation (not the full history).

● MDPs give a clear mathematical way to describe environments in RL, using


states, actions, rewards, and transitions.

2. Computational Roots (1960s–1980s)

Artificial Intelligence (1960s–1970s)

● Early AI systems began exploring how machines could learn from trial and error,
much like animals and humans do.

● Arthur Samuel (1959) developed one of the first programs that could learn to
play checkers better over time by adjusting its strategy, a very early form of RL.

Temporal Difference Learning (1980s)

● In the 1980s, Andrew Barto, Richard Sutton, and others introduced Temporal
Difference (TD) learning, which combines ideas from two earlier methods—
Monte Carlo (learning from full episodes) and dynamic programming (using
known models).

● TD learning allows an agent to learn from its experience step by step, even when
the final result is not yet known.

Q-Learning (1989)

● Chris Watkins introduced Q-learning, which lets agents learn the value of doing
a certain action in a certain situation, even when they don’t know how the
environment works.
● This method is model-free, meaning the agent doesn't need a plan or map of the
environment—it learns just by interacting with it.

3. Reinforcement Learning as a Field (1990s)

● The book "Reinforcement Learning: An Introduction" (1998) by Sutton and


Barto brought together many of the ideas and made RL a clear and unified field of
study.

● During this time, researchers also began using function approximation and
neural networks, allowing RL to handle bigger problems where listing every
state or action is not possible.

4. Modern Era: Deep Reinforcement Learning (2010s–Present)

Deep Learning Integration

● Deep Q-Networks (DQN, 2015):


In 2015, researchers at DeepMind created a system that played Atari video games
using deep neural networks to understand screen images. This was a big step
because it showed that RL could work on complex tasks with just visual input.

● Policy Gradient Methods:


These methods, like REINFORCE and Actor-Critic, help agents learn directly
how to act, especially in environments with many possible actions (like moving
a robot arm). These are better suited for continuous and more complex
environments.

Applications and Achievements

● AlphaGo (2016): Used RL to beat world champions in the board game Go, which
was previously thought to be too complex for machines.

● AlphaZero and AlphaFold (2018–2020): Showed how RL can be applied beyond


games. AlphaZero learned to play multiple games from scratch, and AlphaFold
solved major problems in biology by predicting protein structures.
5. Emerging Trends

● RL is now used in many real-world areas like self-driving cars, robots,


personalized medicine, and smart trading systems.

● New research areas include:

○ Multi-Agent RL: Learning when many agents interact.

○ Offline RL: Learning from past data without needing live experiments.

○ Combining RL with Natural Language Processing (NLP): For better


understanding and communication.

○ Causal RL: Learning how actions cause outcomes, not just patterns.

6. Key Concepts Influencing RL

● Exploration vs. Exploitation:


Should the agent try something new or stick with what worked before? This
balance is critical for learning effectively.

● Model-Free vs. Model-Based RL:

○ Model-Free: Learns by experience only—like trial and error without a


map.

○ Model-Based: Builds a map of the environment to make better, smarter


plans.

Conclusion

Reinforcement Learning has grown from simple ideas about behavior into a powerful
tool used in cutting-edge technology. It teaches machines to learn from their
experiences just like humans or animals. As it continues to evolve, RL is helping create
intelligent systems that can adapt, improve, and make better decisions in complex,
real-world environments.
Describe the Concepts of Value Iteration and
Policy Iteration in Reinforcement Learning
1. Introduction to Reinforcement Learning (RL)

Reinforcement Learning is a branch of machine learning where an agent learns how to


act in an environment to get maximum rewards over time. The agent explores the
environment, takes actions, receives rewards (or penalties), and gradually learns which
actions are good or bad.

This learning process is based on trial and error, and is widely used in robotics, game
AI, finance, healthcare, and more.

The environment is often modeled as a Markov Decision Process (MDP), which


provides a formal framework for decision-making.

2. What is a Markov Decision Process (MDP)?

An MDP (Markov Decision Process) is a mathematical framework used for modeling


decision-making in situations where outcomes are partly random and partly under
the control of a decision maker.

● States (S): All possible situations the agent can be in.

● Actions (A): All possible moves the agent can take.


● Rewards (R): Numerical feedback after an action.

● Transition probabilities (P): Chance of moving to a new state after an action.

● Discount factor (γ): Determines how much future rewards matter.

The goal of the agent is to find an optimal policy (π)* that tells it what to do in each state to
maximize total rewards over time.

3. The Role of Dynamic Programming in MDPs

Dynamic Programming (DP) is used to solve MDPs efficiently. It relies on breaking big
problems into smaller ones and solving them recursively.

Two of the most important DP methods in reinforcement learning are:

● Value Iteration

● Policy Iteration

These methods help compute the optimal value function and optimal policy, allowing
the agent to behave in the best possible way.

4. Concept of Value Iteration

🔍 Definition:
Value Iteration is a foundational algorithm in Reinforcement Learning (RL) and
Dynamic Programming (DP) used to compute the optimal policy and the optimal value
function for a given Markov Decision Process (MDP). It is an iterative approach that
refines estimates of the value function until they converge to the optimal values
📚 Example:

Imagine a robot in a grid. It can move up, down, left, or right. Some cells have rewards
(like +10 or -5). Using value iteration, the robot calculates the value of each cell and
learns the best path to reach the high-reward cells while avoiding bad ones.

✅ Advantages of Value Iteration:

● Simple to implement.

● Works well for smaller environments.

● Efficient when only approximate solutions are acceptable.

5. Concept of Policy Iteration

🔍 Definition:
Policy Iteration is a method that starts with a random policy, evaluates how good it is,
and then improves it step-by-step. It repeats this process until the policy becomes
stable and optimal.

📚 Example:

In the same grid-world as before, the robot starts with a random policy (e.g., always
move right). It calculates how good that policy is (by evaluating total expected rewards).
Then it updates its policy to something better (e.g., move up if reward is higher). This
continues until it finds the best path.
✅ Advantages of Policy Iteration:

● Usually converges in fewer steps than value iteration.

● Gives the exact optimal policy.

● More efficient in problems where evaluation is fast.

6. Comparison Between Value Iteration and Policy Iteration

Feature Value Iteration Policy Iteration

Focus Value function first Policy directly

Steps One combined loop Two separate steps

Speed Faster per step Fewer total steps

Output Value → Policy Direct optimal policy

Efficienc Good for small/medium Better for larger


y problems problems

7. Key Similarities

● Both are based on Bellman equations.

● Both assume complete knowledge of the MDP (model-based).

● Both converge to the optimal policy.

● Both are part of Dynamic Programming.

8. Real-Life Applications

● Robotics: Helping robots find the best way to clean a room or reach a goal.

● Gaming AI: Creating smart agents that can plan ahead and make winning moves.

● Healthcare: Making treatment plans over time.

● Logistics & Navigation: Finding the best delivery or travel routes.


9. Conclusion

Value Iteration and Policy Iteration are powerful tools in reinforcement learning. They
both help agents learn optimal behaviors in complex environments by analyzing
rewards, transitions, and future consequences.

● Value Iteration is simple and fast for approximate solutions.

● Policy Iteration is more structured and gives exact results.

Together, they form the foundation of many modern reinforcement learning


algorithms, and are essential for anyone looking to understand intelligent decision-
making systems.

Q-learning: An In-Depth Explanation and


Comparison with Value Iteration and Policy
Iteration

1. Introduction to Q-learning

Reinforcement Learning (RL) focuses on training agents to make decisions by


interacting with an environment to maximize long-term rewards. Among the different
approaches to RL, Q-learning is one of the most widely used model-free methods,
meaning that it does not require knowledge of the environment’s transition model or
reward function.

In Q-learning, an agent learns an optimal policy by estimating the Q-values (action-


value function) for each state-action pair. Over time, the agent updates these Q-values
based on its experiences (rewards and transitions), and through this process, it learns
the best actions to take to maximize future rewards.

2. What is Q-learning?

Definition:
Q-learning is a type of off-policy reinforcement learning algorithm. In this algorithm,
the agent tries to learn the optimal policy without directly following it, meaning it
explores the environment while also aiming to improve its policy. The goal of Q-
learning is to learn the Q-function, which tells the agent how good it is to take a specific
action in a particular state.

The Q-function or action-value function Q(s,a) represents the expected cumulative


reward of taking action a in state s and following the optimal policy thereafter.

The Bellman equation for Q-learning is given by:

Where:

● Q(s,a) is the current Q-value for state s and action a.

● R(s,a) is the immediate reward after taking action a from state s.

● γ is the discount factor, which determines the importance of future rewards.

● α is the learning rate, controlling how much new information overrides old
information.

● is the maximum Q-value for the next state s′,, representing the
best possible action from the next state.
3. Key Features of Q-learning

1. Model-free:
Q-learning is model-free, meaning it doesn't require a model of the
environment (i.e., transition probabilities and rewards). The agent learns directly
from experience.

2. Off-policy:
Q-learning is an off-policy algorithm, which means the agent can learn the optimal
policy while following a different exploratory policy (such as ϵ-greedy).

3. Convergence:
If the learning rate α decays appropriately over time and all state-action pairs are visited
infinitely often, Q-learning will converge to the optimal Q-values, Q*(s, a), and the
corresponding optimal policy.

4. Exploration vs Exploitation:
The agent faces a dilemma between exploration (trying new actions to discover
better rewards) and exploitation (choosing the known best action). This is often
handled with an ϵ-greedy approach, where the agent chooses a random action with
probability ϵ- (exploration), and the best action with probability 1−ϵ1 (exploitation).

4. Comparison of Q-learning with Value Iteration and Policy Iteration

While Q-learning, value iteration, and policy iteration all aim to solve reinforcement
learning problems and find an optimal policy, they differ in key aspects. Let's break
down these differences:

4.1 Value Iteration

Value Iteration is a foundational algorithm in Reinforcement Learning (RL) and


Dynamic Programming (DP) used to compute the optimal policy and the optimal value
function for a given Markov Decision Process (MDP). It is an iterative approach that
refines estimates of the value function until they converge to the optimal values
📚 Example:

Imagine a robot in a grid. It can move up, down, left, or right. Some cells have rewards
(like +10 or -5). Using value iteration, the robot calculates the value of each cell and
learns the best path to reach the high-reward cells while avoiding bad ones.

✅ Advantages of Value Iteration:

● Simple to implement.

● Works well for smaller environments.

● Efficient when only approximate solutions are acceptable.

4.2. Concept of Policy Iteration


🔍 Definition:

Policy Iteration is a method that starts with a random policy, evaluates how good it is,
and then improves it step-by-step. It repeats this process until the policy becomes
stable and optimal.

📚 Example:

In the same grid-world as before, the robot starts with a random policy (e.g., always
move right). It calculates how good that policy is (by evaluating total expected rewards).
Then it updates its policy to something better (e.g., move up if reward is higher). This
continues until it finds the best path.

✅ Advantages of Policy Iteration:

● Usually converges in fewer steps than value iteration.

● Gives the exact optimal policy.

● More efficient in problems where evaluation is fast.


5. Advantages of Q-learning

● Flexibility:
Q-learning is highly flexible because it does not require a model of the
environment, making it applicable in real-world scenarios where the
environment is unknown or changing.

● Simplicity:
Q-learning is relatively simple to implement, especially for environments where
the state space is not extremely large.

● Efficiency in Complex Environments:


Q-learning works well in large or continuous environments where using
dynamic programming methods like value iteration or policy iteration would be
too slow or impractical.

● Off-policy Learning:
Q-learning's off-policy nature allows it to learn optimal behavior while
following a different exploratory strategy.

6. Challenges and Limitations of Q-learning

● Large State and Action Spaces:


As the size of the state and action spaces grows, the Q-table becomes very large,
which leads to high memory usage and computational complexity.

● Slow Convergence:
In large environments, Q-learning may take a long time to converge to the
optimal Q-values due to the need for extensive exploration.

● Sensitive to Hyperparameters:
The convergence and performance of Q-learning are sensitive to the choice of
hyperparameters, such as the learning rate α and the discount factor γ.

7. Conclusion

Q-learning is a powerful model-free reinforcement learning algorithm that is widely


used in situations where the environment’s model is unknown or too complex to
compute. Unlike value iteration and policy iteration, Q-learning does not require a
model of the environment and is off-policy, allowing for exploration during the
learning process.

While Q-learning has many advantages, such as flexibility and simplicity, it also has
limitations, including slow convergence and high memory usage in large state spaces.
Nevertheless, Q-learning remains an essential algorithm in the field of reinforcement
learning, particularly for real-world applications.

By comparing Q-learning with value iteration and policy iteration, we can see that Q-
learning offers a more flexible and efficient solution in many real-world environments
where the environment model is either unavailable or too complicated to work with.
Axioms of Probability and Their Role in
Reinforcement Learning

1. Introduction

Reinforcement Learning (RL) is a branch of Machine Learning where an agent interacts


with an environment to learn optimal actions by trial and error. This learning process is
uncertain and probabilistic in nature. Therefore, probability theory plays a
foundational role in defining the behavior of RL algorithms.

Before understanding how probability applies to RL, it’s important to learn the basic
axioms (rules) that govern probability. These axioms help formalize how probabilities
are assigned and manipulated, ensuring that RL models behave consistently.

2. What Are Axioms of Probability?

The axioms of probability are fundamental principles proposed by Andrey Kolmogorov


in 1933. They are the building blocks of probability theory and are used in every field
where uncertainty is modeled, including RL.

Let’s denote:

● S = Sample Space (the set of all possible outcomes)

● A,B = Events (subsets of S)

● P(A)= Probability of event A

The three Kolmogorov Axioms of Probability are:

3. Axiom 1: Non-Negativity

P(A)≥0

This axiom states that the probability of any event is never negative. The lowest
possible value for a probability is 0 (impossible event), and the highest is 1 (certain
event).

Example in RL:
The probability of moving from state st to state st+1 after taking an action at must be a
non-negative number:

This ensures that every transition has a logical, valid probability assigned.

4. Axiom 2: Total Probability of Sample Space is 1

P(S)=1

The probability of the entire sample space is 1. This means that something in the
sample space must happen. All possible events must add up to 100% probability.

Example in RL:

If an agent takes action A in state S, the sum of probabilities of transitioning to all


possible next states must be equal to 1:

This ensures that the agent will definitely end up in some state after an action — the
future is uncertain, but one outcome must happen.

5. Axiom 3: Additivity for Disjoint Events

If two events cannot occur together (i.e., they are mutually exclusive), then the
probability that either one occurs is simply the sum of their individual probabilities.

Example in RL:

Suppose an agent in state S can move to either state s1 or s2, but not both at once (they
are disjoint). Then:
This axiom is used in planning and estimating total probability for multiple exclusive
transitions.

6. Role of Probability Axioms in Reinforcement Learning

Now, let’s understand how these axioms are applied in Reinforcement Learning, and
why they are so important:

a. Defining MDPs (Markov Decision Processes)

RL problems are often modeled as Markov Decision Processes (MDPs), which are
defined as a 5-tuple:

(S,A,P,R,γ)

Where:

● S = Set of states

● A = Set of actions

● P(s′∣s,a) = Probability of moving to state s′ from S after action A

● R(s,a)= Reward

● γ = Discount factor

The transition function P(s′∣s,a) is entirely based on probability, and must obey
the axioms:

● All probabilities ≥0 (Axiom 1)

● Sum of probabilities over all s′ must equal 1 (Axiom 2)

These axioms ensure that the agent’s transitions are mathematically valid and
interpretable.

b. Bellman Equation
The Bellman Equation is the foundation of many RL algorithms like Value Iteration and
Q-Learning. It depends on expected values of future states, which require probability
distributions that follow the axioms.

Example Bellman Expectation Equation:

In this equation:

● π(a∣s) = policy (probability of taking action A in state S)

● P(s′∣s,a) = transition probability (must follow axioms)

Without the axioms, we could get invalid or misleading results from the Bellman
equation.

c. Designing Policies

In RL, a policy is a mapping from states to a probability distribution over actions:

π(a∣s)

This function must follow the axioms of probability:

These ensure that the policy makes valid choices — it always selects one action, even if
probabilistically.

d. Value Functions and Expectation

Value functions represent the expected reward an agent will receive. To calculate
expectation, we multiply possible outcomes by their probabilities, so again, valid
probabilities (following axioms) are critical.
For example:

This only makes sense if P(r) values are valid probabilities (i.e., they are non-negative
and sum to 1).

e. Sampling and Exploration

In real-world environments, agents often sample actions and transitions to learn. These
samples are based on probabilistic functions that follow axioms:

● When choosing an action randomly from a policy

● When observing the next state based on environment dynamics

● When estimating expected return using Monte Carlo methods

In each case, the core mechanism is probability, and valid behavior requires adherence
to the axioms.

7. Summary

The axioms of probability provide a mathematical foundation for modeling and


working with uncertainty. In Reinforcement Learning, where the environment is often
stochastic and unpredictable, these axioms ensure that:

● Transitions between states are valid

● Rewards are interpreted correctly

● Actions follow meaningful policies

● Value estimates are based on legitimate expectations

Without these axioms, the entire framework of RL would collapse, as algorithms would
operate on undefined or invalid probability distributions.
8. Final Thoughts

To summarize, the axioms of probability (non-negativity, total probability = 1, and


additivity) are not just theoretical concepts — they are practically essential in
Reinforcement Learning. From defining transitions and policies to computing expected
rewards and updating value functions, everything in RL is built upon these basic rules.

Understanding these axioms deeply will help you master advanced RL algorithms like
Q-learning, Policy Gradient, Actor-Critic, and Deep RL models, all of which rely on valid
probability models.

Understanding Probability Concepts in


Reinforcement Learning
Reinforcement Learning (RL) involves an agent interacting with an environment,
aiming to maximize rewards. Many key concepts in RL, such as transitions, rewards,
and states, have inherent uncertainty. To model these uncertainties, probability theory
plays a crucial role. In this context, concepts like Probability Mass Function (PMF),
Probability Density Function (PDF), Cumulative Distribution Function (CDF), and
Expectation are vital for understanding and working with uncertain elements in RL.

1. Introduction to Probability in Reinforcement Learning

In Reinforcement Learning, the agent interacts with an environment that can be


uncertain, non-deterministic, or stochastic. Probabilistic models help formalize this
uncertainty, especially when predicting or estimating rewards, state transitions, and
decisions. To understand these models, we must first review some key probability
concepts and how they apply to RL.

The Environment in RL:

The environment in RL is typically represented as a Markov Decision Process (MDP). In


an MDP, an agent observes the current state st, selects an action at, and transitions to a
new state st+1, based on the environment’s dynamics. The reward rt is associated with
the transition from state st to st+1 after taking action at. This entire process is
probabilistic since the outcomes of actions in a given state often vary.
For instance:

● State Transition Probability: P(st+1∣st,at)

● Reward Distribution: P(rt∣st,at)

These probabilistic models form the foundation of understanding how the agent learns
and improves its behavior.

2. Probability Mass Function (PMF)

Definition:

A Probability Mass Function (PMF) is a function that defines the probability of discrete
outcomes in a probability space. It applies when the random variable takes discrete
values. The PMF maps each possible outcome to its probability.

For example, consider the agent’s action a in a specific state s. If the environment is
stochastic, the next state st+1 could be one of several possible states, each with its own
probability. The PMF for the next state st+1 given the current state st and action at can
be expressed as:

P(st+1∣st,at)=p(st+1)

This function assigns a probability to each possible next state.

In RL:

The PMF is used to model the stochastic nature of state transitions. For example, if we
consider the probability distribution of next states, it is crucial for the agent to learn the
likelihood of reaching certain states after taking an action. It helps estimate the
expected outcomes of actions, making it essential for RL algorithms like Q-learning and
value iteration.

Example in RL:

If we have a discrete set of states s1,s2,s3 the PMF P(s2∣s1,a)) could


represent the probability that taking action aa in state s1 leads to state
s2. If P(s2∣s1,a)=0.5, it means there is a 50% chance that the agent will
end up in state s2 after performing action a.

3. Probability Density Function (PDF)


Definition:

A Probability Density Function (PDF) is used when the random variable is continuous.
It describes the likelihood of a random variable taking a particular value. Unlike the
PMF, which sums to 1 over all discrete outcomes, the PDF integrates to 1 over a
continuous range of outcomes..

For a continuous random variable X, the PDF is denoted as fX(x), and the probability
that X lies in a certain range [a,b][a, b] is given by the integral of the PDF over that
range:

In RL:

In RL, the PDF is often used when modeling continuous random variables like
continuous rewards or state variables. For instance, if the reward rt is not discrete, but
instead a continuous variable, we use the PDF to model the probability distribution of
rewards.

The continuous nature of the PDF allows for finer granularity in modeling
environments where states and rewards can take a range of values (e.g., in real-world
problems with continuous variables like position, velocity, or temperature).

Example : Robot’s Position After Movement

Imagine an agent (like a robot) trying to move forward.

● Because of small errors (like wheel slips), its position after moving isn't exact.

● The final position can be modeled as a continuous variable with a PDF.

📌 The PDF helps model the probability of the robot ending up between 1.8
m and 2.2 m forward, rather than at exactly 2.0 m.

This helps the robot learn more accurately where it might end up — and adjust its
movement accordingly.

4. Cumulative Distribution Function (CDF)

Definition:
The Cumulative Distribution Function (CDF) of a random variable X gives the
probability that X will take a value less than or equal to x. The CDF is derived from the
PDF (or PMF) and provides the cumulative probability up to a certain point.

Mathematically, for a continuous random variable:

For a discrete random variable, it is the sum of the probabilities:

In RL:

In reinforcement learning, the CDF can be used to represent the cumulative probability
distribution of rewards or state transitions. It helps the agent understand the likelihood
of receiving a certain amount of reward or ending up in a particular state within a given
range.

Example of CDF in RL

Imagine a robot delivery agent that gets a reward based on delivery time:

● If it delivers fast, it gets a high reward.

● If it's slow, the reward is lower.

Now let’s say the reward is not always the same — sometimes it gets 8, 9, or 10 points,
depending on traffic.
These rewards follow a normal distribution (most of the time around 9, but can vary).

5. Expectation

Definition:

The expectation (or expected value) of a random variable is a measure of the center of
its distribution. It represents the average or mean value that we expect from a random
variable after repeated trials. Mathematically, for a discrete random variable X with
probability mass function P(X), the expectation E[X] is given by:
For a continuous random variable, the expectation is calculated using the PDF:

In RL:

In reinforcement learning, the expectation is critical in evaluating the long-term


rewards of actions. The agent typically aims to maximize the expected reward, which is
crucial in decision-making. The expected reward is computed using the reward
distribution, and the expected future value is a central component of the Bellman
equation.

In RL, the Q-value for a state-action pair is essentially the expected cumulative reward
starting from that state and taking a certain action.

Example in RL:

If the agent takes action at in state st, the expected reward R can be calculated as:

This represents the expected reward the agent will get, based on the probabilities of
transitioning to different states s′s' and the associated rewards.

6. Conclusion

In Reinforcement Learning, probability plays a vital role in modeling uncertainty in


state transitions and rewards. Concepts like PMF, PDF, CDF, and expectation are used to
represent and compute the uncertainties faced by an agent in a probabilistic
environment.

● PMF and PDF help model discrete and continuous distributions for state
transitions and rewards.
● The CDF gives cumulative probabilities and is useful in analyzing and
interpreting the likelihood of certain outcomes.

● Expectation provides an average value that an agent can use to evaluate the
potential rewards of different actions and decisions.

Together, these concepts are foundational to the development of effective RL


algorithms, allowing agents to make well-informed decisions in uncertain
environments. Understanding these probability concepts is key to designing and
improving RL models and algorithms.

You might also like