Reinforcement Learning
Reinforcement Learning
1. Early Foundations
● His Bellman Equation became a key idea in RL, showing how to calculate the
value of different actions in a step-by-step process.
● Early AI systems began exploring how machines could learn from trial and error,
much like animals and humans do.
● Arthur Samuel (1959) developed one of the first programs that could learn to
play checkers better over time by adjusting its strategy, a very early form of RL.
● In the 1980s, Andrew Barto, Richard Sutton, and others introduced Temporal
Difference (TD) learning, which combines ideas from two earlier methods—
Monte Carlo (learning from full episodes) and dynamic programming (using
known models).
● TD learning allows an agent to learn from its experience step by step, even when
the final result is not yet known.
Q-Learning (1989)
● Chris Watkins introduced Q-learning, which lets agents learn the value of doing
a certain action in a certain situation, even when they don’t know how the
environment works.
● This method is model-free, meaning the agent doesn't need a plan or map of the
environment—it learns just by interacting with it.
● During this time, researchers also began using function approximation and
neural networks, allowing RL to handle bigger problems where listing every
state or action is not possible.
● AlphaGo (2016): Used RL to beat world champions in the board game Go, which
was previously thought to be too complex for machines.
○ Offline RL: Learning from past data without needing live experiments.
○ Causal RL: Learning how actions cause outcomes, not just patterns.
Conclusion
Reinforcement Learning has grown from simple ideas about behavior into a powerful
tool used in cutting-edge technology. It teaches machines to learn from their
experiences just like humans or animals. As it continues to evolve, RL is helping create
intelligent systems that can adapt, improve, and make better decisions in complex,
real-world environments.
Describe the Concepts of Value Iteration and
Policy Iteration in Reinforcement Learning
1. Introduction to Reinforcement Learning (RL)
This learning process is based on trial and error, and is widely used in robotics, game
AI, finance, healthcare, and more.
The goal of the agent is to find an optimal policy (π)* that tells it what to do in each state to
maximize total rewards over time.
Dynamic Programming (DP) is used to solve MDPs efficiently. It relies on breaking big
problems into smaller ones and solving them recursively.
● Value Iteration
● Policy Iteration
These methods help compute the optimal value function and optimal policy, allowing
the agent to behave in the best possible way.
🔍 Definition:
Value Iteration is a foundational algorithm in Reinforcement Learning (RL) and
Dynamic Programming (DP) used to compute the optimal policy and the optimal value
function for a given Markov Decision Process (MDP). It is an iterative approach that
refines estimates of the value function until they converge to the optimal values
📚 Example:
Imagine a robot in a grid. It can move up, down, left, or right. Some cells have rewards
(like +10 or -5). Using value iteration, the robot calculates the value of each cell and
learns the best path to reach the high-reward cells while avoiding bad ones.
● Simple to implement.
🔍 Definition:
Policy Iteration is a method that starts with a random policy, evaluates how good it is,
and then improves it step-by-step. It repeats this process until the policy becomes
stable and optimal.
📚 Example:
In the same grid-world as before, the robot starts with a random policy (e.g., always
move right). It calculates how good that policy is (by evaluating total expected rewards).
Then it updates its policy to something better (e.g., move up if reward is higher). This
continues until it finds the best path.
✅ Advantages of Policy Iteration:
7. Key Similarities
8. Real-Life Applications
● Robotics: Helping robots find the best way to clean a room or reach a goal.
● Gaming AI: Creating smart agents that can plan ahead and make winning moves.
Value Iteration and Policy Iteration are powerful tools in reinforcement learning. They
both help agents learn optimal behaviors in complex environments by analyzing
rewards, transitions, and future consequences.
1. Introduction to Q-learning
2. What is Q-learning?
Definition:
Q-learning is a type of off-policy reinforcement learning algorithm. In this algorithm,
the agent tries to learn the optimal policy without directly following it, meaning it
explores the environment while also aiming to improve its policy. The goal of Q-
learning is to learn the Q-function, which tells the agent how good it is to take a specific
action in a particular state.
Where:
● α is the learning rate, controlling how much new information overrides old
information.
● is the maximum Q-value for the next state s′,, representing the
best possible action from the next state.
3. Key Features of Q-learning
1. Model-free:
Q-learning is model-free, meaning it doesn't require a model of the
environment (i.e., transition probabilities and rewards). The agent learns directly
from experience.
2. Off-policy:
Q-learning is an off-policy algorithm, which means the agent can learn the optimal
policy while following a different exploratory policy (such as ϵ-greedy).
3. Convergence:
If the learning rate α decays appropriately over time and all state-action pairs are visited
infinitely often, Q-learning will converge to the optimal Q-values, Q*(s, a), and the
corresponding optimal policy.
4. Exploration vs Exploitation:
The agent faces a dilemma between exploration (trying new actions to discover
better rewards) and exploitation (choosing the known best action). This is often
handled with an ϵ-greedy approach, where the agent chooses a random action with
probability ϵ- (exploration), and the best action with probability 1−ϵ1 (exploitation).
While Q-learning, value iteration, and policy iteration all aim to solve reinforcement
learning problems and find an optimal policy, they differ in key aspects. Let's break
down these differences:
Imagine a robot in a grid. It can move up, down, left, or right. Some cells have rewards
(like +10 or -5). Using value iteration, the robot calculates the value of each cell and
learns the best path to reach the high-reward cells while avoiding bad ones.
● Simple to implement.
Policy Iteration is a method that starts with a random policy, evaluates how good it is,
and then improves it step-by-step. It repeats this process until the policy becomes
stable and optimal.
📚 Example:
In the same grid-world as before, the robot starts with a random policy (e.g., always
move right). It calculates how good that policy is (by evaluating total expected rewards).
Then it updates its policy to something better (e.g., move up if reward is higher). This
continues until it finds the best path.
● Flexibility:
Q-learning is highly flexible because it does not require a model of the
environment, making it applicable in real-world scenarios where the
environment is unknown or changing.
● Simplicity:
Q-learning is relatively simple to implement, especially for environments where
the state space is not extremely large.
● Off-policy Learning:
Q-learning's off-policy nature allows it to learn optimal behavior while
following a different exploratory strategy.
● Slow Convergence:
In large environments, Q-learning may take a long time to converge to the
optimal Q-values due to the need for extensive exploration.
● Sensitive to Hyperparameters:
The convergence and performance of Q-learning are sensitive to the choice of
hyperparameters, such as the learning rate α and the discount factor γ.
7. Conclusion
While Q-learning has many advantages, such as flexibility and simplicity, it also has
limitations, including slow convergence and high memory usage in large state spaces.
Nevertheless, Q-learning remains an essential algorithm in the field of reinforcement
learning, particularly for real-world applications.
By comparing Q-learning with value iteration and policy iteration, we can see that Q-
learning offers a more flexible and efficient solution in many real-world environments
where the environment model is either unavailable or too complicated to work with.
Axioms of Probability and Their Role in
Reinforcement Learning
1. Introduction
Before understanding how probability applies to RL, it’s important to learn the basic
axioms (rules) that govern probability. These axioms help formalize how probabilities
are assigned and manipulated, ensuring that RL models behave consistently.
Let’s denote:
3. Axiom 1: Non-Negativity
P(A)≥0
This axiom states that the probability of any event is never negative. The lowest
possible value for a probability is 0 (impossible event), and the highest is 1 (certain
event).
Example in RL:
The probability of moving from state st to state st+1 after taking an action at must be a
non-negative number:
This ensures that every transition has a logical, valid probability assigned.
P(S)=1
The probability of the entire sample space is 1. This means that something in the
sample space must happen. All possible events must add up to 100% probability.
Example in RL:
This ensures that the agent will definitely end up in some state after an action — the
future is uncertain, but one outcome must happen.
If two events cannot occur together (i.e., they are mutually exclusive), then the
probability that either one occurs is simply the sum of their individual probabilities.
Example in RL:
Suppose an agent in state S can move to either state s1 or s2, but not both at once (they
are disjoint). Then:
This axiom is used in planning and estimating total probability for multiple exclusive
transitions.
Now, let’s understand how these axioms are applied in Reinforcement Learning, and
why they are so important:
RL problems are often modeled as Markov Decision Processes (MDPs), which are
defined as a 5-tuple:
(S,A,P,R,γ)
Where:
● S = Set of states
● A = Set of actions
● R(s,a)= Reward
● γ = Discount factor
The transition function P(s′∣s,a) is entirely based on probability, and must obey
the axioms:
These axioms ensure that the agent’s transitions are mathematically valid and
interpretable.
b. Bellman Equation
The Bellman Equation is the foundation of many RL algorithms like Value Iteration and
Q-Learning. It depends on expected values of future states, which require probability
distributions that follow the axioms.
In this equation:
Without the axioms, we could get invalid or misleading results from the Bellman
equation.
c. Designing Policies
π(a∣s)
These ensure that the policy makes valid choices — it always selects one action, even if
probabilistically.
Value functions represent the expected reward an agent will receive. To calculate
expectation, we multiply possible outcomes by their probabilities, so again, valid
probabilities (following axioms) are critical.
For example:
This only makes sense if P(r) values are valid probabilities (i.e., they are non-negative
and sum to 1).
In real-world environments, agents often sample actions and transitions to learn. These
samples are based on probabilistic functions that follow axioms:
In each case, the core mechanism is probability, and valid behavior requires adherence
to the axioms.
7. Summary
Without these axioms, the entire framework of RL would collapse, as algorithms would
operate on undefined or invalid probability distributions.
8. Final Thoughts
Understanding these axioms deeply will help you master advanced RL algorithms like
Q-learning, Policy Gradient, Actor-Critic, and Deep RL models, all of which rely on valid
probability models.
These probabilistic models form the foundation of understanding how the agent learns
and improves its behavior.
Definition:
A Probability Mass Function (PMF) is a function that defines the probability of discrete
outcomes in a probability space. It applies when the random variable takes discrete
values. The PMF maps each possible outcome to its probability.
For example, consider the agent’s action a in a specific state s. If the environment is
stochastic, the next state st+1 could be one of several possible states, each with its own
probability. The PMF for the next state st+1 given the current state st and action at can
be expressed as:
P(st+1∣st,at)=p(st+1)
In RL:
The PMF is used to model the stochastic nature of state transitions. For example, if we
consider the probability distribution of next states, it is crucial for the agent to learn the
likelihood of reaching certain states after taking an action. It helps estimate the
expected outcomes of actions, making it essential for RL algorithms like Q-learning and
value iteration.
Example in RL:
A Probability Density Function (PDF) is used when the random variable is continuous.
It describes the likelihood of a random variable taking a particular value. Unlike the
PMF, which sums to 1 over all discrete outcomes, the PDF integrates to 1 over a
continuous range of outcomes..
For a continuous random variable X, the PDF is denoted as fX(x), and the probability
that X lies in a certain range [a,b][a, b] is given by the integral of the PDF over that
range:
In RL:
In RL, the PDF is often used when modeling continuous random variables like
continuous rewards or state variables. For instance, if the reward rt is not discrete, but
instead a continuous variable, we use the PDF to model the probability distribution of
rewards.
The continuous nature of the PDF allows for finer granularity in modeling
environments where states and rewards can take a range of values (e.g., in real-world
problems with continuous variables like position, velocity, or temperature).
● Because of small errors (like wheel slips), its position after moving isn't exact.
📌 The PDF helps model the probability of the robot ending up between 1.8
m and 2.2 m forward, rather than at exactly 2.0 m.
This helps the robot learn more accurately where it might end up — and adjust its
movement accordingly.
Definition:
The Cumulative Distribution Function (CDF) of a random variable X gives the
probability that X will take a value less than or equal to x. The CDF is derived from the
PDF (or PMF) and provides the cumulative probability up to a certain point.
In RL:
In reinforcement learning, the CDF can be used to represent the cumulative probability
distribution of rewards or state transitions. It helps the agent understand the likelihood
of receiving a certain amount of reward or ending up in a particular state within a given
range.
Example of CDF in RL
Imagine a robot delivery agent that gets a reward based on delivery time:
Now let’s say the reward is not always the same — sometimes it gets 8, 9, or 10 points,
depending on traffic.
These rewards follow a normal distribution (most of the time around 9, but can vary).
5. Expectation
Definition:
The expectation (or expected value) of a random variable is a measure of the center of
its distribution. It represents the average or mean value that we expect from a random
variable after repeated trials. Mathematically, for a discrete random variable X with
probability mass function P(X), the expectation E[X] is given by:
For a continuous random variable, the expectation is calculated using the PDF:
In RL:
In RL, the Q-value for a state-action pair is essentially the expected cumulative reward
starting from that state and taking a certain action.
Example in RL:
If the agent takes action at in state st, the expected reward R can be calculated as:
This represents the expected reward the agent will get, based on the probabilities of
transitioning to different states s′s' and the associated rewards.
6. Conclusion
● PMF and PDF help model discrete and continuous distributions for state
transitions and rewards.
● The CDF gives cumulative probabilities and is useful in analyzing and
interpreting the likelihood of certain outcomes.
● Expectation provides an average value that an agent can use to evaluate the
potential rewards of different actions and decisions.