0% found this document useful (0 votes)
8 views

ML U5 Notes

c dxc dv

Uploaded by

Shrenik Pittala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML U5 Notes

c dxc dv

Uploaded by

Shrenik Pittala
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Type of Reinforcement Learning

Key Terms to Understand RL:


1. Agent: The learner or decision-maker in RL, such as a robot or software
program.
2. Environment: The world in which the agent operates and interacts.
3. State: The current situation or context of the agent in the environment.
o Example: A car’s current position and speed in a driving simulation.
4. Action: The decision the agent makes, like moving left, right, accelerating,
or braking.
5. Policy: The agent’s strategy, or the mapping from states to actions.
o Example: If the agent sees a red light (state), it decides to stop
(action).
6. Reward: A signal from the environment that tells the agent how good or bad
an action was.
o Example: Getting points for completing a task or a penalty for failing.
7. Value Function: A prediction of the future reward the agent will receive
from a given state or action.

Types of RL Algorithms:
1. Value-Based Algorithms
These algorithms focus on estimating how good (valuable) a state or action is by
learning a value function.
• How it works: The agent tries different actions and updates its
understanding of which actions give the best rewards.
Examples:
• Q-Learning:
o The agent learns the quality (Q-value) of each action in each state.
o It's called "model-free" because it doesn't need to know how the
environment works.
o Example: Learning the best moves in a maze.
• SARSA:
o Similar to Q-Learning but updates values based on the agent’s current
policy (its current way of acting).
o Example: A robot following its own learning path to improve.
• Deep Q-Networks (DQN):
o Uses deep neural networks to estimate Q-values when the state
space is very large (like pixels in a video game).

2. Policy-Based Algorithms
These directly learn the policy (the agent’s strategy) without estimating the value
of states or actions.
• How it works: The agent improves its strategy by continuously trying to
maximize rewards.
Examples:
• REINFORCE:
o A method that uses the full history of actions and rewards to improve
the policy.
o Example: Teaching a robot to walk by letting it explore different
movements.
• Actor-Only Methods:
o Focus only on improving the agent’s policy using gradient-based
techniques (math that helps in optimization).

3. Actor-Critic Methods
These combine value-based and policy-based approaches.
• How it works:
o The Actor: Learns the policy (decides actions).
o The Critic: Evaluates how good the actions are using a value
function.
Examples:
• A3C (Asynchronous Advantage Actor-Critic):
o Multiple agents (learners) work in parallel to speed up training.
o Example: Multiple robots exploring a warehouse at the same time.
• PPO (Proximal Policy Optimization):
o Makes actor-critic methods more stable by using small updates to the
policy.
o Example: Teaching a robot arm to pick up objects.
• DDPG (Deep Deterministic Policy Gradient):
o Extends actor-critic methods to handle continuous actions like turning
a car steering wheel.

4. Off-Policy vs. On-Policy Algorithms


These define how algorithms learn from data:
• Off-Policy: Learns from old data or data generated by another policy.
o Example: Learning to drive by watching recordings of others driving
(e.g., Q-Learning).
• On-Policy: Learns from data generated by its current policy.
o Example: A robot learning by following its current rules and
improving them (e.g., SARSA).

5. Evolutionary and Genetic Algorithms


These use concepts from natural selection to evolve policies over time.
• How it works:
o Start with a population of random policies (strategies).
o The best policies are selected, and variations are introduced to create
new policies.
o Repeat until the policy improves.
Examples:
• Genetic Algorithms (GA):
o Policies are represented like genes (strings of numbers).
o Example: Evolving strategies for playing chess.
• Neuroevolution:
o Uses evolution to train neural networks for policies.
o Example: Developing strategies for a video game without
backpropagation.

Summary of Approaches:
• Value-Based: Focus on estimating how good actions/states are.
• Policy-Based: Directly learn the strategy (policy).
• Actor-Critic: Combine the above two for better learning.
• Off-Policy vs. On-Policy: How data is used during learning.
• Evolutionary: Evolve policies using concepts from biology.

Getting Lost(just a story)


Scenario Setup:
1. Lost in a City:
o You’re in a foreign city at 3 a.m., lost, and looking for your
backpacker’s hostel.
o You don’t remember the name of the hostel but believe you’ll
recognize it when you see it.
o The area consists of several old squares, making it difficult to
navigate.
2. Constraints:
o You’re sure you’ve stayed within the old part of the city.
o You look at a map of the old town squares (Figure 11.3 in the book).
o You buy some food (potato chips) but decide to save it as a reward
for finding the hostel.
3. Reward System:
o You decide that eating chips is the ultimate reward but will only allow
yourself to eat them when you reach the hostel.
o You give yourself no rewards for intermediate actions, except for the
final goal.

RL Application in This Scenario:


1. Goal:
o Your goal is to reach the backpacker’s hostel, which you believe is in
square F on the map.
2. Actions and Rewards:
o Staying Still:
▪ You assign a penalty of −5 because it’s unproductive (like
pinching yourself to stay awake).
o Moving Between Squares:
▪ Moving is neutral; no rewards are assigned unless you reach the
target square (state F).
o Reaching the Hostel (State F):
▪ Once you reach F, you stay there, and it’s considered an
absorbing state (the end of the journey).
▪ At this point, you get the full reward (eating chips).
3. Reward Matrix:
o The reward matrix (R) describes the rewards for moving between
different squares.
o “−” in the matrix means no direct path exists between those squares.

Concepts Illustrated:
1. Delayed Reward:
o The reward is given only when the final goal is achieved, teaching the
agent to focus on the end result.
2. Absorbing State:
o A state where the agent remains once it reaches its goal (like staying
in the hostel in square F).
3. Exploration:
o Since you don’t know the reward structure initially, you must explore
the environment to discover it.

Next Steps in RL:


1. You (the agent) must learn the environment by exploring it (even though
you’re tired).
2. You will apply a reinforcement learning algorithm (like Q-Learning) to
discover the best actions to reach state F (the hostel).
11.2.1 State and Action Spaces
1. What is the State Space?
o It’s the collection of all possible situations (states) the learner (agent)
can experience.
o Example: In the lost traveler example, each square (A, B, C, etc.) on
the map is a state.
2. What is the Action Space?
o It’s the collection of all possible actions the agent can take.
o Example: Moving from one square to another or staying in the same
square.
3. Why Reduce the Size?
o A larger state or action space means more time is needed to learn and
find solutions.
o Example: Searching through all squares repeatedly to reach the goal
takes a long time.
4. Dimensionality Challenge:
o Example: If there are 5 inputs, each ranging from 0 to 100, the state
space size is 1005=10,000,000100^5 = 10,000,0001005=10,000,000
(huge!).
o Solution: Simplify the data, e.g., reduce inputs to just 2 categories
(values < 50 = Class 1, values ≥ 50 = Class 2). This reduces the state
space to 25=322^5 = 3225=32.
5. Trade-off:
o Reducing state/action space too much might lose important details,
affecting the solution’s accuracy.
o Balance is key: make spaces smaller but preserve critical details.

11.2.2 Carrots and Sticks: The Reward Function


1. What is the Reward Function?
o It assigns a numerical reward (positive or negative) to each action the
agent takes in a given state.
o Example:
▪ Staying still in square A = reward of −5 (punishment).
▪ Reaching the hostel (square F) = big positive reward (e.g., 50).
2. Purpose of the Reward Function:
o Guides the agent toward its goal by maximizing rewards.
o Tells what to achieve (goal) but not how to achieve it.
3. Why is it External?
o The reward comes from the environment (external system), not the
agent itself.
o Example: The world doesn’t tell you what’s good or bad unless
there’s a mechanism (like your brain or a computer) interpreting it.
4. Designing the Reward Function:
o Must be carefully crafted to encourage desired behavior.
o Example:
▪ Reward 50 only when reaching the maze center → Agent learns
to find the center.
▪ Reward −1 per move + 50 at the maze center → Agent learns to
find the center efficiently (shortest path).
5. Episodic vs. Continual Tasks:
o Episodic Tasks:
▪ Have a clear endpoint (e.g., reaching the maze center or the
hostel).
▪ Rewards can be propagated back through the sequence of
actions to improve learning.
o Continual Tasks:
▪ No defined endpoint (e.g., learning to walk).
▪ Rewards must be predicted far into the future, which is harder.
6. Immediate vs. Total Reward:
o Immediate Reward:
▪ Small rewards/punishments for each action taken.
▪ Example: −1 per move.
o Total Reward:
▪ Combined reward from now until the task ends (or infinitely
into the future for continual tasks).

Key Takeaways:
• Smaller state and action spaces make learning faster but require careful
design to retain important details.
• The reward function is critical in defining the agent’s goal and guiding its
behavior.
• Balancing immediate and total rewards helps improve learning efficiency,
especially for tasks with different types of endpoints.

Discounting is like valuing a reward today more than the same reward in the future
because the future is uncertain.

Why Do We Need Discounting?


1. Uncertainty of the Future:
o We cannot be sure what will happen in the future.
o For example, in a game, taking an action now has a more predictable
outcome than one far ahead.
2. Immediate Rewards Are Reliable:
o Rewards that we receive soon are more certain.
o Farther rewards depend on multiple steps, and many things could
change, making them less reliable.
3. Practical Example:
Imagine you’re saving for a vacation.
o Someone offers you ₹500 today or ₹500 next year.
o You’ll likely take ₹500 today because you can use it now, and next
year’s promise may not happen.

The Concept of Discounting


In reinforcement learning, we calculate the total reward not just by adding future
rewards but by discounting them to reflect their reduced importance as they are
farther in the future.

Discounting Equation

Why Prioritize Sooner Rewards?


1. More Certainty:
Rewards closer to the present are more predictable.
Example: If you’re learning to walk, avoiding a fall in the next step is more
certain than avoiding one 100 steps later.
2. Faster Feedback:
Immediate rewards provide quicker feedback to improve learning.
Example: A child gets praised immediately after saying “thank you,”
reinforcing good manners faster than waiting weeks for praise.
3. Real-World Analogy:
o Immediate Gain: Investing ₹100 today earns quicker returns than
waiting for 10 years.
o Similarly, a learning agent values a nearby small reward more than a
large, distant one that might never happen.

Example:

Summary:
• Discounting makes future rewards less valuable because of uncertainty.
• It helps the agent focus more on reliable, immediate rewards while still
considering future rewards.
• γ\gammaγ controls how much we care about the future:
o Smaller γ\gammaγ: Focus on the present.
o Larger γ\gammaγ: Consider the future more.

Simple ga cheppalante, hostel(continuing the same story) ki reach aye mundhu, nee
bag nunchi chips theesi thintav at every step o square.
Ala thinukuntu pothe , hostel ki cheruknaka ne appetite thaggipodhi
That is why, Final reward without discounting is greater than final reward with
discounting

11.2.4 Action Selection


1. What is Action Selection?
o Choosing the next action based on expected rewards.
2. How to Estimate Action Value?
o Use Qs,t(a)Q_{s,t}(a)Qs,t(a): Average reward expected from taking
action aaa in state sss.
o Over time, Qs,t(a)Q_{s,t}(a)Qs,t(a) converges to the true reward.
3. Methods of Action Selection:
o Greedy:
▪ Always choose the action with the highest
Qs,t(a)Q_{s,t}(a)Qs,t(a).
▪ Pro: Exploits current knowledge.
▪ Con: May miss better solutions due to lack of exploration.
o ϵ\epsilonϵ-Greedy:
▪ Most of the time, choose the greedy action.
▪ Occasionally (ϵ\epsilonϵ probability), pick a random action to
explore.
▪ Pro: Balances exploration and exploitation.
o Soft-max:
▪ Assign probabilities to actions based on their
Qs,t(a)Q_{s,t}(a)Qs,t(a) values.
▪ Formula:
P(Qs,t(a))=exp⁡(Qs,t(a)/τ)∑bexp⁡(Qs,t(b)/τ).P(Q_{s,t}(a)) =
\frac{\exp(Q_{s,t}(a) / \tau)}{\sum_b \exp(Q_{s,t}(b) /
\tau)}.P(Qs,t(a))=∑bexp(Qs,t(b)/τ)exp(Qs,t(a)/τ).
▪ Parameter τ\tauτ (Temperature):
▪ Large τ\tauτ: Actions have similar probabilities → more
exploration.
▪ Small τ\tauτ: Focus on actions with higher rewards →
more exploitation.

11.2.5 Policy
1. What is a Policy (π\piπ)?
o A strategy for selecting the best action in each state to maximize
rewards.
o Example: A learned policy might dictate moving "right" in state A and
"down" in state B.
2. Why is the Policy Important?
o It determines how the agent behaves in every possible state.
o The goal is to find the optimal policy that gives the maximum total
reward.
3. Learning a Policy:
o The agent learns π(s)\pi(s)π(s), mapping states (sss) to actions (aaa).
o The challenge is to:
▪ Maximize rewards while still exploring.
▪ Handle states that depend on the sequence of previous actions.

Summary:
1. Discounting helps prioritize immediate rewards over uncertain future
rewards using a discount factor γ\gammaγ.
2. Action selection balances:
o Exploitation (using current knowledge) via greedy methods.
o Exploration (discovering new strategies) via ϵ\epsilonϵ-greedy or soft-
max.
3. The policy guides the agent in choosing actions to maximize total rewards,
and the goal of reinforcement learning is to learn an optimal policy.

Markov Chain in Beginner Terms


Random Sampling
Random sampling is a method to select a subset of data or elements from a larger
set, where every element has an equal chance of being chosen. It ensures unbiased
selection and represents the overall dataset accurately.

Practical Example of Random Sampling


Scenario: You want to estimate the average height of students in a school with
1,000 students.
• Instead of measuring everyone's height (time-consuming), you randomly
select 100 students and calculate their average height.
• This random sample is used as an estimate of the true average height of all
1,000 students.
What is Monte Carlo?
Monte Carlo (MC) methods are a class of techniques that solve problems using
random sampling to estimate numerical results. Instead of calculating exact
results, MC uses randomness to approximate solutions, especially for problems
that are too complex for deterministic methods.

How Does Monte Carlo Use Random Sampling?


Monte Carlo relies on random samples to simulate scenarios or estimate values. It
repeatedly generates random inputs, evaluates the system, and averages the results
to make predictions or calculations.
What is MCMC (Monte Carlo Markov Chain)?
MCMC combines Monte Carlo's random sampling with Markov Chains to
sample efficiently from complex probability distributions.
• Monte Carlo: Uses randomness to approximate results.
• Markov Chain: Samples in a way where the next sample depends only on
the current one (Markov Property).

Why MCMC?
In many problems (e.g., RL), directly sampling or calculating probabilities in large
spaces is infeasible. MCMC helps explore important regions of the space
efficiently without exhaustive calculations.

Practical Example: Monte Carlo in RL


Reinforcement Learning Problem
Imagine a robot navigating a grid world. The robot starts at one corner and wants
to reach the goal at the opposite corner while avoiding obstacles.
• State Space: All grid positions.
• Action Space: Move up, down, left, right.
• Rewards: +10 for reaching the goal, -1 for hitting an obstacle.
Using MCMC in RL
1. State Transition (Markov Chain):
o The robot uses a Markov Chain to decide its next state based on its
current state and action probabilities.
o For example:
▪ Current state: (1, 1).
▪ Possible transitions: (2, 1), (1, 2), or stay in (1, 1).
2. Monte Carlo Sampling:
o The robot simulates many random episodes of movements.
o Each episode consists of random transitions until the robot reaches the
goal or exceeds a time limit.
3. Estimate Rewards:
o For each episode, the robot calculates total rewards (RtR_tRt) by
summing immediate and discounted future rewards.
o Average the rewards over multiple episodes to estimate the value of
each state (V(s)V(s)V(s)) or action (Q(s,a)Q(s, a)Q(s,a)).
4. Policy Improvement:
o Based on estimated values, the robot updates its policy to choose
actions that maximize future rewards.

Detailed Example: MCMC in RL


Grid World with Obstacles
• Grid: 3×33 \times 33×3, goal at (3, 3).
• Obstacles at (2, 2).
• Actions: Move up, down, left, right.
• Transition Probabilities:
o 0.80.80.8: Desired direction.
o 0.10.10.1: Left or right of the desired direction.
Steps:
1. Initialize State and Policy:
o Start at (1, 1).
o Initial policy is random.
2. Simulate Random Episodes (Monte Carlo):
o The robot takes random actions using the transition probabilities.
o Records states, actions, and rewards during each episode.
3. Update Value Estimates (Markov Chain):
o Use the Markov Property to update the value of each state based on
sampled transitions: V(s)←Expected reward from s.V(s) \leftarrow
\text{Expected reward from } s.V(s)←Expected reward from s.
4. Converge to Optimal Policy:
o After enough episodes, the robot identifies the best policy by selecting
actions with the highest estimated rewards.

Advantages of MCMC in RL
1. Efficient Sampling:
o Focuses on frequently visited or important regions of the state space.
2. Handles Large Spaces:
o Useful for problems with many states where exhaustive exploration is
impossible.
3. Probabilistic Modeling:
o Captures uncertainties in transitions and rewards.
Proposal Distribution
In statistics and machine learning, proposal distribution is a key concept often
used in Monte Carlo methods, particularly in Markov Chain Monte Carlo
(MCMC) algorithms like Metropolis-Hastings or Gibbs Sampling.
To understand proposal distribution, let’s break it down step-by-step.

What is a Proposal Distribution?


A proposal distribution is a way of suggesting new candidate points (values) for
exploration during a sampling process. These candidate points are used to
approximate the desired probability distribution when we can't sample directly
from it.
Think of it like this:
• You’re exploring a new city but don’t have a map.
• To explore, you propose the next spot to visit based on some strategy (like
moving randomly or toward the nearest landmark).
• This strategy is your proposal distribution. It guides where you look next.

Why Do We Need a Proposal Distribution?


In many real-world problems, we want to work with a target distribution
P(x)P(x)P(x), but:
• It is complex or hard to calculate directly.
• We cannot easily draw samples from it.
To overcome this, we:
1. Use a simpler, easy-to-sample proposal distribution Q(x)Q(x)Q(x).
2. Use this to propose new points.
3. Gradually move toward the actual target distribution P(x)P(x)P(x) using
acceptance rules.
Key Characteristics of a Proposal Distribution
1. Simple to Sample From: Q(x)Q(x)Q(x) is often chosen to make sampling
easy (e.g., Gaussian or uniform distributions).
2. Similar to the Target: Ideally, Q(x)Q(x)Q(x) should resemble the shape of
the target distribution P(x)P(x)P(x). A poor match can result in inefficiency.
3. Flexible: The proposal distribution can adapt to the problem. For example:
o Fixed proposal distributions (e.g., Gaussian with a fixed variance).
o Adaptive proposal distributions that change over time.
Write these two algos

HMM (Hidden Markov Model) Basics


Before diving into the algorithms, let’s quickly understand Hidden Markov
Models (HMMs):
1. Hidden States: These are the states we can't directly observe. For example,
weather conditions (sunny, rainy) if we only see someone's behavior
(carrying an umbrella or not).
2. Observations: These are the data we can observe, like seeing someone with
an umbrella.
3. Goals of HMM Algorithms:
o Compute probabilities for observed sequences.
o Find the most likely sequence of hidden states.
o Train HMMs to learn probabilities.

1. Forward Algorithm (Finding the Probability of an Observation Sequence)


What is it?
The Forward Algorithm calculates the probability of a given observation sequence
(e.g., the probability of seeing a series of observations like "umbrella, no umbrella,
umbrella").
Why Use It?
To answer: “How likely is it to observe this sequence, given the HMM?”

How It Works (Step-by-Step)


1. Initialization:
o At time t=1t=1t=1, calculate the probability of starting in each hidden
state and observing the first observation.
o Formula: α1(i)=πi⋅bi(o1)\alpha_1(i) = \pi_i \cdot b_i(o_1)
Recursion:
• For each time step ttt, calculate the probability of reaching each state iii
while considering all possible paths leading to iii.
• Formula:

Termination:
• Add up the probabilities of all possible paths that lead to the final
observation.
• Formula

2. Viterbi Algorithm (Finding the Most Likely Sequence of Hidden States)


What is it?
The Viterbi Algorithm finds the most likely sequence of hidden states (e.g.,
"rainy, sunny, rainy") that could have produced the observed sequence.
Why Use It?
To answer: “What’s the most likely sequence of hidden states that explains these
observations?”
How It Works (Step-by-Step)
1. Initialization:
o At t=1t=1t=1, calculate the probability of starting in each hidden state
and observing the first observation, just like in the Forward
Algorithm.
o Formula

1. Backtracking:
o Trace back through the states that maximized the probability to get the
most likely sequence.
3. Forward-Backward Algorithm (Finding Probabilities of States)
What is it?
The Forward-Backward Algorithm calculates the posterior probability of each
hidden state at each time step, given the entire observation sequence.
Why Use It?
To answer: “What’s the probability of being in a particular state at a specific time,
given the observations?”
How It Works (Step-by-Step)
1. Forward Pass:
o Use the Forward Algorithm to calculate the probabilities of observing
the sequence up to time ttt and ending in a particular state.
2. Backward Pass:
o Calculate the probabilities of observing the sequence from time
t+1t+1t+1 to the end, starting in a particular state.
o Formula for the backward step

Explain Partcle Fiktering Algorithm


ChatGPT said:
ChatGPT
The Particle Filtering Algorithm (also known as Sequential Monte Carlo) is used
to estimate the state of a system when measurements are noisy or uncertain, and
the system may follow non-linear or non-Gaussian models. It is widely used in
robotics, tracking, and signal processing.

What is Particle Filtering?


Particle filtering is a Bayesian filtering technique that represents the probability
distribution of a system's state using a set of weighted random samples, called
particles.

Why Use Particle Filtering?


1. Non-Linear Models: Unlike Kalman Filters, it works well with non-linear
systems.
2. Non-Gaussian Noise: Handles measurement noise that doesn't follow a
Gaussian (normal) distribution.
3. Approximation via Particles: Approximates the probability distribution
instead of assuming specific distributions.

Key Concepts
1. State: The variable being estimated (e.g., position of a robot, velocity, etc.).
o Represented as xkx_kxk at time kkk.
2. Particles: A collection of samples (points) that represent the possible states
of the system.
3. Weights: Each particle is assigned a weight based on how well it matches
the measurements.
4. Prediction: Uses a model to estimate the next state of the particles.
5. Update: Adjusts particle weights based on observed measurements.
6. Resampling: Particles with higher weights are kept, while those with lower
weights are discarded or replaced.

You might also like