ML U5 Notes
ML U5 Notes
Types of RL Algorithms:
1. Value-Based Algorithms
These algorithms focus on estimating how good (valuable) a state or action is by
learning a value function.
• How it works: The agent tries different actions and updates its
understanding of which actions give the best rewards.
Examples:
• Q-Learning:
o The agent learns the quality (Q-value) of each action in each state.
o It's called "model-free" because it doesn't need to know how the
environment works.
o Example: Learning the best moves in a maze.
• SARSA:
o Similar to Q-Learning but updates values based on the agent’s current
policy (its current way of acting).
o Example: A robot following its own learning path to improve.
• Deep Q-Networks (DQN):
o Uses deep neural networks to estimate Q-values when the state
space is very large (like pixels in a video game).
2. Policy-Based Algorithms
These directly learn the policy (the agent’s strategy) without estimating the value
of states or actions.
• How it works: The agent improves its strategy by continuously trying to
maximize rewards.
Examples:
• REINFORCE:
o A method that uses the full history of actions and rewards to improve
the policy.
o Example: Teaching a robot to walk by letting it explore different
movements.
• Actor-Only Methods:
o Focus only on improving the agent’s policy using gradient-based
techniques (math that helps in optimization).
3. Actor-Critic Methods
These combine value-based and policy-based approaches.
• How it works:
o The Actor: Learns the policy (decides actions).
o The Critic: Evaluates how good the actions are using a value
function.
Examples:
• A3C (Asynchronous Advantage Actor-Critic):
o Multiple agents (learners) work in parallel to speed up training.
o Example: Multiple robots exploring a warehouse at the same time.
• PPO (Proximal Policy Optimization):
o Makes actor-critic methods more stable by using small updates to the
policy.
o Example: Teaching a robot arm to pick up objects.
• DDPG (Deep Deterministic Policy Gradient):
o Extends actor-critic methods to handle continuous actions like turning
a car steering wheel.
Summary of Approaches:
• Value-Based: Focus on estimating how good actions/states are.
• Policy-Based: Directly learn the strategy (policy).
• Actor-Critic: Combine the above two for better learning.
• Off-Policy vs. On-Policy: How data is used during learning.
• Evolutionary: Evolve policies using concepts from biology.
Concepts Illustrated:
1. Delayed Reward:
o The reward is given only when the final goal is achieved, teaching the
agent to focus on the end result.
2. Absorbing State:
o A state where the agent remains once it reaches its goal (like staying
in the hostel in square F).
3. Exploration:
o Since you don’t know the reward structure initially, you must explore
the environment to discover it.
Key Takeaways:
• Smaller state and action spaces make learning faster but require careful
design to retain important details.
• The reward function is critical in defining the agent’s goal and guiding its
behavior.
• Balancing immediate and total rewards helps improve learning efficiency,
especially for tasks with different types of endpoints.
Discounting is like valuing a reward today more than the same reward in the future
because the future is uncertain.
Discounting Equation
Example:
Summary:
• Discounting makes future rewards less valuable because of uncertainty.
• It helps the agent focus more on reliable, immediate rewards while still
considering future rewards.
• γ\gammaγ controls how much we care about the future:
o Smaller γ\gammaγ: Focus on the present.
o Larger γ\gammaγ: Consider the future more.
Simple ga cheppalante, hostel(continuing the same story) ki reach aye mundhu, nee
bag nunchi chips theesi thintav at every step o square.
Ala thinukuntu pothe , hostel ki cheruknaka ne appetite thaggipodhi
That is why, Final reward without discounting is greater than final reward with
discounting
11.2.5 Policy
1. What is a Policy (π\piπ)?
o A strategy for selecting the best action in each state to maximize
rewards.
o Example: A learned policy might dictate moving "right" in state A and
"down" in state B.
2. Why is the Policy Important?
o It determines how the agent behaves in every possible state.
o The goal is to find the optimal policy that gives the maximum total
reward.
3. Learning a Policy:
o The agent learns π(s)\pi(s)π(s), mapping states (sss) to actions (aaa).
o The challenge is to:
▪ Maximize rewards while still exploring.
▪ Handle states that depend on the sequence of previous actions.
Summary:
1. Discounting helps prioritize immediate rewards over uncertain future
rewards using a discount factor γ\gammaγ.
2. Action selection balances:
o Exploitation (using current knowledge) via greedy methods.
o Exploration (discovering new strategies) via ϵ\epsilonϵ-greedy or soft-
max.
3. The policy guides the agent in choosing actions to maximize total rewards,
and the goal of reinforcement learning is to learn an optimal policy.
Why MCMC?
In many problems (e.g., RL), directly sampling or calculating probabilities in large
spaces is infeasible. MCMC helps explore important regions of the space
efficiently without exhaustive calculations.
Advantages of MCMC in RL
1. Efficient Sampling:
o Focuses on frequently visited or important regions of the state space.
2. Handles Large Spaces:
o Useful for problems with many states where exhaustive exploration is
impossible.
3. Probabilistic Modeling:
o Captures uncertainties in transitions and rewards.
Proposal Distribution
In statistics and machine learning, proposal distribution is a key concept often
used in Monte Carlo methods, particularly in Markov Chain Monte Carlo
(MCMC) algorithms like Metropolis-Hastings or Gibbs Sampling.
To understand proposal distribution, let’s break it down step-by-step.
Termination:
• Add up the probabilities of all possible paths that lead to the final
observation.
• Formula
1. Backtracking:
o Trace back through the states that maximized the probability to get the
most likely sequence.
3. Forward-Backward Algorithm (Finding Probabilities of States)
What is it?
The Forward-Backward Algorithm calculates the posterior probability of each
hidden state at each time step, given the entire observation sequence.
Why Use It?
To answer: “What’s the probability of being in a particular state at a specific time,
given the observations?”
How It Works (Step-by-Step)
1. Forward Pass:
o Use the Forward Algorithm to calculate the probabilities of observing
the sequence up to time ttt and ending in a particular state.
2. Backward Pass:
o Calculate the probabilities of observing the sequence from time
t+1t+1t+1 to the end, starting in a particular state.
o Formula for the backward step
Key Concepts
1. State: The variable being estimated (e.g., position of a robot, velocity, etc.).
o Represented as xkx_kxk at time kkk.
2. Particles: A collection of samples (points) that represent the possible states
of the system.
3. Weights: Each particle is assigned a weight based on how well it matches
the measurements.
4. Prediction: Uses a model to estimate the next state of the particles.
5. Update: Adjusts particle weights based on observed measurements.
6. Resampling: Particles with higher weights are kept, while those with lower
weights are discarded or replaced.