0% found this document useful (0 votes)
12 views

Reinforcement Learning

Uploaded by

sunilshah7241
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Reinforcement Learning

Uploaded by

sunilshah7241
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Reinforcement Learning (RL):

Reinforcement Learning (RL) is a machine learning technique where an


agent learns how to make decisions by interacting with an environment to
achieve a goal. Unlike supervised learning, RL doesn't rely on labeled data
but instead learns from feedback in the form of rewards or penalties.

Key Concepts of RL

1. Components of RL

1. Agent: The learner or decision-maker (e.g., a robot, software).

2. Environment: The external system with which the agent interacts.

3. State (S): The current situation or condition of the agent in the


environment.

4. Action (A): The choices or moves the agent can take in a given state.

5. Reward (R): Feedback from the environment after an action.

o Positive reward: Encourages the action.

o Negative reward: Discourages the action.

6. Policy (π): A strategy or mapping from states to actions.

o Determines what action to take in a given state.

7. Value Function: Predicts the long-term cumulative reward for a state


or state-action pair.

2. Goal of RL

The goal is to learn a policy that maximizes the cumulative reward over
time, considering both immediate and future rewards.

Example:
Training a robot to walk:

 Agent: Robot.

 Environment: Ground or track.

 State: Robot’s position and posture.


 Action: Move forward, backward, or stay still.

 Reward: +1 for moving forward, -1 for falling.

Reinforcement Learning Framework:

The Reinforcement Learning (RL) framework is the structured process


used to model and solve decision-making problems where an agent learns by
interacting with its environment. It consists of key elements that define how
the agent learns, explores, and optimizes its actions to achieve a goal.

Components of the RL Framework

1. Agent:
The decision-maker or learner in the system.

o Example: A robot, a chess-playing AI, or an automated trading


algorithm.

2. Environment:
Everything outside the agent that it interacts with. The environment
provides feedback to the agent.

o Example: A chessboard for a chess-playing AI or the stock


market for a trading bot.

3. State (SSS):
A representation of the current situation or condition of the agent in
the environment.

o Example: The positions of pieces on a chessboard.

4. Action (AAA):
The possible moves or decisions the agent can make in a given state.

o Example: Moving a pawn in chess or buying stocks.

5. Reward (RRR):
Feedback given by the environment to the agent for performing an
action in a state.

o Positive reward: Encourages the action.

o Negative reward: Discourages the action.

o Example: +10 for winning a game, -1 for a wrong move.


6. Policy (πππ):
A strategy that maps states to actions, telling the agent what to do in
each state.

o Example: Always move the highest-valued piece in chess.

7. Value Function (VVV):


Estimates the long-term reward an agent can expect from a state by
following a policy.

o Example: If being in the middle of the chessboard often leads to


winning, that state has a high value.

8. Action-Value Function (QQQ):


Estimates the long-term reward for taking a specific action in a given
state.

o Example: In chess, moving the queen to the center may have a


higher value than moving a pawn.

9. Discount Factor (γγγ):


A value between 0 and 1 that determines how much importance is
given to future rewards.

o γ=1\gamma = 1γ=1: Long-term rewards are equally important


as immediate rewards.

o γ<1\gamma < 1γ<1: Future rewards are less important.

RL Framework Example: Training a Dog

1. Agent: The dog.

2. Environment: The room where the dog is trained.

3. State (SSS): The dog is sitting or standing.

4. Action (AAA): Commands like "sit," "stand," or "fetch."

5. Reward (RRR): A treat when the dog performs correctly; no reward


otherwise.

6. Policy (πππ): The dog learns which actions lead to rewards.

7. Value Function (VVV): Over time, the dog understands the value of
following a command.

Acronym: Use "SPARQ" for State, Policy, Action, Reward, Q-value to remember key terms.
Visualization: Draw a flowchart showing the RL loop (State → Action →
Reward → Update Policy).
Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework used to
model decision-making problems where outcomes are partly random and
partly under the control of a decision-maker (agent). MDPs are widely used in
Reinforcement Learning (RL) to formalize the environment and guide the
agent’s learning process.

Key Characteristics of an MDP


1.

The future state depends only on the current state and the action
taken, not on the history of past states.

Components of an MDP
An MDP is defined by five elements: (S,A,P,R,γ).
1. States (S)
The set of all possible situations the agent can be in.
 Example:
In a grid-based game, the states might be all grid positions the agent
can occupy.
2. Actions (A)
The set of all possible actions the agent can take in each state.
 Example:
In the grid-based game, actions might be move up, down, left, or
right.
3. Transition Probability (P)
The probability of moving from one state to another given a specific action.
 Formally: P(s′∣s,a)
o s: Current state.
o a: Action taken.
o s′: Next state.
4. Reward Function (R)
The immediate feedback the agent receives after transitioning from one
state to another by taking an action.
 Formally: R(s,a,s′)
o R: Scalar value representing reward.
5. Discount Factor (γ)
Determines how much the agent values future rewards compared to
immediate rewards.
 0≤γ≤1
o γ=0: Only immediate rewards matter.
o γ=1: Future rewards are valued equally to immediate rewards.
Applications of MDP
1. Robotics: Navigating a robot in uncertain environments.
2. Gaming: Designing AI for board games or video games.
3. Finance: Making optimal investment decisions.
4. Healthcare: Optimizing treatment plans over time.
Bellman Equations:
The Bellman Equations are mathematical formulas used in
Reinforcement Learning (RL) and Markov Decision Processes (MDPs)
to describe the relationship between the value of a state and the values of
subsequent states. They serve as the foundation for finding optimal policies
in decision-making problems.

Why Do We Need Bellman Equations?


In RL, the goal is to maximize the cumulative reward (future rewards) from
each state. The Bellman equations break down this goal into smaller,
recursive relationships:
 The value of the current state depends on the rewards from the current
state and the value of the next states.

Types of Bellman Equations


1. Bellman Equation for the State Value Function (V(s)):
The state value function V(s) gives the expected cumulative reward
starting from state s and following a policy π.

Breaking It Down:
 s: Current state.
 a: Action taken in the current state.
 s′: Next state reached from s after taking a.
 R(s,a): Immediate reward for taking action a in state s.
 γ: Discount factor (how much future rewards matter compared to
immediate ones).
 P(s′∣s,a): Probability of transitioning to s′ from s after taking action a.
 V(s′): Value of the next state.

2. Bellman Equation for the Action Value Function (Q(s,a)):


The action value function Q(s,a) gives the expected cumulative reward
starting from state s, taking action a, and then following policy π.

Breaking It Down:
 Q(s,a): The quality of taking action a in state s.
 R(s,a): Immediate reward for the action.
 γ: Discount factor.
 P(s′∣s,a): Probability of transitioning to s′.
 Max_a′Q(s′,a′)The best possible reward from the next state s′.

How Are They Used?


The Bellman equations are used iteratively to compute the values of states
or actions:
1. Start with initial estimates of V(s)V(s)V(s) or Q(s,a)Q(s, a)Q(s,a)
(usually zeros).
2. Update the values using the equations until they converge to the
optimal solution.
Value Iteration
Value Iteration is an algorithm that computes the optimal state value
function V∗(s) directly by iteratively applying the Bellman Optimality
Equation. Once V∗(s) is computed, the optimal policy π∗ can be derived.
Steps of Value Iteration
1. Initialization:
2. Bellman Update:
3. Repeat Until Convergence:
4. Derive Optimal Policy:
Key Characteristics of Value Iteration
 Focuses on finding V∗(s).
 Policy is derived only after V∗(s) converges.
 Simple but may take many iterations to converge.

Policy Iteration
Overview
Policy Iteration alternates between evaluating a policy and improving it.
Instead of directly optimizing V(s)V(s)V(s), it starts with a random policy and
iteratively improves it until the optimal policy π∗\pi^*π∗ is found.
Steps of Policy Iteration
1. Policy Evaluation
2. Policy Improvement
3. Repeat Until Convergence
Key Characteristics of Policy Iteration
 Alternates between evaluating and improving the policy.
 Converges faster in terms of iterations compared to Value Iteration.
 Requires solving a system of equations in the evaluation step, which
can be computationally expensive.
Actor-Critic Model: Simplified Explanation
The Actor-Critic Model is a framework in Reinforcement Learning (RL)
that combines the advantages of two key components:
 The Actor: Decides which action to take based on a policy.
 The Critic: Evaluates the action by estimating the value function and
provides feedback to the Actor.
This division helps stabilize training and improve efficiency in learning
optimal behaviors.

Components of the Actor-Critic Model


1. Actor (Policy Network)
 Role: Chooses actions based on the current policy.
 Output: A probability distribution over actions for each state.
 Objective: Update the policy to maximize cumulative rewards.
 Representation: πθ(a∣s)\pi_\theta(a|s)πθ(a∣s) Where:
o πθ\pi_\thetaπθ: Policy parameterized by θ\thetaθ.
o aaa: Action chosen by the policy.
o sss: Current state.
2. Critic (Value Network)
 Role: Evaluates how good the action taken by the Actor is by
estimating the value function.
 Output: A scalar value representing the value of a state (V(s)V(s)V(s))
or an action (Q(s,a)Q(s, a)Q(s,a)).
 Objective: Minimize the error in value estimation using Temporal
Difference (TD) learning.
 Representation: Vϕ(s)V_\phi(s)Vϕ(s) Where:
o VϕV_\phiVϕ: Value function parameterized by ϕ\phiϕ.
o sss: Current state.

How Actor-Critic Works


1. Actor's Policy Update

 The policy gradient is computed as: ∇θJ(θ)=Eπθ[∇θlog⁡πθ(a∣s)⋅A(s,a)]\


 The Actor updates its policy based on feedback from the Critic.

nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \Big[ \nabla_\theta \


log \pi_\theta(a|s) \cdot A(s, a) \Big]∇θJ(θ)=Eπθ[∇θlogπθ(a∣s)⋅A(s,a)]
Where:
o A(s,a)A(s, a)A(s,a): Advantage function (how much better the
action aaa is compared to the average action in state sss).
2. Critic's Value Update
 The Critic updates its value estimates using the Bellman Equation.
 Value function loss: L(ϕ)=E[(Rt+γVϕ(st+1)−Vϕ(st))2]L(\phi) = \
mathbb{E} \Big[ (R_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))^2 \
Big]L(ϕ)=E[(Rt+γVϕ(st+1)−Vϕ(st))2] Where:
o RtR_tRt: Reward at time ttt.
o γ\gammaγ: Discount factor.
o Vϕ(st)V_\phi(s_t)Vϕ(st): Estimated value of state sts_tst.

Algorithm Steps
1. Initialize Networks: Start with random weights for both the Actor and
Critic.
2. Collect Data:
o Use the Actor to select an action ata_tat in state sts_tst.
o Execute ata_tat, observe reward RtR_tRt and next state
st+1s_{t+1}st+1.
3. Critic Update:
o Calculate the Temporal Difference (TD) error:
δt=Rt+γVϕ(st+1)−Vϕ(st)\delta_t = R_t + \gamma V_\
phi(s_{t+1}) - V_\phi(s_t)δt=Rt+γVϕ(st+1)−Vϕ(st)
o Update the Critic's weights to minimize δt2\delta_t^2δt2.
4. Actor Update:
o Use the TD error as the advantage A(st,at)≈δtA(s_t, a_t) \approx \
delta_tA(st,at)≈δt.
o Update the Actor's weights using the policy gradient.
5. Repeat: Continue until convergence or a stopping criterion is met.

Key Advantages of Actor-Critic


1. Lower Variance: The Critic reduces the high variance associated with
policy-based methods by providing a stable learning signal.
2. Continuous Action Spaces: Actor-Critic methods work well with
continuous action spaces, unlike purely value-based methods like Q-
learning.
3. Efficient Learning: The Actor learns directly from the Critic’s
feedback, making the updates more targeted.

Challenges
1. Instability: Training both the Actor and Critic simultaneously can lead
to instability if not carefully managed.
2. Hyperparameters: Choosing the right learning rates for the Actor and
Critic is crucial for convergence.
3. Bias-Variance Tradeoff: The Critic can introduce bias, while the
Actor's updates aim to reduce variance.

Variants of Actor-Critic
1. Deep Deterministic Policy Gradient (DDPG): Extends Actor-Critic
to handle continuous action spaces with deep neural networks.
2. Asynchronous Advantage Actor-Critic (A3C): Uses multiple
parallel agents to improve the efficiency of Actor-Critic training.
3. Proximal Policy Optimization (PPO): Stabilizes Actor updates by
limiting large policy changes.

Example: Simplified Actor-Critic Flow


1. Initial State: The agent starts in a grid world.
2. Actor Chooses Action:
o Actor suggests moving right.
o Action is taken, and a reward is observed.
3. Critic Evaluates:
o Critic estimates that moving right was slightly better than
average.
4. Update:
o Critic updates its value function.
o Actor adjusts its policy to favor moving right more in similar
situations.

You might also like