Reinforcement Learning
Reinforcement Learning
Key Concepts of RL
1. Components of RL
4. Action (A): The choices or moves the agent can take in a given state.
2. Goal of RL
The goal is to learn a policy that maximizes the cumulative reward over
time, considering both immediate and future rewards.
Example:
Training a robot to walk:
Agent: Robot.
1. Agent:
The decision-maker or learner in the system.
2. Environment:
Everything outside the agent that it interacts with. The environment
provides feedback to the agent.
3. State (SSS):
A representation of the current situation or condition of the agent in
the environment.
4. Action (AAA):
The possible moves or decisions the agent can make in a given state.
5. Reward (RRR):
Feedback given by the environment to the agent for performing an
action in a state.
7. Value Function (VVV): Over time, the dog understands the value of
following a command.
Acronym: Use "SPARQ" for State, Policy, Action, Reward, Q-value to remember key terms.
Visualization: Draw a flowchart showing the RL loop (State → Action →
Reward → Update Policy).
Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework used to
model decision-making problems where outcomes are partly random and
partly under the control of a decision-maker (agent). MDPs are widely used in
Reinforcement Learning (RL) to formalize the environment and guide the
agent’s learning process.
The future state depends only on the current state and the action
taken, not on the history of past states.
Components of an MDP
An MDP is defined by five elements: (S,A,P,R,γ).
1. States (S)
The set of all possible situations the agent can be in.
Example:
In a grid-based game, the states might be all grid positions the agent
can occupy.
2. Actions (A)
The set of all possible actions the agent can take in each state.
Example:
In the grid-based game, actions might be move up, down, left, or
right.
3. Transition Probability (P)
The probability of moving from one state to another given a specific action.
Formally: P(s′∣s,a)
o s: Current state.
o a: Action taken.
o s′: Next state.
4. Reward Function (R)
The immediate feedback the agent receives after transitioning from one
state to another by taking an action.
Formally: R(s,a,s′)
o R: Scalar value representing reward.
5. Discount Factor (γ)
Determines how much the agent values future rewards compared to
immediate rewards.
0≤γ≤1
o γ=0: Only immediate rewards matter.
o γ=1: Future rewards are valued equally to immediate rewards.
Applications of MDP
1. Robotics: Navigating a robot in uncertain environments.
2. Gaming: Designing AI for board games or video games.
3. Finance: Making optimal investment decisions.
4. Healthcare: Optimizing treatment plans over time.
Bellman Equations:
The Bellman Equations are mathematical formulas used in
Reinforcement Learning (RL) and Markov Decision Processes (MDPs)
to describe the relationship between the value of a state and the values of
subsequent states. They serve as the foundation for finding optimal policies
in decision-making problems.
Breaking It Down:
s: Current state.
a: Action taken in the current state.
s′: Next state reached from s after taking a.
R(s,a): Immediate reward for taking action a in state s.
γ: Discount factor (how much future rewards matter compared to
immediate ones).
P(s′∣s,a): Probability of transitioning to s′ from s after taking action a.
V(s′): Value of the next state.
Breaking It Down:
Q(s,a): The quality of taking action a in state s.
R(s,a): Immediate reward for the action.
γ: Discount factor.
P(s′∣s,a): Probability of transitioning to s′.
Max_a′Q(s′,a′)The best possible reward from the next state s′.
Policy Iteration
Overview
Policy Iteration alternates between evaluating a policy and improving it.
Instead of directly optimizing V(s)V(s)V(s), it starts with a random policy and
iteratively improves it until the optimal policy π∗\pi^*π∗ is found.
Steps of Policy Iteration
1. Policy Evaluation
2. Policy Improvement
3. Repeat Until Convergence
Key Characteristics of Policy Iteration
Alternates between evaluating and improving the policy.
Converges faster in terms of iterations compared to Value Iteration.
Requires solving a system of equations in the evaluation step, which
can be computationally expensive.
Actor-Critic Model: Simplified Explanation
The Actor-Critic Model is a framework in Reinforcement Learning (RL)
that combines the advantages of two key components:
The Actor: Decides which action to take based on a policy.
The Critic: Evaluates the action by estimating the value function and
provides feedback to the Actor.
This division helps stabilize training and improve efficiency in learning
optimal behaviors.
Algorithm Steps
1. Initialize Networks: Start with random weights for both the Actor and
Critic.
2. Collect Data:
o Use the Actor to select an action ata_tat in state sts_tst.
o Execute ata_tat, observe reward RtR_tRt and next state
st+1s_{t+1}st+1.
3. Critic Update:
o Calculate the Temporal Difference (TD) error:
δt=Rt+γVϕ(st+1)−Vϕ(st)\delta_t = R_t + \gamma V_\
phi(s_{t+1}) - V_\phi(s_t)δt=Rt+γVϕ(st+1)−Vϕ(st)
o Update the Critic's weights to minimize δt2\delta_t^2δt2.
4. Actor Update:
o Use the TD error as the advantage A(st,at)≈δtA(s_t, a_t) \approx \
delta_tA(st,at)≈δt.
o Update the Actor's weights using the policy gradient.
5. Repeat: Continue until convergence or a stopping criterion is met.
Challenges
1. Instability: Training both the Actor and Critic simultaneously can lead
to instability if not carefully managed.
2. Hyperparameters: Choosing the right learning rates for the Actor and
Critic is crucial for convergence.
3. Bias-Variance Tradeoff: The Critic can introduce bias, while the
Actor's updates aim to reduce variance.
Variants of Actor-Critic
1. Deep Deterministic Policy Gradient (DDPG): Extends Actor-Critic
to handle continuous action spaces with deep neural networks.
2. Asynchronous Advantage Actor-Critic (A3C): Uses multiple
parallel agents to improve the efficiency of Actor-Critic training.
3. Proximal Policy Optimization (PPO): Stabilizes Actor updates by
limiting large policy changes.