Lecture 3.1 AML
Lecture 3.1 AML
Learning with
TensorFlow
22TCSE532
Lecture_3.1
Introduction to Reinforcement Learning
● Key Elements:
○ Agent: Learner or decision-maker.
Definition: Reinforcement Learning (RL) is a ○ Environment: Everything the agent interacts with.
type of machine learning where an agent ○ State: Representation of the environment at a given time.
interacts with its environment by taking ○ Action: Decision taken by the agent.
actions, receiving feedback in the form of ○ Reward: Feedback from the environment, indicating the
rewards, and learning to make better
success of an action.
decisions over time.
Exploration vs. Exploitation: The agent needs to balance exploring new actions
(exploration) and leveraging known information to maximize rewards (exploitation).
Policy: Determines the best action to take in each state based on the value function.
● Deterministic Policy: Always selects the same action for a given state.
● Stochastic Policy: Selects actions based on a probability distribution over possible actions.
Value Function (V): Estimates the expected cumulative reward from a state under a policy.
Action-Value Function (Q): Estimates the expected reward from taking a specific action in a state:
This equation represents the Q-value
function in reinforcement learning, and it
expresses how much expected reward an
agent can obtain by taking an action aaa in a
state sss and then continuing optimally
afterward.
Value-based Methods for RL
Overview:
● Value-based methods focus on learning the value of different states or state-action pairs to help an agent make better decisions.
● The goal is to use this information to derive the best policy (a strategy for choosing actions).
Key Concepts:
1. Policy Evaluation:
○ What it means: You have a policy (a way of choosing actions) and you want to know how good it is.
○ How to do it: Estimate the value function (for states) or the Q-function (for state-action pairs).
○ Why it's important: It helps you know the expected rewards for following the policy, so you can measure if it's a good
strategy.
2. Policy Improvement:
○ What it means: Now that you have an estimate of how good different actions are, update your policy by choosing actions
that maximize value.
○ Why it's important: This helps the agent keep improving its decisions, eventually leading to the best strategy.
Real-World Examples of Value-Based Methods
Self-Driving Cars:
Q-learning is a value-based reinforcement learning algorithm that helps an agent learn the optimal Q-values (action-
values) for each state-action pair, without needing to know the exact model of the environment (like the probabilities of
transitioning between states).
How it works:
● The agent interacts with the environment and learns the Q-values by updating them based on the rewards it
receives and the future rewards it expects.
● At each step, the agent updates the Q-value for a state-action pair using the Bellman equation for Q-values.
● It doesn't require a model of the environment (i.e., transition probabilities or reward structure).
● It updates Q-values step by step through trial and error, allowing the agent to improve its policy over time.
Example of Q-Learning in Real Life:
○ How it works: In a video game, an AI player learns to maximize its score (reward) by exploring different
actions (e.g., attacking, defending, moving) in different game states.
○ Q-learning: The AI uses Q-learning to figure out which actions lead to higher scores based on its experiences
in the game. It updates its strategy by evaluating which actions result in the best outcomes over time.
2. Recommendation Systems:
○ How it works: Platforms like Netflix or Spotify use Q-learning to improve recommendation systems. When a
user watches a movie or listens to a song, the system recommends new content (action) and tracks how the
user responds.
○ Q-learning: The algorithm updates the Q-values of recommendations based on user engagement (clicks,
likes), learning which types of content keep users most engaged over time.
Fitted Q-Learning
Fitted Q-learning is an extension of the traditional Q-learning algorithm designed to handle large or continuous state
spaces by using function approximation techniques, such as regression models or other function approximators. It's
particularly useful in scenarios where the state space is too large to represent each state-action pair explicitly, making
traditional Q-learning impractical.
Function Approximation:
● Instead of maintaining a table of Q-values for each state-action pair, Fitted Q-learning uses a function approximator
(e.g., a regression model or neural network) to estimate the Q-values.
● This function Q(s,a;θ) is parameterized by θ\thetaθ, and it generalizes the Q-values across similar states and actions,
allowing the agent to handle large or continuous state-action spaces.
Batch Update:
● In Fitted Q-learning, the Q-values are updated in batches rather than one step at a time, as in traditional Q-learning.
● A batch of experiences is collected, and the Q-function is updated to fit all these experiences simultaneously. This can
lead to more stable learning, as the update is based on a larger sample of experiences.
Procedure for Fitted Q-Learning:
Example Analogy:
● Generalization: By using function approximators, Fitted Q-learning can generalize across states and actions, making it
possible to learn effectively in high-dimensional or continuous spaces.
● Stability: Batch updates using multiple experiences reduce variance and lead to more stable learning compared to
single-step updates.
Challenges:
● Function Approximation Error: Using a function approximator introduces approximation errors, which can destabilize
learning if not managed properly.
● Data Efficiency: Fitted Q-learning typically requires more data and computational resources than traditional Q-learning
due to the use of batch updates and function approximators.
Real-World Applications:
1. Robotics:
○ Robots often operate in high-dimensional or continuous state spaces, such as positions and velocities. Fitted
Q-learning can help them learn effective policies for complex tasks like navigation or manipulation without
needing an exhaustive state-action table.
2. Autonomous Vehicles:
○ In self-driving scenarios, the state space (e.g., road conditions, traffic) and action space (e.g., steering angles,
acceleration) are continuous. Fitted Q-learning allows for efficient policy learning in such environments.
3. Energy Management:
○ In smart grids, managing energy distribution involves continuous variables like power generation and
consumption rates. Fitted Q-learning can help optimize energy usage policies by approximating Q-values
across these continuous states.
Deep Q-Networks (DQN)
Deep Q-Networks (DQN) use neural networks to approximate the Q-
values for high-dimensional and complex environments. The main
innovations in DQN include:
○ Q-Network: A neural network that estimates the Q-values for each action in a given state.
○ Target Network: A copy of the Q-network used for calculating stable target values.
○ Replay Buffer: A data structure to store experiences (state, action, reward, next state).
2. Collect Experiences:
○ The agent interacts with the environment and stores experiences (s,a,r,s′) in the replay buffer.
○ This step is similar to the experience collection in Fitted Q-learning.
3. Sample Mini-Batch:
● DQN was famously used by DeepMind to achieve human-level performance in many Atari games. The neural
network could learn to play the games directly from pixel inputs and game scores.
Robotics:
● DQN can be used for robotic control tasks, where the robot learns to perform actions (like moving arms or navigating)
based on visual inputs.
Finance:
● DQN can be applied to trading algorithms where the state is represented by market indicators and the actions are
different trading decisions.
Double Deep Q-Network (Double DQN)
Motivation: Traditional DQN has a tendency to overestimate Q-values, which can lead to suboptimal policies, especially in
environments with noisy or complex rewards. Double DQN addresses this by decoupling the action selection from the action
evaluation, thus reducing this overestimation bias.
How Double DQN Works:
Example:
Dueling Network Architecture
Why Transition from Double DQN to Dueling Network Architecture?
● Double DQN: As we discussed, Double DQN was developed to address the overestimation problem in traditional
DQN by decoupling the action selection and evaluation. This improved stability and performance, especially in
noisy environments.
● Limitation of Double DQN: While it effectively reduces overestimation, Double DQN still struggles in situations
where:
○ There are many actions with similar outcomes.
○ The Q-values do not vary much with respect to actions in certain states, making it hard to distinguish
which states are inherently valuable.
● In many real-world scenarios, like deciding between multiple minor actions in a video game (e.g., small
movements), the action choice might not drastically change the state’s value. It’s more crucial to know whether
being in a state is beneficial overall, regardless of the precise action.
Solution: The Dueling Network Architecture was introduced to separately learn the state value (how good is being
in a state overall) and the advantage of each action (how much better or worse each action is relative to others). This
separation allows for better generalization and more efficient learning, especially when many actions lead to similar
rewards.
In the Dueling Network Architecture, the neural network is divided into two separate streams after a shared set of
convolutional or dense layers:
2. Value Stream:
○ One stream outputs the state value function, V(s).
○ This stream estimates the value of being in a particular state, regardless of the action taken.
3. Advantage Stream:
○ The other stream outputs the advantage function, A(s,a).
○ This stream estimates the relative advantage of each action compared to other actions in the same state.
Example:
In complex environments where rewards are noisy or unpredictable, capturing this distribution is crucial. By using
Distributional DQN, we can leverage the strengths of Double DQN and Dueling Network Architecture while gaining
additional insights into the uncertainty around decisions. This smooth transition highlights how the field progresses from
reducing bias (Double DQN), improving state-action representation (Dueling Networks), to now managing uncertainty and
risk using Distributional DQN.
Distributional DQN
Purpose: Distributional DQN is an enhancement over the traditional DQN architecture. Instead of predicting
a single expected Q-value for each state-action pair, it predicts a full distribution over possible Q-values.
This allows the agent to capture the uncertainty and variability in the rewards, which is crucial in
environments where outcomes can be highly uncertain or noisy.
Key Concepts:
● Traditional DQN: Predicts a single Q-value for each state-action pair, representing the expected future reward.
● Distributional DQN: Instead of predicting one Q-value, it predicts a probability distribution over many possible future
rewards. This enables the agent to not only consider the expected reward but also the spread or variance in possible
outcomes.
● In Distributional DQN, the Bellman equation is adapted to work with distributions rather than scalar values. Instead of
computing the expectation of future rewards, the agent learns to approximate the entire distribution of possible rewards.
● This allows the network to provide richer information about the possible outcomes of an action.
Q-Value Distribution:
In a traditional DQN, the Q-value for a state-action pair, Q(s,a), is a single number representing the expected reward. In
Distributional DQN, the Q-value is represented as a distribution of possible rewards. For example, instead of saying
“the expected reward is 5,” the model might say, “there is a 30% chance of getting a reward of 4, a 50% chance of
getting 5, and a 20% chance of getting 6.”
Similar to traditional DQNs, the network begins with a shared set of layers to extract features from the input state. However,
instead of directly outputting a single Q-value, the final layer outputs multiple values representing the probability distribution of
possible rewards for each action.
● Shared Layers: The initial layers extract features from the state.
● Output Layer: Instead of a single Q-value, the output layer predicts a set of values that represent a distribution of
potential Q-values for each action.
Example Analogy:
Purpose: The goal of Multi-step Learning is to improve the efficiency of learning by considering rewards collected
over multiple time steps. This method enables agents to better capture long-term dependencies and delayed
rewards, which are crucial for more informed decision-making in complex environments.
Key Concepts:
● Immediate Return: Traditional Q-learning relies on a one-step return, where only the immediate reward
is considered before updating the Q-value.
● Multi-step Return: In Multi-step Learning, the agent considers a sequence of rewards over multiple
steps. This gives a more accurate and long-term estimate of the total reward for a given action.
How Multi-step Learning Enhances Learning:
Rollout Over Multiple Steps: Instead of updating the Q-value after just one step, the agent collects rewards over
multiple steps, allowing it to consider a more extended sequence of actions and their resulting rewards.
Handling Delayed Rewards: Multi-step Learning is especially useful in environments where rewards are delayed. For
example, in games or robotics tasks, the agent might need to take a series of actions before receiving a reward, and
Multi-step Learning helps capture these long-term effects.
Better Reward Propagation: In tasks with delayed rewards, it ensures that future rewards are propagated more
quickly back to earlier states, allowing the agent to learn more efficiently.
Balances Short- and Long-term Returns: It helps the agent find a balance between immediate and long-term
returns, making it more capable of handling tasks where the optimal solution requires considering delayed
gratification.
Simplified Explanation:
● Traditional Q-learning: Looks only at the immediate reward and updates the Q-value based on
a single step, which can be inefficient when rewards are delayed.
● Multi-step Learning: Looks at multiple future rewards over a sequence of steps, allowing the
agent to better capture long-term dependencies and make more informed decisions.
Generalization in Reinforcement Learning (RL)
● Techniques:
○ Dimensionality Reduction: Using methods such as PCA (Principal Component Analysis) or autoencoders can
help reduce the complexity of the input space while retaining the most important information.
○ Domain Knowledge: Leveraging expert knowledge to select meaningful features can guide the agent toward
better generalization by focusing on the relevant parts of the environment.
● Impact: Proper feature selection helps the agent generalize better by ensuring that it learns from the most important
aspects of the environment, reducing the likelihood of overfitting.
Regularization
Purpose: Regularization techniques are used to prevent the agent from overfitting to the training data, which
can harm its generalization ability.
● Dropout: This technique randomly drops neurons in a neural network during training, preventing the
network from becoming overly reliant on specific features.
● Weight Regularization (L1 or L2): These techniques penalize large weights in the network,
encouraging the model to maintain smaller, more general weights that do not overfit to particular training
examples.
Impact: Regularization helps the agent generalize better by preventing it from learning highly specific patterns
in the training data that do not generalize well to unseen states.
Generalization Challenges in RL
Data Augmentation: Similar to computer vision, data augmentation techniques can introduce variations in
the environment or state representation to prevent overfitting to specific features.
Multi-task Learning: Training the agent to perform multiple tasks can improve generalization, as it
encourages the agent to learn more abstract representations that apply across different tasks and
environments.
Modifying the Objective Function in RL
Reinforcement Learning traditionally focuses on maximizing cumulative rewards, but this approach often needs to be
adapted for more complex, real-world problems. The objective function can be modified to meet additional constraints, such
as risk sensitivity, fairness, or robustness.
Key Considerations:
● Risk Sensitivity: Standard RL maximizes the expected reward but does not account for the variability in returns.
Risk-sensitive RL introduces penalties for high variance to make the agent risk-averse. This is essential in
applications like financial trading or resource management.
○ Approach: Add a term to the objective function to penalize high variance or introduce utility functions like the
exponential utility function to account for risk.
● Approach: Incorporate fairness constraints in the reward structure or add regularization terms that account
for biased outcomes.
● Example: Penalize decisions that lead to unequal opportunities between demographic groups.
Robustness: In uncertain environments or adversarial conditions, an RL agent needs to adapt to possible model
misspecifications or environmental noise.
● Approach: Design robust MDPs where the objective function considers worst-case scenarios or
perturbations. Techniques like minimax optimization are used to deal with adversarial settings.
● Example: Robust RL minimizes the worst-case expected loss, ensuring stability even when the model is
slightly incorrect.
Hierarchical Reinforcement Learning (HRL)
Hierarchical Reinforcement Learning aims to decompose complex tasks into smaller, more manageable sub-tasks by
introducing multiple levels of abstraction. This allows agents to operate more efficiently in environments with long-term
dependencies or multi-stage problems.
Key Concepts:
● Hierarchical Task Decomposition: In HRL, the overall goal is broken down into a hierarchy of subgoals or
subtasks. These subtasks can either be defined manually or learned autonomously by the agent.
○ Example: In a robot navigation task, the high-level task might be "reach the destination," while the sub-tasks
could be "move to the door" or "avoid obstacles."
● Options Framework: A popular framework for HRL is the options framework, where an agent learns policies over a
set of temporally extended actions (options). Each option has a termination condition, and the agent chooses when
to switch between them.
○ Options: Higher-level actions that encapsulate a sequence of low-level actions. For example, "grabbing a
cup" might involve several motor actions but can be treated as a single option.
MAXQ Decomposition: This method decomposes the value function into a hierarchy of value functions
corresponding to different subgoals. It breaks down a complex RL task into a hierarchy where each subtask
has its own reward and transition model.
● Advantages: Increases the efficiency of learning by solving smaller tasks and generalizing solutions
to new problems.
Meta-Controller: In HRL, the agent often has a "meta-controller" that decides which subtask or option to
focus on. The meta-controller operates at a higher level, making decisions based on more abstract states or
goals.
● Example: In a robot control scenario, the meta-controller decides between different high-level
strategies (e.g., "walk" vs. "jump") depending on the environment.
Bias-Overfitting Tradeoff in RL
The bias-overfitting tradeoff is a core concept in machine learning, including RL, where the agent needs to generalize
well without being over-optimistic (bias) or overly tailored to specific experiences (overfitting).
Key Concepts:
● Bias: Bias occurs when the model makes overly simplistic assumptions, resulting in systematic errors. In RL,
high bias can manifest when an agent is too conservative or has limited representation power, leading to
suboptimal policies.
○ Example: A model that assumes linear value functions may fail to capture complex reward structures,
leading to poor generalization in complex environments.
● Overfitting: Overfitting occurs when the model learns patterns that are specific to the training data but do not
generalize well to new situations. In RL, this can happen when the agent learns specific nuances of a particular
environment that do not hold in general.
○ Example: An agent trained to play a specific level of a game may overfit to that level’s structure and fail
to adapt to new levels.
Tradeoff: There is a balance between bias and overfitting:
● High bias often results in underfitting, where the model is too simple to capture the true environment dynamics.
● High variance (overfitting) leads to poor performance on unseen tasks, as the model is too finely tuned to specific
experiences.
Regularization Techniques:
● Feature Selection: Limiting the number of features the agent can use to generalize better.
● Regularization: Add penalties for complex models (e.g., L2 regularization) to reduce overfitting.
● Cross-validation: Techniques like cross-validation help estimate how well the learned policy generalizes to new
data.
Generalization in RL: Ideally, an RL agent should generalize well to unseen states or environments. Strategies like domain
randomization, where the agent is exposed to varied environments during training, help improve robustness and prevent
overfitting.
Practical Example:
In environments like autonomous driving, overfitting to a
specific city’s road layout could cause the agent to perform
poorly in a different city. Introducing bias through regularization
(e.g., domain randomization) helps create a more
generalizable agent capable of driving in varied scenarios.