0% found this document useful (0 votes)
2 views

CH5_Function Approximation (1)

The document discusses function approximation in reinforcement learning (RL), highlighting its necessity due to challenges like the curse of dimensionality and the impracticality of storing explicit value functions. It covers various types of function approximators, including linear and nonlinear methods, and introduces Deep Q-Learning, which utilizes deep neural networks to approximate Q-values for optimal action selection. Additionally, it addresses challenges in deep RL such as overfitting and the exploration-exploitation tradeoff, while also explaining techniques like experience replay and target networks to stabilize learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CH5_Function Approximation (1)

The document discusses function approximation in reinforcement learning (RL), highlighting its necessity due to challenges like the curse of dimensionality and the impracticality of storing explicit value functions. It covers various types of function approximators, including linear and nonlinear methods, and introduces Deep Q-Learning, which utilizes deep neural networks to approximate Q-values for optimal action selection. Additionally, it addresses challenges in deep RL such as overfitting and the exploration-exploitation tradeoff, while also explaining techniques like experience replay and target networks to stabilize learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Function Approximation

Dr. D. John Pradeep


Associate Professor
VIT-AP University
Function Approximation
Need for Function Approximation
• In many real-life learning situations, maintaining explicit
representations of value functions or policies is impractical, due
to the large and continuous state and action spaces.
• The curse of dimensionality poses a significant challenge,
leading to increased memory and processing demands.
• Function approximation addresses this problem by enabling
agents to make informed decisions in unfamiliar conditions by
generalizing previously learned information.
Function Approximation
Types of Function Approximation in RL
Linear Function Approximation: Uses a weighted sum of
features

where w is a vector of weights, ( ) is a feature vector for state


Function approximation - RL

• Instead of storing V/Q values, update θ parameters such that


they fulfill the approximation relations

Function approximation - RL
Policy Improvement
Least Squares Method
• The least-squares method is a statistical method used to find the
line of best fit of the form of an equation such as y = mx + b to
the given data. The curve of the equation is called the regression
line
• The main objective of this method is to reduce the sum of the
squares of errors as much as possible. This is the reason this
method is called the least-squares method.
Procedure
• Let us assume that the given points of data are (x1, y1), (x2, y2),
(x3, y3), …, (xn, yn) in which all x’s are independent variables,
while all y’s are dependent ones.
• This method is used to find a linear line of the form y = mx + b,
where y and x are variables, m is the slope, and b is the y-
intercept.
• The formula to calculate slope m and the value of b is given by:
m = (n∑xy - ∑y∑x)/n∑x2 - (∑x)2
b = (∑y - m∑x)/n
Here, n is the number of data points
LSM - Example

m = 65/50 = 13/10
b = 5.5/5

y = mx + b = 13/10x + 5.5/5.
Function Approximation
Types of Function Approximation in RL
Nonlinear Function Approximation
• Uses models like neural networks to approximate value
functions or policies.
• Deep RL methods rely on nonlinear approximators:
• Deep Q-Networks (DQN) – Uses deep learning for Q-value
approximation.
• Actor-Critic Methods – Uses neural networks for both policy and
value function approximation.
Function Approximation - Challenges
Overfitting
• During training, agents could commit some states to memory,
which could hinder their ability to generalize in new scenarios.
This problem can be addressed by building a cautious
approximation model construction.
Exploration-Exploitation Tradeoff
• To find the best plans, agents must strike a balance between
using information that is already known and investigating
uncharted territory. To achieve a balance, incentive structures
and exploration tactics must be carefully designed.
Tile Coding
• Used for shrinking the feature vector thereby improving the
computational efficiency
• We cover more states, with fewer features.
• The indicated point is represented as (0,0,0,0,1,0,0,0,0). – one hot
encoding
• How about you take 4 - 2x2 boxes, and just shift them a little bit.
• Now you cover 10 states with only 4 dimensions, or 4 inputs: red box,
green box, blue box, and purple box.
• Now the same middle point could be represented by (1,1,1,1).
• This means you can generalize better. Before - gradient descent
would only affect the middle point parameters. Now, since a point is
influenced by a combination of a few features - all of these features
parameters will be affected. Which also allows for faster learning.
What is Deep Q-Learning?

• Deep Q-Learning is a reinforcement learning technique that


combines Q-Learning, an algorithm for learning optimal actions in
an environment, with deep neural networks.

• It aims to enable agents to learn optimal actions in complex, high-


dimensional environments.

• By using a neural network to approximate the Q-function, which


estimates the expected cumulative reward for each action in a given
state, Deep Q-Learning can handle environments with large state
spaces.
What is Deep Q-Learning?
• The network is updated iteratively through episodes, using a
combination of exploration and exploitation strategies.

• However, care must be taken to mitigate instability caused by non-


stationarity and divergence issues, typically addressed by
experience replay and target networks.

• Deep Q-Learning has proven effective in training agents for various


tasks, including video games and robotic control.
Why ‘Deep’ Q-Learning?
• Q-learning is a simple yet quite powerful algorithm to create a cheat sheet for
our agent. This helps the agent figure out exactly which action to perform.
• But what if this cheat sheet is too long? Imagine an environment with 10,000
states and 1,000 actions per state. This would create a table of 10 million
cells. Things will quickly get out of control!
• It is pretty clear that we can’t infer the Q-value of new states from already
explored states. This presents two problems:
• First, the amount of memory required to save and update that table would
increase as the number of states increases
• Second, the amount of time required to explore each state to create the
required Q-table would be unrealistic
• Here’s a thought – what if we approximate these Q-values with machine
learning models such as a neural network? Well, this was the idea behind
DeepMind’s algorithm that led to its acquisition by Google for 500 million
dollars!
Deep Q-Networks

• In deep Q-learning, we use a neural network to approximate the


Q-value function.

• The state is given as the input and the Q-value of all possible
actions is generated as the output. The comparison between Q-
learning & deep Q-learning is wonderfully illustrated below:
Deep Q-Learning
• Observe that in the equation target = R(s,a,s’) +
the term gamma max_{a'}Q_{k}(s',a’) is a variable term.

• Therefore in this process, the target for the neural network is variable
unlike other typical Deep Learning processes where the target is stationary.
• This problem is overcome by having two neural networks instead of one.
One neural network is used to adjust the parameters of the network and the
other is used for computing the target and which has the same architecture
as the first network but has frozen parameters.
• After an x number of iterations in the primary network, the parameters are
copied to the target network.
Deep Q-Learning
Challenges in Deep RL as Compared to Deep
Learning
• So far, this all looks great. We understood how neural networks can
help the agent learn the best actions. However, there is a challenge
when we compare deep RL to deep learning (DL):
Challenges in Deep RL as Compared to Deep
Learning

• As you can see in the above code, the target is continuously


changing with each iteration. In deep learning, the target variable
does not change and hence the training is stable, which is just not
true for RL.
• To summarise, we often depend on the policy or value functions in
reinforcement learning to sample actions. However, this is
frequently changing as we continuously learn what to explore. As
we play out the game, we get to know more about the ground truth
values of states and actions and hence, the output is also
changing.
• So, we try to learn to map for a constantly changing input and
output. But then what is the solution?
1. Target Network
• Since the same network is calculating the predicted value and
the target value, there could be a lot of divergence between
these two. So, instead of using one neural network for learning,
we can use two.

• We could use a separate network to estimate the target. This


target network has the same architecture as the function
approximator but with frozen parameters. For every C iterations
(a hyperparameter), the parameters from the prediction network
are copied to the target network. This leads to more stable
training because it keeps the target function fixed (for a while):
2. Experience Replay

• To perform experience replay, we store the agent’s experiences –


Putting it all Together

• The concepts we have learned so far? They all combine to make the
deep Q-learning algorithm that was used to achieve human-level level
performance in Atari games (using just the video frames of the game).
Summary - Deep Q-Learning
• Deep Q-Learning is a type of reinforcement learning algorithm
that uses a deep neural network to approximate the Q-
function, which is used to determine the optimal action to take
in a given state. The Q-function represents the expected
cumulative reward of taking a certain action in a certain state
and following a certain policy. In Q-Learning, the Q-function is
updated iteratively as the agent interacts with the environment.
Deep Q-Learning is used in various applications such as game
playing, robotics and autonomous vehicles.
• Deep Q-Learning is a variant of Q-Learning that uses a deep
neural network to represent the Q-function, rather than a
simple table of values. This allows the algorithm to handle
environments with a large number of states and actions, as well
as to learn from high-dimensional inputs such as images or
Summary - Deep Q-Learning
• Experience replay is a technique where the agent stores a
subset of its experiences (state, action, reward, next state) in a
memory buffer and samples from this buffer to update the Q-
function. This helps to decorrelate the data and make the
learning process more stable.

• Target networks, on the other hand, are used to stabilize the Q-


function updates. In this technique, a separate network is used
to compute the target Q-values, which are then used to update
the Q-function network.

• Deep Q-Learning has been applied to a wide range of


problems, including game playing, robotics, and autonomous
Fitted Q-iteration
• Instead of learning a value function ( ), FQI learns an
approximation of the optimal action-value function ∗( , ) from
a batch of data.
• It is off-policy, model-free, and sample-efficient, making it ideal
for scenarios where data collection is expensive.
Fitted Q-iteration
Fitted Q-iteration
Intuition:
At each iteration,
• Use your current Q-function estimate to compute targets for
each sample in your batch.
• Train a function approximator to map from state-action pairs
(s,a) to those target values.
• Iterate until convergence.
Actor – Critic Methods
• Actor-Critic methods are a class of reinforcement
learning (RL) algorithms that combine both value-based
and policy-based approaches. They're super popular and
powerful — especially in continuous action spaces or
environments with large state spaces.
Actor – Critic Methods
Actor – Critic Methods
How it works?

At each step:
1. The actor picks an action “a” based on current policy π(s).
2.The environment returns reward r and next state s'.
3.The Critic estimates the advantage or TD error: δ=r + γV(s′) − V(s)
4.The Critic updates its value function V(s).
5.The Actor updates its policy to improve actions with positive advantage δ.

You might also like