CH5_Function Approximation (1)
CH5_Function Approximation (1)
m = 65/50 = 13/10
b = 5.5/5
y = mx + b = 13/10x + 5.5/5.
Function Approximation
Types of Function Approximation in RL
Nonlinear Function Approximation
• Uses models like neural networks to approximate value
functions or policies.
• Deep RL methods rely on nonlinear approximators:
• Deep Q-Networks (DQN) – Uses deep learning for Q-value
approximation.
• Actor-Critic Methods – Uses neural networks for both policy and
value function approximation.
Function Approximation - Challenges
Overfitting
• During training, agents could commit some states to memory,
which could hinder their ability to generalize in new scenarios.
This problem can be addressed by building a cautious
approximation model construction.
Exploration-Exploitation Tradeoff
• To find the best plans, agents must strike a balance between
using information that is already known and investigating
uncharted territory. To achieve a balance, incentive structures
and exploration tactics must be carefully designed.
Tile Coding
• Used for shrinking the feature vector thereby improving the
computational efficiency
• We cover more states, with fewer features.
• The indicated point is represented as (0,0,0,0,1,0,0,0,0). – one hot
encoding
• How about you take 4 - 2x2 boxes, and just shift them a little bit.
• Now you cover 10 states with only 4 dimensions, or 4 inputs: red box,
green box, blue box, and purple box.
• Now the same middle point could be represented by (1,1,1,1).
• This means you can generalize better. Before - gradient descent
would only affect the middle point parameters. Now, since a point is
influenced by a combination of a few features - all of these features
parameters will be affected. Which also allows for faster learning.
What is Deep Q-Learning?
• The state is given as the input and the Q-value of all possible
actions is generated as the output. The comparison between Q-
learning & deep Q-learning is wonderfully illustrated below:
Deep Q-Learning
• Observe that in the equation target = R(s,a,s’) +
the term gamma max_{a'}Q_{k}(s',a’) is a variable term.
• Therefore in this process, the target for the neural network is variable
unlike other typical Deep Learning processes where the target is stationary.
• This problem is overcome by having two neural networks instead of one.
One neural network is used to adjust the parameters of the network and the
other is used for computing the target and which has the same architecture
as the first network but has frozen parameters.
• After an x number of iterations in the primary network, the parameters are
copied to the target network.
Deep Q-Learning
Challenges in Deep RL as Compared to Deep
Learning
• So far, this all looks great. We understood how neural networks can
help the agent learn the best actions. However, there is a challenge
when we compare deep RL to deep learning (DL):
Challenges in Deep RL as Compared to Deep
Learning
• The concepts we have learned so far? They all combine to make the
deep Q-learning algorithm that was used to achieve human-level level
performance in Atari games (using just the video frames of the game).
Summary - Deep Q-Learning
• Deep Q-Learning is a type of reinforcement learning algorithm
that uses a deep neural network to approximate the Q-
function, which is used to determine the optimal action to take
in a given state. The Q-function represents the expected
cumulative reward of taking a certain action in a certain state
and following a certain policy. In Q-Learning, the Q-function is
updated iteratively as the agent interacts with the environment.
Deep Q-Learning is used in various applications such as game
playing, robotics and autonomous vehicles.
• Deep Q-Learning is a variant of Q-Learning that uses a deep
neural network to represent the Q-function, rather than a
simple table of values. This allows the algorithm to handle
environments with a large number of states and actions, as well
as to learn from high-dimensional inputs such as images or
Summary - Deep Q-Learning
• Experience replay is a technique where the agent stores a
subset of its experiences (state, action, reward, next state) in a
memory buffer and samples from this buffer to update the Q-
function. This helps to decorrelate the data and make the
learning process more stable.
At each step:
1. The actor picks an action “a” based on current policy π(s).
2.The environment returns reward r and next state s'.
3.The Critic estimates the advantage or TD error: δ=r + γV(s′) − V(s)
4.The Critic updates its value function V(s).
5.The Actor updates its policy to improve actions with positive advantage δ.