0% found this document useful (0 votes)
2 views

Policy_Approximation_Document

The document discusses policy approximation in policy gradient methods, emphasizing the use of differentiable parameterization to optimize policies for maximizing long-term rewards. It highlights the soft-max function for converting action preferences into probabilities, allowing for both deterministic and stochastic policies. The Policy Gradient Theorem is introduced as a method for optimizing policies by computing the gradient of expected returns with respect to policy parameters, particularly useful in high-dimensional reinforcement learning problems.

Uploaded by

ishwaryagundra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Policy_Approximation_Document

The document discusses policy approximation in policy gradient methods, emphasizing the use of differentiable parameterization to optimize policies for maximizing long-term rewards. It highlights the soft-max function for converting action preferences into probabilities, allowing for both deterministic and stochastic policies. The Policy Gradient Theorem is introduced as a method for optimizing policies by computing the gradient of expected returns with respect to policy parameters, particularly useful in high-dimensional reinforcement learning problems.

Uploaded by

ishwaryagundra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Policy Approximation and Policy Gradient Theorem

Policy Approximation
In policy gradient methods, we can parameterize the policy in a variety of ways, as long as it
is differentiable with respect to its parameters. This allows the policy to be updated using
optimization techniques.

The goal is to learn a good policy that maximizes long-term rewards. Policy approximation
refers to approximating the policy (the action-selection rule) using a function that can be
easily adjusted and improved over time.

Common Parameterization of Policies


A common method for discrete action spaces is to use action preferences for each state-
action pair, denoted as h(s,a,θ), where θ represents the policy parameters. These
preferences are converted into probabilities using a soft-max function:

π(a|s,θ) = e^h(s,a,θ) / Σ_b e^h(s,b,θ)

This soft-max function ensures that actions with higher preferences have higher
probabilities of being chosen, while all probabilities sum to 1.

Advantages of Using Soft-Max Policy Approximation


1. Approaching Deterministic Policies: Using a soft-max parameterization, the policy can
become deterministic over time.

2. Selection of Actions with Arbitrary Probabilities: Stochastic policies can be naturally


learned, especially useful in games like Poker.

3. Easier to Approximate in Some Problems: Policies may be easier to model than action-
value functions in certain environments.

4. Injecting Prior Knowledge: Parameterizing the policy can incorporate domain knowledge
into the learning process.

Example 13.1: Stochastic Policy in a Simple Corridor Gridworld


In this example, the agent must navigate a corridor with two actions: move left or right.
Action-value methods like ϵ-greedy may struggle, while policy gradient methods can learn a
stochastic policy that maximizes expected reward.

Key Takeaways
• Policy parameterization via soft-max allows for creating deterministic or stochastic
policies.

• Policy-based methods excel when stochastic behavior is optimal.


• These methods can learn faster in simpler environments.

• Prior knowledge about optimal policies can improve learning efficiency.

The Policy Gradient Theorem


The Policy Gradient Theorem optimizes a policy by computing the gradient of a
performance measure (expected return) with respect to policy parameters. This is
especially useful in high-dimensional problems.

Reinforcement Learning Setup


In RL, an agent interacts with the environment and learns to take actions based on
observations. The policy π(a|s,θ) defines action probabilities, and the goal is to maximize
the expected return.

Objective of Policy Gradient Methods


To improve the policy πθ, we calculate the gradient of J(θ) with respect to θ and update the
parameters to improve performance.

Deriving the Policy Gradient Theorem


The theorem simplifies computing the gradient ∇θJ(θ) using expected rewards and action
probabilities.

You might also like