Policy_Approximation_Document
Policy_Approximation_Document
Policy Approximation
In policy gradient methods, we can parameterize the policy in a variety of ways, as long as it
is differentiable with respect to its parameters. This allows the policy to be updated using
optimization techniques.
The goal is to learn a good policy that maximizes long-term rewards. Policy approximation
refers to approximating the policy (the action-selection rule) using a function that can be
easily adjusted and improved over time.
This soft-max function ensures that actions with higher preferences have higher
probabilities of being chosen, while all probabilities sum to 1.
3. Easier to Approximate in Some Problems: Policies may be easier to model than action-
value functions in certain environments.
4. Injecting Prior Knowledge: Parameterizing the policy can incorporate domain knowledge
into the learning process.
Key Takeaways
• Policy parameterization via soft-max allows for creating deterministic or stochastic
policies.