What is the main idea behind policy-based reinforcement learning methods?
To directly parameterize and optimize the policy by maximizing the expected return using gradient ascent. Used when action space is large or continuous.
Why are policy-based methods particularly suitable for stochastic policies?
Because they directly represent policies as probability distributions over actions.
In which type of problems are stochastic policies especially useful?
In explicit exploration settings and multi-agent systems where Nash equilibria are stochastic.
Why are policy-based methods well-suited for continuous action spaces?
Because they can output continuous probability distributions or parameters of continuous actions.
What is meant by a stationary policy? and non-stationary policy?
A policy that depends only on the current state and not explicitly on time.
A policy that depends on time step
What is the difference between deterministic and stochastic policies?
Deterministic policies map states to a single action, while stochastic policies map states to a probability distribution over actions.
What distinguishes on-policy from off-policy learning algorithms?
On-policy methods learn from data generated strictly by the current policy, whereas off-policy methods can learn from any behavior policy and are more sample efficient.
Why are off-policy methods often more sample efficient?
Because they can reuse past experience, including data stored in replay buffers.
What is the objective optimized by policy gradient methods?
The expected cumulative discounted return under the policy.
What is the policy gradient update rule in general form?
w_t+1 ← w_t + η ∇w V^{π_w}(x₀). new weighting of policy should be in direction to increase maximum return
What is the policy gradient theorem?
∇w V^{π_w}(x₀)
It expresses the gradient of the expected return as an expectation over states and actions weighted by Q-values.
Why does the policy gradient not depend on the gradient of the state distribution or environment?
Because the policy gradient theorem removes the need to differentiate the state distribution or environment.
What is the likelihood ratio trick used for in policy gradients?
To rewrite the gradient of the policy as the gradient of the log-policy times the policy.
Write the stochastic policy gradient expression.
E[∇w log π_w(x,a) · Q^{π_w}(x,a)].
What are the two main steps in policy gradient algorithms?
Policy evaluation and policy improvement.
What is REINFORCE?
A policy gradient algorithm that estimates Q-values using Monte Carlo on-policy rollouts.
What is the main advantage of REINFORCE (and on policy methods)?
It provides an unbiased estimator of the policy gradient.
What is the main drawback of REINFORCE?
High variance and the need for many on-policy rollouts.
How do actor–critic methods improve over REINFORCE?
By using a learned value function to reduce variance in gradient estimates.
What is the role of a baseline in policy gradient methods?
To reduce the variance of the gradient estimator without introducing bias using A(x,a) = Q(x,a) − V(x). Can be seen as a baseline/control variate for a gradient estimator, and can reach a given performance with fewer updates
What is entropy regularization in policy gradients?
A regularization term that encourages stochasticity in the policy to promote exploration.
Why is entropy regularization useful?
It prevents premature convergence to deterministic policies and improves exploration.
How are action probabilities typically parameterized in discrete action spaces?
Using a softmax function over network outputs.
What is a key disadvantage of pure policy gradient methods compared to value-based methods?
They typically have higher variance and lower sample efficiency.