Policy Gradient Flashcards by Winston Awortwi

What is the main idea behind policy-based reinforcement learning methods?

To directly parameterize and optimize the policy by maximizing the expected return using gradient ascent. Used when action space is large or continuous.

How well did you know this?

Not at all

Perfectly

Why are policy-based methods particularly suitable for stochastic policies?

Because they directly represent policies as probability distributions over actions.

How well did you know this?

Not at all

Perfectly

In which type of problems are stochastic policies especially useful?

In explicit exploration settings and multi-agent systems where Nash equilibria are stochastic.

How well did you know this?

Not at all

Perfectly

Why are policy-based methods well-suited for continuous action spaces?

Because they can output continuous probability distributions or parameters of continuous actions.

How well did you know this?

Not at all

Perfectly

What is meant by a stationary policy? and non-stationary policy?

A policy that depends only on the current state and not explicitly on time.

A policy that depends on time step

How well did you know this?

Not at all

Perfectly

What is the difference between deterministic and stochastic policies?

Deterministic policies map states to a single action, while stochastic policies map states to a probability distribution over actions.

How well did you know this?

Not at all

Perfectly

What distinguishes on-policy from off-policy learning algorithms?

On-policy methods learn from data generated strictly by the current policy, whereas off-policy methods can learn from any behavior policy and are more sample efficient.

How well did you know this?

Not at all

Perfectly

Why are off-policy methods often more sample efficient?

Because they can reuse past experience, including data stored in replay buffers.

How well did you know this?

Not at all

Perfectly

What is the objective optimized by policy gradient methods?

The expected cumulative discounted return under the policy.

How well did you know this?

Not at all

Perfectly

What is the policy gradient update rule in general form?

w_t+1 ← w_t + η ∇w V^{π_w}(x₀). new weighting of policy should be in direction to increase maximum return

How well did you know this?

Not at all

Perfectly

What is the policy gradient theorem?

∇w V^{π_w}(x₀)

It expresses the gradient of the expected return as an expectation over states and actions weighted by Q-values.

How well did you know this?

Not at all

Perfectly

Why does the policy gradient not depend on the gradient of the state distribution or environment?

Because the policy gradient theorem removes the need to differentiate the state distribution or environment.

How well did you know this?

Not at all

Perfectly

What is the likelihood ratio trick used for in policy gradients?

To rewrite the gradient of the policy as the gradient of the log-policy times the policy.

How well did you know this?

Not at all

Perfectly

Write the stochastic policy gradient expression.

E[∇w log π_w(x,a) · Q^{π_w}(x,a)].

How well did you know this?

Not at all

Perfectly

What are the two main steps in policy gradient algorithms?

Policy evaluation and policy improvement.

How well did you know this?

Not at all

Perfectly

What is REINFORCE?

Study These Flashcards

A policy gradient algorithm that estimates Q-values using Monte Carlo on-policy rollouts.

What is the main advantage of REINFORCE (and on policy methods)?

Study These Flashcards

It provides an unbiased estimator of the policy gradient.

What is the main drawback of REINFORCE?

Study These Flashcards

High variance and the need for many on-policy rollouts.

How do actor–critic methods improve over REINFORCE?

Study These Flashcards

By using a learned value function to reduce variance in gradient estimates.

What is the role of a baseline in policy gradient methods?

Study These Flashcards

To reduce the variance of the gradient estimator without introducing bias using A(x,a) = Q(x,a) − V(x). Can be seen as a baseline/control variate for a gradient estimator, and can reach a given performance with fewer updates

What is entropy regularization in policy gradients?

Study These Flashcards

A regularization term that encourages stochasticity in the policy to promote exploration.

Why is entropy regularization useful?

Study These Flashcards

It prevents premature convergence to deterministic policies and improves exploration.

How are action probabilities typically parameterized in discrete action spaces?

Study These Flashcards

Using a softmax function over network outputs.

What is a key disadvantage of pure policy gradient methods compared to value-based methods?

Study These Flashcards

They typically have higher variance and lower sample efficiency.

What is one key advantage of policy-based methods over value-based methods?

They can naturally handle continuous action spaces and stochastic policies.

How are policy gradient methods typically optimized in practice?

Using stochastic gradient ascent with samples collected from the environment.

Why are policy gradient methods considered on-policy by default?

Because gradient estimates require data generated by the current policy.

What caution should be taken when benchmarking policy gradient algorithms?

Results are highly sensitive to randomness and hyperparameter choices.

Why is averaging over multiple random seeds important in RL benchmarking?

Because single runs can be misleading due to high variance.

What is a general conclusion about policy-based methods in deep RL?

They are powerful and flexible but require careful variance reduction and benchmarking.

How do policy gradient methods optimize performance objective

By finding a policy in the set of parametrized policies, and using gradient ascent with respect to the policy parameters

Policy Gradient Flashcards

(31 cards)