Policy Gradient Flashcards

(31 cards)

1
Q

What is the main idea behind policy-based reinforcement learning methods?

A

To directly parameterize and optimize the policy by maximizing the expected return using gradient ascent. Used when action space is large or continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why are policy-based methods particularly suitable for stochastic policies?

A

Because they directly represent policies as probability distributions over actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In which type of problems are stochastic policies especially useful?

A

In explicit exploration settings and multi-agent systems where Nash equilibria are stochastic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why are policy-based methods well-suited for continuous action spaces?

A

Because they can output continuous probability distributions or parameters of continuous actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is meant by a stationary policy? and non-stationary policy?

A

A policy that depends only on the current state and not explicitly on time.

A policy that depends on time step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between deterministic and stochastic policies?

A

Deterministic policies map states to a single action, while stochastic policies map states to a probability distribution over actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What distinguishes on-policy from off-policy learning algorithms?

A

On-policy methods learn from data generated strictly by the current policy, whereas off-policy methods can learn from any behavior policy and are more sample efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are off-policy methods often more sample efficient?

A

Because they can reuse past experience, including data stored in replay buffers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the objective optimized by policy gradient methods?

A

The expected cumulative discounted return under the policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the policy gradient update rule in general form?

A

w_t+1 ← w_t + η ∇w V^{π_w}(x₀). new weighting of policy should be in direction to increase maximum return

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the policy gradient theorem?

A

∇w V^{π_w}(x₀)

It expresses the gradient of the expected return as an expectation over states and actions weighted by Q-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does the policy gradient not depend on the gradient of the state distribution or environment?

A

Because the policy gradient theorem removes the need to differentiate the state distribution or environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the likelihood ratio trick used for in policy gradients?

A

To rewrite the gradient of the policy as the gradient of the log-policy times the policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Write the stochastic policy gradient expression.

A

E[∇w log π_w(x,a) · Q^{π_w}(x,a)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two main steps in policy gradient algorithms?

A

Policy evaluation and policy improvement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is REINFORCE?

A

A policy gradient algorithm that estimates Q-values using Monte Carlo on-policy rollouts.

17
Q

What is the main advantage of REINFORCE (and on policy methods)?

A

It provides an unbiased estimator of the policy gradient.

18
Q

What is the main drawback of REINFORCE?

A

High variance and the need for many on-policy rollouts.

19
Q

How do actor–critic methods improve over REINFORCE?

A

By using a learned value function to reduce variance in gradient estimates.

20
Q

What is the role of a baseline in policy gradient methods?

A

To reduce the variance of the gradient estimator without introducing bias using A(x,a) = Q(x,a) − V(x). Can be seen as a baseline/control variate for a gradient estimator, and can reach a given performance with fewer updates

21
Q

What is entropy regularization in policy gradients?

A

A regularization term that encourages stochasticity in the policy to promote exploration.

22
Q

Why is entropy regularization useful?

A

It prevents premature convergence to deterministic policies and improves exploration.

23
Q

How are action probabilities typically parameterized in discrete action spaces?

A

Using a softmax function over network outputs.

24
Q

What is a key disadvantage of pure policy gradient methods compared to value-based methods?

A

They typically have higher variance and lower sample efficiency.

25
What is one key advantage of policy-based methods over value-based methods?
They can naturally handle continuous action spaces and stochastic policies.
26
How are policy gradient methods typically optimized in practice?
Using stochastic gradient ascent with samples collected from the environment.
27
Why are policy gradient methods considered on-policy by default?
Because gradient estimates require data generated by the current policy.
28
What caution should be taken when benchmarking policy gradient algorithms?
Results are highly sensitive to randomness and hyperparameter choices.
29
Why is averaging over multiple random seeds important in RL benchmarking?
Because single runs can be misleading due to high variance.
30
What is a general conclusion about policy-based methods in deep RL?
They are powerful and flexible but require careful variance reduction and benchmarking.
31
How do policy gradient methods optimize performance objective
By finding a policy in the set of parametrized policies, and using gradient ascent with respect to the policy parameters