Q-Learning Flashcards

(33 cards)

1
Q

What is the objective of value-based reinforcement learning?

A

To learn a value function that estimates the expected return of states or state–action pairs in order to derive an optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the definition of the Q-value function Qπ(x,a)?

A

The expected discounted return obtained by taking action a in state x and then following policy π thereafter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why can the optimal policy be directly obtained from Q*?

A

Because the optimal policy selects the action that maximizes Q*(x,a) in each state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Bellman optimality equation for Q*?

A

Q(x,a) = E[r + γ max_{a’} Q(y’,a’) | x,a].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Bellman operator in Q-learning?

A

An operator that maps a Q-function to a new Q-function using expected rewards and discounted future optimal Q-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the key assumption of dynamic programming methods in RL?

A

Full knowledge of the MDP, including the transition and reward functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does Q-learning differ from dynamic programming?

A

Q-learning does not require prior knowledge of the MDP and learns from sampled experience.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Write the Q-learning update rule.

A

Q(x,a) ← Q(x,a) + α [r + γ max_{a’} Q(x’,a’) − Q(x,a)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is Q-learning an on-policy or off-policy algorithm?

A

Off-policy, because it learns the optimal policy independently of the behavior policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the conditions for the Q-function to converge

A

You keep learning forever and the learning rate shrinks enough to become stable, and the exploration policy is such that every state/action pair is visited

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What role does the discount factor γ play in Q-learning?

A

It controls the importance of future rewards relative to immediate rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does tabular Q-learning fail in large or continuous state spaces?

A

Because the number of state–action pairs grows exponentially (curse of dimensionality).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When are function approximators required in Q-learning?

A

When the state space is large or continuous, making tabular representations infeasible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is the Q-function represented in Deep Q-Learning (DQN)?

A

By a deep neural network parameterized by θ that approximates Q(x,a) with Q(x,a;θ). Where θ is updated using gradient descent with target value Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What loss function is minimized in DQN?

A

The squared temporal-difference error between predicted Q-values and target values: (Q(x,a,θ) - y)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the DQN target value Y?

A

Y = r + γ max_{a’} Q(x’,a’; θ⁻), where θ⁻ are target network parameters.

17
Q

Why does DQN use a target network?

A

To stabilize learning by keeping the target values fixed for several updates.

18
Q

What is experience replay and why is it used?

A

A memory buffer that stores transitions and allows random sampling to reduce correlation and variance.

19
Q

What are the two main sources of instability in naive deep Q-learning?

A

Strong correlation between samples and moving target values.

20
Q

What is bootstrapping in Q-learning?

A

Using the algorithm’s own value estimates to update current estimates.

21
Q

What is multistep Q-learning?

A

A variant that uses multiple future rewards before updating values, reducing bias but increasing variance.

22
Q

What is a key requirement for unbiased n-step learning?

A

Online data or additional correction techniques.

23
Q

What is Distributional DQN?

A

A DQN variant that learns a distribution over returns instead of only the expected value. Where expectation equal to optimal Q

24
Q

What advantage does Distributional DQN provide?

A

Possibility of implementing risk aware behavior
Leads to better performing learning in practice

25
Why can two policies have the same Q-value but different value distributions?
Because they can have identical expected returns but different reward variances.
26
What is a key advantage of using deep learning in Q-learning?
Generalization across similar states, mitigating the curse of dimensionality.
27
What is a major drawback of deep Q-learning compared to tabular Q-learning?
Loss of guaranteed convergence and sensitivity to hyperparameters.
28
Why is Q-learning suitable for problems like Mountain Car?
Because it can learn long-term strategies from delayed rewards.
29
What is the main challenge when choosing the discount factor in deep RL?
Balancing short-term learning stability with long-term planning performance.
30
How can increasing the discount factor during training help?
It allows learning short-term behaviors first and gradually incorporating long-term rewards.
31
Summarize the main difference between tabular Q-learning and DQN.
Tabular Q-learning stores explicit Q-values, while DQN approximates them using neural networks.
32
How does Q-learning algorithm work in the tabular setting
Initialize Q(x,a) arbitrarily for each episode (initial state till terminal condition: - initialize an x and for each step in episode: - Choose action a given x derived from Q and observe reward and next state - Update Q(x,a) in table
33
What is a Parallelized Q-network
Using parallel (vectorized) environments and skipping requirement of replay memory and target network.