Q-Learning Flashcards by Winston Awortwi

What is the objective of value-based reinforcement learning?

To learn a value function that estimates the expected return of states or state–action pairs in order to derive an optimal policy.

How well did you know this?

Not at all

Perfectly

What is the definition of the Q-value function Qπ(x,a)?

The expected discounted return obtained by taking action a in state x and then following policy π thereafter.

How well did you know this?

Not at all

Perfectly

Why can the optimal policy be directly obtained from Q*?

Because the optimal policy selects the action that maximizes Q*(x,a) in each state.

How well did you know this?

Not at all

Perfectly

What is the Bellman optimality equation for Q*?

Q(x,a) = E[r + γ max_{a’} Q(y’,a’) | x,a].

How well did you know this?

Not at all

Perfectly

What is the Bellman operator in Q-learning?

An operator that maps a Q-function to a new Q-function using expected rewards and discounted future optimal Q-values.

How well did you know this?

Not at all

Perfectly

What is the key assumption of dynamic programming methods in RL?

Full knowledge of the MDP, including the transition and reward functions.

How well did you know this?

Not at all

Perfectly

How does Q-learning differ from dynamic programming?

Q-learning does not require prior knowledge of the MDP and learns from sampled experience.

How well did you know this?

Not at all

Perfectly

Write the Q-learning update rule.

Q(x,a) ← Q(x,a) + α [r + γ max_{a’} Q(x’,a’) − Q(x,a)].

How well did you know this?

Not at all

Perfectly

Is Q-learning an on-policy or off-policy algorithm?

Off-policy, because it learns the optimal policy independently of the behavior policy.

How well did you know this?

Not at all

Perfectly

What are the conditions for the Q-function to converge

You keep learning forever and the learning rate shrinks enough to become stable, and the exploration policy is such that every state/action pair is visited

How well did you know this?

Not at all

Perfectly

What role does the discount factor γ play in Q-learning?

It controls the importance of future rewards relative to immediate rewards.

How well did you know this?

Not at all

Perfectly

Why does tabular Q-learning fail in large or continuous state spaces?

Because the number of state–action pairs grows exponentially (curse of dimensionality).

How well did you know this?

Not at all

Perfectly

When are function approximators required in Q-learning?

When the state space is large or continuous, making tabular representations infeasible.

How well did you know this?

Not at all

Perfectly

How is the Q-function represented in Deep Q-Learning (DQN)?

By a deep neural network parameterized by θ that approximates Q(x,a) with Q(x,a;θ). Where θ is updated using gradient descent with target value Y

How well did you know this?

Not at all

Perfectly

What loss function is minimized in DQN?

The squared temporal-difference error between predicted Q-values and target values: (Q(x,a,θ) - y)^2

How well did you know this?

Not at all

Perfectly

What is the DQN target value Y?

Study These Flashcards

Y = r + γ max_{a’} Q(x’,a’; θ⁻), where θ⁻ are target network parameters.

Why does DQN use a target network?

Study These Flashcards

To stabilize learning by keeping the target values fixed for several updates.

What is experience replay and why is it used?

Study These Flashcards

A memory buffer that stores transitions and allows random sampling to reduce correlation and variance.

What are the two main sources of instability in naive deep Q-learning?

Study These Flashcards

Strong correlation between samples and moving target values.

What is bootstrapping in Q-learning?

Study These Flashcards

Using the algorithm’s own value estimates to update current estimates.

What is multistep Q-learning?

Study These Flashcards

A variant that uses multiple future rewards before updating values, reducing bias but increasing variance.

What is a key requirement for unbiased n-step learning?

Study These Flashcards

Online data or additional correction techniques.

What is Distributional DQN?

Study These Flashcards

A DQN variant that learns a distribution over returns instead of only the expected value. Where expectation equal to optimal Q

What advantage does Distributional DQN provide?

Study These Flashcards

Possibility of implementing risk aware behavior
Leads to better performing learning in practice

Why can two policies have the same Q-value but different value distributions?

Because they can have identical expected returns but different reward variances.

What is a key advantage of using deep learning in Q-learning?

Generalization across similar states, mitigating the curse of dimensionality.

What is a major drawback of deep Q-learning compared to tabular Q-learning?

Loss of guaranteed convergence and sensitivity to hyperparameters.

Why is Q-learning suitable for problems like Mountain Car?

Because it can learn long-term strategies from delayed rewards.

What is the main challenge when choosing the discount factor in deep RL?

Balancing short-term learning stability with long-term planning performance.

How can increasing the discount factor during training help?

It allows learning short-term behaviors first and gradually incorporating long-term rewards.

Summarize the main difference between tabular Q-learning and DQN.

Tabular Q-learning stores explicit Q-values, while DQN approximates them using neural networks.

How does Q-learning algorithm work in the tabular setting

Initialize Q(x,a) arbitrarily for each episode (initial state till terminal condition: - initialize an x and for each step in episode: - Choose action a given x derived from Q and observe reward and next state - Update Q(x,a) in table

What is a Parallelized Q-network

Using parallel (vectorized) environments and skipping requirement of replay memory and target network.

Q-Learning Flashcards

(33 cards)