What is the objective of value-based reinforcement learning?
To learn a value function that estimates the expected return of states or state–action pairs in order to derive an optimal policy.
What is the definition of the Q-value function Qπ(x,a)?
The expected discounted return obtained by taking action a in state x and then following policy π thereafter.
Why can the optimal policy be directly obtained from Q*?
Because the optimal policy selects the action that maximizes Q*(x,a) in each state.
What is the Bellman optimality equation for Q*?
Q(x,a) = E[r + γ max_{a’} Q(y’,a’) | x,a].
What is the Bellman operator in Q-learning?
An operator that maps a Q-function to a new Q-function using expected rewards and discounted future optimal Q-values.
What is the key assumption of dynamic programming methods in RL?
Full knowledge of the MDP, including the transition and reward functions.
How does Q-learning differ from dynamic programming?
Q-learning does not require prior knowledge of the MDP and learns from sampled experience.
Write the Q-learning update rule.
Q(x,a) ← Q(x,a) + α [r + γ max_{a’} Q(x’,a’) − Q(x,a)].
Is Q-learning an on-policy or off-policy algorithm?
Off-policy, because it learns the optimal policy independently of the behavior policy.
What are the conditions for the Q-function to converge
You keep learning forever and the learning rate shrinks enough to become stable, and the exploration policy is such that every state/action pair is visited
What role does the discount factor γ play in Q-learning?
It controls the importance of future rewards relative to immediate rewards.
Why does tabular Q-learning fail in large or continuous state spaces?
Because the number of state–action pairs grows exponentially (curse of dimensionality).
When are function approximators required in Q-learning?
When the state space is large or continuous, making tabular representations infeasible.
How is the Q-function represented in Deep Q-Learning (DQN)?
By a deep neural network parameterized by θ that approximates Q(x,a) with Q(x,a;θ). Where θ is updated using gradient descent with target value Y
What loss function is minimized in DQN?
The squared temporal-difference error between predicted Q-values and target values: (Q(x,a,θ) - y)^2
What is the DQN target value Y?
Y = r + γ max_{a’} Q(x’,a’; θ⁻), where θ⁻ are target network parameters.
Why does DQN use a target network?
To stabilize learning by keeping the target values fixed for several updates.
What is experience replay and why is it used?
A memory buffer that stores transitions and allows random sampling to reduce correlation and variance.
What are the two main sources of instability in naive deep Q-learning?
Strong correlation between samples and moving target values.
What is bootstrapping in Q-learning?
Using the algorithm’s own value estimates to update current estimates.
What is multistep Q-learning?
A variant that uses multiple future rewards before updating values, reducing bias but increasing variance.
What is a key requirement for unbiased n-step learning?
Online data or additional correction techniques.
What is Distributional DQN?
A DQN variant that learns a distribution over returns instead of only the expected value. Where expectation equal to optimal Q
What advantage does Distributional DQN provide?
Possibility of implementing risk aware behavior
Leads to better performing learning in practice