What does generalization mean in reinforcement learning?
The ability of an RL agent to achieve good performance either with limited collected data or in a related but different environment.
Why is generalization particularly challenging in real-world RL problems?
Agent may not be able to interreact with true environment but only a simulation of it (Reality Gap)
Agent has access to limited data due to safety (robotics, med trials), computation, or limited exogenous data (weather conditions, trading markets)
How does generalization in RL differ from supervised learning?
In RL, data is generated by the agent’s policy and affects future data, whereas supervised learning assumes independent identically distributed samples.
What is bias in the context of learning algorithms?
The error introduced by limitations in the model or learning algorithm, even with infinite data.
What is overfitting in learning from limited data?
Poor performance on unseen data due to excessive sensitivity to the specific training dataset.
What tradeoff governs learning from limited data in supervised learning?
The bias–variance tradeoff. An error introduced from the learning algorithm or an error due to limited data available
Why is there no strict bias–variance decomposition in reinforcement learning?
Because not all RL objectives rely on an L2 loss (error^2) and involve sequential decision making.
How is policy suboptimality decomposed in batch reinforcement learning?
Into asymptotic bias (difference from optimal policy) and error due to finite dataset size (overfitting).
What is asymptotic bias in reinforcement learning?
The difference in performance between optimal policy and policy learned with infinite data
What causes overfitting in reinforcement learning?
Learning from a limited dataset that does not sufficiently cover the state–action space.
What is the key tradeoff when selecting a policy class in RL?
Between expressiveness to reduce bias and simplicity to avoid overfitting.
Name three elements that can improve generalization in reinforcement learning.
Abstract state representations: (discarding non-essential features)
Modifying the objective function: (Reward shaping, tuning training discount factor)
Choosing appropriate learning algorithms (model based/free) and function approximators
How does abstract state representation improve generalization?
By discarding non-essential features and reducing sensitivity to random correlations which reduces overfitting.
What is the risk of using too many features in state representations?
Overfitting due to reliance on spurious correlations.
What bias is introduced by overly coarse abstractions?
They may merge states with different dynamics, reducing policy optimality.
What is reward shaping?
Modifying the reward function away from actual objective to facilitate learning, at the cost of introducing bias.
How can tuning the discount factor affect generalization?
It biases learning toward short-term or long-term rewards, which can stabilize training.
Why can modifying the training objective help generalization?
Because a biased but smoother objective may be easier to learn from limited data.
How does the choice of learning algorithm affect generalization?
Different algorithms balance bias and overfitting differently depending on structure and assumptions.
Since each algorithm contains a different value function for each state action pair, representation of the policy, and/or a model of environment and planning algorithm.
Why do function approximators play a key role in generalization?
An attention mechanism: they determine how features are transformed into higher-level abstractions. Depending on task either model-free or model based is best
What is the parallel between model-free vs model-based RL and human cognition?
Model-free RL resembles fast, intuitive reasoning, while model-based RL resembles slow, deliberate reasoning.
What is transfer learning in reinforcement learning?
Using knowledge learned in one environment to improve learning in a related environment.
Why is transfer learning important in deep RL for real world tasks?
Because training from scratch is expensive and often infeasible in real-world tasks.
What is undirected exploration?
Exploration strategies such as ε-greedy that do not use uncertainty estimates.