What is the difference between RL with control and RL without control?
Control in RL just means that actions are being chosen by the learner.
What does a contraction mapping do when applied to to functions F and G?
It brings them closer together
What three things need to hold for the convergence theorem to guarantee that Q converges in the limit?
How might we compute how much we do (or don’t) care about future rewards?
It’s proportional to something like H ~ 1 / (1 - gamma)
Why is it not a good idea to set gamma to very small values?
You end up with an agent that acts myopically, always seeking immediate gratification rather than playing “the long game”
What are three important features of policy iteration (PI)? What is the tradeoff we make when using PI over VI? What is the most important feature of PI?
The downside of PI is computational complexity because now we have to iterate over a full policy space rather than just at states / state-actions
Most important thing about PI is that it can’t get stuck in a local minimum (because of value improvement/non-improvement).
What is bounded loss/regret with respect to policies?
A policy that epsilon-optimal, i.e. that at each timestep produces a value that is no further than epsilon away than what would have been achieved by the optimal policy.