What is the RL problem setup
An agent interacts with an environment over time to maximize cumulative reward.
What are the 4 key elements of an RL loop
Agent state action reward environment.
Define state S_t
The information the agent receives from the environment at time t that summarizes the situation.
Define action A_t
The choice the agent makes at time t that affects the environment.
Define reward R_t
Scalar feedback signal from the environment indicating desirability of the last action.
Goal of the agent
Maximize expected cumulative reward over time.
Define policy π
the policy is a function that, given the current state, outputs either action probabilities or a direct action
Difference between deterministic and stochastic policy
Deterministic picks one action given a state, stochastic outputs a probability distribution over actions.
Define value function V(s)
Expected cumulative reward starting from state s following policy π.
Define action-value function Q(s a)
Expected cumulative reward starting from state s taking action a then following policy π.
What is an episode
A sequence of states actions rewards that terminates in a terminal state.
What is a step / timestep
A single interaction cycle (state action reward next state).
What is the Markov property
The future depends only on the current state not on past history.
Define MDP
An MDP is an environment model where you specify the states, actions, how the environment transitions, the rewards, and how future rewards are discounted.
Define transition function P(s’ | s a)
Probability of next state s’ given current state s and action a.
Define discount factor γ
Number between 0 and 1 controlling how much future rewards are valued.
Role of γ close to 0
Agent focuses on immediate rewards.
Role of γ close to 1
Agent heavily values long term rewards.
Return G_t
Sum of discounted rewards from timestep t: R_{t+1} + γR_{t+2} + γ²R_{t+3} + …
Objective in RL in terms of return
Maximize expected return E[G_t].
What defines optimal policy π*
Policy that yields the highest value function for all states.
Difference between model based and model free RL
Model based tries to learn or knows the transition and reward model, model free learns values or policy without modeling environment dynamics.