How do Markov Decision Processes (MDPs) differ from state space searches
In standard search, action have guaranteed outcomes, in an MDP, actions have probabilistic outcomes (e.g. 80% chance of x, 20% chance of y)
Probabilistic outcomes
You don’t know for certain which state you will reach
What is the Markov Property
It is the “memoryless” property.
The future depends only on the current state and action, not the history of how you got there
What is the mathematical rule for transition probabilities in any state-action pair
For a given state s and action a, the sum of probabilities for all possible next states must equal 1
What is the formula for Discounted Return (Gt)
Gt = r(t+1) + γr(t+2) + γ²r(t+3) + … = Σ (γ^k * r_{t+k+1})
What does the discounted formula equation represent
It is a way to calculate the total value of all rewards an agent receives, starting from time t.
What does r(t+1), r(t+2) … represent in the Discounted Return formula
These are the individual rewards received at each future step. The first reward is not discounted because it is received immediately
What does gamma represent in the Discounted Return formula
This is the discount rate.
It is a value between 0 and 1 that determines how much we value rewards relative to immediate ones.
What does gamma = 1 mean
Future rewards are worth just as much as current rewards
What does gamma^k mean in the Discounted Return formula
As time goes on, k increases. Since gamma is usually less than 1, gamma^k gets smaller and smaller, meaning rewards in the distant future fade away and count for less
How does a lower discount rate change an agent’s behaviour
It motivates the decision-maker to favour immediate rewards and take actions early rather than postponing them.
What is a Policy (π) in an MDP
A strategy that specifies exactly which action to take for every possible state in the process.
What is the difference between Deterministic and Stochastic policy
Deterministic - Selects exactly one specific action for each state
Stochastic - Assigns probabilities to different actions for each state
What 2 factors must be balanced to achieve optimal behaviour in an MDP
Risks and Rewards
How is the discount rate used in Climate Policy?
High SDR (social discount rate) - Values the present more than the future; used to argue against drastic immediate climate action.
Low SDR - Values the future almost as much as the present; used to argue for immediate action
States
A set of all possible situations
Actions
The choices available in each state
Transition probability
The likelihood of ending up in a new state given the current state and action
P (S t+1 | St, at)
Reward
The immediate payoff received after a transition