Unit1 Framework Flashcards

(27 cards)

1
Q

What is the RL problem setup

A

An agent interacts with an environment over time to maximize cumulative reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 key elements of an RL loop

A

Agent state action reward environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define state S_t

A

The information the agent receives from the environment at time t that summarizes the situation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define action A_t

A

The choice the agent makes at time t that affects the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define reward R_t

A

Scalar feedback signal from the environment indicating desirability of the last action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Goal of the agent

A

Maximize expected cumulative reward over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define policy π

A

the policy is a function that, given the current state, outputs either action probabilities or a direct action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Difference between deterministic and stochastic policy

A

Deterministic picks one action given a state, stochastic outputs a probability distribution over actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define value function V(s)

A

Expected cumulative reward starting from state s following policy π.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define action-value function Q(s a)

A

Expected cumulative reward starting from state s taking action a then following policy π.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an episode

A

A sequence of states actions rewards that terminates in a terminal state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a step / timestep

A

A single interaction cycle (state action reward next state).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Markov property

A

The future depends only on the current state not on past history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define MDP

A

An MDP is an environment model where you specify the states, actions, how the environment transitions, the rewards, and how future rewards are discounted.

17
Q

Define transition function P(s’ | s a)

A

Probability of next state s’ given current state s and action a.

18
Q

Define discount factor γ

A

Number between 0 and 1 controlling how much future rewards are valued.

19
Q

Role of γ close to 0

A

Agent focuses on immediate rewards.

20
Q

Role of γ close to 1

A

Agent heavily values long term rewards.

21
Q

Return G_t

A

Sum of discounted rewards from timestep t: R_{t+1} + γR_{t+2} + γ²R_{t+3} + …

22
Q

Objective in RL in terms of return

A

Maximize expected return E[G_t].

23
Q

What defines optimal policy π*

A

Policy that yields the highest value function for all states.

24
Q

Difference between model based and model free RL

A

Model based tries to learn or knows the transition and reward model, model free learns values or policy without modeling environment dynamics.

25
What is exploration
Trying actions to gather information about the environment.
26
What is exploitation
Using known information to choose best actions for reward.
27
Why exploration is needed
Without exploration the agent may get stuck in suboptimal behavior.