Unit1 Framework Flashcards by Keesjan lod

What is the RL problem setup

An agent interacts with an environment over time to maximize cumulative reward.

How well did you know this?

Not at all

Perfectly

What are the 4 key elements of an RL loop

Agent state action reward environment.

How well did you know this?

Not at all

Perfectly

Define state S_t

The information the agent receives from the environment at time t that summarizes the situation.

How well did you know this?

Not at all

Perfectly

Define action A_t

The choice the agent makes at time t that affects the environment.

How well did you know this?

Not at all

Perfectly

Define reward R_t

Scalar feedback signal from the environment indicating desirability of the last action.

How well did you know this?

Not at all

Perfectly

Goal of the agent

Maximize expected cumulative reward over time.

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

Define policy π

the policy is a function that, given the current state, outputs either action probabilities or a direct action

How well did you know this?

Not at all

Perfectly

Difference between deterministic and stochastic policy

Deterministic picks one action given a state, stochastic outputs a probability distribution over actions.

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

Define value function V(s)

Expected cumulative reward starting from state s following policy π.

How well did you know this?

Not at all

Perfectly

Define action-value function Q(s a)

Expected cumulative reward starting from state s taking action a then following policy π.

How well did you know this?

Not at all

Perfectly

What is an episode

A sequence of states actions rewards that terminates in a terminal state.

How well did you know this?

Not at all

Perfectly

What is a step / timestep

A single interaction cycle (state action reward next state).

How well did you know this?

Not at all

Perfectly

What is the Markov property

The future depends only on the current state not on past history.

How well did you know this?

Not at all

Perfectly

Define MDP

Study These Flashcards

An MDP is an environment model where you specify the states, actions, how the environment transitions, the rewards, and how future rewards are discounted.

Define transition function P(s’ | s a)

Study These Flashcards

Probability of next state s’ given current state s and action a.

Define discount factor γ

Study These Flashcards

Number between 0 and 1 controlling how much future rewards are valued.

Role of γ close to 0

Study These Flashcards

Agent focuses on immediate rewards.

Role of γ close to 1

Study These Flashcards

Agent heavily values long term rewards.

Return G_t

Study These Flashcards

Sum of discounted rewards from timestep t: R_{t+1} + γR_{t+2} + γ²R_{t+3} + …

Objective in RL in terms of return

Study These Flashcards

Maximize expected return E[G_t].

What defines optimal policy π*

Study These Flashcards

Policy that yields the highest value function for all states.

Difference between model based and model free RL

Study These Flashcards

Model based tries to learn or knows the transition and reward model, model free learns values or policy without modeling environment dynamics.

What is exploration

Trying actions to gather information about the environment.

What is exploitation

Using known information to choose best actions for reward.

Why exploration is needed

Without exploration the agent may get stuck in suboptimal behavior.

Unit1 Framework Flashcards

(27 cards)