Reinforcement Learning - 12 Flashcards

(55 cards)

1
Q

Define reinforcement learning.

A

Reinforcement Learning is learning what to do, how to map situations to actions, to maximize a numerical reward signal.

The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

All reinforcement learning agents have explicit goals, can sense aspects of their environments and can choose actions to influence their environments. Moreover, it is usually assumed from the beginning that the agent has to operate despite significant uncertainty about the environment it faces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In the real world, the deterministic successor assumption is often unrealistic. True or False?

A

True. There is randomness, i.e. taking an action might lead to any one of many possible states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If we increase the probability of slipping too much, then it starts being better if we opt for the ________ (_________ rewarding) final state.

A

safer; least

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RL problems can be expressed as a system consisting of an ________ and an _______________.

A

agent; environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define environment.

A

An environment produces information that describes the state of the system. This is known as a state. After the agent selects an action, the environment accepts it and transitions into the next state. It then returns the next state and a reward to the agent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define agent.

A

An agent interacts with an environment by observing the state and using this information to select an action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The cycle ___________ > ____________ > ____________ repeats until the _________________ terminates (or the problem is solved).

A

state; action; reward; environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A MDP can be represented as a graph. The nodes are states and chance nodes. Edges coming out of states are the possible ___________ from that state, which lead to ___________ nodes. Chance nodes (s, a) is a node that represents a state and action. Edges coming out of a chance node are the possible __________ ___________ of that __________, which end up back in _________. Our convention is to label these chance-to-state edges with the probability of a particular ____________ and the associated __________ for traversing that edge.

A

actions; chance; random outcomes; action; states; transition; reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Associated with each transition (s, a, s’) is a reward, which could be either positive or negative. True or False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In MDPs, we define a solution by using the notion of a policy. Define policy.

A

A policy is a mapping from each state to an action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We define a policy, which specifies an action for every _________, not just the states along a path as in search problems.

The best thing I can do is for every state to just tell you what is the best thing you can do for that particular state.

A

state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

We can maximize the total rewards (utility). True or False?

A

False. Utility is a random quantity so we cannot maximize it, we can only maximize the expected utility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define expected utility (value).

A

It is the value of a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define utility.

A

Following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is a random quantity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The discounting parameter is applied ______________ to future rewards, so the distant future is always going to have a fairly ________ contribution to the utility (unless it is equal to __).

A

exponentially; small; 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A larger value of the discount parameter actually means that the future is discounted _________.

A

less

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define episode.

A

Given a policy and an MDP, following the policy produces a sequence (action, reward, new state). We call such a sequence an episode (a path in the MDP graph). Each episode is associated with a utility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define Q-value of a policy.

A

It’s the expected utility of taking an action a from state s, and then following policy pi.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Define value of a policy.

A

It’s the expected utility received by following policy pi from state s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In terms of the MDP graph, one can think of the value as labeling the ________ nodes, and the Q as labeling the _________ nodes.

A

state; chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The plan is defining recurrences relating value and Q-value. How de we proceed?

A

First, we get the value of state s, by just following the action edge specified by the policy and taking the Q-value.

Second, we get the Q-value by considering all possible transitions to successor states s’ and taking the expectation over the immediate reward plus the discounted future reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

For a much larger MDP with large number of states, how can we efficiently compute the value of a policy?

A

With iterative algorithms. We should start with arbitrary policy values and repeatedly apply recurrences to converge to true values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

We do not have a dependence on the number of actions because we have a fixed policy and we only need to look at the action specified by the policy. True or False?

24
Q

The number of ____________ is exponential in the number of states.

25
Define optimal value.
The optimal value is the maximum value attained by any policy.
26
Value iteration is guaranteed to converge to the optimal value. True or False?
True
27
If the discount factor is ___ 1 or if the MDP is __________, then the value iteration converges to the correct answer.
<; acyclic
28
If the MDP graph has cycles, the discount factor should be less than 1. True or False?
True
29
When the graph has cycles and the discount factor is equal to 1, if you are getting ________ rewards, you are never going to change or move from your states. You are always going to be stuck in your state.
zero
30
When the graph has cycles and the discount factor is equal to 1, if you have non-zero rewards, you are getting an ____________ ___________. You would never ____________, so we would never find out what your utility was at the end, because it is just going to end up becoming numerically difficult.
unbounded reward; terminate;
31
Distinguish MDPs from reinforcement learning.
MDPs are offline and reinforcement learning is online. In MDPs, you have a mental model of how the world works. You go lock yourself in a room, think really hard, and come up with a policy. Then you come out and use it to act in the real world. In RL, you do not know how the world works, but you only have one life, so you just have to go out into the real world and learn how it works by experiencing it and trying to take actions that yield high rewards.
32
In RL we do not have access to ___________ nor to ____________.
T(s, a, s'); Reward(s, a, s')
33
We can think of an agent (the reinforcement learning algorithm) that repeatedly chooses an ________ to perform in the environment, and receives some ___________, and information about the new state. The 2 big questions are how to choose __________ and how to update parameters.
action; reward; actions
34
In model-based Monte Carlo, how do we compute the transitions?
The count of how often we observed an action a from state s leading to state s' divided by the total number of times we executed action a in state s.
35
What's the problem with model-based Monte Carlo?
So far, our policies have been deterministic, mapping s always to pi(s). However, if we use such a policy to generate our data, there are certain s a pairs that we will never see and therefore never be able to estimate their Q-value and never know what the effect of those actions are.
36
To do reinforcement learning, we need to __________ the state space.
explore
37
Distinguish reinforcement learning from supervised learning.
In reinforcement learning, we actually have to act to get data, rather than just having data poured over us, like in supervised learning.
38
If the goal is to just find good policies, all we need is to get a good estimate of the optimal ____________. From that perspective, estimating the model (transitions and rewards) was just a means towards an end and we can just not explicitly estimate the transitions and rewards and directly compute the optimal Q-value. This is called ________________ learning.
Q-value; model-free
39
Explain how the model-free learning proceeds.
If we take the average of u_t over all times that s_t = s and a_ = a, then we obtain our Monte Carlo estimate Q-value .
40
In Model-free Monte Carlo target was u, the discounted sum of rewards after taking an action. However, u itself is just an estimate of the Q-value. If the episode is _______, u will be a pretty _________ estimate. Because u only corresponds to one __________ out of an exponential number of possible episodes, so as the episode lengthens, it becomes an increasingly ________ representative sample of what could happen.
long; lousy; episode; less
41
An alternative to model-free Monte Carlo is SARSA, whose target is a combination of the _______ (reward) and the __________ (Q-value*gamma).
data; estimate
42
Define on-policy.
Estimate the value of data-generating policy learn pi while applying own policy pi may get stuck in local maxima.
43
Define off-policy.
Estimate the value of another policy learn pi while applying some other policy pi won’t get stuck in local maxima (given enough experience).
44
__________ uses estimate Q-value instead of just raw data u.
SARSA
45
What are the advantages and disadvantages of model-free Monte Carlo?
It is based on 1 path and unbiased. However, it generates large variance and waits until the end to update.
46
What are the advantages and disadvantages of SARSA?
It is based on estimate so it is biased. However, it generates small variance and can update immediately.
47
If you increase numEpisodes to 1000, SARSA will behave very much like ______________ ________ _________, computing the value of the random policy. However, note that Q-learning is computing an estimate of the optimal Q-value, so the resulting Q-values will be very ___________. The average ___________ will not change (in the volcano crossing example from the slides) since we are still following and being evaluated on the same random policy. This is an important point for off-policy methods: the online performance (average utility) is generally a lot __________ and not representative of what the model has learned, which is captured in the estimated Q-values.
model-free Monte Carlo; different; utility; worse
48
What problem leads us to choose Q-learning?
Model-free Monte Carlo and SARSA only estimate Q, but our goal is to get an optimal policy, which means estimating Qopt.
49
Q-learning is an _______-policy algorithm.
off
50
SARSA requires knowing what _______ I am gonna task next. In Q-learning, it does not matter what action you took since you are just going to use the one that maximizes the _____________ _____________ of the ____________ Q-value of the _________ __________.
action; discounted estimate; optimal; next state
51
How do we choose what exploration policy to use?
If we choose the optimal policy according to the estimated Q-value, the agent becomes too greedy and the estimates are inaccurate (all exploitation). We can go to the other extreme and use an exploration policy that always chooses a random action. It will do a much better job of exploration, but it does not exploit what it learns and ends up with a very low utility (exploration is not guided). Therefore, we need to balance exploration and exploitation. The natural thing to do when you have two extremes is to interpolate between the two. The result is the epsilon-greedy algorithm, which explores with probability epsilon and exploits with probability 1-epsilon.
52
Epsilon cannot be reduced over time. True or False?
False
53
__________ ________ algorithms estimate transitions, rewards and Q-values from data.
Monte Carlo
54
_______________ algorithms update towards target that depends on estimate rather just raw data.
Bootstrapping
55
Which algorithms are bootstrapping algorithms?
SARSA and Q-learning.