Define reinforcement learning.
Reinforcement Learning is learning what to do, how to map situations to actions, to maximize a numerical reward signal.
The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.
All reinforcement learning agents have explicit goals, can sense aspects of their environments and can choose actions to influence their environments. Moreover, it is usually assumed from the beginning that the agent has to operate despite significant uncertainty about the environment it faces.
In the real world, the deterministic successor assumption is often unrealistic. True or False?
True. There is randomness, i.e. taking an action might lead to any one of many possible states.
If we increase the probability of slipping too much, then it starts being better if we opt for the ________ (_________ rewarding) final state.
safer; least
RL problems can be expressed as a system consisting of an ________ and an _______________.
agent; environment
Define environment.
An environment produces information that describes the state of the system. This is known as a state. After the agent selects an action, the environment accepts it and transitions into the next state. It then returns the next state and a reward to the agent.
Define agent.
An agent interacts with an environment by observing the state and using this information to select an action.
The cycle ___________ > ____________ > ____________ repeats until the _________________ terminates (or the problem is solved).
state; action; reward; environment
A MDP can be represented as a graph. The nodes are states and chance nodes. Edges coming out of states are the possible ___________ from that state, which lead to ___________ nodes. Chance nodes (s, a) is a node that represents a state and action. Edges coming out of a chance node are the possible __________ ___________ of that __________, which end up back in _________. Our convention is to label these chance-to-state edges with the probability of a particular ____________ and the associated __________ for traversing that edge.
actions; chance; random outcomes; action; states; transition; reward
Associated with each transition (s, a, s’) is a reward, which could be either positive or negative. True or False?
True
In MDPs, we define a solution by using the notion of a policy. Define policy.
A policy is a mapping from each state to an action.
We define a policy, which specifies an action for every _________, not just the states along a path as in search problems.
The best thing I can do is for every state to just tell you what is the best thing you can do for that particular state.
state
We can maximize the total rewards (utility). True or False?
False. Utility is a random quantity so we cannot maximize it, we can only maximize the expected utility.
Define expected utility (value).
It is the value of a policy.
Define utility.
Following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is a random quantity).
The discounting parameter is applied ______________ to future rewards, so the distant future is always going to have a fairly ________ contribution to the utility (unless it is equal to __).
exponentially; small; 1
A larger value of the discount parameter actually means that the future is discounted _________.
less
Define episode.
Given a policy and an MDP, following the policy produces a sequence (action, reward, new state). We call such a sequence an episode (a path in the MDP graph). Each episode is associated with a utility.
Define Q-value of a policy.
It’s the expected utility of taking an action a from state s, and then following policy pi.
Define value of a policy.
It’s the expected utility received by following policy pi from state s.
In terms of the MDP graph, one can think of the value as labeling the ________ nodes, and the Q as labeling the _________ nodes.
state; chance
The plan is defining recurrences relating value and Q-value. How de we proceed?
First, we get the value of state s, by just following the action edge specified by the policy and taking the Q-value.
Second, we get the Q-value by considering all possible transitions to successor states s’ and taking the expectation over the immediate reward plus the discounted future reward.
For a much larger MDP with large number of states, how can we efficiently compute the value of a policy?
With iterative algorithms. We should start with arbitrary policy values and repeatedly apply recurrences to converge to true values.
We do not have a dependence on the number of actions because we have a fixed policy and we only need to look at the action specified by the policy. True or False?
True
The number of ____________ is exponential in the number of states.
policies