reinforcement learning
inspired by operant conditioning
contrasts with the supervised-learning method
requires no labeled training examples
an AGENT—the learning program—performs ACTIONS in an ENVIRONMENT (usually a computer simulation) and occasionally receives REWARDS from the environment. These intermittent rewards are the only feedback the agent uses for learning.
The promise of reinforcement learning
the agent can learn flexible strategies on its own simply by performing actions in the world and occasionally receiving rewards (that is, reinforcement) without humans having to MANUALLY WRITE RULES or DIRECTLY TEACH THE AGENT EVERY POSSIBLE CIRCUMSTANCE
state
the state of an agent at a given time is the agent’s perception of its current situation.
In the purest form of reinforcement learning, the learning agent doesn’t remember its previous states.
what does the algorithm do
tells the agent how to learn from her experiences.
Reinforcement learning occurs by
having the agent take actions over a series of learning EPISODES, each of which consists of some number of ITERATIONS.
What does the agent learn?
upon receiving a reward, the agent learns only about;
the STATE and the ACTION that immediately preceded the reward
the value of an action
he value of action A in state S is a number reflecting the agent’s current prediction of how much reward it will EVENTUALLY obtain if, when in state S, it performs action A, AND THEN CONTINUES PERFORMING HIGH-VALUE ACTIONS
the goal of reinforcement learning
for the agent to learn values that are good predictions of upcoming rewards (assuming that the agent keeps doing the right thing after taking the action in question)
Q-table
a table of states, actions, and values
Given a state, each action in that state has a numerical value; these values will change— becoming more accurate predictions of upcoming rewards—as Rosie continues to learn
Reinforcement learning is here the gradual updating of values in the Q- table
essence of q-learning
Rosie can now learn something about the action (Forward) she took in the immediately previous state (one step away).
so, memory
exploration versus exploitation balance
Deciding how much to explore new actions and how much to exploit
A naive strategy would be to always choose the action with the highest value for the current state in the Q-table.
Achieving the right balance is a core issue for making reinforcement learning successful.
two major stumbling blocks that might arise in extrapolating our “training Rosie” example to reinforcement learning in real-world tasks.
the best-known reinforcement-learning successes have been in the domain of game playing.
episode of Q-learning
at each iteration the learning agent does the following: