chapter 9 Flashcards

Question 1

Q

deep q-learning

Answer

A

The DeepMind group combined Q- learning with deep neural networks

deep Q-learning is exactly the same as q-learning, except that a convolutional neural network takes the place of the Q-table.

Question 2

Q

the Deep Q-Network (DQN)

Answer

A

The input:
the state of the system at a given time

The outputs:
the estimated values for each possible action, given the input state.

The network itself is a ConvNet
> Instead of the values in a Q-table, the weights in this network that are learned.

Question 3

Q

the state of a DQN

Answer

A

the current “frame”: the pixels of the current screen

plus three prior frames

 This definition of state provides the system with a small amount of memory

Question 4

Q

updating weights with supervised learning

Answer

A

back-propagation works by changing a neural network’s weights so as to reduce the error in the network’s outputs.

learning to match its outputs to human-given labels

Question 5

Q

updating weights with reinforcement learning

Answer

A

temporal difference learning

the network learns to make its outputs consistent from one iteration to the next, assuming that later iterations give better estimates of value than earlier iterations

Question 6

Q

q learning process

Answer

A

The system gives its current state as input to the Deep Q-Network.
The Deep Q-Network outputs a value for each possible action.
The system chooses and performs an action, resulting in a new state.
Now the learning step takes place: the system inputs its new state to the network, which outputs a new set of values for each action.
The difference between the new set of values and the previous set of values is considered the “error” of the network; this error is used by back- propagation to change the weights of the network.

Question 7

Q

minimax

Answer

A

Once each of the bottom-row board positions was thus evaluated, the minimax uses these values—from the end of the look-ahead process—in order to rate the program’s immediate possible moves from its current board position.

The program then chooses the highest-rated move.

o after an evaluation function assigns value to all possible board positions, the evaluations are propagated back up the tree by minimax so that the highest rated move can be chosen

Question 8

Q

Deep Blue used much the same method as Samuel’s checkers player:

Answer

A

at a given turn, it created a partial game tree using the current board position as the root

it applied its evaluation function to the furthest layer in the tree

then used the minimax algorithm to propagate the values up the tree in order to determine which move it should make.

Question 9

Q

The major differences between Samuel’s program and Deep Blue

Answer

A

Deep Blue’s deeper look- ahead in its game tree
its more complex (chess-specific) evaluation function,
hand-programmed chess knowledge
specialized parallel hardware to make it run very fast.
unlike Samuel’s checkers- playing program, Deep Blue did not use machine learning in any central way.

Question 10

Q

Go programs are not able to use this strategy for two reasons.

Answer

A

the size of a look-ahead tree in Go is dramatically larger than that in chess.
no one has succeeded in creating a good evaluation function for Go board positions.
> AI researchers haven’t figured out how to encode intuition into an evaluation function.

Question 11

Q

AlphaGo

Answer

A

Unlike Deep Blue, AlphaGo acquired its abilities by reinforcement learning via self-play.

used an intricate mix of deep Q- learning, “Monte Carlo tree search,” supervised learning, and specialized Go knowledge.

Question 12

Q

Monte Carlo Tree Search (MCTS)

Answer

A

The roulette-wheel-inspired randomness is used in deciding how to do the look-ahead.

imagines (that is, simulates) several possible ways the game could play out, counts how many wins and losses those hypothetical sequences lead to, and uses those counts to give a score to each of its possible moves.

the program chooses moves probabilistically based on any scores that those moves might have from previous rounds of Monte Carlo tree search.

At first, the program’s choice of moves from a given board position is quite random (the program is doing the equivalent of spinning a roulette wheel to choose a move), but as the program performs additional roll-outs, generating additional statistics, it is increasingly biased to choose those moves that in past roll-outs led to the most wins.

 In this way, Monte Carlo tree search doesn’t have to guess from just looking at a board position which move is most likely to lead to a win; it uses its roll-outs to collect statistics on how many times a given move actually leads to a win or loss.

Question 13

Q

MCTS rollout

Answer

A

Such a simulation, starting from a given board position, is called a roll-out from that position

Question 14

Q

The DeepMind group realized that they could improve their system by

Answer

A

complementing Monte Carlo tree search with a deep convolutional neural network.

o Given the current board position as input, AlphaGo uses a trained deep convolutional neural network to assign a rough value to all possible moves from the current position. Then Monte Carlo tree search uses those values to kick-start its search: rather than initially choosing moves at random, Monte Carlo tree search uses values output by the ConvNet as an indicator of which initial moves should be preferred

Question 15

Q

evaluation function

Answer

A

used to assign a score to each possible move from a given board position

chapter 9 Flashcards

(15 cards)