deep q-learning
The DeepMind group combined Q- learning with deep neural networks
deep Q-learning is exactly the same as q-learning, except that a convolutional neural network takes the place of the Q-table.
the Deep Q-Network (DQN)
The input:
the state of the system at a given time
The outputs:
the estimated values for each possible action, given the input state.
The network itself is a ConvNet
> Instead of the values in a Q-table, the weights in this network that are learned.
the state of a DQN
the current “frame”: the pixels of the current screen
plus three prior frames
This definition of state provides the system with a small amount of memory
updating weights with supervised learning
back-propagation works by changing a neural network’s weights so as to reduce the error in the network’s outputs.
learning to match its outputs to human-given labels
updating weights with reinforcement learning
temporal difference learning
the network learns to make its outputs consistent from one iteration to the next, assuming that later iterations give better estimates of value than earlier iterations
q learning process
minimax
Once each of the bottom-row board positions was thus evaluated, the minimax uses these values—from the end of the look-ahead process—in order to rate the program’s immediate possible moves from its current board position.
The program then chooses the highest-rated move.
o after an evaluation function assigns value to all possible board positions, the evaluations are propagated back up the tree by minimax so that the highest rated move can be chosen
Deep Blue used much the same method as Samuel’s checkers player:
at a given turn, it created a partial game tree using the current board position as the root
it applied its evaluation function to the furthest layer in the tree
then used the minimax algorithm to propagate the values up the tree in order to determine which move it should make.
The major differences between Samuel’s program and Deep Blue
Go programs are not able to use this strategy for two reasons.
AlphaGo
Unlike Deep Blue, AlphaGo acquired its abilities by reinforcement learning via self-play.
used an intricate mix of deep Q- learning, “Monte Carlo tree search,” supervised learning, and specialized Go knowledge.
Monte Carlo Tree Search (MCTS)
The roulette-wheel-inspired randomness is used in deciding how to do the look-ahead.
imagines (that is, simulates) several possible ways the game could play out, counts how many wins and losses those hypothetical sequences lead to, and uses those counts to give a score to each of its possible moves.
the program chooses moves probabilistically based on any scores that those moves might have from previous rounds of Monte Carlo tree search.
At first, the program’s choice of moves from a given board position is quite random (the program is doing the equivalent of spinning a roulette wheel to choose a move), but as the program performs additional roll-outs, generating additional statistics, it is increasingly biased to choose those moves that in past roll-outs led to the most wins.
In this way, Monte Carlo tree search doesn’t have to guess from just looking at a board position which move is most likely to lead to a win; it uses its roll-outs to collect statistics on how many times a given move actually leads to a win or loss.
MCTS rollout
Such a simulation, starting from a given board position, is called a roll-out from that position
The DeepMind group realized that they could improve their system by
complementing Monte Carlo tree search with a deep convolutional neural network.
o Given the current board position as input, AlphaGo uses a trained deep convolutional neural network to assign a rough value to all possible moves from the current position. Then Monte Carlo tree search uses those values to kick-start its search: rather than initially choosing moves at random, Monte Carlo tree search uses values output by the ConvNet as an indicator of which initial moves should be preferred
evaluation function
used to assign a score to each possible move from a given board position