When is sequence to sequence (seq2seq) a reasonable approach?
When we have variable length input and variable length output. Use a constant sized neural network model repeatedly on the input data.
Mention three possible seq2seq approaches
Explain the steps in a RNN-encoder-decoder structure (seq2seq with RNN) in reference to a text translation task.
Explain the basic concept of beam search
Beam search picks the N (beam width) highest probabilities at each word prediction and calculates the next N highest probabilities based on each of the previously picked probabilities. We now have N^2 probabilities, pick the N highest probabilities to apply to the next time step.
Simply put: Beam search tries to increase accuracy of the seq2seq by not only picking the argmax of each prediction and then applying this to the next time step but rather pick the top N candidates.
Give a brief explanation of the FCN with self-attention. From the paper “attention is all you need”.
Consists of an encoder and a decoder part. The encoder encodes the input into an intermediate representation which we use to influence all our previous words in the decoder.
This explanation is quite lacking.. stoopid
How can you use reinforcement learning in seq2seq for machine translation?
We want to sample next word in our decoder part from the previous softmax output. However, sampling is non-differentiable thus we need an alternate scheme for selecting our next word. We can use reinforcement learning to select this word.
Briefly explaing the concept of co-attention and when it is appropriate to use it.
Co-attention can be used when doing question answering where we feed both a context matrix (paragraph of text) and a corresponding question matrix and want to make the network make a prediction based on both of the two inputs.
Co-attention is a method of combining the answer and the question in such a way that each word in the question matrix can be viewed in the light of each word in the context matrix and vice versa. This is done using dot products between the context and the question (and vice versa). Co-attention also uses a normalization scheme corresponding to softmax operations.