Advantage Transformer over LSTM & RNN
Transformer Architecture
Self-attention layers
Self-attetnion layers functions to learn
Self-attention layer steps
-> for faster processing calculation done in matrix (instead of single word embeddings, put them into matrix
Residual Connections
= connections that pass information from a lower layer to a higher layer without going through the intermediate layer: Allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lower layers
Layer Normalization
vector components are normalized by subtracting the mean from each and dividing by the standard deviation
Multihead Attention
Positional Embeddings
Transfomer Training
Cross-entropy loss
Distance between gold distribution & prediction
Perplexity
Inverse probability of the test set, normalized by number of words
BERT definition
BERT steps
BERT - tokenization
BERT- subword segmentation
BERT - Input embeddings
BERT - segment embeddings
shows which sentence it belongs to and at what position it is -> segment embeddings are not used anymore because not effective (can probably just use token embeddings with SEP etc.)
BERT - position embeddings
position of word in sentence encoded in vectors
-> segment & position to preserve ordering (embeddings are fed in simultaneously)
BERT - pre-training
model is trained on unlabeled data over different pre-training tasks; extract generalization from large amount of text (understand language); simultaneously: Masked language model (MLM) & Next Sentence Prediction
BERT - pre-training MLM
BERT - pre-training NSP
BERT - fine-tuning
BERT results evaluation