Word2Vec vs Glove
Word2Vec:
Model updates using skip gram with negative sampling. Negative sampling for non appearing words (only updates ~20 words not appearing in window)
Glove:
Count-based model (No model)
Uses co-occurrence matrix of words to create vectors
Difference between LSTM and GRU.
LSTMs:
Using gating mechanism to combat vanishing gradients (improvenmnet on RNN)
3 gates (input, forget, output)
2 states (cell, hidden)
More parameters, slower to train, need more data
Better at capturing long range dependancies
GRUs:
2 gates (reset, update(combines forget/input into 1))
1 state (hidden)
Less parameters, faster to train, need less data
Not as good as capturing long range dependancies
What is a CBOW and Skip Gram model?
CBOW:
When you use context words before and after a center word.
Skip-Gram:
All context words become training samples w/ respect to the centre word. This is how W2V trains?
How does word2vec train?
Viterbi vs Beam Search vs Greedy
Viterbi:
Searches all possible candidates at each step
Beam Search:
Searches the K best possible candidates
Greedy:
Take the best possible candidate at each time step
What are some NLP Libraries in Python?
Gensim, NLTK, SpaCy, JohnSnowLabs, AllenNLP, HuggingFace, TFHub
Pytorch, Tensorflow, Jax..
What is BERT?
BERT:
- Transformer encoder trained to predict masked words
- Masks 15% of tokens in sentence
- Sometimes randomly replaces words and tries to predict the correct word (adds noise to embedding)
- Sometimes predicts order if sentence B follows sentence A (50% true, 50% random sentence)
- Embeddings: Positional, Segment, and Token Embedding
What is ELMO?
ELMO:
- Two stacked bidirectional LSTM language model trained to predict the next word (language modeling)
- At inference, takes the weighted sum of the hidden states from each layer of the bi-LSTM and the raw word vector
What are 3 subword embedding strategies?
BPE:
1. Start with vocab of characters
2. Add most frequent n-gram pair to vocab
3. Continue until target vocab size reached
WordPiece, SentencePiece
- Same as BPE except n-gram pairs are chosen based on the highest likelihood
- WP treats words seperately
- SP treats entire sentence as 1 string with _ replacing spaces
How does RoBERTa improve on BERT?