Challenges in LM
Vanishing probabilities:
Unknown words and sequences
Exactness vs generalizations:
Exactness vs generalizations:
Unknown words and sequences
Vanishing probabilities:
When to use LMs?
Selected applications
Speech recognition: Disambiguate unclear words based on likelihood
Spelling/Grammar correction: Fin likely errors und suggest alternatives
Machine translation: Find likely interpretation/order in target language
How to stop generating text?
The maximum length of the output sequence may be prespecified
Also, LMs may learn to generate a special end tag, </s>
Large language model(LLM)
A neural language model trained on huge amounts of textual data
Usually based on the transformer architecture
Transformer
A neural network architecture for processing input sequences in parallel
Models each input based on its surrounding inputs, called self-attention
What n to use? (N-Gram language Model)
-Bigrams are used in the examples above mainly for simplicity
-In practice, mostly trigrams, 4-grams, or 5-grams are used
- The higher n, the more training data is needed for reliable probabilities
Evaluation of LM
Perplexity
The perplexity PPL of an LM on a test set is the inverse probability of the test set, normalized by the numbers of tokens
Notice:
Branching factor
Sampling of sequences
Unigram sampling
Bigram sampling
Sparsity
Why are zero probabilities problematic?
Unknown tokens
Out-of-vocabulary (OOV) tokens
Solution:
Alternative 1: Closed vocabulary
Alternative 2: Frequency pruning
Smoothing
Unknown sequences
General idea of smoothing (aka discounting)
Main types of smoothing