What are the four essential aspects of LLM
1 - Natural language
2 - Artificial Neural Networks
3 - Self-improvement
4 - creativity
Why did it take nearly seventy years until computers could be programmed to use natural language at a level comparable to humans, or even at a superhuman level? State the two complications and how Chat GPT solves them.
Early attempts at natural language processing (NLP) faced two major challenges:
1.Common sense: Computers lacked the world knowledge and common sense necessary to understand and generate natural language effectively.
2.Context: Natural language often relies on context to resolve ambiguities, such as the meaning of pronouns.
Sure, here is a shorter summary of the text:
The challenges of natural language processing and how ChatGPT addresses them
Early attempts at natural language processing (NLP) faced two major challenges:
Common sense: Computers lacked the world knowledge and common sense necessary to understand and generate natural language effectively.
Context: Natural language often relies on context to resolve ambiguities, such as the meaning of pronouns.
ChatGPT, a large language model (LLM), overcomes these challenges through two techniques:
Transformer
A certain type of deep neural network designed around the idea of attention to transform a sequence of input embeddings into a sequence of output embeddings, can be viewed as a stack of self-attention layers
Attention
A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.
Positional Encoding
A technique transformers use to keep track of word order,
A common implementation of positional encoding uses a sinusoidal function. This technique enables a Transformer model to learn to attend to different parts of the sequence based on their position.
encoder
In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation
decoder
In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.
token
The “atomic unit” in language models that the model is training on and making predictions on. A token is typically one of the following:
1. a word
2. a character
3. sub words - in which a word can be a single token or multiple tokens. A subword consists of a root-word, a prefix, or a suffix.
Each Token has an ID.
Model dimension
To model dimension
facilitate all connections, all embeddings and sublayers have the same length,
i.e., the model dimension.
softmax
A function that determines probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0.
(check softmax equation)
Transformer architecture for machine translation
(see notes)
auto-regressive
Previously generated tokens serve as additional input when generating the next token
Tokenizer/ Tokenization
The process of converting a text to tokens is called tokenization and is performed by a tokenizer.
Vocabulary
the set of all tokens is tokenizer
called the vocabulary
Embedding
Embeddings convert tokens to real valued vectors of dimension d(model). This makes it possible, for example, that tokens of similar meaning or use are converted to points that are close in the real vector space.
Scaled Dot-Product Attention
(see notes for diagram + equation)
In the mechanism of scaled dot-production attention, queries and keys of dimension d(k) as well as values of dimension d(v) are used. In practice,
the attention function works on a set of queries simultaneously, and hence
the queries, keys, and values are stored in matrices Q, K, and V.
Multi-head attention
(see notes + equations)
The advantage of multi-head attention is that the model can pay attention to information from different representation subspaces at different positions. This is inhibited by averaging if a single attention head is used.
Applications of Attention
self-attention
A neural network layer that transforms a sequence of embeddings (for instance, token embeddings) into another sequence of embeddings. Each embedding in the output sequence is constructed by integrating information from the elements of the input sequence through an attention mechanism.
The self part of self-attention refers to the sequence attending to itself rather than to some other context. Self-attention is one of the main building blocks for Transformers and uses dictionary lookup terminology, such as “query”, “key”, and “value”.
A self-attention layer starts with a sequence of input representations, one for each word. The input representation for a word can be a simple embedding. For each word in an input sequence, the network scores the relevance of the word to every element in the whole sequence of words. The relevance scores determine how much the word’s final representation incorporates the representations of other words.
What does GPT stand for?
Generative Pre-trained Transformer
What are the criteria the human labelers used to judge model output of chatGPT?
Helpful, harmless, honest
Reinforcement learning
A family of algorithms that learn an optimal policy, whose goal is to maximize return when interacting with an environment.