LLM Flashcards

Question 1

Q

What are the four essential aspects of LLM

Answer

A

1 - Natural language
2 - Artificial Neural Networks
3 - Self-improvement
4 - creativity

Question 2

Q

Why did it take nearly seventy years until computers could be programmed to use natural language at a level comparable to humans, or even at a superhuman level? State the two complications and how Chat GPT solves them.

Answer

A

Early attempts at natural language processing (NLP) faced two major challenges:

1.Common sense: Computers lacked the world knowledge and common sense necessary to understand and generate natural language effectively.

2.Context: Natural language often relies on context to resolve ambiguities, such as the meaning of pronouns.

Sure, here is a shorter summary of the text:

The challenges of natural language processing and how ChatGPT addresses them

Early attempts at natural language processing (NLP) faced two major challenges:

Common sense: Computers lacked the world knowledge and common sense necessary to understand and generate natural language effectively.

Context: Natural language often relies on context to resolve ambiguities, such as the meaning of pronouns.

ChatGPT, a large language model (LLM), overcomes these challenges through two techniques:

Self-improvement: ChatGPT is trained on massive amounts of text data, allowing it to extract and store common sense knowledge.
Attention mechanism: Transformers, a type of neural network architecture, use an attention mechanism to focus on specific parts of the input text when generating output. This helps the LLM to understand context and resolve references correctly.

Question 3

Q

Transformer

Answer

A

A certain type of deep neural network designed around the idea of attention to transform a sequence of input embeddings into a sequence of output embeddings, can be viewed as a stack of self-attention layers

Question 4

Q

Attention

Answer

A

A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.

Question 5

Q

Positional Encoding

Answer

A

A technique transformers use to keep track of word order,

A common implementation of positional encoding uses a sinusoidal function. This technique enables a Transformer model to learn to attend to different parts of the sequence based on their position.

Question 6

Q

encoder

Answer

A

In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation

Question 7

Q

decoder

Answer

A

In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

Question 8

Q

token

Answer

A

The “atomic unit” in language models that the model is training on and making predictions on. A token is typically one of the following:
1. a word
2. a character
3. sub words - in which a word can be a single token or multiple tokens. A subword consists of a root-word, a prefix, or a suffix.
Each Token has an ID.

Question 9

Q

Model dimension

Answer

A

To model dimension
facilitate all connections, all embeddings and sublayers have the same length,
i.e., the model dimension.

Question 10

Q

softmax

Answer

A

A function that determines probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0.
(check softmax equation)

Question 11

Q

Transformer architecture for machine translation

Answer

A

(see notes)

Question 12

Q

auto-regressive

Answer

A

Previously generated tokens serve as additional input when generating the next token

Question 13

Q

Tokenizer/ Tokenization

Answer

A

The process of converting a text to tokens is called tokenization and is performed by a tokenizer.

Question 14

Q

Vocabulary

Answer

A

the set of all tokens is tokenizer
called the vocabulary

Question 15

Q

Embedding

Answer

A

Embeddings convert tokens to real valued vectors of dimension d(model). This makes it possible, for example, that tokens of similar meaning or use are converted to points that are close in the real vector space.

Question 16

Q

Scaled Dot-Product Attention

Answer

Study These Flashcards

A

(see notes for diagram + equation)
In the mechanism of scaled dot-production attention, queries and keys of dimension d(k) as well as values of dimension d(v) are used. In practice,
the attention function works on a set of queries simultaneously, and hence
the queries, keys, and values are stored in matrices Q, K, and V.

Question 17

Q

Multi-head attention

Answer

Study These Flashcards

A

(see notes + equations)
The advantage of multi-head attention is that the model can pay attention to information from different representation subspaces at different positions. This is inhibited by averaging if a single attention head is used.

Question 18

Q

Applications of Attention

Answer

Study These Flashcards

A

In the encoder decoder-attention layers, the queries come from the previous decoder layer,
while the keys and the values come the output of the encoder. In this manner, all positions in the decoder can attend to all positions in the input
sequence.
the attention layers in the encoder are self-attention layers.
In the encoder, all keys, values, and queries come from the output of the
previous encoder layer. Hence, all positions in the encoder can attend to all
positions in the previous layer in the encoder
analogously to the attention layers in the encoder, the attention layers in the decoder are also self-attention layers, but with the slight
difference that all positions in the decoder attend to all positions in the decoder up to and including that position.sel

Question 19

Q

self-attention

Answer

Study These Flashcards

A

A neural network layer that transforms a sequence of embeddings (for instance, token embeddings) into another sequence of embeddings. Each embedding in the output sequence is constructed by integrating information from the elements of the input sequence through an attention mechanism.

The self part of self-attention refers to the sequence attending to itself rather than to some other context. Self-attention is one of the main building blocks for Transformers and uses dictionary lookup terminology, such as “query”, “key”, and “value”.

A self-attention layer starts with a sequence of input representations, one for each word. The input representation for a word can be a simple embedding. For each word in an input sequence, the network scores the relevance of the word to every element in the whole sequence of words. The relevance scores determine how much the word’s final representation incorporates the representations of other words.