LLM Flashcards

(22 cards)

1
Q

What are the four essential aspects of LLM

A

1 - Natural language
2 - Artificial Neural Networks
3 - Self-improvement
4 - creativity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why did it take nearly seventy years until computers could be programmed to use natural language at a level comparable to humans, or even at a superhuman level? State the two complications and how Chat GPT solves them.

A

Early attempts at natural language processing (NLP) faced two major challenges:

1.Common sense: Computers lacked the world knowledge and common sense necessary to understand and generate natural language effectively.

2.Context: Natural language often relies on context to resolve ambiguities, such as the meaning of pronouns.

Sure, here is a shorter summary of the text:

The challenges of natural language processing and how ChatGPT addresses them

Early attempts at natural language processing (NLP) faced two major challenges:

Common sense: Computers lacked the world knowledge and common sense necessary to understand and generate natural language effectively.

Context: Natural language often relies on context to resolve ambiguities, such as the meaning of pronouns.

ChatGPT, a large language model (LLM), overcomes these challenges through two techniques:

  1. Self-improvement: ChatGPT is trained on massive amounts of text data, allowing it to extract and store common sense knowledge.
  2. Attention mechanism: Transformers, a type of neural network architecture, use an attention mechanism to focus on specific parts of the input text when generating output. This helps the LLM to understand context and resolve references correctly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transformer

A

A certain type of deep neural network designed around the idea of attention to transform a sequence of input embeddings into a sequence of output embeddings, can be viewed as a stack of self-attention layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Attention

A

A mechanism used in a neural network that indicates the importance of a particular word or part of a word. Attention compresses the amount of information a model needs to predict the next token/word. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Positional Encoding

A

A technique transformers use to keep track of word order,

A common implementation of positional encoding uses a sinusoidal function. This technique enables a Transformer model to learn to attend to different parts of the sequence based on their position.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

encoder

A

In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

decoder

A

In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

token

A

The “atomic unit” in language models that the model is training on and making predictions on. A token is typically one of the following:
1. a word
2. a character
3. sub words - in which a word can be a single token or multiple tokens. A subword consists of a root-word, a prefix, or a suffix.
Each Token has an ID.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Model dimension

A

To model dimension
facilitate all connections, all embeddings and sublayers have the same length,
i.e., the model dimension.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

softmax

A

A function that determines probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0.
(check softmax equation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Transformer architecture for machine translation

A

(see notes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

auto-regressive

A

Previously generated tokens serve as additional input when generating the next token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Tokenizer/ Tokenization

A

The process of converting a text to tokens is called tokenization and is performed by a tokenizer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Vocabulary

A

the set of all tokens is tokenizer
called the vocabulary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Embedding

A

Embeddings convert tokens to real valued vectors of dimension d(model). This makes it possible, for example, that tokens of similar meaning or use are converted to points that are close in the real vector space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Scaled Dot-Product Attention

A

(see notes for diagram + equation)
In the mechanism of scaled dot-production attention, queries and keys of dimension d(k) as well as values of dimension d(v) are used. In practice,
the attention function works on a set of queries simultaneously, and hence
the queries, keys, and values are stored in matrices Q, K, and V.

17
Q

Multi-head attention

A

(see notes + equations)
The advantage of multi-head attention is that the model can pay attention to information from different representation subspaces at different positions. This is inhibited by averaging if a single attention head is used.

18
Q

Applications of Attention

A
  1. In the encoder decoder-attention layers, the queries come from the previous decoder layer,
    while the keys and the values come the output of the encoder. In this manner, all positions in the decoder can attend to all positions in the input
    sequence.
  2. the attention layers in the encoder are self-attention layers.
    In the encoder, all keys, values, and queries come from the output of the
    previous encoder layer. Hence, all positions in the encoder can attend to all
    positions in the previous layer in the encoder
  3. analogously to the attention layers in the encoder, the attention layers in the decoder are also self-attention layers, but with the slight
    difference that all positions in the decoder attend to all positions in the decoder up to and including that position.sel
19
Q

self-attention

A

A neural network layer that transforms a sequence of embeddings (for instance, token embeddings) into another sequence of embeddings. Each embedding in the output sequence is constructed by integrating information from the elements of the input sequence through an attention mechanism.

The self part of self-attention refers to the sequence attending to itself rather than to some other context. Self-attention is one of the main building blocks for Transformers and uses dictionary lookup terminology, such as “query”, “key”, and “value”.

A self-attention layer starts with a sequence of input representations, one for each word. The input representation for a word can be a simple embedding. For each word in an input sequence, the network scores the relevance of the word to every element in the whole sequence of words. The relevance scores determine how much the word’s final representation incorporates the representations of other words.

20
Q

What does GPT stand for?

A

Generative Pre-trained Transformer

21
Q

What are the criteria the human labelers used to judge model output of chatGPT?

A

Helpful, harmless, honest

22
Q

Reinforcement learning

A

A family of algorithms that learn an optimal policy, whose goal is to maximize return when interacting with an environment.