1
Q

What does GPT stand for?

A

Generative Pre-trained Transformer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do LLMs like ChatGPT produce text?

A

They take in a passage, and predict the next word in the passage. They output a probability distribution for the next word, then sample from that distribution. Over and over and over.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How would you turn a model like that into a chat bot?

A

Basically with a system prompt. So you pass something like the following into the model and ask it to complete it:

“What follows is an interaction between a user and a helpful AI assistant:

User: <user></user>

AI Assistant: _______”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a token?

A

The unit of content that ChatGPT predicts one at a time. Words, pieces of words, or punctuation marks.

In other domains, it could be little chunks of an image, or little patches of a sound for audio processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the core concept underlying transformers?

A

Self attention. Better understanding a particular part P of the input by learning to pay attention to other parts of the image to inform your understanding of P.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a transformer layer essentially?

A

A transformer layer is basically just a layer which applies self attention to an input sequence, plus some additional frills for performance (though they are meaningful: for example, the MLP after each attention mechanism seems to be where ChatGPT stores facts).

So it receives some sort of embedding of every input in a sequence, and uses self attention to output new, better embeddings for that sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is taking the dot product of two vectors?

A

Pairwise multiplying each entry in the vectors, then summing those all up to produce a scalar

It’s positive if they point in similar directions, 0 if they’re orthogonal, negative if they point in “opposite-ish” directions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the first step in GPT-3 processing an input?

A

Pass all the “words”, i.e. tokens, through an initial learned embedding matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the inner workings of a chatbot model like GPT-3 working with? What thing are they continually refining?

A

They’re refining those initial, “baseline” embeddings for each word to be richer, more “context-dependent word embeddings” as I call them.

For example, coloring “king” by the fact that it seems like a royal king, described in Shakespearean language, rather than a king on a chess board.

In these modern LLMs, these embeddings eventually get really, really really rich with meaning, such that at the very end you can use just the final embedding to accurately predict the next word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the unembedding matrix at the end of GPT?

A

An embedding matrix takes the one-hot encoded vector of the word in the model’s vocabulary, and maps it to an embedding.

An unembedding matrix takes the last contextual word embedding in the sequence, which is what GPT uses to predict the next word, and maps it to a vector with length equal to the vocabulary size, where the scalars are logits for which word is best to predict next. That’s then passed through a softmax to get predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the main reason why GPT only uses the final word embedding to predict the next word, rather than all of the final embeddings for the whole input sequence?

A

It makes for more efficient training to have each word embedding in that final sequence used to predict the next word, so for each forward pass you get thousands of predictions you can backpropagate on

(Note that this says quite a lot about just how rich these embeddings get, like 3b1b explains. I like the “contextual embedding” outlook, but by the end in reality they get even richer than that: even if the last word is just “the” for example, you can successfully predict the next word from only that embedding!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What specifically is the goal of one attention block applied to word embeddings?

A

To compute the delta that needs to be applied to those embeddings in order to make them more rich contextually. It’s to compute the E-delta’s that you add to the Es to get the E’s here

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are Q K and V

A

These matrixes are just re-representations of the incoming embedding matrix, achieved by multiplying the incoming matrix by 3 weight matrixes, W_Q W_K and W_V. They have the same number tokens/columns, but different dimension/rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is matrix Q conceptually? Or easier, what is its entry q for one word in the input?

A

The qs, or queries, can be thought of as “the input word asking a question, that it can use to better contextually embed itself.” Like “are there any adjectives modifying me?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is K conceptually?

A

The ks, or keys, are the potential answers to the questions being asked by the queries. For example, maybe the value encodes “yes, I’m an adjective!” if the word is an adjective, providing an answer to the question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you determine how well each key matches each query?

A

Compute a dot product of every possible query-key pair, yielding a scalar for each pair. So you basically see how “similar” the query and value are.

So basically, a big matrix multiplication of the Q and K matrices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does it mean for a k to “attend to” a q?

A

It means that the network realizes it should have attention on the word corresponding to k when interpreting the word corresponding to q

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What do we do to K^T * Q, the dot products of all the query and key vectors?

A

We pass them through a softmax (within the transformer block, not just at the end of the network), so basically the dot products that are negative and around zero go to nothing, and the large positive magnitudes are the ones that matter. These are the ones that show words that “attend to” other words.

The softmax is over all keys, for each query. So as per 3b1b it’s a column-wise softmax. For each query, you’re making a vector showing how much each key attends to that query.

There’s also a simple scaling term used here for stability. You divide by the sqrt of the dimension of the q and k vectors.

19
Q

In 3b1b’s and Jay Alammar’s explanations, does the data flowing through the network have words/tokens along rows, or columns?

A

Columns. Hits my brain weird but it’s the way they do it

In the original Attention is All You Need paper, they do rows. But in these flashcards I generally do columns, because I used 3b1b’s videos (and also cuz it seems kinda nice I think)

20
Q

This isn’t a real flashcard, but for reference, I want to link a note I have about an error in 3b1b’s videos that tripped me up in understanding this stuff: https://docs.google.com/document/d/1ahiSnxsoEKXEe1Pq-gYK3AsVFMgykHZoE7czyy0qO4s/edit?tab=t.0

21
Q

What is masked self attention?

A

You stop the network from having later words attend to earlier words. In the case of GPT-3, this is useful so every word in a batch can be a training example for next-word prediction, without having the model “cheat by looking later in the sequence”

22
Q

How is masked self attention accomplished computationally?

A

Before you apply the column-wise softmax to (K^T * Q), you set all the entries in the matrix corresponding to a key at index beyond the query’s index to be negative infinity. So when you pass it through softmax, the attentiveness on those later values becomes zero.

23
Q

How does the complexity of a transformer block scale with context size? Where in the transformer operation does this come from?

A

Quadratically. It comes from K^T * Q, which has dimension of (context size * key/query shared dim) * (key/query shared dim * context size)

24
Q

What is V conceptually?

A

Using Q and K, we’ve determined which words are relevant to which other words. Now we need to use that to update embeddings.

So now, in a basic conceptual sense, we need to know, “if word A is relevant to this other word B, how do we update the embedding for word B based on word A?”

By multiplying the input words in X by M_V to get V, we basically answer, “if word x is relevant to some other word, how should we update that other word?” V encodes the answer to that question for each input x. So it’s not “how should x be updated?”, it’s “if x is relevant to something else, how should that something else be updated?”

So each value v can be thought of as being associated with a key k, not a query q. Because they’re associated with the word that does-the-informing-wrt-the-word-being-updated.

25
Alright, so we've got the two pieces to finish off the attention mechanism. You've got V, where each column says "if this column is relevant to something, how should that something be updated?" And you've got the (context x context) sized matrix product that came out of K and Q, that shows pair-wise, "does the word in row i attend to the row in column j? Should row i's word cause an update to the word in index j?" How do you take these two components and produce our final attention output?
To start, you multiply them together as pictured. By turning a col of the attention matrix on its side, and summing all the cols in V according to those values, what you're doing taking the softmax over what is important to query q, and using those coefficients to take a weighted sum of all the values in V, which are statements on how to update the embedding for the word associated with q if the word associated with v attends to the q word. So exactly what we want! So that final matrix multiplication gives the *deltas* we want to the embeddings. So we add that result to the original embeddings to get our output.
26
So what is the final formula for the attention update?
embeddings = embeddings + :
27
How is M_V stored?
Rather than a giant (embed_dim x embed_dim) matrix, it is stored as a low-rank representation of a (embed_dim x embed_dim) matrix, in the form of two matrices of size (embed_dim x smaller_size) and (smaller_size x embed_dim). In the case of GPT-3, the smaller_size there happens to be the same size as dimension of q's and k's. 3b1b calls these two matrixes the value_down and value_up matrices.
28
In short, how is cross attention different from the self attention described here for gpt-3?
There are two different strings of tokens, and you do attention across them, rather than attending from one string to...that same string. So for language translation, you see which french words attend to which english words, for example.
29
What is multi headed attention?
In multi-headed attention, you simply do attention several distinct times within an attention block, *kinda* like a convolutional layer with multiple output channels. A single attention head, like we've discussed here, learns one way of understanding attention. Using M_Q, M_K, and the decomposed M_V, it learns one way to update the embedding of a token based on the other tokens in its context. But there could be several ways of doing this. Maybe one way is to have the query say "what adjectives are modifying what nouns?" and another say "are there other proper nouns that add color to the context of this proper noun?" etc. Each head of attention within the block gets its *own* M_Q, M_K and decomposed M_V. You're just doing attention several times over.
30
How does multi headed attention alter a token embedding?
You just take the delta in the embedding output by each head of attention, and add *all* of them to the embedding, all at once. So for example, 96 different instances of M_Q, M_K and (decomposed) M_V yield 96 different deltas, which are all summed at the end of the multi headed attention block and added to the initial embeddings that came into that attention block.
31
3b1b says there is a "value matrix" M_V, decomposed into "value up" and "value down" matrices. How are these slightly differently represented and used in much ML literature and in practical implementation?
Across multi-headed attention, all the value up matrices are actually stapled together, and referred to as a single "output" matrix. And those value down matrices will actually be referred to as just "value" matrices. Possible source of confusion and parlance difference to be aware of. Furthermore: so we're thinking of v = M_V_up * M_V_down * x. And we multiply the v's by the attention softmax weights we got from Q and K. In practice, you multiply just (M_V_down * x) by those attention weights, then up-project *that* with M_V_up. This is mathematically equivalent to doing it the straightforward way, but it saves on some computation because (M_V_down * x) is smaller than (M_V_up * M_V_down * x).
32
What is GELU?
A slightly modified, smoother ReLU that models sometimes use
33
What is a transformer model's temperature?
A hyperparameter you can set that kinda determines how "creative" the model will be. More specifically, it determines how likely or unlikely the model is to predict values that aren't at the top of the probability distribution. Higher temperature means a higher likelihood of these lower-down values.
34
How is temperature implemented computationally?
It's done within the final softmax function. When you do e^logit for each logit and then sum them, you instead do e^(logit/T) for the temperature value T. So for example, if T=1, it's just a normal softmax, but if T is high, it brings the extreme positive values closer to the more middling positive values. T=0 means always picking the most likely outcome, but that would be dividing by zero, so we create this behavior with a simple if-else statement: "if T=0 just pick the most likely word, else do the probability computation as normal"
35
In a transformer block such as GPT3's, at a high level, what happens to the data after it goes through the attention block?
It flows through the MLP block ***wait but I think that's technically incorrect. It does layer norm first. Then does another layer norm after NLP also https://dugas.ch/artificial_curiosity/img/GPT_architecture/fullarch.png https://dugas.ch/artificial_curiosity/GPT_architecture.html I should update this card once I dig more into gpt3's exact architecture
36
What is the structure of the MLP block?
Linear layer -> ReLU -> Linear -> add that to the input to the linear layer The first linear layer significantly increases dimension, and the second brings it back down to where it was (so similar to how the attention block computes a delta, this block is also computing a delta that is added to its input)
37
Do K Q and V , and their application to incoming data, use bias vectors
Not typically As per http://ai.stackexchange.com/questions/40252/why-are-biases-typically-not-used-in-attention-mechanism and just watching 3b1b's video The MLP block after the attention block uses biases, naturally, as it's an MLP
38
What is the other major component of a transformer block in GPT-3?
A layer norm operation after the MLPs. (Layer norm is described in more detail in the deep learning deck) **again, I think this diagram from Grant may be freaking wrong! Or at least there's a disagreement between my sources, 3b1b and here https://dugas.ch/artificial_curiosity/GPT_architecture.html will need to resolve this disagreement and update this slide accordingly
39
What is BERT generally?
BERT is a language model that takes as input a phrase and returns context-dependent word embeddings for each of the words, as well as a context-dependent embedding for start and end tokens which are placed at the beginning and end of the word. ("Context dependant word embeddings" is how I think about it.) "Base" embeddings are size 768; "large" embeddings are size 1024.
40
How does SBERT generally work? How does it relate to BERT, and how can it go from word embeddings to sentence embeddings?
SBERT is essentially a fine-tuned version of BERT that pools word embeddings to create good sentence embeddings. For a given input phrase, BERT's output is an embedding for each word, and the start and end tokens. These can be "pooled" to make sentence embeddings: one common option is simply to output the embedding of the start token as your sentence embeddings, and another is to take the average of all the embeddings. SBERT automatically does one of these based on the version (so its output on a given phrase is a single sentence embedding vector), and it has been fine-tuned to be good at this specifically.
41
Different versions of SBERT can and have been trained using different methods, but what is one key method that has been used?
You have SBERT embed two sentences, then use something simple like cosine similarity to calculate the sentences' similarity based on the sentence embeddings, and then you compare this to a label you have between 0 and 1 showing how similar they are.
42
How could BERT be used to build a search engine?
Suppose we're using it for sentence/phrase/paragraph embeddings specifically, getting it from the embedding of the start token. We can use this for search engines: get an embedding of the text from every page on the internet, and get an embedding for the phrase the user input to the search engine, and return pages whose embeddings are similar to the query's embeddings
43
How could BERT be used to make a classifier like a spam filter or a fact-checker?
Input your documents to BERT and get a sentence embedding for each, then train a few additional layers to predict your outcome variable based on those embeddings. If you have lots of data you could also fine-tune BERT itself.