ML: Transformers Flashcards

(55 cards)

1
Q

Why were Transformers introduced?

A

Solve RNN limitations: sequential computation, vanishing gradients, long-range dependencies

Enable parallelization over sequence elements

Better scaling to large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do Transformers achieve parallelization?

A

Self-attention computes relationships between all tokens simultaneously

Unlike RNNs, no sequential recurrence → GPU/TPU friendly

Batch computations are straightforward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main limitations of RNNs/LSTMs that Transformers overcome?

A

Sequential computation → slow

Difficulty with long-range dependencies

Vanishing/exploding gradients (residual connections, layer normalisation - stable activations, weight init)

Harder to scale to very large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain self-attention.

A

Computes attention scores between every token pair in a sequence

Weighted sum of values based on query-key similarity

Captures dependencies regardless of distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is positional encoding needed?

A

Self-attention is permutation invariant → needs explicit position info

Positional encodings add order information to input embeddings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of positional encoding?

A

Sinusoidal (fixed) → allows extrapolation

Learned embeddings → trainable

Relative positional encodings → distance-aware attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the main components of a Transformer encoder?

A

Input embedding + positional encoding

Multi-head self-attention layer

Feedforward network

Residual connections + layer normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the main components of a Transformer decoder?

A

Masked multi-head self-attention (prevents looking ahead)

Encoder-decoder attention

Feedforward network

Residual connections + layer normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is multi-head attention and why is it useful?

A

Multiple attention heads learn different relationships/subspaces

Concatenated and projected → richer representations

Helps model complex dependencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Difference between encoder-only, decoder-only, and encoder-decoder models?

A

Encoder-only (BERT): classification, embedding tasks

Decoder-only (GPT): autoregressive generation

Encoder-decoder (T5, BART): sequence-to-sequence tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is layer normalization used instead of batch normalization?

A

Sequences have variable lengths → batch norm unstable

Layer norm normalizes per sequence element → stable training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are key training challenges for Transformers?

A

Quadratic attention complexity (memory/compute)

Need massive datasets

Overfitting if small data

Optimization instability → mitigated by Adam, learning rate warmup, gradient clipping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do modern Transformers improve training efficiency?

A

Sparse attention / efficient attention (Longformer, BigBird)

Mixed precision / float16

Gradient checkpointing

Model parallelism & pipeline parallelism

FlashAttention and memory-efficient kernels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is transfer learning in Transformers?

A

Pretrain on large corpora → fine-tune on downstream tasks

Often freezes lower layers to reduce computation and avoid overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the building blocks of LLMs?

A

Decoder-only Transformers (GPT, LLaMA) for autoregressive text

Large embedding matrices

Positional encodings

Layer normalization, feedforward networks, multi-head attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do Transformers enable RAG (Retrieval-Augmented Generation)?

A

Query embedded → nearest document vectors retrieved

Retrieved context concatenated → fed to decoder

Decoder generates answer conditioned on retrieved information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What problems do RAG pipelines solve?

A

Memory limitation of LLMs → can access external knowledge

Reduces hallucinations by grounding output in retrieved docs

Scalable knowledge updates without full model retraining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do embeddings interact with Transformers in agentic AI?

A

Embeddings used as queries/keys/values for retrieval

Clustered or indexed for efficient tool routing

Semantic similarity for planning / multi-step reasoning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are key improvements over vanilla Transformer?

A

Sparse / linear attention → Longformer, Performer

Relative positional encodings → improves long-range modeling

Adapter layers → efficient fine-tuning

Memory-augmented models → Retrieval-Enhanced Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does attention scale with sequence length?

A

Standard self-attention: O(n^2 d) memory/time complexity

Sparse / local attention →
O(n⋅d) for long sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is instruction-tuning in LLMs?

A

Fine-tuning on human instructions → aligns model behavior

Often combined with RLHF (Reinforcement Learning from Human Feedback)

Improves agentic reasoning and task compliance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do Transformers handle multi-modal inputs?

A

Use modality-specific embeddings

Cross-attention layers combine text, images, audio

Examples: Flamingo, GPT-4 multi-modal, PaLI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Sequence length is very long and self-attention uses too much memory. What do you do?

A

Use sparse attention or memory-efficient attention

Truncate or chunk sequences

Use Longformer / Performer / BigBird

Gradient checkpointing to save memory

24
Q

LLM generates incorrect facts. How does RAG help?

A

Retrieves up-to-date documents

Conditions generation on external knowledge

Reduces hallucination

25
Training loss plateaus with larger model. Possible solutions?
Learning rate schedule / warmup Better initialization Larger batch size or gradient accumulation Mixed precision for efficiency Layer-wise learning rate decay
26
You want the model to handle both text and images. What architecture changes?
Embed image patches (ViT-like) Cross-attention between modalities Multi-modal decoder or unified encoder-decoder Positional encodings for both modalities
27
Attention scores cluster around 0.5 for all tokens. What could be wrong?
Queries/keys not properly scaled Input embeddings lack variation Softmax temperature issues Check layer normalization or initialization
28
An agentic AI must plan multi-step tasks using LLMs. How do Transformers help?
Autoregressive decoding models multi-step reasoning Attention captures dependencies across steps Can integrate retrieved knowledge at each step (RAG) Memory modules track intermediate states
29
What is fine-tuning in the context of LLMs?
Adjusting the weights of a pre-trained model on a smaller, task-specific dataset to specialize the model for a downstream task.
30
Why is fine-tuning often preferred over training from scratch?
Reduces data and compute requirements Leverages pre-trained representations Faster convergence and better generalization Avoids overfitting when labeled data is small
31
Name common strategies for fine-tuning large models.
Full model fine-tuning Adapter layers / LoRA (low-rank adaptation) Prefix-tuning / prompt-tuning Freeze early layers, train later layers only
32
How do you evaluate fine-tuned models?
Task-specific metrics (accuracy, F1, BLEU, ROUGE, etc.) Human evaluation for subjective tasks Cross-validation or hold-out validation
33
When is fine-tuning preferable to RAG?
Task is small and narrow domain High accuracy required on a specific dataset Model must produce output without external retrieval (low-latency scenarios)
34
When is RAG preferable to fine-tuning?
Knowledge changes frequently Dataset is too large to fine-tune on Multi-domain tasks or open-domain QA Reduce compute/storage costs
35
Can fine-tuning and RAG be combined?
Yes: Fine-tune model for alignment, instruction-following, or task adaptation Use RAG to provide up-to-date knowledge dynamically Often improves factuality and flexibility
36
How does model size affect fine-tuning?
Large models → expensive, memory-heavy Parameter-efficient methods (LoRA, adapters) reduce cost
37
How do you avoid catastrophic forgetting during fine-tuning?
Use small learning rates Freeze early layers Replay / mixed data Regularization (e.g., L2)
38
What are key challenges of RAG pipelines?
Retriever quality → affects answer grounding Context length → long documents may exceed model input Latency in retrieval + generation Outdated or low-quality documents → hallucinations
39
You need to update a chatbot daily with latest policies. Fine-tuning or RAG? Why?
RAG → updating document index is faster and cheaper than daily model retraining Fine-tuning daily is costly and impractical
40
A legal model needs high precision on a small dataset. Fine-tuning or RAG?
Fine-tuning → small, domain-specific data, high accuracy needed RAG may introduce retrieval errors
41
LLM provides outdated info. How do you fix it without retraining?
Use RAG → add/update documents in the retrieval database Model stays frozen; no retraining needed
42
What is LoRA in the context of fine-tuning LLMs?
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method where only low-rank updates to weight matrices are trained, while the original model weights remain frozen.
43
How does LoRA reduce memory and compute cost?
Adds small low-rank matrices 𝐴,𝐵 to pre-trained weights 𝑊0 Only 𝐴,𝐵 are updated during training → fewer trainable parameters Avoids full model gradient computation and storage
44
What types of layers are typically adapted with LoRA?
Linear projection layers (e.g., query/key/value projections in attention) Feedforward layers in MLP blocks of Transformers Layers most sensitive to task adaptation
45
Why is LoRA effective for LLM fine-tuning?
Large models have redundant capacity → low-rank updates suffice for new tasks Preserves general pre-trained knowledge → reduces catastrophic forgetting Efficient storage → multiple task adapters can be stored separately
46
Write the LoRA weight update formula.
W=W0​+ΔW,ΔW=AB
47
How is rank 𝑟 chosen in LoRA?
Small 𝑟 → fewer parameters, faster training, may underfit Larger 𝑟 → more expressive, slower and more memory Tune empirically based on task complexity and data size
48
How does LoRA compare to full fine-tuning in storage?
Full fine-tuning: stores full model per task (~billions of params) LoRA: stores only low-rank adapters (~few million params) Multiple adapters can share a single frozen backbone
49
Can LoRA be combined with other parameter-efficient methods?
Yes, e.g.: LoRA + prompt-tuning LoRA + adapters LoRA + prefix-tuning Useful for multi-task learning and modular AI agents
50
When is LoRA particularly useful?
Large models (GPT, LLaMA, Falcon) where full fine-tuning is expensive Multiple task-specific adaptations Situations with limited compute or memory
51
How do you evaluate LoRA adapters?
Standard downstream task metrics (accuracy, F1, BLEU, ROUGE) Compare to baseline pre-trained model without adapters Optionally compare to full fine-tuning
52
Can LoRA adapters be swapped or shared across tasks?
Yes, because base weights remain frozen Adapters are modular → can dynamically switch tasks without retraining backbone
53
You fine-tune an LLM on domain-specific QA using LoRA, but it underperforms. What can you do?
Increase LoRA rank 𝑟 Train more epochs or increase learning rate Check if the adapted layers are appropriate (attention vs feedforward) Combine with instruction-tuning or RAG for additional context
54
ou want to support multiple domains on one LLM efficiently. How does LoRA help?
Store one frozen backbone Train small adapters for each domain Load the corresponding adapter at inference → low storage and compute
55
Why is masking needed
Padding masks - for handling variable lengths Computers love consistency, but human sentences vary in length. When we process sentences in batches, we pad shorter sentences with zeros (or a token) to make them the same length as the longest one.The Problem: The attention mechanism doesn't know these pads are "garbage" data. It would try to calculate relationships between actual words and empty padding tokens.The Solution: A Padding Mask tells the model to ignore those specific positions by setting their attention scores to $-\infty$ before the Softmax layer. This ensures the padding has zero influence on the final output. Causal (look-ahead) masks - preventing cheating When a model is predicting the next word in a sentence, it shouldn't be allowed to see the words that come after the current position.The Problem: During training, we give the model the entire finished sentence. If we didn't use a mask, the model would simply "look ahead" at the next word in the input and copy it, rather than learning how to predict it.The Solution: We apply a triangular mask (often called a Causal Mask). For any word at position $i$, the mask blocks out all words at positions $i+1, i+2, \dots, n$ Mathematically, apply +M before softmax.