ML: Transformers Flashcards

Question

Training loss plateaus with larger model. Possible solutions?

Answer 1

Learning rate schedule / warmup Better initialization Larger batch size or gradient accumulation Mixed precision for efficiency Layer-wise learning rate decay

Answer 2

Embed image patches (ViT-like) Cross-attention between modalities Multi-modal decoder or unified encoder-decoder Positional encodings for both modalities

Answer 3

Queries/keys not properly scaled Input embeddings lack variation Softmax temperature issues Check layer normalization or initialization

Answer 4

Autoregressive decoding models multi-step reasoning Attention captures dependencies across steps Can integrate retrieved knowledge at each step (RAG) Memory modules track intermediate states

Answer 5

Adjusting the weights of a pre-trained model on a smaller, task-specific dataset to specialize the model for a downstream task.

Answer 6

Reduces data and compute requirements Leverages pre-trained representations Faster convergence and better generalization Avoids overfitting when labeled data is small

Answer 7

Full model fine-tuning Adapter layers / LoRA (low-rank adaptation) Prefix-tuning / prompt-tuning Freeze early layers, train later layers only

Answer 8

Task-specific metrics (accuracy, F1, BLEU, ROUGE, etc.) Human evaluation for subjective tasks Cross-validation or hold-out validation

Answer 9

Task is small and narrow domain High accuracy required on a specific dataset Model must produce output without external retrieval (low-latency scenarios)

Answer 10

Knowledge changes frequently Dataset is too large to fine-tune on Multi-domain tasks or open-domain QA Reduce compute/storage costs

Answer 11

Yes: Fine-tune model for alignment, instruction-following, or task adaptation Use RAG to provide up-to-date knowledge dynamically Often improves factuality and flexibility

Answer 12

Large models → expensive, memory-heavy Parameter-efficient methods (LoRA, adapters) reduce cost

Answer 13

Use small learning rates Freeze early layers Replay / mixed data Regularization (e.g., L2)

Answer 14

Retriever quality → affects answer grounding Context length → long documents may exceed model input Latency in retrieval + generation Outdated or low-quality documents → hallucinations

Answer 15

RAG → updating document index is faster and cheaper than daily model retraining Fine-tuning daily is costly and impractical

Answer 16

Fine-tuning → small, domain-specific data, high accuracy needed RAG may introduce retrieval errors

Answer 17

Use RAG → add/update documents in the retrieval database Model stays frozen; no retraining needed

Answer 18

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method where only low-rank updates to weight matrices are trained, while the original model weights remain frozen.

Answer 19

Adds small low-rank matrices 𝐴,𝐵 to pre-trained weights 𝑊0 Only 𝐴,𝐵 are updated during training → fewer trainable parameters Avoids full model gradient computation and storage

Answer 20

Linear projection layers (e.g., query/key/value projections in attention) Feedforward layers in MLP blocks of Transformers Layers most sensitive to task adaptation

Answer 21

Large models have redundant capacity → low-rank updates suffice for new tasks Preserves general pre-trained knowledge → reduces catastrophic forgetting Efficient storage → multiple task adapters can be stored separately

Answer 22

W=W0+ΔW,ΔW=AB

Answer 23

Small 𝑟 → fewer parameters, faster training, may underfit Larger 𝑟 → more expressive, slower and more memory Tune empirically based on task complexity and data size

Answer 24

Full fine-tuning: stores full model per task (~billions of params) LoRA: stores only low-rank adapters (~few million params) Multiple adapters can share a single frozen backbone

Answer 25

Yes, e.g.: LoRA + prompt-tuning LoRA + adapters LoRA + prefix-tuning Useful for multi-task learning and modular AI agents

Answer 26

Large models (GPT, LLaMA, Falcon) where full fine-tuning is expensive Multiple task-specific adaptations Situations with limited compute or memory

Answer 27

Standard downstream task metrics (accuracy, F1, BLEU, ROUGE) Compare to baseline pre-trained model without adapters Optionally compare to full fine-tuning

Answer 28

Yes, because base weights remain frozen Adapters are modular → can dynamically switch tasks without retraining backbone

Answer 29

Increase LoRA rank 𝑟 Train more epochs or increase learning rate Check if the adapted layers are appropriate (attention vs feedforward) Combine with instruction-tuning or RAG for additional context

Answer 30

Store one frozen backbone Train small adapters for each domain Load the corresponding adapter at inference → low storage and compute

Answer 31

Padding masks - for handling variable lengths Computers love consistency, but human sentences vary in length. When we process sentences in batches, we pad shorter sentences with zeros (or a token) to make them the same length as the longest one.The Problem: The attention mechanism doesn't know these pads are "garbage" data. It would try to calculate relationships between actual words and empty padding tokens.The Solution: A Padding Mask tells the model to ignore those specific positions by setting their attention scores to $-\infty$ before the Softmax layer. This ensures the padding has zero influence on the final output. Causal (look-ahead) masks - preventing cheating When a model is predicting the next word in a sentence, it shouldn't be allowed to see the words that come after the current position.The Problem: During training, we give the model the entire finished sentence. If we didn't use a mask, the model would simply "look ahead" at the next word in the input and copy it, rather than learning how to predict it.The Solution: We apply a triangular mask (often called a Causal Mask). For any word at position $i$, the mask blocks out all words at positions $i+1, i+2, \dots, n$ Mathematically, apply +M before softmax.

ML: Transformers Flashcards

(55 cards)