Why were Transformers introduced?
Solve RNN limitations: sequential computation, vanishing gradients, long-range dependencies
Enable parallelization over sequence elements
Better scaling to large datasets
How do Transformers achieve parallelization?
Self-attention computes relationships between all tokens simultaneously
Unlike RNNs, no sequential recurrence → GPU/TPU friendly
Batch computations are straightforward
What are the main limitations of RNNs/LSTMs that Transformers overcome?
Sequential computation → slow
Difficulty with long-range dependencies
Vanishing/exploding gradients (residual connections, layer normalisation - stable activations, weight init)
Harder to scale to very large datasets
Explain self-attention.
Computes attention scores between every token pair in a sequence
Weighted sum of values based on query-key similarity
Captures dependencies regardless of distance
Why is positional encoding needed?
Self-attention is permutation invariant → needs explicit position info
Positional encodings add order information to input embeddings
Types of positional encoding?
Sinusoidal (fixed) → allows extrapolation
Learned embeddings → trainable
Relative positional encodings → distance-aware attention
What are the main components of a Transformer encoder?
Input embedding + positional encoding
Multi-head self-attention layer
Feedforward network
Residual connections + layer normalization
What are the main components of a Transformer decoder?
Masked multi-head self-attention (prevents looking ahead)
Encoder-decoder attention
Feedforward network
Residual connections + layer normalization
What is multi-head attention and why is it useful?
Multiple attention heads learn different relationships/subspaces
Concatenated and projected → richer representations
Helps model complex dependencies
Difference between encoder-only, decoder-only, and encoder-decoder models?
Encoder-only (BERT): classification, embedding tasks
Decoder-only (GPT): autoregressive generation
Encoder-decoder (T5, BART): sequence-to-sequence tasks
Why is layer normalization used instead of batch normalization?
Sequences have variable lengths → batch norm unstable
Layer norm normalizes per sequence element → stable training
What are key training challenges for Transformers?
Quadratic attention complexity (memory/compute)
Need massive datasets
Overfitting if small data
Optimization instability → mitigated by Adam, learning rate warmup, gradient clipping
How do modern Transformers improve training efficiency?
Sparse attention / efficient attention (Longformer, BigBird)
Mixed precision / float16
Gradient checkpointing
Model parallelism & pipeline parallelism
FlashAttention and memory-efficient kernels
What is transfer learning in Transformers?
Pretrain on large corpora → fine-tune on downstream tasks
Often freezes lower layers to reduce computation and avoid overfitting
What are the building blocks of LLMs?
Decoder-only Transformers (GPT, LLaMA) for autoregressive text
Large embedding matrices
Positional encodings
Layer normalization, feedforward networks, multi-head attention
How do Transformers enable RAG (Retrieval-Augmented Generation)?
Query embedded → nearest document vectors retrieved
Retrieved context concatenated → fed to decoder
Decoder generates answer conditioned on retrieved information
What problems do RAG pipelines solve?
Memory limitation of LLMs → can access external knowledge
Reduces hallucinations by grounding output in retrieved docs
Scalable knowledge updates without full model retraining
How do embeddings interact with Transformers in agentic AI?
Embeddings used as queries/keys/values for retrieval
Clustered or indexed for efficient tool routing
Semantic similarity for planning / multi-step reasoning
What are key improvements over vanilla Transformer?
Sparse / linear attention → Longformer, Performer
Relative positional encodings → improves long-range modeling
Adapter layers → efficient fine-tuning
Memory-augmented models → Retrieval-Enhanced Transformers
How does attention scale with sequence length?
Standard self-attention: O(n^2 d) memory/time complexity
Sparse / local attention →
O(n⋅d) for long sequences
What is instruction-tuning in LLMs?
Fine-tuning on human instructions → aligns model behavior
Often combined with RLHF (Reinforcement Learning from Human Feedback)
Improves agentic reasoning and task compliance
How do Transformers handle multi-modal inputs?
Use modality-specific embeddings
Cross-attention layers combine text, images, audio
Examples: Flamingo, GPT-4 multi-modal, PaLI
Sequence length is very long and self-attention uses too much memory. What do you do?
Use sparse attention or memory-efficient attention
Truncate or chunk sequences
Use Longformer / Performer / BigBird
Gradient checkpointing to save memory
LLM generates incorrect facts. How does RAG help?
Retrieves up-to-date documents
Conditions generation on external knowledge
Reduces hallucination