What is Retrieval-Augmented Generation (RAG)?
An architecture pattern where relevant documents are retrieved from an external knowledge base (e.g., via vector search) and injected into the model’s context window, grounding the generated output in trusted data to reduce hallucination.
What is Hybrid Search and why does Huyen advocate for it?
Hybrid search combines dense vector retrieval (semantic similarity via embeddings) with sparse keyword algorithms (like BM25). Pure vector search can miss precise identifiers (e.g., FAR clause numbers, case IDs), while BM25 alone misses semantic meaning. Together they maximize recall.
What is BM25?
Best Matching 25 — a sparse keyword retrieval algorithm that ranks documents by term frequency and inverse document frequency. Used alongside vector retrieval in hybrid search for precise keyword matching (e.g., regulatory citations, case numbers).
What is Reranking in a RAG pipeline?
A second-stage retrieval step where a cross-encoder model re-scores and reorders documents from the initial search, improving precision before context is injected into the LLM. Critical for high-impact decisions where answer quality must be maximized.
What is Span-Level Verification?
A technique requiring the model to identify and highlight the exact sentence in a source document that supports its generated answer. Enforces truthfulness and auditability in RAG-based systems.
Why does RAG support “truth-seeking” better than fine-tuning?
RAG provides citations — every answer can be traced to source documents. If a model hallucinates, the sources can be audited. Fine-tuning buries knowledge in opaque weights where it cannot be verified for neutrality or accuracy.
What makes RAG superior for data that changes frequently?
RAG retrieves documents at inference time from an updatable corpus. When source data changes (e.g., new regulations, updated country conditions), the retrieval corpus is updated without retraining the model. Fine-tuned models require expensive retraining.
How does a RAG-first architecture protect sensitive data?
Data stays in the retrieval corpus and is fetched at query time, injected into the context window, then discarded after the response. The model weights are never updated with the data. This means sensitive data can be used without being permanently embedded in the model.
What is the “data flywheel” in a RAG system?
Every user interaction generates data (queries, feedback, interaction logs) that feeds back into the system — improving retrieval corpus quality, evaluation pipeline calibration, and prompt optimization. The system gets better with use.