What is the Hallucination Paradox?
Reasoning models (e.g., OpenAI o1, o3) hallucinate at HIGHER rates on simple factual benchmarks despite improved logical reasoning. Mechanism: more internal claims in the reasoning chain create more opportunities for error propagation. o3 had a 0.51 hallucination rate vs. 0.44 for o1 on SimpleQA.
What are Reasoning Models (e.g., o1, o3)?
LLMs that generate explicit chains of thought (“System 2 thinking”) before producing a final answer. They improve performance on logic and coding but exhibit the Hallucination Paradox on factual benchmarks.
When should you use a standard LLM with RAG instead of a Reasoning Model?
For applications requiring strict factual accuracy (e.g., benefits eligibility, regulatory citations). Reasoning models should be reserved for complex analysis where the logic can be verified (e.g., code vulnerability analysis, legal reasoning over provided statutes).
What is a Small Language Model (SLM)?
An LLM with fewer than ~7 billion parameters (e.g., Phi-3). Advantages: runs on edge devices without cloud connectivity, dramatically cheaper inference, data never leaves the device (privacy), and lower latency for simple tasks.
What are the key advantages of SLMs for federal use?
(1) Edge capability — run on laptops or tactical devices without cloud; (2) Cost — orders of magnitude cheaper; (3) Privacy — data never leaves the device; (4) Speed — lower latency for simple tasks; (5) Bypasses data center energy bottleneck.
What is the formula for the Compound Mistake Problem?
Accuracy_total = (Accuracy_step)^n, where n is the number of steps. A 95% per-step accuracy over 10 steps yields only ~60% total accuracy (0.95^10 ≈ 0.598).
What is RAG in one sentence?
Retrieve relevant documents from an external knowledge base, inject them into the model’s context window, and generate answers grounded in those documents to reduce hallucination.
What is the Data Flywheel in one sentence?
A virtuous cycle where user interactions generate data that improves the AI system, which produces better interactions that generate more valuable data.
What is LoRA in one sentence?
A parameter-efficient fine-tuning method that adds small trainable low-rank matrices (~100MB) to frozen model layers, avoiding full retraining of billions of parameters.
What is AI-as-a-Judge in one sentence?
Using a strong LLM to automatically evaluate the outputs of another LLM for quality, faithfulness, or safety at scale.
What is the Model Router in one sentence?
An abstraction layer that routes AI requests to different models based on complexity, cost, or capability — enabling vendor-agnostic, cost-optimized architectures.
What is the Instruction Hierarchy in one sentence?
A prompt design pattern where system-level instructions take precedence over user-level inputs, enforcing behavioral compliance regardless of user attempts to override.
What is Span-Level Verification in one sentence?
Requiring the model to cite the exact sentence in a source document that supports its answer, enforcing auditability and truthfulness.
What is Hybrid Search in one sentence?
Combining dense vector search (semantic meaning) with sparse keyword search (BM25 for exact terms) to maximize retrieval recall in RAG pipelines.
What is Reranking in one sentence?
A second-stage retrieval step where a cross-encoder model re-scores initial search results to maximize precision before context is passed to the LLM.
What does Faithfulness measure in one sentence?
Whether a generated answer is derived ONLY from the retrieved context, with no hallucinated information introduced by the model.
What is the Planner-Evaluator-Executor pattern in one sentence?
A decoupled agent architecture where a Planner proposes actions, an Evaluator checks them for safety and compliance, and an Executor carries out only approved actions.
What is Evaluation-Driven Development in one sentence?
The discipline of defining success metrics and building the evaluation pipeline BEFORE building the AI system, so every design decision is tested against measurable criteria.
What are the three memory types in Huyen’s hierarchy?
Parametric (model weights), Episodic (context window), and Semantic (vector database / external storage).
What is PEFT in one sentence?
Parameter-Efficient Fine-Tuning methods (like LoRA) that adapt only a small subset of model parameters, dramatically reducing compute cost while preserving most fine-tuning benefits.