ML: Model Evaluation & Selection Flashcards

Question

Why is regularization related to interpretability?

Answer 1

L1 creates sparse models; discourages overly complex decision boundaries.

Answer 2

Debugging and trust Safety and compliance Detecting hallucinations or unsafe agent actions

Answer 3

Measure drop in performance after randomly permuting one feature → assesses sensitivity.

Answer 4

Correlated features Extrapolation outside data manifold Instability for small perturbations Model interactions not captured well

Answer 5

Agents require behavioral evaluation, including: Task success rate Consistency, robustness Tool-use accuracy Safety constraints Planning reliability Hallucination rates

Answer 6

Agents generate sequences of actions, not single outputs → must measure process quality, not just final predictions.

Answer 7

Response correctness Faithfulness Reasoning depth Retrieval precision/recall Latency and token efficiency Tool invocation success

Answer 8

Overly specialized fine-tuning → low bias, high variance → poor generalization LoRA rank controls effective model complexity Needs proper validation tasks

Answer 9

Recall@k MRR (Mean Reciprocal Rank) Embedding drift over time Query latency Relevance consistency

Answer 10

The agent exploits loopholes in evaluation or reward signals. Model selection must detect gaming behavior rather than relying solely on numeric metrics.

Answer 11

Agents make multi-step decisions → need visibility into: Planning chain Memory usage Tool routing Risk of hallucination-driven actions

Answer 12

Evaluating agents without executing live actions (e.g., simulated tools). Difficult because behavior depends heavily on environment feedback.

Answer 13

Fraction of predicted positives that are actually positive. Precision = 𝑇𝑃/ (𝑇𝑃 + 𝐹𝑃)

Answer 14

Fraction of actual positives that were correctly identified. Recall = 𝑇𝑃/ (𝑇𝑃 + 𝐹𝑁)

Answer 15

When the cost of a false positive is high, e.g.: Fraud detection reviewing bank accounts Medical test with harmful interventions Spam filters blocking important emails

Answer 16

When missing positive cases is expensive/dangerous, e.g.: Cancer screening Safety anomaly detection Search systems (don’t miss relevant documents)

Answer 17

Harmonic mean of precision and recall. Used when: Class imbalance exists Precision and recall are both important

Answer 18

Fraction of top-K results that are relevant. Used in ranking, retrieval, and recommender systems.

Answer 19

Majority class dominates the score; a naive model can achieve high accuracy without learning anything about minority classes.

Answer 20

To measure how well retrieved documents support answering the query.

Answer 21

Recall@k → proportion of relevant docs retrieved Precision@k nDCG (discounted cumulative gain) MRR (mean reciprocal rank) Embedding similarity drift (advanced)

Answer 22

When coverage matters—LLM can filter noise but cannot hallucinate missing facts. E.g., fact retrieval, medical Q&A, legal reasoning.

Answer 23

When poor docs confuse or mislead the LLM → reduces generation quality. E.g., domain-specific assistive tools.

Answer 24

Ground-truth retrieval evaluation Similarity histograms Retrieval latency + recall Inspecting LLM output for hallucinations caused by retrieval gaps

Answer 25

Faithfulness (does it stick to retrieved doc?) Correctness / factuality Relevance Context utilisation Token-level overlap metrics (BLEU, ROUGE) LLM-as-a-judge metrics (modern)

Answer 26

They measure surface similarity → fail on: Paraphrasing Correct answers with different wording Reasoning tasks

Answer 27

Compare output against retrieved docs (faithfulness) Use LLM-judge scoring Entropy or perplexity-based heuristics Retrieval mismatch detection

Answer 28

Combination of: Retrieval recall Generation correctness Faithfulness to retrieved content Latency + cost metrics (tokens, retrieval time)

Answer 29

Long contexts dilute relevant info → reduces faithfulness and raises hallucination risk. Evaluations should measure how well context is used.

Answer 30

Retrieval misses relevant docs Over-retrieval causes noise Hallucinations even with correct retrieval Query reformulation errors Embedding drift over time

ML: Model Evaluation & Selection Flashcards

(54 cards)