ML: Model Evaluation & Selection Flashcards

(54 cards)

1
Q

What is the bias–variance trade-off?

A

Bias = error from oversimplification. Introduced by approximating real-world problems with an overly simplistic model. Wrong assumptions about data.

Variance = error from oversensitivity. Introduced by a model’s sensitivity to small noise in training data. Model too complex and overfits training data.

Increasing complexity ↓ bias but ↑ variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why does overfitting occur?

A

Model learns noise or spurious patterns due to:

High complexity

Too few examples

Poor regularization

Data leakage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between training error and generalization error?

A

Training error = performance on training data.

Generalization error = expected performance on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is cross-validation and why is it used?

A

Resampling method to estimate out-of-sample performance.

Reduces variance and avoids relying on a single validation split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain k-fold vs stratified k-fold CV.

A

Stratified preserves class proportions → better stability in classification tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is regularization? Give examples.

A

Techniques to reduce overfitting by penalizing complexity:

L1 (sparsity)

L2 (weight shrinkage)

Dropout

Data augmentation

Early stopping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does L1 differ from L2 regularization?

A

L1: promotes sparse weights; feature selection

L2: distributes weights; smoother optimization; prevents large weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of a validation set?

A

Hyperparameter tuning and model selection without contaminating the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data leakage, and how do you prevent it?

A

Using information from the test/validation set during training.

Prevent by applying preprocessing inside pipelines after splitting data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you detect overfitting in practice?

A

Training loss ↓, validation loss ↑

Gap between training and validation metrics

Poor performance shift after deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is early stopping a form of regularization?

A

Stops training before the model fits noise; limits effective complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When would AUC be preferred over accuracy?

A

Class imbalance

Different decision thresholds

When ranking performance matters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What evaluation metric do you use for class imbalance?

A

F1-score

Balanced accuracy

ROC-AUC

Precision–Recall AUC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do ML engineers care about calibration?

A

Probabilities should reflect true likelihood → critical for risk-sensitive tasks (credit scoring, medical decisions, agentic planning).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you choose between a simpler and more complex model?

A

Trade-off accuracy vs interpretability, training cost, robustness, risk of overfitting, and deployment constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is hyperparameter search, and what are common strategies?

A

Exploration of config space:

Grid search

Random search

Bayesian optimization

Hyperband / ASHA

Population-based training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why does dropout reduce overfitting?

A

Forces networks to not rely on specific neurons; ensemble averaging effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is validation loss often noisier than training loss?

A

Validation set is smaller; no gradient smoothing; stochasticity in regularization (dropout, augmentation).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does batch normalization affect model evaluation?

A

Different behavior in train vs eval mode (uses running averages in eval).

Incorrect mode → inflated or broken metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the impact of label noise on model selection?

A

Increases variance

Causes overfitting

Makes validation metrics unreliable
Need robust losses or cleaning strategies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the double descent phenomenon?

A

Deep models can generalize well even far past the interpolation threshold where classical bias–variance predicts poor performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why is the test set used exactly once?

A

Repeated evaluation leaks information into training → overly optimistic performance.

22
Q

What is the difference between global and local interpretability?

A

Global: model-wide behavior (feature importance, coefficients)

Local: individual prediction explanations (SHAP, LIME)

23
Q

How does SHAP compute feature attributions?

A

Uses Shapley values from cooperative game theory → average marginal contribution of each feature.

24
Why is regularization related to interpretability?
L1 creates sparse models; discourages overly complex decision boundaries.
25
Name 3 reasons interpretability matters in ML + agentic systems.
Debugging and trust Safety and compliance Detecting hallucinations or unsafe agent actions
26
What is permutation feature importance?
Measure drop in performance after randomly permuting one feature → assesses sensitivity.
27
Why might SHAP or LIME give misleading results?
Correlated features Extrapolation outside data manifold Instability for small perturbations Model interactions not captured well
28
How do we evaluate an LLM-based agent vs a static ML model?
Agents require behavioral evaluation, including: Task success rate Consistency, robustness Tool-use accuracy Safety constraints Planning reliability Hallucination rates
29
Why is standard accuracy insufficient for evaluating agentic systems?
Agents generate sequences of actions, not single outputs → must measure process quality, not just final predictions.
30
What are key evaluation metrics for modern LLM systems?
Response correctness Faithfulness Reasoning depth Retrieval precision/recall Latency and token efficiency Tool invocation success
31
How does bias–variance relate to LLM fine-tuning?
Overly specialized fine-tuning → low bias, high variance → poor generalization LoRA rank controls effective model complexity Needs proper validation tasks
32
How do we evaluate retrieval systems in agent memory?
Recall@k MRR (Mean Reciprocal Rank) Embedding drift over time Query latency Relevance consistency
33
What is reward hacking in agent systems and why is it a model selection issue?
The agent exploits loopholes in evaluation or reward signals. Model selection must detect gaming behavior rather than relying solely on numeric metrics.
34
Why does interpretability matter more for autonomous agents?
Agents make multi-step decisions → need visibility into: Planning chain Memory usage Tool routing Risk of hallucination-driven actions
35
What is “offline evaluation” for LLM agents, and why is it difficult?
Evaluating agents without executing live actions (e.g., simulated tools). Difficult because behavior depends heavily on environment feedback.
36
What is precision?
Fraction of predicted positives that are actually positive. Precision = 𝑇𝑃/ (𝑇𝑃 + 𝐹𝑃) ​
37
What is recall?
Fraction of actual positives that were correctly identified. Recall = 𝑇𝑃/ (𝑇𝑃 + 𝐹𝑁)​
38
When is high precision more important than recall?
When the cost of a false positive is high, e.g.: Fraud detection reviewing bank accounts Medical test with harmful interventions Spam filters blocking important emails
39
When is high recall more important than precision?
When missing positive cases is expensive/dangerous, e.g.: Cancer screening Safety anomaly detection Search systems (don’t miss relevant documents)
40
What is the F1-score and when is it preferred?
Harmonic mean of precision and recall. Used when: Class imbalance exists Precision and recall are both important
41
What is Precision@K?
Fraction of top-K results that are relevant. Used in ranking, retrieval, and recommender systems.
42
Why is accuracy a poor metric for imbalanced datasets?
Majority class dominates the score; a naive model can achieve high accuracy without learning anything about minority classes.
43
What is the goal of retrieval evaluation in a RAG pipeline?
To measure how well retrieved documents support answering the query.
44
What metrics are used to evaluate retrieval quality?
Recall@k → proportion of relevant docs retrieved Precision@k nDCG (discounted cumulative gain) MRR (mean reciprocal rank) Embedding similarity drift (advanced)
45
When is Recall@k more important than Precision@k in RAG?
When coverage matters—LLM can filter noise but cannot hallucinate missing facts. E.g., fact retrieval, medical Q&A, legal reasoning.
46
When is Precision@k more important?
When poor docs confuse or mislead the LLM → reduces generation quality. E.g., domain-specific assistive tools.
47
How do you detect retrieval issues in a RAG system?
Ground-truth retrieval evaluation Similarity histograms Retrieval latency + recall Inspecting LLM output for hallucinations caused by retrieval gaps
48
What metrics evaluate the generation stage?
Faithfulness (does it stick to retrieved doc?) Correctness / factuality Relevance Context utilisation Token-level overlap metrics (BLEU, ROUGE) LLM-as-a-judge metrics (modern)
49
Why are BLEU/ROUGE not enough for RAG evaluation?
They measure surface similarity → fail on: Paraphrasing Correct answers with different wording Reasoning tasks
50
How do you measure hallucinations in a RAG system?
Compare output against retrieved docs (faithfulness) Use LLM-judge scoring Entropy or perplexity-based heuristics Retrieval mismatch detection
51
How do you evaluate end-to-end performance of a RAG system?
Combination of: Retrieval recall Generation correctness Faithfulness to retrieved content Latency + cost metrics (tokens, retrieval time)
52
Why is context compression important for RAG evaluation?
Long contexts dilute relevant info → reduces faithfulness and raises hallucination risk. Evaluations should measure how well context is used.
53
What are failure modes unique to RAG systems?
Retrieval misses relevant docs Over-retrieval causes noise Hallucinations even with correct retrieval Query reformulation errors Embedding drift over time