ML: Model Training Flashcards by Henry Ye

Why is k-fold cross-validation generally preferred over a single train/validation split?

Because k-fold CV reduces variance in performance estimates by averaging across multiple train–test folds. It maximizes data usage, reduces sensitivity to a “lucky” or “unlucky” split, and gives a more stable generalization estimate.

How well did you know this?

Not at all

Perfectly

What conditions make cross-validation unreliable or misleading?

Time-series data (must use chronological splits, not random).

Data leakage (improper preprocessing before the split).

Highly imbalanced data unless using stratified CV.

Non-IID data, e.g., grouped or clustered samples → need GroupKFold.

How well did you know this?

Not at all

Perfectly

When should you prefer StratifiedKFold over KFold?

In classification tasks—especially imbalanced ones—to preserve label distribution across folds, preventing misleading validation metrics.

How well did you know this?

Not at all

Perfectly

How do you correctly apply preprocessing inside a CV pipeline?

Wrap preprocessing in a sklearn Pipeline so transformations are fit inside each fold to avoid leakage.

How well did you know this?

Not at all

Perfectly

How is evaluation split design different in LLM/RAG/agent training?

You need separate splits for:

Retrieval grounding (faithfulness)

Generation tasks

Tool-use/agent interaction traces
Often requires multi-task stratification, separating evaluation of retrieval quality from generation quality.

How well did you know this?

Not at all

Perfectly

What is the statistical purpose of bootstrapping?

Bootstrapping approximates the sampling distribution of an estimator by resampling with replacement, allowing estimation of variance, confidence intervals, and bias without strong parametric assumptions.

How well did you know this?

Not at all

Perfectly

Why does bagging reduce variance but not necessarily bias?

Because variance is reduced by aggregating predictions from multiple high-variance learners (trees). Bias remains unchanged unless the base learners themselves are biased.

How well did you know this?

Not at all

Perfectly

Why are decision trees good candidates for bagging but linear models are not?

Trees have high variance and benefit from averaging; linear models have low variance, so bagging provides limited improvement.

How well did you know this?

Not at all

Perfectly

How can bootstrapping ideas be applied in LLM training or evaluation?

Sampling-based confidence estimation for LLM outputs

Self-consistency decoding (majority voting over sampled responses)

Uncertainty estimation in retrieval (bootstrap sub-indexes or chunk samples)

How well did you know this?

Not at all

Perfectly

Why can grid search be inefficient compared to random search?

Random search explores more unique configurations, especially when only a subset of hyperparameters matter. Grid search wastes computation on irrelevant dimensions.

How well did you know this?

Not at all

Perfectly

What is early-stopping and why is it both a regularizer and tuner?

Early stopping ends training once validation performance stops improving. It prevents overfitting (regularizer) and selects an optimal number of training iterations (hyperparameter tuner).

How well did you know this?

Not at all

Perfectly

Describe a tuning strategy for models where training is very expensive.

Use coarse-to-fine search

Low-fidelity training: fewer epochs, smaller subsets

Successive Halving / Hyperband

Transfer learning from prior tuned models

How well did you know this?

Not at all

Perfectly

What new hyperparameters matter in modern LLM fine-tuning?

LoRA rank / α / dropout

Sequence length / chunk size

SFT batch size / grad accumulation

RLHF hyperparameters (KL coeff, λ)

Retrieval hyperparameters for RAG agents: top-k, chunking, embedding model, re-ranking type.

How well did you know this?

Not at all

Perfectly

Why is hyperparameter tuning hard for agents compared to classic ML models?

Agents have non-deterministic behavior, long-horizon tasks, and multiple interacting modules (retriever ↔ planner ↔ tools). Metrics are noisy, and rollout variance makes classical tuning unstable.

How well did you know this?

Not at all

Perfectly

What does a “training curve that diverges but validation remains flat” indicate?

Overfitting starting—model memorizes training data while generalization stagnates.

How well did you know this?

Not at all

Perfectly

Explain how learning rate interacts with batch size.

Study These Flashcards

Large batch → lower gradient noise → supports higher LR.

Small batch → noisy gradients → requires smaller LR to avoid instability.

What steps would you take if training loss is not decreasing?

Study These Flashcards

Check data pipeline (shuffling, preprocessing)

Decrease learning rate

Simplify model

Validate gradients (not NaN or zero)

Check label formatting or class mismatch

Verify correct task loss (e.g., BCE vs CE)

Why do LLM training runs use warmup schedules?

Study These Flashcards

Warmup prevents early large updates that destabilize training weights (especially with AdamW + large batch size). Helps avoid divergence in first few hundred steps.

What do crossing learning curves (training above validation early, but later reversed) indicate?

Study These Flashcards

Underfitting initially, then the model generalizes better with more data or longer training, eventually reducing validation error below training error.

How do learning curves help diagnose data scarcity vs model capacity issues?

Study These Flashcards

High bias → training and validation errors both high → add model capacity.

High variance → training low, validation high → add regularization or more data.

What would you do if learning curves show high variance?

Study These Flashcards

Increase dataset size

Add regularization

Reduce model complexity

Use data augmentation

Add dropout (neural nets)

What learning curves are relevant for RAG or agent training?

Study These Flashcards

Retrieval recall@k vs training iterations

Hallucination rate vs training steps

Agent success rate vs number of rollouts

Context-utilization efficiency (tokens used/kept)

These curves diagnose whether the model is learning to use retrieval, follow tools, or plan actions.

You find that cross-validation performance is highly variable across folds. What does this indicate?

Study These Flashcards

Data may be heterogeneous or splits may contain different distributions. Solutions: stratification, group-based CV, or checking for data leakage.

Your RandomForest drastically overfits on small datasets. What would you adjust?

Study These Flashcards

Decrease tree depth

Increase min_samples_leaf

Increase bootstrap sample size

Increase number of trees

Try ExtraTrees (more randomization)

A RAG system performs well on offline metrics but users report hallucinations. Why?

Offline metrics (recall@k) may not reflect actual grounding because: Retrieved documents irrelevant to instructions Model ignoring retrieved context Poor chunking → irrelevant slices Bad reranker Solve via: grounding eval, factuality tests, task-specific RAG benchmarks.

An AI agent’s success rate fluctuates wildly across test runs. What likely causes this?

High stochasticity in generation Tool call failures Poor state tracking Unstable planner behavior Non-deterministic environment Solution: deterministic decoding, retries, improved planning, or supervised traces.

During hyperparameter tuning, increasing model size improves validation accuracy but crushes training time. What do you do?

Use Bayesian/Hyperband instead of full search Use early stopping Reduce training data via subsampling for early exploration Use transfer learning or model distillation

What is regularisation?

A method to solve overfitting. Intuitively it reduces the model's flexibility and therefore complexity. L1 and L2 regularisation. L1 reg/norm (lasso) - Adds sum of abs val of weights to loss. Penalises large weight values. Encourages weights close to zero - makes network sparse. L2 reg/weight decay (ridge) - Adds sum of squared val of weights to loss. Exaggerate higher vals and penalises them more. Results in less sparse networks. Tune alpha - how much to consider the penalisation term.

ML: Model Training Flashcards

(28 cards)