Why is k-fold cross-validation generally preferred over a single train/validation split?
Because k-fold CV reduces variance in performance estimates by averaging across multiple train–test folds. It maximizes data usage, reduces sensitivity to a “lucky” or “unlucky” split, and gives a more stable generalization estimate.
What conditions make cross-validation unreliable or misleading?
Time-series data (must use chronological splits, not random).
Data leakage (improper preprocessing before the split).
Highly imbalanced data unless using stratified CV.
Non-IID data, e.g., grouped or clustered samples → need GroupKFold.
When should you prefer StratifiedKFold over KFold?
In classification tasks—especially imbalanced ones—to preserve label distribution across folds, preventing misleading validation metrics.
How do you correctly apply preprocessing inside a CV pipeline?
Wrap preprocessing in a sklearn Pipeline so transformations are fit inside each fold to avoid leakage.
How is evaluation split design different in LLM/RAG/agent training?
You need separate splits for:
Retrieval grounding (faithfulness)
Generation tasks
Tool-use/agent interaction traces
Often requires multi-task stratification, separating evaluation of retrieval quality from generation quality.
What is the statistical purpose of bootstrapping?
Bootstrapping approximates the sampling distribution of an estimator by resampling with replacement, allowing estimation of variance, confidence intervals, and bias without strong parametric assumptions.
Why does bagging reduce variance but not necessarily bias?
Because variance is reduced by aggregating predictions from multiple high-variance learners (trees). Bias remains unchanged unless the base learners themselves are biased.
Why are decision trees good candidates for bagging but linear models are not?
Trees have high variance and benefit from averaging; linear models have low variance, so bagging provides limited improvement.
How can bootstrapping ideas be applied in LLM training or evaluation?
Sampling-based confidence estimation for LLM outputs
Self-consistency decoding (majority voting over sampled responses)
Uncertainty estimation in retrieval (bootstrap sub-indexes or chunk samples)
Why can grid search be inefficient compared to random search?
Random search explores more unique configurations, especially when only a subset of hyperparameters matter. Grid search wastes computation on irrelevant dimensions.
What is early-stopping and why is it both a regularizer and tuner?
Early stopping ends training once validation performance stops improving. It prevents overfitting (regularizer) and selects an optimal number of training iterations (hyperparameter tuner).
Describe a tuning strategy for models where training is very expensive.
Use coarse-to-fine search
Low-fidelity training: fewer epochs, smaller subsets
Successive Halving / Hyperband
Transfer learning from prior tuned models
What new hyperparameters matter in modern LLM fine-tuning?
LoRA rank / α / dropout
Sequence length / chunk size
SFT batch size / grad accumulation
RLHF hyperparameters (KL coeff, λ)
Retrieval hyperparameters for RAG agents: top-k, chunking, embedding model, re-ranking type.
Why is hyperparameter tuning hard for agents compared to classic ML models?
Agents have non-deterministic behavior, long-horizon tasks, and multiple interacting modules (retriever ↔ planner ↔ tools). Metrics are noisy, and rollout variance makes classical tuning unstable.
What does a “training curve that diverges but validation remains flat” indicate?
Overfitting starting—model memorizes training data while generalization stagnates.
Explain how learning rate interacts with batch size.
Large batch → lower gradient noise → supports higher LR.
Small batch → noisy gradients → requires smaller LR to avoid instability.
What steps would you take if training loss is not decreasing?
Check data pipeline (shuffling, preprocessing)
Decrease learning rate
Simplify model
Validate gradients (not NaN or zero)
Check label formatting or class mismatch
Verify correct task loss (e.g., BCE vs CE)
Why do LLM training runs use warmup schedules?
Warmup prevents early large updates that destabilize training weights (especially with AdamW + large batch size). Helps avoid divergence in first few hundred steps.
What do crossing learning curves (training above validation early, but later reversed) indicate?
Underfitting initially, then the model generalizes better with more data or longer training, eventually reducing validation error below training error.
How do learning curves help diagnose data scarcity vs model capacity issues?
High bias → training and validation errors both high → add model capacity.
High variance → training low, validation high → add regularization or more data.
What would you do if learning curves show high variance?
Increase dataset size
Add regularization
Reduce model complexity
Use data augmentation
Add dropout (neural nets)
What learning curves are relevant for RAG or agent training?
Retrieval recall@k vs training iterations
Hallucination rate vs training steps
Agent success rate vs number of rollouts
Context-utilization efficiency (tokens used/kept)
These curves diagnose whether the model is learning to use retrieval, follow tools, or plan actions.
You find that cross-validation performance is highly variable across folds. What does this indicate?
Data may be heterogeneous or splits may contain different distributions. Solutions: stratification, group-based CV, or checking for data leakage.
Your RandomForest drastically overfits on small datasets. What would you adjust?
Decrease tree depth
Increase min_samples_leaf
Increase bootstrap sample size
Increase number of trees
Try ExtraTrees (more randomization)