What’s the difference between a parameter and a statistic?
A parameter describes a population (e.g., μ, σ), while a statistic describes a sample (e.g., x̄, s).
What is the Central Limit Theorem (CLT)?
The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population’s distribution.
Define bias and variance in model performance.
Bias is systematic error (underfitting). Variance is sensitivity to noise (overfitting). Goal: minimize both (bias–variance trade-off).
What’s the difference between correlation and causation?
Correlation measures association, not cause–effect. Causation requires controlled experiments or causal inference.
What are Type I and Type II errors?
Type I = false positive (reject true H₀); Type II = false negative (fail to reject false H₀).
Difference between supervised and unsupervised learning?
Supervised uses labeled data (e.g., regression, classification); unsupervised finds patterns in unlabeled data (e.g., clustering, dimensionality reduction).
Explain cross-validation.
Splits data into multiple folds; trains on subsets and tests on held-out data to better estimate generalization.
What’s regularization?
Penalizing model complexity to prevent overfitting (e.g., L1 for sparsity, L2 for shrinkage).
When would you use precision vs. recall?
Precision for minimizing false positives; recall for minimizing false negatives. Use F1 when balancing both.
What is feature importance?
A measure of how much each feature contributes to model predictions (e.g., via Gini importance, permutation importance, or SHAP values).
What’s the difference between tidy and wide data formats?
Tidy: each variable in a column, each observation in a row. Wide: multiple observations per row, often used for time series.
Common strategies for missing data?
Drop rows, impute (mean/median/mode/KNN), or model missingness explicitly.
What’s data leakage?
When information from outside the training dataset (especially from the test set) influences model training, inflating performance estimates.
Why use log transforms?
To stabilize variance, make skewed data more normal-like, and reduce the influence of outliers.
Difference between normalization and standardization?
Normalization rescales to [0,1]; standardization rescales to mean=0, std=1.
What’s ROC–AUC?
The area under the ROC curve; measures the model’s ability to rank positive instances higher than negatives.
What’s the purpose of a confusion matrix?
To summarize classification performance via true/false positives/negatives.
Explain overfitting and underfitting.
Overfitting: model captures noise (low train error, high test error). Underfitting: model too simple (high train/test error).
Why use pipelines in ML workflows?
To streamline preprocessing + modeling, ensure reproducibility, and prevent data leakage during cross-validation.
What’s concept drift?
When statistical properties of target or input variables change over time, degrading model performance.