Data Science Fundamentals Flashcards

Question 1

Q

What’s the difference between a parameter and a statistic?

Answer

A

A parameter describes a population (e.g., μ, σ), while a statistic describes a sample (e.g., x̄, s).

Question 2

Q

What is the Central Limit Theorem (CLT)?

Answer

A

The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population’s distribution.

Question 3

Q

Define bias and variance in model performance.

Answer

A

Bias is systematic error (underfitting). Variance is sensitivity to noise (overfitting). Goal: minimize both (bias–variance trade-off).

Question 4

Q

What’s the difference between correlation and causation?

Answer

A

Correlation measures association, not cause–effect. Causation requires controlled experiments or causal inference.

Question 5

Q

What are Type I and Type II errors?

Answer

A

Type I = false positive (reject true H₀); Type II = false negative (fail to reject false H₀).

Question 6

Q

Difference between supervised and unsupervised learning?

Answer

A

Supervised uses labeled data (e.g., regression, classification); unsupervised finds patterns in unlabeled data (e.g., clustering, dimensionality reduction).

Question 7

Q

Explain cross-validation.

Answer

A

Splits data into multiple folds; trains on subsets and tests on held-out data to better estimate generalization.

Question 8

Q

What’s regularization?

Answer

A

Penalizing model complexity to prevent overfitting (e.g., L1 for sparsity, L2 for shrinkage).

Question 9

Q

When would you use precision vs. recall?

Answer

A

Precision for minimizing false positives; recall for minimizing false negatives. Use F1 when balancing both.

Question 10

Q

What is feature importance?

Answer

A

A measure of how much each feature contributes to model predictions (e.g., via Gini importance, permutation importance, or SHAP values).

Question 11

Q

What’s the difference between tidy and wide data formats?

Answer

A

Tidy: each variable in a column, each observation in a row. Wide: multiple observations per row, often used for time series.

Question 12

Q

Common strategies for missing data?

Answer

A

Drop rows, impute (mean/median/mode/KNN), or model missingness explicitly.

Question 13

Q

What’s data leakage?

Answer

A

When information from outside the training dataset (especially from the test set) influences model training, inflating performance estimates.

Question 14

Q

Why use log transforms?

Answer

A

To stabilize variance, make skewed data more normal-like, and reduce the influence of outliers.

Question 15

Q

Difference between normalization and standardization?

Answer

A

Normalization rescales to [0,1]; standardization rescales to mean=0, std=1.

Question 16

Q

What’s ROC–AUC?

Answer

Study These Flashcards

A

The area under the ROC curve; measures the model’s ability to rank positive instances higher than negatives.

Question 17

Q

What’s the purpose of a confusion matrix?

Answer

Study These Flashcards

A

To summarize classification performance via true/false positives/negatives.

Question 18

Q

Explain overfitting and underfitting.

Answer

Study These Flashcards

A

Overfitting: model captures noise (low train error, high test error). Underfitting: model too simple (high train/test error).

Question 19

Q

Why use pipelines in ML workflows?

Answer

Study These Flashcards

A

To streamline preprocessing + modeling, ensure reproducibility, and prevent data leakage during cross-validation.

Question 20

Q

What’s concept drift?

Answer

Study These Flashcards

A

When statistical properties of target or input variables change over time, degrading model performance.

Data Science Fundamentals Flashcards

(20 cards)