Data Science Fundamentals Flashcards

(20 cards)

1
Q

What’s the difference between a parameter and a statistic?

A

A parameter describes a population (e.g., μ, σ), while a statistic describes a sample (e.g., x̄, s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Central Limit Theorem (CLT)?

A

The sampling distribution of the sample mean approaches a normal distribution as sample size increases, regardless of the population’s distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define bias and variance in model performance.

A

Bias is systematic error (underfitting). Variance is sensitivity to noise (overfitting). Goal: minimize both (bias–variance trade-off).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the difference between correlation and causation?

A

Correlation measures association, not cause–effect. Causation requires controlled experiments or causal inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are Type I and Type II errors?

A

Type I = false positive (reject true H₀); Type II = false negative (fail to reject false H₀).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Difference between supervised and unsupervised learning?

A

Supervised uses labeled data (e.g., regression, classification); unsupervised finds patterns in unlabeled data (e.g., clustering, dimensionality reduction).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain cross-validation.

A

Splits data into multiple folds; trains on subsets and tests on held-out data to better estimate generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s regularization?

A

Penalizing model complexity to prevent overfitting (e.g., L1 for sparsity, L2 for shrinkage).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When would you use precision vs. recall?

A

Precision for minimizing false positives; recall for minimizing false negatives. Use F1 when balancing both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is feature importance?

A

A measure of how much each feature contributes to model predictions (e.g., via Gini importance, permutation importance, or SHAP values).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the difference between tidy and wide data formats?

A

Tidy: each variable in a column, each observation in a row. Wide: multiple observations per row, often used for time series.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Common strategies for missing data?

A

Drop rows, impute (mean/median/mode/KNN), or model missingness explicitly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What’s data leakage?

A

When information from outside the training dataset (especially from the test set) influences model training, inflating performance estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why use log transforms?

A

To stabilize variance, make skewed data more normal-like, and reduce the influence of outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Difference between normalization and standardization?

A

Normalization rescales to [0,1]; standardization rescales to mean=0, std=1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What’s ROC–AUC?

A

The area under the ROC curve; measures the model’s ability to rank positive instances higher than negatives.

17
Q

What’s the purpose of a confusion matrix?

A

To summarize classification performance via true/false positives/negatives.

18
Q

Explain overfitting and underfitting.

A

Overfitting: model captures noise (low train error, high test error). Underfitting: model too simple (high train/test error).

19
Q

Why use pipelines in ML workflows?

A

To streamline preprocessing + modeling, ensure reproducibility, and prevent data leakage during cross-validation.

20
Q

What’s concept drift?

A

When statistical properties of target or input variables change over time, degrading model performance.