Statistical Learning Flashcards

(43 cards)

1
Q

Difference between prediction and inference?

A

minimize prediction error (ŷ)
create interpretable model (f(x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a parametric approach?

A
  1. make assumption about the functional form
  2. find procedure to train model of that form
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is supervised learning?

A

for each observation of a set of features Xᵢ there is a measured response yᵢ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference regression- and classification problem?

A

r: quantitative response
c: qualitative response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the variance of a model?

A

how much f(x) changes with different sets of training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the bias of a model

A

the error introduced by approximation and simplification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the error rate?

A

incorrect predictions / total number of predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the accuracy?

A

1 - error rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the bayes-classifier?

A

the classifier with the smallest probability of misclassification given the same set of predictors (benchmark)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

K-nearest neighbours classification:

A

identify k-nearest neighbours, then estimate conditional probability for class j, assign x to class with highest probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Residual Sum off Squares (RSS)

A

RSS=1∑n​ (yᵢ−ŷ​ᵢ​)²

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standard Error (SE(ß))

A

SE(ß) = σ² / 1∑n​ (xᵢ - x̄)²

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

F-test formula

A

F = ((TSS - RSS) / p) / RSS / (n-p-1)

if F > F crit reject H0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Forward Selection:

A

starting from null model, the variable associated with the lowest RSS for its linear model is added, this continues until a stopping rule is satisfied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Backward Selection

A

Starting with a full model, step by step the variable with the largest p-value is removed until a stopping condition is reached

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Mixed Selection:

A

like forward selection but if at any point a p-value exceeds a threshold the corresponding variable is removed, this procedure continues until each variable is either inside the model or all variables outside the model were discarded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Polynomial regression

A

Y = ß0 + ßvariable + ßvariable² …
in R: y ~ poly(variable, n)
with n being the highest order polynomial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Heteroscedasticity

A

the variance of the response variable is not constant
εᵢ is dependent on yᵢ

19
Q

standardization

A

rescale data to have mean 0 and standard deviation 1.
xnew = (xᵢ – x̄) / s
s = standard deviation

20
Q

normalization

A

rescaling the data so that every observation falls between 0 an 1
xnew = (xi – xmin) / (xmax – xmin)

21
Q

Outlier

A

observation with unusual and significantly different response yᵢ

22
Q

High leverage point

A

observation with an unusual set of features that has “more weight” in determining the model due to its distance from other observations

23
Q

Variance inflation factor

A

measures collinearity: 1 / 1 - R²
where R² is the R² of a regression from Xj onto all other predictors
1 = no multicollinearity, >5 = problematic

24
Q

Logistic function

A

p(x) = e^f(x) / 1 + e^f(x)
p(x) is between 0 and 1 -> probability

25
Odds
p(x) / 1 - p(x) increasing X by one unit changes the log odds by ß1
26
What is Likelihood?
Likelihood measures how plausible different parameter values are, given the observed data. It’s a function of parameters for fixed data. In contrast, probability is a function of data for fixed parameters.
27
What does linear discriminant analysis do?
linear discriminant analysis considers two criteria simultaneously : maxmizing distance between class means u and minimizing class variance (scatter = s²) while its performing complexity reduction (u1 - u2)² / s1² + s2² -> ideally large
28
What is sensitivity? (aka. recall)
true positives / all positives (TP + FN)
29
What is specificity?
true negatives / all negatives (TN + FP)
30
What is Precision?
true positives / all predicted positives (TP + FP)
31
What is Type 1 Error?
False positive rate = false positives / all negatives (False Alarm)
32
What is Type 2 Error?
False negative rate = false negatives / all positives (miss rate)
33
What assumptions does LDA make?
fk is the density function for a multivariate normal random variable with class specific mean uk and shared covariance matrix ∑
34
What assumptions does QDA make?
fk is the density function for a multivariate normal random variable with class specific mean uk and class specific covariance matrix ∑k
35
What assumption does Naive Bayes make?
within the kth class, the p predictors are independent
36
When to use LDA over QDA and vice versa?
QDA estimates far more parameters (Kp(p+1)/2 versus Kp) so LDA has less variance. In contrast if LDA assumption is off it suffers from high bias. If n is large relatively to p then QDA is preferred as variance is less of a concern
37
What is a poisson distribution?
y takes non-negative values as its independent events that happen in x time/distance/space. Where n is extremely large while the probability is infinitely small
38
Cross validation
split data into k folds, then use 1 fold for validation of model trained on all other folds.
39
Bootstrapping
real population -> sample -> bootstrap samples from a sample draw new samples with replacement to estimate the distribution, variance and confidence interval for any statistic
40
Best subset selection
1.fit all (p over k) models that contain exactly k of the p predictors 2.Pick the best model from among them and call it Mk 3.Select a single best model from M0,Mk...Mp using cross validated prediction error or adjusted R2
41
What to minimize in a ridge regression?
RSS + λ (p∑j=1 ßj²) where λ >= 0 is a tuning parameter that penalizes big coefficients. So ridge regression shrinks less significant coefficients to 0
42
What to minimize in lasso regression?
RSS + λ (p∑j=1 |ßj|) where λ >= 0 is a tuning parameter that penalizes big coefficients. can shrink coefficients to exactly zero
43