Statistical Learning Flashcards by Max Tonner

Difference between prediction and inference?

minimize prediction error (ŷ)
create interpretable model (f(x))

How well did you know this?

Not at all

Perfectly

What is a parametric approach?

make assumption about the functional form
find procedure to train model of that form

How well did you know this?

Not at all

Perfectly

What is supervised learning?

for each observation of a set of features Xᵢ there is a measured response yᵢ

How well did you know this?

Not at all

Perfectly

Difference regression- and classification problem?

r: quantitative response
c: qualitative response

How well did you know this?

Not at all

Perfectly

What is the variance of a model?

how much f(x) changes with different sets of training data

How well did you know this?

Not at all

Perfectly

What is the bias of a model

the error introduced by approximation and simplification

How well did you know this?

Not at all

Perfectly

What is the error rate?

incorrect predictions / total number of predictions

How well did you know this?

Not at all

Perfectly

What is the accuracy?

1 - error rate

How well did you know this?

Not at all

Perfectly

What is the bayes-classifier?

the classifier with the smallest probability of misclassification given the same set of predictors (benchmark)

How well did you know this?

Not at all

Perfectly

K-nearest neighbours classification:

identify k-nearest neighbours, then estimate conditional probability for class j, assign x to class with highest probability

How well did you know this?

Not at all

Perfectly

Residual Sum off Squares (RSS)

RSS=1∑n (yᵢ−ŷᵢ)²

How well did you know this?

Not at all

Perfectly

Standard Error (SE(ß))

SE(ß) = σ² / 1∑n (xᵢ - x̄)²

How well did you know this?

Not at all

Perfectly

F-test formula

F = ((TSS - RSS) / p) / RSS / (n-p-1)

if F > F crit reject H0

How well did you know this?

Not at all

Perfectly

Forward Selection:

starting from null model, the variable associated with the lowest RSS for its linear model is added, this continues until a stopping rule is satisfied

How well did you know this?

Not at all

Perfectly

Backward Selection

Starting with a full model, step by step the variable with the largest p-value is removed until a stopping condition is reached

How well did you know this?

Not at all

Perfectly

Mixed Selection:

like forward selection but if at any point a p-value exceeds a threshold the corresponding variable is removed, this procedure continues until each variable is either inside the model or all variables outside the model were discarded

How well did you know this?

Not at all

Perfectly

Polynomial regression

Y = ß0 + ßvariable + ßvariable² …
in R: y ~ poly(variable, n)
with n being the highest order polynomial

How well did you know this?

Not at all

Perfectly

Heteroscedasticity

Study These Flashcards

the variance of the response variable is not constant
εᵢ is dependent on yᵢ

standardization

Study These Flashcards

rescale data to have mean 0 and standard deviation 1.
xnew = (xᵢ – x̄) / s
s = standard deviation

normalization

Study These Flashcards

rescaling the data so that every observation falls between 0 an 1
xnew = (xi – xmin) / (xmax – xmin)

Outlier

Study These Flashcards

observation with unusual and significantly different response yᵢ

High leverage point

Study These Flashcards

observation with an unusual set of features that has “more weight” in determining the model due to its distance from other observations

Variance inflation factor

Study These Flashcards

measures collinearity: 1 / 1 - R²
where R² is the R² of a regression from Xj onto all other predictors
1 = no multicollinearity, >5 = problematic

Logistic function

Study These Flashcards

p(x) = e^f(x) / 1 + e^f(x)
p(x) is between 0 and 1 -> probability

Odds

p(x) / 1 - p(x) increasing X by one unit changes the log odds by ß1

What is Likelihood?

Likelihood measures how plausible different parameter values are, given the observed data. It’s a function of parameters for fixed data. In contrast, probability is a function of data for fixed parameters.

What does linear discriminant analysis do?

linear discriminant analysis considers two criteria simultaneously : maxmizing distance between class means u and minimizing class variance (scatter = s²) while its performing complexity reduction (u1 - u2)² / s1² + s2² -> ideally large

What is sensitivity? (aka. recall)

true positives / all positives (TP + FN)

What is specificity?

true negatives / all negatives (TN + FP)

What is Precision?

true positives / all predicted positives (TP + FP)

What is Type 1 Error?

False positive rate = false positives / all negatives (False Alarm)

What is Type 2 Error?

False negative rate = false negatives / all positives (miss rate)

What assumptions does LDA make?

fk is the density function for a multivariate normal random variable with class specific mean uk and shared covariance matrix ∑

What assumptions does QDA make?

fk is the density function for a multivariate normal random variable with class specific mean uk and class specific covariance matrix ∑k

What assumption does Naive Bayes make?

within the kth class, the p predictors are independent

When to use LDA over QDA and vice versa?

QDA estimates far more parameters (Kp(p+1)/2 versus Kp) so LDA has less variance. In contrast if LDA assumption is off it suffers from high bias. If n is large relatively to p then QDA is preferred as variance is less of a concern

What is a poisson distribution?

y takes non-negative values as its independent events that happen in x time/distance/space. Where n is extremely large while the probability is infinitely small

Cross validation

split data into k folds, then use 1 fold for validation of model trained on all other folds.

Bootstrapping

real population -> sample -> bootstrap samples from a sample draw new samples with replacement to estimate the distribution, variance and confidence interval for any statistic

Best subset selection

1.fit all (p over k) models that contain exactly k of the p predictors 2.Pick the best model from among them and call it Mk 3.Select a single best model from M0,Mk...Mp using cross validated prediction error or adjusted R2

What to minimize in a ridge regression?

RSS + λ (p∑j=1 ßj²) where λ >= 0 is a tuning parameter that penalizes big coefficients. So ridge regression shrinks less significant coefficients to 0

What to minimize in lasso regression?

RSS + λ (p∑j=1 |ßj|) where λ >= 0 is a tuning parameter that penalizes big coefficients. can shrink coefficients to exactly zero

Statistical Learning Flashcards

(43 cards)