What are the three types of error in a ML model? Briefly explain each.
What is the bias-variance trade off?
Bias refers to error from an estimator that is too general (inflexible) and does not learn relationships from a data set that would allow it to make better predictions.
Variance refers to error from an estimator being too specific (overly flexible) which learns relationships that are specific to the training set but will not generalized well to new data.
In short, bias-variance trade off is the trade off between underfitting and overfitting. As we decrease variance, we tend to increase bias. As we increase variance, we tend to decrease bias.
Our goal is to create models that minimize the overall error by careful model selection and tuning to ensure there is a balance between bias and variance: general enough to make good predictions on new data but specific (flexible) enough to pick up as much signal as possible.
What are some naive approaches to classification that can be used as a baseline for results?
These baseline examples are good to calculate at the start and we should include at least one when making any assertions about the efficacy of the model, e.g., claiming “our model was 50% more accurate than the naive approach of suggesting all customers buy the most popular car.”
Explain the classification metrics Area Under the Curve (AUC) and Gini.
AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. It can have values between 0 and 1, with values closer to 1 indicating a more predictive model.
An uninformative model (guessing 0 or 1 at random on a balanced data set) will yield a score of 0.5, and a model that always predicts wrong will have a score of 0.
Gini is a similar metric that scales AUC between -1 and 1 so that 0 represents a model that makes random predictions. Gini = 2*AUC-1.
What is the difference between boosting and bagging?
Bagging and boosting are both ensemble methods, where they combine weak predictors to create a strong predictor. One key difference is that bagging builds independent models in parallel, whereas boosting build them sequentially, at each step emphasizing the observations that were misclassified in previous steps.
How can we tell if our model is underfitting the data?
If our training and validation errors are relatively equal and very high, then our model is most likely underfitting our training data.
How can we tell if our model is overfitting the training data?
If our training error is low and our validation error is high, then our model is most likely overfitting our training data.
Name and briefly explain several evaluation metrics that are useful for classification problems.
Accuracy = (true positive+true negative)/all samples
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
Name and explain several metrics that are useful for regression problems.
where,
SS_reg = sum(yhat - ybar)^2
SS_res = sum(yi - yhat)^2
SS_total = sum(yi - ybar)^2 = SS_res + SS_reg
R2 = 1-(SS_res/SS_total) ~= SS_reg/SS_tot
Why use ROC?
ROC summarizes all confusion matrices that each DECISION THRESHOLD produced, thus indicates the decision threshold value which returns best prediction result for a given clf.
So when we want to control the False-Negative outcome when having false negative would be catastrophic, e.g. if classifying ebola infection; a false-negative misclassification would risk an outbreak (higher accuracy of True-Negatives at cost of more misclassified False-Positives).
ROC x-axis is False-Positive Rate = 1-Specificity = FP/(FP + TN) = proportion of negative class that were misclassified as positives
ROC y-axis is True Positive Rate = sensitivity = TP /(TP + FN = proportion of positive class samples correctly clf’d)
Why use AUC?
The AUC allows us to COMPARE one ROC from a classifier to another ROC from a different classifier.
So if AUC_logistic > AUC_SVM, then we would select logistic regression over an SVM classifier!
Explain how to visualize how the performance of a model changes as the value of a hyperparameter changes
Many training algos contain hyperparameters that must be chosen before training begins. For example, one hyperparam in random forest is the number of weak trees in the forest (ensemble) and it is useful to viz how RF performance changes as a hyperparam value changes.
In sklearn, we can calc a validation curve which contains three important parameters:
param_name: the name of the hyperparam to vary
param_range: value of the hyperparam to use
scoring: the evaluation metric
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_section import validation_curve
rf = RandomForestClassifier()
param_range=np.arange(10,100,5)
# compute accuracy scores array over hyperprams train_scores, test_scores = validation_curve(rf, Xtrain, ytrain, param_name="n_estimators", param_range=param_range, cv=3, scoring="accuracy")
# compute mean for train scores train_mean = np.mean(train_scores, axis=1)
# compute mean for test scores test_mean = np.mean(test_scores, axis=1)
plot mean train and test scores with corresp. std
plt. plot(param_range, train_mean, label=”train_score”)
plt. plot(param_range, test_mean, label=”test_score”)