ML: What is a generative model?
It models P(Y|X) using P(X|Y)P(Y)
(derived from Bayes rule)
ML: What is a discriminative model?
Models P(Y|X) directly, as P(Y|X)
ML: What is a Bayes classifier?

ML: At a high level, how would a Bayes Classifier be constructed in a case where errors are asymmetric, meaning some errors are worse than others?
We would weight each possible error with its own amount of loss Li,j, pertaining to what you said the answer is and what the answer should’ve been.
ML: What is a simplistic view of logistic regression when Y is binary?
We are just doing a linear regression on our X’s, then squishing the outcome into [0,1]
ML: In binary logistic regression, what is the formula for P(Y=1|X=x)?
For x of any dimension, it is as follows:

ML: What decision boundary does binary logistic regression yield?
It yields the linear separator:

ML: For binary logistic regression, how do we learn the coefficients of the model B0, B1, B2… in order to make predictions?
We estimate them using simple maximum likelihood, found iteratively with something like gradient descent.. So we find the coefficients B0, B1, B2… that have the highest likelihood of producing the observed training data with the observed labels.
ML: How to we extend binary logistic regression to multinomial logistic regression? So our X’s are still n-dimensional vectors, but now our Y’s are scalars in [1,K]?

ML: How do you specify a hyperplane classifier for D-dimensional x, and Y either 0 or 1?

ML: How does prediction work for a KNN classifier or regressor?
For each new point, you find the K points in the training set closest to it (by some measure such as euclidean distance); then for classification you’d return the most common label among the nearest neighbors, and for regression you’d return the average label of the nearest neighbors.
ML: How do ensemble classifiers generally work?
Several models are learned, and they vote on what the prediction is.
ML: What is boosting? How does it generally work?
In boosting, you fit several (typically very basic) classifiers iteratively, with each new classifier trying to do well on the training points that the previous classifiers did poorly on.
ML: How does boosting improve the performance of the classifiers its using? In otherwords, how does bias and/or variance impacted before and after the use of boosting?
Because it uses very simple predictors (for example, stumps of decision trees), each one individually has very high bias. But, by fitting a lot of them and correcting for previous mistakes, bias is decreased.
ML: What is the (somewhat high-level) algorithm for Adaboost?
We start with each training example Xi having equal weights wi. Then, repeat:
ML: How (at a high level) does gradient boosting work?
At each step, we find the current gradient of the loss function. We then find our next classifier to “approximate this gradient vector”.
So we in some way optimally take into account the gradient of the loss function at each step as we choose our classifiers.
What is the xgboost library at a very high level?
The xgboost library is an ML library for training models that are ensembles of decision trees using gradient boosting (a version of boosting, which you’re familiar with). The implementation in the library contains many optimizations, making it very fast and effective, hence it’s popularity throughout the ML world.
Creating models with xgboost is simple: its interface is very similar to sklearn for example.
It was made by CMU professor Tianqi Chen!
ML: How does bagging work, at a fairly high level? How do we train a bagging ensemble given a training set of n Xi’s?
Resample several samples from your training data; say m bootstrapped samples of size n, with replacement.
Train a (typically complex) model on each of the m bootstrapped samples. Have these models vote on the prediction.
ML: How do bias and variance work with bagging? So what is the level bias and variance from one of the individual classifiers in the ensemble, and then what happens to bias and variance when we have several classifiers vote?
The classifiers in bagging are complex; if they’re trees, they’re deep trees. As a result, the individual classifiers have low bias, but high variance.
However, when we use several classifiers and have them vote, we can drive down this variance because we’re averaging votes from classifiers trained on different bootstrapped samples. (As n goes infinity, these samples become independent.)
ML: What are the high-level differences between boosting and bagging, and how they achieve the goal of making several classifiers work effectively together?
Boosting fits several models that are not complex, and thus have high bias. But by fitting several and correcting errors from previous iterations, bias is reduced.
Bagging fits several complex models, so bias is lower but variance is high. But by averaging many predictions on slightly different datasets, we can lower variance.
ML: What is an improvement we can make to bagging (specifically classification bagging) when we are calculating our predicted class?
Rather than just picking the class with the most votes from each of the classifiers, average the probabilities of each class from each of the classifiers, then predict the class with the highest probability.
(If we get to the end of a branch in one of the classifiers, the “probability” of class j predicted by that classifier is the proportion of training examples that had class j).
This improvement makes performance more consistent, decreasing variance a bit.

ML: How do Random Forests work?
Random Forests work basically just by using bagging with deep decision trees: bootstrapping several resampled training sets, training a deep tree on each of them, and having them vote for a classification or regressed value.
The key change from traditional bagging is this: if there are p features of a training example X, at each split in a decision tree, only m < p features are considered for that split. (Typically m = sqrt(p))
ML: Why is the Random Forest alteration to bagging often advantageous?
By only considering m < p features at each split of a decision tree, the trees become more independent, which helps the random forest algorithm decrease the variance in the overall prediction (which is the general goal of bagging).
ML: What method is typically used to evaluate random forests instead of cross-validated error? Why is this other approach typically better for random forests?
You use out-of-bag error. Because each classifier is trained on a bootstrapped training set, each point in the original training set was not used in training some of the classifiers (about 36.8%). So, get a prediction on this point from these classifiers, and find the error on all the points from that method.
This is preferable to cross-validation because you don’t need to retrain the model for each of the folds; training takes a long time for random forests, so we want to avoid it. But with enough trees, out-of-bag error tends to be very similar to CV error.