ML: test-driven-development, not studied yet, less relevant Flashcards

(55 cards)

1
Q

ML: What is a generative model?

A

It models P(Y|X) using P(X|Y)P(Y)
(derived from Bayes rule)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ML: What is a discriminative model?

A

Models P(Y|X) directly, as P(Y|X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ML: What is a Bayes classifier?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ML: At a high level, how would a Bayes Classifier be constructed in a case where errors are asymmetric, meaning some errors are worse than others?

A

We would weight each possible error with its own amount of loss Li,j, pertaining to what you said the answer is and what the answer should’ve been.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ML: What is a simplistic view of logistic regression when Y is binary?

A

We are just doing a linear regression on our X’s, then squishing the outcome into [0,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ML: In binary logistic regression, what is the formula for P(Y=1|X=x)?

A

For x of any dimension, it is as follows:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ML: What decision boundary does binary logistic regression yield?

A

It yields the linear separator:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ML: For binary logistic regression, how do we learn the coefficients of the model B0, B1, B2… in order to make predictions?

A

We estimate them using simple maximum likelihood, found iteratively with something like gradient descent.. So we find the coefficients B0, B1, B2… that have the highest likelihood of producing the observed training data with the observed labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ML: How to we extend binary logistic regression to multinomial logistic regression? So our X’s are still n-dimensional vectors, but now our Y’s are scalars in [1,K]?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ML: How do you specify a hyperplane classifier for D-dimensional x, and Y either 0 or 1?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ML: How does prediction work for a KNN classifier or regressor?

A

For each new point, you find the K points in the training set closest to it (by some measure such as euclidean distance); then for classification you’d return the most common label among the nearest neighbors, and for regression you’d return the average label of the nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ML: How do ensemble classifiers generally work?

A

Several models are learned, and they vote on what the prediction is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ML: What is boosting? How does it generally work?

A

In boosting, you fit several (typically very basic) classifiers iteratively, with each new classifier trying to do well on the training points that the previous classifiers did poorly on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ML: How does boosting improve the performance of the classifiers its using? In otherwords, how does bias and/or variance impacted before and after the use of boosting?

A

Because it uses very simple predictors (for example, stumps of decision trees), each one individually has very high bias. But, by fitting a lot of them and correcting for previous mistakes, bias is decreased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ML: What is the (somewhat high-level) algorithm for Adaboost?

A

We start with each training example Xi having equal weights wi. Then, repeat:

  1. Fit a new classifier that minimizes weighted training error, based on the weights wi. (Points on which earlier classifiers have performed poorly have higher weights, and errors on those points cost more)
  2. Based on the weighted error of that classifier, find a weight alphaj for the classifier for when it votes later on. The lower its weighted error, the higher its alphaj, and thus the higher its voting power.
  3. Update each of the point weights wi based on this round’s results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ML: How (at a high level) does gradient boosting work?

A

At each step, we find the current gradient of the loss function. We then find our next classifier to “approximate this gradient vector”.

So we in some way optimally take into account the gradient of the loss function at each step as we choose our classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the xgboost library at a very high level?

A

The xgboost library is an ML library for training models that are ensembles of decision trees using gradient boosting (a version of boosting, which you’re familiar with). The implementation in the library contains many optimizations, making it very fast and effective, hence it’s popularity throughout the ML world.

Creating models with xgboost is simple: its interface is very similar to sklearn for example.

It was made by CMU professor Tianqi Chen!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

ML: How does bagging work, at a fairly high level? How do we train a bagging ensemble given a training set of n Xi’s?

A

Resample several samples from your training data; say m bootstrapped samples of size n, with replacement.

Train a (typically complex) model on each of the m bootstrapped samples. Have these models vote on the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

ML: How do bias and variance work with bagging? So what is the level bias and variance from one of the individual classifiers in the ensemble, and then what happens to bias and variance when we have several classifiers vote?

A

The classifiers in bagging are complex; if they’re trees, they’re deep trees. As a result, the individual classifiers have low bias, but high variance.

However, when we use several classifiers and have them vote, we can drive down this variance because we’re averaging votes from classifiers trained on different bootstrapped samples. (As n goes infinity, these samples become independent.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

ML: What are the high-level differences between boosting and bagging, and how they achieve the goal of making several classifiers work effectively together?

A

Boosting fits several models that are not complex, and thus have high bias. But by fitting several and correcting errors from previous iterations, bias is reduced.

Bagging fits several complex models, so bias is lower but variance is high. But by averaging many predictions on slightly different datasets, we can lower variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ML: What is an improvement we can make to bagging (specifically classification bagging) when we are calculating our predicted class?

A

Rather than just picking the class with the most votes from each of the classifiers, average the probabilities of each class from each of the classifiers, then predict the class with the highest probability.

(If we get to the end of a branch in one of the classifiers, the “probability” of class j predicted by that classifier is the proportion of training examples that had class j).

This improvement makes performance more consistent, decreasing variance a bit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

ML: How do Random Forests work?

A

Random Forests work basically just by using bagging with deep decision trees: bootstrapping several resampled training sets, training a deep tree on each of them, and having them vote for a classification or regressed value.

The key change from traditional bagging is this: if there are p features of a training example X, at each split in a decision tree, only m < p features are considered for that split. (Typically m = sqrt(p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ML: Why is the Random Forest alteration to bagging often advantageous?

A

By only considering m < p features at each split of a decision tree, the trees become more independent, which helps the random forest algorithm decrease the variance in the overall prediction (which is the general goal of bagging).

24
Q

ML: What method is typically used to evaluate random forests instead of cross-validated error? Why is this other approach typically better for random forests?

A

You use out-of-bag error. Because each classifier is trained on a bootstrapped training set, each point in the original training set was not used in training some of the classifiers (about 36.8%). So, get a prediction on this point from these classifiers, and find the error on all the points from that method.

This is preferable to cross-validation because you don’t need to retrain the model for each of the folds; training takes a long time for random forests, so we want to avoid it. But with enough trees, out-of-bag error tends to be very similar to CV error.

25
ML: For random forests, what are the 2 most common ways to calculate variable importance metrics?
1. Find all of the splits which use that variable in all of the decision trees, and then find how much on average those splits improve the predictions, using some measure of prediction performance like Gini index. 2. Randomly permute each variable one at a time, and see how much model performance decreases for each. (I'm guessing this means to calculate out-of-bag error, and for each point, before getting the prediction, choose the value of the variable in question randomly.)
26
ML: What is clustering, and how does it differ from classification?
Clustering is an unsupervised learning technique, so our training data are not labeled, and we try to divide the data into groups that are in some way "similar". So basically, we try to label each point such that similar points have the same label. Classification is similar because each point has a label, but classification is supervised: we are given a labeling scheme and want to learn how to predict it as best as possible. With clustering, we aren't given a labeling scheme, and want to find a sensible one.
27
ML: What does k-means clustering approximately minimize? And how does its approximation of this minimization work on a theory level?
28
ML: How do you run the k-means clustering algorithm?
After choosing your number of classes k, randomly initialize the locations of your k cluster centers. Then, repeat: 1. Assign all training points to the cluster center that is closest by euclidean distance 2. Set cluster center as the average of all points in the cluster
29
ML: What are some drawbacks to k-means clustering?
1. It's nondeterministic due to the random initializations, and is not guarenteed to reach the global optimum. 2. Cluster center isn't interpretable, as it's an average of a bunch of points
30
ML: What is an advantage to k-means clustering?
It will always converge, as each step will decrease the in-cluster variance (until it's at a local minimum)
31
ML: How does the k-medoids algorithm work? Specifically, what alterations are made to the k-means algorithm?
Here the cluster centers are actual points in the dataset, rather than an average of a bunch of points. So we initialize k points randomly to be the initial cluster centers, then repeat: 1. Assign all training points to the cluster center (a real point) that is closest to it 2. For each cluster, choose the point in the cluster which minimizes mean squared distance from other points as the new cluster center. (In other words, minimize in-cluster variation.)
32
ML: What does it mean to "fold the intercept in" for a hyperplane classifier?
Rather than writing the hyperplane as wTx + b = 0, you write it as wTx = 0, and implicitly assume that the first (zeroth) entry in w is your intercept b, and the first (zeroth) entry in your x vector is a 1.
33
ML: What is a simple way to draw a hyperplane classifier wTx = 0, as well as the direction of the arrow pointing towards y=1?
Just draw the hyperplane by normally drawing a line, by converting into y = mx + b form. To figure out which side has which label and draw that arrow, do some easy-to-calculate sample point back in the original form of wTx = 0, and see whether wTx is above or below 0.
34
ML: What is the goal of dimensionality reduction?
We want to take our D-dimensional dataset, and represent it using fewer dimensions (or fewer features). And we want to preserve as much of the structure and information in the data as we can: we want similar points in our original dataset to also be similar in the dimension-reduced dataset, and want dissimilar points to also remain dissimilar (by whatever measure of similarity makes sense in context).
35
ML: What does PCA give you?
With D-dimensional data, PCA finds up to D new linear dimensions, or vectors, or "number lines" as I think of them, which are linear combinations of the original dimensions (e1, e2,...). Each of these dimensions is called a Principal Component, and the first is the vector v1 maximizing the variance of Xv1, the projections of the points in design matrix X onto the number line of v1. The remaining principal components maximize the variance of the projection Xvi while also making vi orthogonal to all previous principal components. (All D principal components thus form a basis for RD, as they're all orthogonal.) Using these principal components, you can choose k \< D of them to represent your data in the k dimensional subspace preserving as much of the variance in the data as possible (I bet it's actually an approximation of that, cuz it feels like it's a greedy algorithm, but that's the idea).
36
ML: In PCA, what is proportion of variance explained?
For some subset of the D principal components, say k of them, the proportion of variance explained is how much of the variance from the original dataset is preserved by this k-dimensional representation. (Or what is explained by each individual principal component; both definitions work.) Specifically, it is (I assume) the variance in the k-dimensional representation divided by the variance in the original data. We are often interested in proportion of variance explained *as a function of k*, so we can see when we have a k high enough to explain a reasonable proportion of the variance, such as maybe 90%.
37
ML: In PCA, what is a scree plot?
It plots proportion of variance explained vs number of principal components used, as a means of visualizing how much information will be preserved by each additional principal component. This is the general idea, but there are many versions. Some plot number of PCs used vs eigenvalue associatied with its eigenvector, because these are related (as described in another flashcard). Another type, shown below, has the proportion of variance at each individual PC.
38
ML: What is the formula or equation satisfied by matrix M and its eigenvector v and associated eigenvalue ÿ (lambda)?
Mv = ÿv | (Mv = lambda\*v)
39
ML: In PCA, once you have your top k principal components vi, how do you get your k-dimensional representation of data matrix X?
Xnew = XVk, where the columns of Vk are the k principal components vi. In other words, you find all k PC scores, Xvi, which is just the projection of each point x onto the number line of direction vi, and then you combine all of those projections into your new representation Xnew.
40
ML: There are three common high-level ways to compute PCA. What are they? In other words, what 3 matrices can you do what three things to to get relevant information for PCA?
Given data matrix X, you can: 1. Find the eigendecomposition VDVT of its covariance matrix (Sigma-hat) 2. Find the singular value decomposition of X = UD\*VT. 3. Take the inner product matrix XXT and find XXT = U(D\*)2UT, where U and D\* the same as in option 2.
41
ML: How do you Kernel PCA for some dataset, and what does Kernel PCA allow you to do?
Kernel PCA allows you to find *nonlinear embeddings* using PCA, which typically only finds optimal linear embeddings. You'll recall that you can do normal PCA by taking the inner product matrix as XXT​ = U(D\*)2UT, where XXT is the inner product matrix with (XXT)i,j = xiTxj. Well instead, use the kernel trick and make a different similarity matrix, a kernel matrix K, where the entries Ki,j = k(xi,xj) for some nonlinear kernel. Then do the same decomposition into K = U(D\*)2UT for some other matrices U and D, and find the PC scores in the new nonlinear space as UD just as you would when using XXT. This way, you can find the PCA dimension reduction of X, but after transforming the points of X into any new subspace you want, likely a non-linear one, using kernels.
42
ML: How do you train a model using cross-validation? How does it differ from normal validation?
With normal validation, once you've set aside your test data, you split your remaining data into train and validation data. You train all your models on the train data, then validate them with the validation data. You pick a model and test it on the test data. With cross-validation (say with k folds, such as 5 or 10), after you set aside test data, you instead divide it into k groups, or k "folds". For every model you try, you train it on k-1 of the folds and find its validation error on the remaining fold, and you do this *for each of the k folds*. Finally, you get your validation error by averaging all k of your error measures on each of the folds. (So you train your model k times rather than once when you are validating.) This "wastes" less data, but takes more time and compute
43
ML: When cross-validating, is more or less folds better? What is the "ideal" number of folds? What is the intuition behind this?
Generally more folds is better, and you want as many folds as possible. The "ideal" cross validation is leave-one-out cross validation or LOOCV, where each datapoint is its own fold, with the intuition being that the more data you use to train for each fold, the better you simulate how the model will perform on actual held-out data. But of course, more folds means more time, and thus LOOCV is often not realistic. 5-fold and 10-fold CV are common and tend to do the trick.
44
What is an SVM at a high level? Hard and soft-margin SVM?
Support vector machines (SVMs) are an algorithm for finding the optimal linear/hyperplane classifier for a dataset. In hard-margin, no points can be misclassified, and the classifier tries to maximize the "margin", which is the distance between the hyperplane and the closest point(s). In soft-margin, points are allowed to cross the hyperplane. Each point is assigned a "slack variable", describing how far it's within the margin (a concept still kept from hard-margin) or within the decision boundary (i.e. misclassified).
45
ML: How does multi-class SVM work, i.e. where there's more than two classes?
Similar to multiclass logistic regression, we just fit k binary-classification SVMs: one for each of the k labels, predicting whether it's that specific label vs *any of the others* (we treat all other labels as a single label for each binary classifier). We then predict via:
46
ML: What is the idea of the kernel trick? What two valuable advantages does the kernel trick give us?
47
ML: How does using the kernel trick for SVM work? Where do we sub in the kernel?
48
ML: What does a norm accomplish at a very high level?
It maps a vector to a non-negative scalar in a way that measures its "length"
49
ML: What is the *lp* norm (as in *l0, l1,* etc) for integer values from 0 all the way to infinity?
50
ML: What is the *l2* norm? What does it find, and how do we calculate it for a vector?
The *l2* norm, or the euclidean norm, is simply the distance formula: it's the length of the vector in euclidean space. If x = [1,5,2], ||x||2 = sqrt(12 + 52 + 22).
51
ML: What is the *l1* norm? How do we calculate it?
For ||x||1, you just sum the absolute values of the entries in x. ||x||1 = |x1| + |x2| + |x3|
52
ML: In the context of machine learning, what is the key difference between the *l1* and *l2* norms, and in the outcomes they produce? How does this concept impact the regularization techniques we're familiar with?
Suppose we are trying to find a vector x that minimizes a norm, either the *l1* or *l2* norm. So in both cases, we're trying to find a vector x whose entries are near zero. The *l2* norm doesn't care much whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the *l2* norm, we get a bunch of values that are very nearly zero. Conversely, the *l1* norm *does* care a lot whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the *l1* norm, we get a bunch of values that are exactly zero, whereas with the *l2* norm they weren't quite zeroed out. This pops up in regularization: when we do ridge regression, we're minimizing the parameter vector over the *l2* norm, and so we get a bunch of values that are almost zero, but none that are really zeroed out. When we do lasso regression, we use the *l1* norm and thus get a bunch of parameters that are exactly 0. There are tradeoffs here: lasso regression yields more interpretable results, but ridge regression is better at capturing very small relationships that lasso tends to just zero out.
53
402/401: What is the correct way of interpreting a coefficient Bi in a linear regression? What is an incorrect way of interpreting it? Why is the latter correct while the former is incorrect?
**Correct:** Bi is the expected difference in the outcome variable Y between two points whose difference in Xi is 1, assuming the values of all other predictors are held equal. **Incorrect:** Bi is the expected difference in Y caused by increasing Xi by 1. The incorrect one is wrong because *it assumes some sort of causal relationship*. Our linear model by itself says nothing about what will happen if we *change* the value of Xi, it just says that on average, points with higher Xi have higher (or lower, depending on sign of Bi) values of Y based proportionally on Bi, which is what the correct interpretation is saying.
54
ML: At an extremely high level, how does graph clustering work?
Form a graph from your training points, connecting points that are similar. Then, make graph cuts which separate dissimilar points to form your clusters.
55
ML: What is a key advantage of graph clustering?
It is very good at capturing *non-blob-like clusters*, or clusters with atypical shapes. This can be very useful