ML: test-driven-development, not studied yet, less relevant Flashcards

Question

ML: For random forests, what are the 2 most common ways to calculate variable importance metrics?

Answer 1

1. Find all of the splits which use that variable in all of the decision trees, and then find how much on average those splits improve the predictions, using some measure of prediction performance like Gini index. 2. Randomly permute each variable one at a time, and see how much model performance decreases for each. (I'm guessing this means to calculate out-of-bag error, and for each point, before getting the prediction, choose the value of the variable in question randomly.)

Answer 2

Clustering is an unsupervised learning technique, so our training data are not labeled, and we try to divide the data into groups that are in some way "similar". So basically, we try to label each point such that similar points have the same label. Classification is similar because each point has a label, but classification is supervised: we are given a labeling scheme and want to learn how to predict it as best as possible. With clustering, we aren't given a labeling scheme, and want to find a sensible one.

Answer 3

After choosing your number of classes k, randomly initialize the locations of your k cluster centers. Then, repeat: 1. Assign all training points to the cluster center that is closest by euclidean distance 2. Set cluster center as the average of all points in the cluster

Answer 4

1. It's nondeterministic due to the random initializations, and is not guarenteed to reach the global optimum. 2. Cluster center isn't interpretable, as it's an average of a bunch of points

Answer 5

It will always converge, as each step will decrease the in-cluster variance (until it's at a local minimum)

Answer 6

Here the cluster centers are actual points in the dataset, rather than an average of a bunch of points. So we initialize k points randomly to be the initial cluster centers, then repeat: 1. Assign all training points to the cluster center (a real point) that is closest to it 2. For each cluster, choose the point in the cluster which minimizes mean squared distance from other points as the new cluster center. (In other words, minimize in-cluster variation.)

Answer 7

Rather than writing the hyperplane as w^Tx + b = 0, you write it as w^Tx = 0, and implicitly assume that the first (zeroth) entry in w is your intercept b, and the first (zeroth) entry in your x vector is a 1.

Answer 8

Just draw the hyperplane by normally drawing a line, by converting into y = mx + b form. To figure out which side has which label and draw that arrow, do some easy-to-calculate sample point back in the original form of w^Tx = 0, and see whether w^Tx is above or below 0.

Answer 9

We want to take our D-dimensional dataset, and represent it using fewer dimensions (or fewer features). And we want to preserve as much of the structure and information in the data as we can: we want similar points in our original dataset to also be similar in the dimension-reduced dataset, and want dissimilar points to also remain dissimilar (by whatever measure of similarity makes sense in context).

Answer 10

With D-dimensional data, PCA finds up to D new linear dimensions, or vectors, or "number lines" as I think of them, which are linear combinations of the original dimensions (e_1,e₂,...). Each of these dimensions is called a Principal Component, and the first is the vector v₁ maximizing the variance of Xv₁, the projections of the points in design matrix X onto the number line of v₁. The remaining principal components maximize the variance of the projection Xv_i while also making v_i orthogonal to all previous principal components. (All D principal components thus form a basis for R^D, as they're all orthogonal.) Using these principal components, you can choose k \< D of them to represent your data in the k dimensional subspace preserving as much of the variance in the data as possible (I bet it's actually an approximation of that, cuz it feels like it's a greedy algorithm, but that's the idea).

Answer 11

For some subset of the D principal components, say k of them, the proportion of variance explained is how much of the variance from the original dataset is preserved by this k-dimensional representation. (Or what is explained by each individual principal component; both definitions work.) Specifically, it is (I assume) the variance in the k-dimensional representation divided by the variance in the original data. We are often interested in proportion of variance explained *as a function of k*, so we can see when we have a k high enough to explain a reasonable proportion of the variance, such as maybe 90%.

Answer 12

It plots proportion of variance explained vs number of principal components used, as a means of visualizing how much information will be preserved by each additional principal component. This is the general idea, but there are many versions. Some plot number of PCs used vs eigenvalue associatied with its eigenvector, because these are related (as described in another flashcard). Another type, shown below, has the proportion of variance at each individual PC.

Answer 13

Mv = ÿv | (Mv = lambda\*v)

Answer 14

X_new = XV_k, where the columns of V_k are the k principal components v_i. In other words, you find all k PC scores, Xv_i, which is just the projection of each point x onto the number line of direction v_i, and then you combine all of those projections into your new representation X_new.

Answer 15

Given data matrix X, you can: 1. Find the eigendecomposition VDV^Tof its covariance matrix (Sigma-hat) 2. Find the singular value decomposition of X = UD\*V^T. 3. Take the inner product matrix XX^T and find XX^T = U(D\*)²U^T, where U and D\* the same as in option 2.

Answer 16

Kernel PCA allows you to find *nonlinear embeddings* using PCA, which typically only finds optimal linear embeddings. You'll recall that you can do normal PCA by taking the inner product matrix as XX^T = U(D\*)²U^T, where XX^T is the inner product matrix with (XX^T)_i,j = x_i^Tx_j. Well instead, use the kernel trick and make a different similarity matrix, a kernel matrix K, where the entries K_i,j = k(x_i,x_j) for some nonlinear kernel. Then do the same decomposition into K = U(D\*)²U^T for some other matrices U and D, and find the PC scores in the new nonlinear space as UD just as you would when using XX^T. This way, you can find the PCA dimension reduction of X, but after transforming the points of X into any new subspace you want, likely a non-linear one, using kernels.

Answer 17

With normal validation, once you've set aside your test data, you split your remaining data into train and validation data. You train all your models on the train data, then validate them with the validation data. You pick a model and test it on the test data. With cross-validation (say with k folds, such as 5 or 10), after you set aside test data, you instead divide it into k groups, or k "folds". For every model you try, you train it on k-1 of the folds and find its validation error on the remaining fold, and you do this *for each of the k folds*. Finally, you get your validation error by averaging all k of your error measures on each of the folds. (So you train your model k times rather than once when you are validating.) This "wastes" less data, but takes more time and compute

Answer 18

Generally more folds is better, and you want as many folds as possible. The "ideal" cross validation is leave-one-out cross validation or LOOCV, where each datapoint is its own fold, with the intuition being that the more data you use to train for each fold, the better you simulate how the model will perform on actual held-out data. But of course, more folds means more time, and thus LOOCV is often not realistic. 5-fold and 10-fold CV are common and tend to do the trick.

Answer 19

Support vector machines (SVMs) are an algorithm for finding the optimal linear/hyperplane classifier for a dataset. In hard-margin, no points can be misclassified, and the classifier tries to maximize the "margin", which is the distance between the hyperplane and the closest point(s). In soft-margin, points are allowed to cross the hyperplane. Each point is assigned a "slack variable", describing how far it's within the margin (a concept still kept from hard-margin) or within the decision boundary (i.e. misclassified).

Answer 20

Similar to multiclass logistic regression, we just fit k binary-classification SVMs: one for each of the k labels, predicting whether it's that specific label vs *any of the others* (we treat all other labels as a single label for each binary classifier). We then predict via:

Answer 21

It maps a vector to a non-negative scalar in a way that measures its "length"

Answer 22

The *l₂* norm, or the euclidean norm, is simply the distance formula: it's the length of the vector in euclidean space. If x = [1,5,2], ||x||₂ = sqrt(1² + 5² + 2²).

Answer 23

For ||x||₁, you just sum the absolute values of the entries in x. ||x||₁ = |x₁| + |x₂| + |x₃|

Answer 24

Suppose we are trying to find a vector x that minimizes a norm, either the *l₁* or *l₂* norm. So in both cases, we're trying to find a vector x whose entries are near zero. The *l₂* norm doesn't care much whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the *l₂* norm, we get a bunch of values that are very nearly zero. Conversely, the *l₁* norm *does* care a lot whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the *l₁* norm, we get a bunch of values that are exactly zero, whereas with the *l₂* norm they weren't quite zeroed out. This pops up in regularization: when we do ridge regression, we're minimizing the parameter vector over the *l₂* norm, and so we get a bunch of values that are almost zero, but none that are really zeroed out. When we do lasso regression, we use the *l₁* norm and thus get a bunch of parameters that are exactly 0. There are tradeoffs here: lasso regression yields more interpretable results, but ridge regression is better at capturing very small relationships that lasso tends to just zero out.

Answer 25

**Correct:** B_i is the expected difference in the outcome variable Y between two points whose difference in X_i is 1, assuming the values of all other predictors are held equal. **Incorrect:** B_i is the expected difference in Y caused by increasing X_i by 1. The incorrect one is wrong because *it assumes some sort of causal relationship*. Our linear model by itself says nothing about what will happen if we *change* the value of X_i, it just says that on average, points with higher X_i have higher (or lower, depending on sign of Bi) values of Y based proportionally on B_i, which is what the correct interpretation is saying.

Answer 26

Form a graph from your training points, connecting points that are similar. Then, make graph cuts which separate dissimilar points to form your clusters.

Answer 27

It is very good at capturing *non-blob-like clusters*, or clusters with atypical shapes. This can be very useful

ML: test-driven-development, not studied yet, less relevant Flashcards

(55 cards)