Overview ML - 01 Flashcards

Question

What are the different strategies to handle missing values?

Answer 1

- Ignoring the missing values: we can remove the row or column of data. However, it shouldn’t be performed if there are a lot of missing values in the dataset. - Data imputation: univariate imputation (we can replace each missing value with a common statistic or assumption, like the mode, mean and median); ML imputation (if we have some missing data, we can predict what data shall be present at the empty position by using the existing data, e.g. kNN imputation).

Answer 2

modeling; training; inference; change

Answer 3

problem; data

Answer 4

classification; regression; clustering

Answer 5

Exploratory data analysis

Answer 6

before; change; disrupt

Answer 7

simplest; combinations; fitted

Answer 8

training; unknown; increasing; selection; regularization

Answer 9

Underfitting; training; model; features; reducing

Answer 10

evaluating; testing; model

Answer 11

1. Divide the dataset into two parts: one for training and the other for testing. 2. Train the model on the training set. 3. Validate the model on the test set (or independent validation set). 4. Repeat steps 1 to 3 a couple of times. This number depends on the CV method that you are using.

Answer 12

1) A dataset might not be completely even distribution-wise. The training set will not represent the test set. Both training and test sets may differ a lot, one of them might be easier or harder. 2) Test our model only once might be a bottleneck. The result obtained by the hold-out technique may be considered inaccurate.

Answer 13

1. Divide the dataset into two parts: the training and the test set. Usually, 80% of the dataset goes to the training set and 20% to the test set, but you may choose any splitting that suits you better. 2. Train the model on the training set. 3. Validate the model on the test set. 4. Save the result of the validation set.

Answer 14

simplest; common; large; once

Answer 15

minimizes; stable; trustworthy; parts

Answer 16

Increasing K results in training more models, and the training process might be really expensive and time-consuming.

Answer 17

1. Pick a number of folds– K. Usually, K is 5 or 10, but you can choose any number less than the dataset’s length. 2. Split the dataset into K equal (if possible) parts, called folds. 3. Choose K −1 folds as the training set. The remaining fold will be the test set. 4. Train the model on the training set. On each iteration of CV, you must train a new model independently of the model trained on the previous iteration. 5. Validate on the test set. 6. Save the result of the validation. 7. Repeat steps 3 to 6 K times. Each time, use the remaining fold as the test set. In the end, you should have validated the model on every fold. 8. To get the final score, average the results from step 6.

Answer 18

k-fold; samples; small

Answer 19

LOOCV is more computationally expensive than K-fold, and it may take plenty of time to cross-validate the model.

Answer 20

1. Choose one sample from the dataset, which will be the test set. 2. The remaining N − 1 samples will be the training set. 3. Train the model on the training set. On each iteration, a new model must be trained. 4. Validate on the test set. 5. Save the result of the validation. 6. Repeat steps 1 to 5 N times as for N samples, we have N different training and test sets. 7. To get the final score, average the results from step 5.

Answer 21

imbalance; percentage; target

Answer 22

1. Pick a number of folds– K. 2. Split the dataset into K folds. Each fold must contain approximately the same percentage of samples of each target class as the complete set. 3. Choose K −1 folds, which will be the training set. The remaining fold will be the test set. 4. Train the model on the training set. On each iteration, a new model must be trained. 5. Validate on the test set. 6. Save the result of the validation. 7. Repeat steps 3 to 6 K times. Each time, use the remaining fold as the test set. In the end, you should have validated the model on every fold. 8. To get the final score, average the results from step 6.

Answer 23

Computationally expensive because many models are trained and evaluated.

Answer 24

hyperparameters; generalization error; inner; features; hyperparameters; outer; K; sample

Answer 25

1. Define the set of hyperparameter combinations, C, for the current model. If the model has no hyperparameters, C is the empty set. 2. Divide data into K folds with approximately equal distribution of cases in each target class. 3. (outer loop) For fold k, in the K folds: - Set fold k as the test set. - Perform automated feature selection on the remaining K − 1 folds. - For parameter combination c in C: ◦ (inner loop) For fold k, in the remaining K − 1 folds: * Set fold k as the validation set. * Train model on remaining K − 2 folds. * Evaluate model performance on fold k. ◦ Calculate average performance over K − 2 folds for parameter combination c. - Train model on K − 1 folds using the hyperparameter combination that yields the best average performance over all steps of the inner loop. - Evaluate model performance on fold k. 4. Calculate average performance over K folds.

Answer 26

accuracy [(TN + TP)/(P + N)]

Answer 27

Sensitivity; positive; TP; FN; imbalanced; untuned TP / P = TP / (TP + FN)

Answer 28

Precision; predictions; TP/(TP + FP)

Answer 29

It is the harmonic mean between precision and recall. High F1-score symbolizes high precision and recall; low F1 score tells you (almost) nothing (low precision or low recall?).

Answer 30

TPR; FPR; probability; higher

Answer 31

It is the correlation coefficient between the observed and predicted classifications. It can be used with classes of different sizes. +1 represents a perfect prediction; 0 is no better than a random prediction; −1 indicates total disagreement between prediction and observation

Answer 32

- Ratio of the Variance Explained– provides a relative measure of the overall model quality. A value of 1 is the perfect regressor; a value of 0 is the trivial regressor (predicts the mean). - Mean Absolute Error– measures of how far the predictions were from the actual output. It is non-differentiable as opposed to RMSE, which is differentiable. - Root Mean Squared Error– evaluates the average error of the model and appears in units of the dependent variable and allows for understanding the actual model uncertainty.

Answer 33

correlation; separation

Answer 34

majority; minority; class ratio; minority; 25; smaller

Answer 35

1) Modeling and learning feature correlation properties for lower sampled classes. 2) Detecting relevant feature class separation, i.e., identification of relevant features unique to each class. 3) Addition of large bias to ”standard” evaluation metrics, which are generally designed for similar class sizes. These issues could be mitigated at one of the following: - Model-level: Models can be modified to introduce heavier weighting to smaller representative classes, penalizing more heavily errors in training prediction; - Evaluation-level: Alternative evaluation metrics must be used to account for class balance– note that this solves only the problem of performance evaluation, it does not actually lead to better model classification; - Data-level: Alternatively, the data itself can be transformed! If done smartly, new instances can be introduced in a way that allows models to better represent these classes. These methods are known as data samplers.

Answer 36

- Random undersampling - Random oversampling - Synthetic Minority Oversampling Technique (SMOTE) - Other variants of SMOTE (such as BorderlineSMOTE, SVM-SMOTE, ADASYN)

Answer 37

Performs downsampling of the larger classes by randomly selecting available instances from each class. The number of instances sampled is defined as part of an acceptable class balance threshold and is variable. Ensures no data is artificially generated and all resulting data is a subset of the original input dataset. For high degrees of imbalance, this usually leads to a significant loss in available training data and ultimately leads to reduced model performance.

Answer 38

Smaller classes are oversampled until the class sizes are balanced. With oversampling, instances can (and do) appear multiple times (introducing bias in the dataset). Solves the problem of data deletion (unlike the case of random undersampling). The models will focus on the precise feature values of the repeated samples, rather than identifying relevant separating regions and boundaries.

Answer 39

majority; overlap

Answer 40

feature; minority; neighbor; randomly

Answer 41

training: stratified; non-transformed

Answer 42

loss; biased; inaccurate

Answer 43

noise; overfitting

Answer 44

oversampling; undersampling; combination

Answer 45

Data related: - Insufficient quantity of training data: It takes a lot of data for most Machine Learning algorithms to work properly. Even for very simple problems, you typically need thousands of examples. For complex problems, such as image or speech recognition, millions of examples may be required. - Non-representative training data: To generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is often harder than it sounds: if the sample is too small, you will have sampling noise (i.e., non-representative data resulting from chance). Even very large samples can be non-representative if the sampling method is flawed, a phenomenon known as sampling bias. - Poor-quality data: If your training data is full of errors, outliers, and noise (e.g., due to poor-quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data. If some instances are clearly outliers, it may help to discard them or try to manually correct the errors. If some instances are missing a few features, you must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values, or train one model with the feature and one model without it, and so on. - Irrelevant features: Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A process to come up with a good set of ffeatures to train on is called feature engineering. Model related: - Overfitting the training data - Underfitting the training data

Answer 46

Non-representative data resulting from chance.

Answer 47

It is a process to come up with a good set of features to train on. - Feature selection: selecting the most useful features to train on among existing features. - Feature extraction: combining existing features to produce a more useful one (dimensionality reduction can help). - Creating new features by gathering new data.

Overview ML - 01 Flashcards

(74 cards)