Overview ML - 01 Flashcards

(74 cards)

1
Q

What is machine learning?

A

It is the study of algorithms that improve their performance (understanding), at some task (learning algorithm) with some experience (data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A spam filter based on Machine Learning techniques automatically learns words and phrases that are good ___________ of spam by detecting unusually __________ patterns of words.

A spam filter based on Machine Learning techniques automatically notices the new ___________ has become unusually frequent in spam flagged by the users, and it starts flagging them ____________ your intervention.

A

predictors; frequent; pattern; without

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the different types of machine learning?

A
  • Supervised learning: model is trained on labeled data
  • Unsupervised learning: model is trained on unlabeled data
  • Semi-supervised learning: model is trained on labeled and unlabeled data
  • Reinforcement learning: agent learns to perform a task to obtain maximum reward
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels. True or False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define classification.

A

Task to predict the class label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define regression.

A

Task to predict a target numeric value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification and regression are the two main tasks of ________________ learning.

A

supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the most typical supervised learning algorithms?

A

1) k-nearest neighbor
2) Linear regression
3) Logistic regression
4) Support vector machines
5) Decision trees and random forests
6) Neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In ______________ learning, the training data is unlabeled. The system tries to learn without a teacher.

A

unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main tasks of unsupervised learning (4)?

A

1) Clustering: task to detect groups of similar data

2) Anomaly detection and novelty detection: task to detect if a new instance looks like a normal one or is likely to be an anomaly

3) Visualization and dimensionality reduction: task to visualize or simplify the data without losing too much information

4) Association rule learning: task to discover exciting relations between attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the most typical unsupervised learning algorithms (4)?

A

1) k-means
2) Hierarchical clustering
3) Principal component analysis
4) Kernel PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In semi-supervised learning, the training data is partially labeled, usually a lot of _____________ data and a little bit of ___________ data.

A

unlabeled; labeled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Most semi-supervised algorithms are combinations of unsupervised and supervised algorithms. For example, Deep Belief Networks (DBNs) are based on _____________ components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an _________________ manner, and then the whole system is fine-tuned using _____________ learning techniques.

A

unsupervised; unsupervised; supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In reinforcement learning, the learning system, called an _________, can observe the ________________, select and perform actions, and get ___________ in return (or penalties in the form of negative rewards). The agent must then learn by itself what is the best strategy, called __________, to get the most reward over time. The policy defines what __________ the agent should choose when in a given situation.

A

agent; environment; rewards; policy; action

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define ML model.

A

Models form the central concept in machine learning as they are what is being learned from the data to solve a given task. There is a considerable range of machine learning models to choose from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the different types of models?

A

1) Geometric Models– constructed directly in instance space, using geometric concepts such as lines, planes and distances

2) Probabilistic Models– statistical models that capture the inherent uncertainty in data and incorporate it into their predictions, modeled by means of probability distributions

3) Logical Models– models of this type can be easily translated into rules that are understandable by humans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the stages of the machine learning workflow (6)?

A

1) Data collection
2) Data preparation
3) Choosing a learning algorithm
4) Training the model
5) Evaluating the model
6) Predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 3 types of data?

A

1) Numeric (e.g., income, age)
2) Categorical (e.g., gender, nationality)
3) Ordinal (e.g., low/medium/high)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Data pre-processing is a process of cleaning the raw data, i.e., the data is collected in the real-world and is converted to a clean dataset. Whenever the data is gathered from different sources, it is collected in a raw format, and this data isn’t feasible for the analysis. True or False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are some pre-processing techniques?

A

1) Conversion of data
2) Scaling data
3) Missing values
4) Outliers detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why is using feature scaling important?

A

1) Scaling guarantees that all features are on a comparable scale and have comparable ranges: the magnitude of the features has an impact on many machine learning techniques; larger scale features may dominate the learning process and have an excessive impact on the outcomes.

2) Algorithm performance improvement:
when the features are scaled, several machine learning methods, including gradient descent-based algorithms, distance-based algorithms (such as k-nearest neighbors), and support vector machines, perform better or converge more quickly.

3) Preventing numerical instability: numerical instability can be prevented by avoiding significant scale disparities between features. Examples include distance calculations or matrix operations, where having features with radically differing scales can result in numerical overflow or underflow problems.

4) Scaling features ensure that each characteristic is considered equally during the learning process: without scaling, bigger scale features could dominate the learning, producing skewed outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Standardization makes all variables directly ______________ and ___________ handle ____________.

A

comparable; cannot; outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Range scaling requires a specific range of values, may ____________ data and ___________ are an issue.

A

compress; outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

It is important to fit the scalers to the _____________ data only, not to the full dataset. Only then can you use them to transform the training set and the test set (and new data).

A

training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are the different strategies to handle missing values?
- Ignoring the missing values: we can remove the row or column of data. However, it shouldn’t be performed if there are a lot of missing values in the dataset. - Data imputation: univariate imputation (we can replace each missing value with a common statistic or assumption, like the mode, mean and median); ML imputation (if we have some missing data, we can predict what data shall be present at the empty position by using the existing data, e.g. kNN imputation).
26
Data imputation is essentially is a ___________ process: as with any modeling process, it must be made with the __________ set. The model for imputation is part of the global model and must always be used before the main ______________ model. We want to put on the empty cells some values that will not disrupt the final model. Identify a ”good” value to fill in the missing values so that the overall dataset properties do not ___________.
modeling; training; inference; change
27
Choosing a learning algorithm depends upon the type of ___________ that needs to be solved and the type of _______ we have.
problem; data
28
If the problem is to classify and the data is labeled, ____________ algorithms are used. If the problem is to perform a regression task and the data is labeled, ______________ algorithms are used. If the problem is to create clusters and the data is unlabeled, _____________ algorithms are used.
classification; regression; clustering
29
Data should be visualized and inspected before fitting a model. What is this step called?
Exploratory data analysis
30
Different features may require different types of data scaling and imputation. True or False?
True
31
Typically do data scaling __________ imputation. Imputing data before scaling may __________ the overall properties of each feature and ____________ its contribution to the overall model.
before; change; disrupt
32
Best models are often the __________ models. When in doubt, select the simplest model available (less hyperparameter ______________, less _________ parameters).
simplest; combinations; fitted
33
Overfitting occurs when a model fits perfectly to its ____________ data. When the trained model overfits to training data rather than understanding new and ____________ data. We avoid overfitting by ____________ the amount of training data, augmenting data, feature ____________, cross-validation, _______________, ...
training; unknown; increasing; selection; regularization
34
______________ occurs when a model performs poorly on the ___________ data. The model is unable to capture the relationship between the input examples and the target value (i.e., the model is too simple). We avoid it by selecting a more powerful ___________ (with more parameters), feeding better ____________, ___________ regularization.
Underfitting; training; model; features; reducing
35
Cross-validation (CV) is a technique for ____________ a machine learning model and ___________ its performance. It helps to compare and select an appropriate _________ for the specific predictive modeling problem.
evaluating; testing; model
36
CV tends to have a higher bias than other methods used to count the model’s efficiency scores. True or False?
False
37
What is the general algorithm to cross-validate a model?
1. Divide the dataset into two parts: one for training and the other for testing. 2. Train the model on the training set. 3. Validate the model on the test set (or independent validation set). 4. Repeat steps 1 to 3 a couple of times. This number depends on the CV method that you are using.
38
What are the disadvantages of the hold-out CV (2)?
1) A dataset might not be completely even distribution-wise. The training set will not represent the test set. Both training and test sets may differ a lot, one of them might be easier or harder. 2) Test our model only once might be a bottleneck. The result obtained by the hold-out technique may be considered inaccurate.
39
What is the hold-out CV algorithm?
1. Divide the dataset into two parts: the training and the test set. Usually, 80% of the dataset goes to the training set and 20% to the test set, but you may choose any splitting that suits you better. 2. Train the model on the training set. 3. Validate the model on the test set. 4. Save the result of the validation set.
40
Hold-out CV is the ___________ and most ___________ technique. Usually used on __________ datasets as it requires training the model only ________.
simplest; common; large; once
41
K-fold CV is a technique that ____________ the disadvantages of the hold-out method. In general, it is always better to use the K-fold technique instead of the hold-out. It gives a more _________ and ________________ result since training and testing are performed on several different _________ of the dataset.
minimizes; stable; trustworthy; parts
42
What is the disadvantage of the k-fold CV?
Increasing K results in training more models, and the training process might be really expensive and time-consuming.
43
What is the algorithm of the k-fold CV?
1. Pick a number of folds– K. Usually, K is 5 or 10, but you can choose any number less than the dataset’s length. 2. Split the dataset into K equal (if possible) parts, called folds. 3. Choose K −1 folds as the training set. The remaining fold will be the test set. 4. Train the model on the training set. On each iteration of CV, you must train a new model independently of the model trained on the previous iteration. 5. Validate on the test set. 6. Save the result of the validation. 7. Repeat steps 3 to 6 K times. Each time, use the remaining fold as the test set. In the end, you should have validated the model on every fold. 8. To get the final score, average the results from step 6.
44
Leave-one-out CV (LOOCV) is an extreme case of ________ CV. K is equal to N, where N is the number of __________ in the dataset. Only used on _________ sample size datasets.
k-fold; samples; small
45
What is the disadvantage of the LOOCV?
LOOCV is more computationally expensive than K-fold, and it may take plenty of time to cross-validate the model.
46
What is the algorithm of the LOOCV?
1. Choose one sample from the dataset, which will be the test set. 2. The remaining N − 1 samples will be the training set. 3. Train the model on the training set. On each iteration, a new model must be trained. 4. Validate on the test set. 5. Save the result of the validation. 6. Repeat steps 1 to 5 N times as for N samples, we have N different training and test sets. 7. To get the final score, average the results from step 5.
47
Stratified K-fold is a variation of the standard K-fold CV technique, which is designed to be effective in a large ______________ of the target value in the dataset. Stratified K-fold splits the dataset on K folds such that each fold contains approximately the same _____________ of samples of each __________ class as the complete set.
imbalance; percentage; target
48
What is the algorithm for the stratified k-fold CV?
1. Pick a number of folds– K. 2. Split the dataset into K folds. Each fold must contain approximately the same percentage of samples of each target class as the complete set. 3. Choose K −1 folds, which will be the training set. The remaining fold will be the test set. 4. Train the model on the training set. On each iteration, a new model must be trained. 5. Validate on the test set. 6. Save the result of the validation. 7. Repeat steps 3 to 6 K times. Each time, use the remaining fold as the test set. In the end, you should have validated the model on every fold. 8. To get the final score, average the results from step 6.
49
What is the disadvantage of the nested k-fold CV?
Computationally expensive because many models are trained and evaluated.
50
Unlike the other CV techniques, which are designed to evaluate the quality of an algorithm, nested K-fold CV is used to train a model in which _________________ also need to be optimized. It estimates the ________________ _________ of the underlying model and its (hyper)parameter search. The __________ loop performs CV to identify the best ___________ and model ___________________ using the K − 1 data folds available at each iteration of the _______ loop. The model is trained for each outer loop step and evaluated on the held-out data fold (the test fold). This process yields ____ evaluations of the model performance, one for each data fold, and allows the model to be tested on every ___________.
hyperparameters; generalization error; inner; features; hyperparameters; outer; K; sample
51
What is the algorithm of the nested k-fold CV?
1. Define the set of hyperparameter combinations, C, for the current model. If the model has no hyperparameters, C is the empty set. 2. Divide data into K folds with approximately equal distribution of cases in each target class. 3. (outer loop) For fold k, in the K folds: - Set fold k as the test set. - Perform automated feature selection on the remaining K − 1 folds. - For parameter combination c in C: ◦ (inner loop) For fold k, in the remaining K − 1 folds: * Set fold k as the validation set. * Train model on remaining K − 2 folds. * Evaluate model performance on fold k. ◦ Calculate average performance over K − 2 folds for parameter combination c. - Train model on K − 1 folds using the hyperparameter combination that yields the best average performance over all steps of the inner loop. - Evaluate model performance on fold k. 4. Calculate average performance over K folds.
52
The proportion of samples correctly classified is called _____________.
accuracy [(TN + TP)/(P + N)]
53
True Positive Ratio, Recall or ______________ is the proportion of correctly classified given the ___________ samples. +1 means the model didn’t miss any _____; low recall (< 0.5) means the model has a high number of _____ due to an _______________ class or an __________ model hyperparameters.
Sensitivity; positive; TP; FN; imbalanced; untuned TP / P = TP / (TP + FN)
54
Positive Predictive Value or ___________ is the proportion of correctly classified given the positive _____________. +1 means the model didn’t miss and TP; low precision (< 0.5) means the model has a high number of FP due to an imbalanced class or an untuned model hyperparameters.
Precision; predictions; TP/(TP + FP)
55
What's the F1-score?
It is the harmonic mean between precision and recall. High F1-score symbolizes high precision and recall; low F1 score tells you (almost) nothing (low precision or low recall?).
56
The ROC curve is visual representation of model performance, drawn by calculating the _______ and _______ at every possible threshold. The perfect model has TPR of 1 and FPR of 0. Area under the ROC curve (AUC) represents the ____________ that the model will rank the positive ________ than the negative.
TPR; FPR; probability; higher
57
What is the Matthews Correlation Coefficient?
It is the correlation coefficient between the observed and predicted classifications. It can be used with classes of different sizes. +1 represents a perfect prediction; 0 is no better than a random prediction; −1 indicates total disagreement between prediction and observation
58
What are the evaluation metrics for regression models?
- Ratio of the Variance Explained– provides a relative measure of the overall model quality. A value of 1 is the perfect regressor; a value of 0 is the trivial regressor (predicts the mean). - Mean Absolute Error– measures of how far the predictions were from the actual output. It is non-differentiable as opposed to RMSE, which is differentiable. - Root Mean Squared Error– evaluates the average error of the model and appears in units of the dependent variable and allows for understanding the actual model uncertainty.
59
Imbalanced data is a common problem in machine learning, which brings challenges to feature ____________, class ___________, and evaluation, resulting in poor model performance.
correlation; separation
60
The _____________ class is the class with the highest number of samples. The ___________ class is the class with the lowest number of samples. The _________ __________ for a given dataset is defined as the ratio between the size of the ___________ class and the size of the majority class. Empirically, data ratios of at least ____% do not significantly affect performance. This is no longer true, however, as the ratio becomes _________.
majority; minority; class ratio; minority; 25; smaller
61
What are the challenges arising from datasets with very small class ratios (3)?
1) Modeling and learning feature correlation properties for lower sampled classes. 2) Detecting relevant feature class separation, i.e., identification of relevant features unique to each class. 3) Addition of large bias to ”standard” evaluation metrics, which are generally designed for similar class sizes. These issues could be mitigated at one of the following: - Model-level: Models can be modified to introduce heavier weighting to smaller representative classes, penalizing more heavily errors in training prediction; - Evaluation-level: Alternative evaluation metrics must be used to account for class balance– note that this solves only the problem of performance evaluation, it does not actually lead to better model classification; - Data-level: Alternatively, the data itself can be transformed! If done smartly, new instances can be introduced in a way that allows models to better represent these classes. These methods are known as data samplers.
62
What are the different sampling techniques (4)?
- Random undersampling - Random oversampling - Synthetic Minority Oversampling Technique (SMOTE) - Other variants of SMOTE (such as BorderlineSMOTE, SVM-SMOTE, ADASYN)
63
What is random undersampling?
Performs downsampling of the larger classes by randomly selecting available instances from each class. The number of instances sampled is defined as part of an acceptable class balance threshold and is variable. Ensures no data is artificially generated and all resulting data is a subset of the original input dataset. For high degrees of imbalance, this usually leads to a significant loss in available training data and ultimately leads to reduced model performance.
64
What is random oversampling?
Smaller classes are oversampled until the class sizes are balanced. With oversampling, instances can (and do) appear multiple times (introducing bias in the dataset). Solves the problem of data deletion (unlike the case of random undersampling). The models will focus on the precise feature values of the repeated samples, rather than identifying relevant separating regions and boundaries.
65
First, use random undersampling to trim the number of examples in the majority class, then use SMOTE to oversample the minority class to balance the class distribution. True or False?
True
66
A general downside of SMOTE is that synthetic examples are created without considering the __________ class, possibly resulting in ambiguous examples if there is a strong ___________ for the classes.
majority; overlap
67
Synthetic Minority Oversampling Technique (SMOTE) works by selecting examples that are close in __________ space, drawing a line between them, and generating a new sample at a point along the line. Specifically, a random example from the ___________ class is first chosen. Then k of the nearest neighbors for that example are found (k = 5 is the default value). A randomly selected __________ is chosen, and a synthetic example is created at a ___________ selected point between the two examples in the feature space.
feature; minority; neighbor; randomly
68
The correct application of oversampling during k-fold cross-validation is to apply the method to the ___________ dataset only, then evaluate the model on the _____________ but ______________ test set.
training: stratified; non-transformed
69
Random undersampling leads to a significant ________ of data, the sample of the majority class chosen could be ________ (it might not accurately represent the real world), the result of the analysis may be ____________, the classifier performs poorly on real unseen data.
loss; biased; inaccurate
70
Oversampling using SMOTE can increase overlapping of classes and can introduce additional _______. Additionally, it can lead to _____________.
noise; overfitting
71
It might be preferable to use _______________ since it doesn’t lead to any loss of information and, in some cases, may perform better than _________________. To balance these issues, certain scenarios might require a ______________ of both over- and undersampling.
oversampling; undersampling; combination
72
What are the main challenges of machine learning (6)?
Data related: - Insufficient quantity of training data: It takes a lot of data for most Machine Learning algorithms to work properly. Even for very simple problems, you typically need thousands of examples. For complex problems, such as image or speech recognition, millions of examples may be required. - Non-representative training data: To generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is often harder than it sounds: if the sample is too small, you will have sampling noise (i.e., non-representative data resulting from chance). Even very large samples can be non-representative if the sampling method is flawed, a phenomenon known as sampling bias. - Poor-quality data: If your training data is full of errors, outliers, and noise (e.g., due to poor-quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. It is often well worth the effort to spend time cleaning up your training data. If some instances are clearly outliers, it may help to discard them or try to manually correct the errors. If some instances are missing a few features, you must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values, or train one model with the feature and one model without it, and so on. - Irrelevant features: Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A process to come up with a good set of ffeatures to train on is called feature engineering. Model related: - Overfitting the training data - Underfitting the training data
73
What is sampling noise?
Non-representative data resulting from chance.
74
What is feature engineering what does it involve?
It is a process to come up with a good set of features to train on. - Feature selection: selecting the most useful features to train on among existing features. - Feature extraction: combining existing features to produce a more useful one (dimensionality reduction can help). - Creating new features by gathering new data.