Machine Learning Flashcards

Question

You notice that some trees in your Random Forest are consistently giving poor predictions. What could be the reason?

Answer 1

1. **Insufficient features considered** at splits → poor split decisions. 2. **Small bootstrap sample** → the tree didn’t see enough data. 3. **Tree depth too shallow** → cannot capture patterns. 4. **Noisy or missing data** affecting this tree more than others. **Key point:** Individual trees can be weak; Random Forest **averages out errors**, so some poor trees do not heavily affect overall performance.

Answer 2

1. Measure how much each feature **reduces impurity** (Gini or variance) across all trees. 2. Features that contribute more to **splits that improve prediction** have higher importance. 3. Can also use **permutation importance**: randomly shuffle a feature and observe **drop in model accuracy**. **Key point:** Random Forest provides **built-in methods** to rank and interpret feature relevance.

Answer 3

* Correlated features may lead to some trees relying on one over the other. * Random feature selection at splits helps **reduce correlation between trees**. * Overall, **ensemble averaging** ensures that correlated features do not dominate predictions. * For interpretation, feature importance may be **shared among correlated features**, so be cautious when ranking. **Key point:** Random Forest is robust to correlated features but feature importance may be less reliable.

Answer 4

* Individual trees may be influenced by outliers, but **ensemble averaging** reduces their impact on overall predictions. * For **regression**, outliers affect the average less as many trees contribute. * For **classification**, a few trees misclassifying outliers are outvoted by the majority. * Optional preprocessing: remove or cap extreme values to improve robustness. **Key point:** Random Forest is **naturally robust to outliers** due to averaging across trees.

Answer 5

Common regression metrics: 1. **Mean Squared Error (MSE):** Average squared difference between predicted and actual values. 2. **Root Mean Squared Error (RMSE):** Square root of MSE, interpretable in original units. 3. **Mean Absolute Error (MAE):** Average absolute difference between predicted and actual values. 4. **R² (Coefficient of Determination):** Proportion of variance explained by the model ((0 \le R^2 \le 1)). 5. **OOB R²/error:** Random Forest’s internal estimate using out-of-bag samples. **Key point:** Choose metric based on **sensitivity to outliers** (e.g., MAE less sensitive than MSE).

Answer 6

1. **Interpretability:** Hard to visualize or explain individual predictions due to many trees. 2. **Large memory & computation:** Training many trees on large datasets can be slow and memory-intensive. 3. **Overfitting on noisy data:** Although robust, very noisy data can still affect predictions. 4. **Feature importance bias:** Can be biased toward features with more levels or high cardinality. 5. **Limited extrapolation for regression:** Random Forest cannot predict values outside the range seen in training. **Key point:** Random Forest is powerful but trades **interpretability and efficiency** for accuracy and robustness.

Answer 7

* **Problem:** Severe class imbalance → model biased toward majority class. Accuracy is misleading. * **Solutions:** 1. **Class weighting:** Give higher weight to minority class samples. 2. **Resampling:** Oversample minority or undersample majority. 3. **Balanced bootstrap:** Ensure each tree sees a more balanced subset. 4. **Threshold adjustment:** Lower decision threshold for positive class. **Key point:** Random Forest can handle imbalance, but **explicit strategies** are needed for rare classes.

Answer 8

* **RMSE or MSE** on training vs test set → high difference indicates overfitting. * **R²** on training vs test set → low test R² vs high training R² confirms high variance. * **Steps to reduce variance:** 1. Increase `n_estimators` → stabilize predictions. 2. Limit `max_depth` → shallower trees generalize better. 3. Increase `min_samples_split` or `min_samples_leaf` → prevent overfitting small nodes. 4. Feature selection or dimensionality reduction → reduce noisy features. **Key point:** Random Forest reduces variance via averaging, but **hyperparameter tuning** is still essential.

Answer 9

* **Variance** measures how much a model’s predictions **change with different training data**. * High variance → model **fits training data too closely** (overfitting), performs poorly on unseen data. * Low variance → model predictions are **stable** across different datasets. **Key point:** Random Forest reduces variance by **averaging predictions** of multiple decorrelated trees, improving generalization.

Answer 10

Classification is a supervised learning task where an algorithm learns to assign input data to predefined discrete categories or labels ## Footnote It is used in applications like email filtering, fraud detection, and medical diagnosis.

Answer 11

Training involves presenting the model with labeled examples so it can learn the mapping between features and labels, often by minimizing a loss function such as cross-entropy ## Footnote This process is critical for the model's learning.

Answer 12

A surface (line in 2D, plane in 3D, or hyperplane in higher dimensions) that separates data points of different classes in the feature space ## Footnote It is essential for understanding how the model classifies data.

Answer 13

* Logistic Regression * Linear Discriminant Analysis (LDA) * Support Vector Machines (SVM) with linear kernels ## Footnote These algorithms are effective for linearly separable data.

Answer 14

* Decision Trees * Random Forests * Kernel SVMs * Neural Networks ## Footnote These algorithms can capture complex relationships in data.

Answer 15

* Accuracy (overall correctness) * Precision (positive predictive value) * Recall (sensitivity) * F1-score (harmonic mean of precision and recall) ## Footnote These metrics help assess the effectiveness of classification models.

Answer 16

Because it transforms probability maximization into a minimization problem, penalizes confident wrong predictions sharply, ensures numerical stability when combining probabilities, and makes the loss convex for easier optimization. ## Footnote The log function is crucial for optimizing models in machine learning.

Answer 17

It prevents underflow by turning products of small probabilities into sums of log-probabilities, which are easier and more stable to compute. ## Footnote This stability is important for numerical computations in machine learning.

Answer 18

"Sigmoid for binary classification and Softmax for multi-class classification."

Answer 19

One-hot encoding, especially for linear models or neural networks. ## Footnote This method creates binary columns for each category, making it suitable for models that require numerical input.

Answer 20

Label encoding that reflects the order of categories. ## Footnote This method assigns integer values to categories based on their order.

Answer 21

Embedding layers are most effective for high-cardinality features. ## Footnote Embedding layers help in reducing dimensionality and capturing relationships between categories.

Answer 22

Label encoding, frequency encoding, or target encoding with careful cross-validation. ## Footnote These methods help maintain the integrity of the data while allowing tree-based models to make splits effectively.

Answer 23

Choose an encoding that captures the relationship between the feature and target, and is compatible with the model type and dataset size. ## Footnote This ensures that the encoding enhances model performance and interpretability.

Answer 24

"A supervised learning algorithm used for binary classification that models the probability of a class using the logistic (sigmoid) function."

Answer 25

"It predicts probabilities between 0 and 1 for each class, which can then be thresholded to make binary predictions."

Answer 26

DBSCAN = Density-Based Spatial Clustering of Applications with Noise. It groups points closely packed together (high density) and marks low-density points as noise (outliers).

Answer 27

HDBSCAN = Hierarchical Density-Based Spatial Clustering of Applications with Noise. It extends DBSCAN by building a hierarchy of clusters and extracting the most stable ones.

Answer 28

Answer: A. min_samples and eps

Answer 29

It uses a fixed distance threshold (eps) — points within eps of at least min_samples neighbors form a cluster. Points not reachable are labeled as noise.

Answer 30

It removes the need for a global eps by building a hierarchy of clusters over varying density levels and selecting clusters based on stability.

Answer 31

Answer: True.

Answer 32

Too small eps → many points labeled as noise. Too large eps → clusters merged incorrectly.

Answer 33

HDBSCAN is typically more computationally expensive (O(n log n)) but more robust; DBSCAN is simpler and slightly faster.

Answer 34

Answer: B. HDBSCAN

Answer 35

In DBSCAN, you must set a global eps (epsilon) — a fixed distance threshold. Points within distance eps are considered neighbors. A cluster forms when a point has at least min_samples neighbors within eps. ⚠️ The problem: A single eps value works well only if all clusters in your data have similar density. If some clusters are dense and others are sparse, DBSCAN either: merges sparse clusters incorrectly, or labels many points in sparse regions as noise. ➡️ So DBSCAN can’t handle clusters with different densities because eps is global and static. ------ Meanwhile,

Answer 36

D. Both B and C

Answer 37

. Differences in longitude dominate distances due to larger range

Answer 38

: Because Euclidean distance sums squared differences for each feature, so a feature with a larger numeric range contributes much more to the distance, making other features almost negligible.

Answer 39

✅ Answer: B. Y dominates clustering → clusters reflect mostly Y

Answer 40

True K-Means partitions data into convex, roughly spherical clusters because it minimizes variance within each cluster. DBSCAN identifies dense regions, allowing for arbitrary shapes and noise detection.

Answer 41

Answer: To reduce the number of input features while retaining the most important information. It simplifies models, removes redundant or noisy features, helps visualization in 2D/3D, and can prevent overfitting

Answer 42

B. Explanation: PCA computes principal components (orthogonal axes) that capture the largest variance in the data. It projects the data onto these components, keeping only the top ones to reduce dimensionality.

Answer 43

PCA: Linear, preserves global structure (variance across all samples). t-SNE: Nonlinear, preserves local structure (pairwise neighborhood relationships). Thus, PCA is better for feature compression, t-SNE for visualizing clusters.

Answer 44

Explanation: K-Means is a clustering algorithm. However, you can use K-Means results (like cluster labels) as new features — which becomes feature engineering.

Answer 45

Answer: D. Explanation: PCA reduces dimensionality, making K-Means computationally cheaper and less sensitive to noise. It also allows visualization of clusters in 2D or 3D.

Answer 46

Clustering groups features by global similarity patterns, not just pairwise correlation. It can capture nonlinear relationships. It’s more robust in high dimensions where correlations alone may be misleading.

Answer 47

Answer: B. Explanation: Reducing dimensions (e.g., via PCA) simplifies data → clustering reveals structure → cluster labels are used as additional engineered features to enhance model performance.

Answer 48

❌ A: False — it’s unsupervised (no labels are used).

Answer 49

It builds a tree (dendrogram) of clusters by recursively merging or splitting groups. Two types: Agglomerative (bottom-up): Start with each point as its own cluster, then merge. Divisive (top-down): Start with one cluster, then split recursively. You can “cut” the tree at a chosen level to define the number of clusters. → Used in biology for gene expression analysis or document taxonomy.

Answer 50

A: Algorithms like DBSCAN or HDBSCAN find clusters as areas of high point density separated by sparse regions. They automatically detect noise/outliers and handle clusters of irregular shape. → Used in spatial data (e.g., detecting hotspots of crimes or disease spread).

Answer 51

Marketing: Segment customers by demographics or spending patterns (K-Means). Healthcare: Cluster patients by symptoms or gene expression (Hierarchical). Cybersecurity: Detect anomalous network traffic (DBSCAN / HDBSCAN). Finance: Group similar investment profiles (GMM). NLP / Vision: Cluster embeddings from deep models (UMAP + HDBSCAN).

Answer 52

Principal Component Analysis (PCA) is a linear technique that projects data onto orthogonal directions (principal components) capturing maximum variance. Best for linearly correlated features. Reduces redundancy and speeds up clustering. → Example: Before K-Means, to stabilize centroids and visualize data in 2D.

Answer 53

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear manifold learning technique preserving local structure (neighborhoods). Ideal for visualization in 2D/3D of high-dimensional clusters. Sensitive to parameters (perplexity, learning_rate). → Example: Visualizing clusters of image embeddings or word embeddings after deep learning models.

Answer 54

UMAP (Uniform Manifold Approximation and Projection): Nonlinear and faster than t-SNE. Preserves both local and some global structure. Scales better for large datasets and can be used before clustering. → Example: UMAP + HDBSCAN for clustering text embeddings or genetic data.

Answer 55

: Logistic Regression.

Answer 56

Decision Trees

Answer 57

KNN: Can produce nonlinear boundaries depending on the local structure of data. Logistic Regression: Produces a linear boundary unless features are transformed.

Answer 58

: SVM (with kernel trick) and Decision Tree.

Answer 59

KNN — because distance measures can be distorted by unscaled or noisy features.

Answer 60

K-Nearest Neighbors (KNN), since it classifies based on proximity to similar data points.

Answer 61

KNN, because it’s computationally expensive at prediction time.

Answer 62

Logistic Regression — it provides interpretable coefficients that show feature influence.

Answer 63

Decision Tree or SVM (with nonlinear kernel).

Answer 64

: Decision Tree — prone to overfitting unless pruned or regularized.

Answer 65

❌ False — despite the name, it’s used for classification.

Answer 66

❌ False — it’s non-parametric because it makes no assumptions about data distribution.

Answer 67

❌ False — it can use kernels to handle nonlinear data.

Answer 68

❌ False — it’s a lazy learner that does not build a model during training.

Answer 69

: K-Nearest Neighbors (KNN) — because it directly uses distance (e.g., Euclidean distance) between customers to classify the new one based on similar nearby examples.

Answer 70

A1: Linear Regression.

Answer 71

Logarithmic Regression.

Answer 72

A6: A heuristic for choosing the optimal number of clusters (k) by plotting the sum of squared distances (inertia) vs. k and looking for the “elbow” where improvement slows.

Answer 73

Logarithmic Regression.

Answer 74

A13: Linear Regression.

Answer 75

A3: Precision = TP / (TP + FP), the proportion of true positives among all predicted positives.

Answer 76

A4: Recall = TP / (TP + FN), the proportion of true positives among all actual positives.

Answer 77

A8: When false positives are costly (e.g., spam detection).

Answer 78

A9: When false negatives are costly (e.g., disease detection).

Answer 79

A12: Recall is critical because missing cancer patients (FN) is more dangerous than false positives.

Answer 80

A13: Precision — reduce false positives.

Answer 81

A14: The model predicts almost all as non-fraud, achieving high accuracy but failing to detect fraud (low recall for minority class).

Answer 82

A11: On average, the predicted house price differs from the actual price by $5,000.

Answer 83

A12: There are some large errors because RMSE > MAE (RMSE penalizes large errors more).

Answer 84

A13: Some features do not contribute meaningfully; the model may overfit.

Answer 85

A14: R² (scale-independent).

Answer 86

A19: ✅ True — it penalizes irrelevant features.

Answer 87

A1: Because there are no labeled outputs to directly compare predictions against, so evaluation relies on internal structure, heuristics, or external proxies.

Answer 88

A2: Internal evaluation (based on model-intrinsic measures, e.g., cohesion and separation) External evaluation (if ground truth labels exist, e.g., Adjusted Rand Index) Relative evaluation / heuristics (comparison between models, cross-validation, stability, or visualization)

Answer 89

: Points are closer to their cluster centers, meaning clusters are tight and cohesive.

Answer 90

A8: k=3 clusters, because adding a fourth cluster reduces inertia only slightly.

Answer 91

A1: A technique to evaluate a model’s generalization ability by splitting the dataset into training and validation subsets multiple times.

Answer 92

The dataset is divided into k equal folds. The model is trained on k-1 folds and validated on the remaining fold. This is repeated k times, and results are averaged.

Answer 93

A3: Special case of k-fold where k = n (number of samples). Each sample is used once as the validation set, and the rest as training.

Answer 94

A version of k-fold CV that preserves the class distribution in each fold, useful for imbalanced classification problems.

Answer 95

A6: It reduces variance due to random splits and gives a more robust estimate of model performance.

Answer 96

k-fold: Fewer iterations, less computationally expensive, slightly higher bias. LOOCV: One iteration per sample, very low bias, high computational cost, can have high variance.

Answer 97

A12: Use nested cross-validation — inner loop for tuning, outer loop for performance evaluation.

Answer 98

: It prevents data leakage from tuning into the evaluation set, providing an unbiased performance estimate.

Answer 99

A1: Data leakage occurs when a model has access to information it wouldn’t have in a real prediction scenario, leading to overestimated performance.

Answer 100

When a feature contains information directly related to the target, which wouldn’t be available at prediction time.

Answer 101

Predicting loan default using a feature “Has account been written off?” — it directly indicates default.

Answer 102

Using future information to predict past or present, which is not available at prediction time.

Answer 103

❌ False — it can also occur in unsupervised preprocessing or feature engineering.

Answer 104

1: ✅ Yes — the test set influenced training.

Answer 105

Always split dataset first, then apply preprocessing only on training data. Apply same transformations to test data.

Answer 106

A15: By keeping the outer validation set separate from hyperparameter tuning, avoiding leakage from tuning into evaluation.

Answer 107

A16: Exclude features that directly include target information or are only available after the outcome occurs.

Answer 108

A17: Ensure training data precedes validation/test data chronologically, never using future information for past prediction.

Answer 109

✅ Safe — past information is allowed. Only future information or features including the target itself would cause leakage.

Answer 110

A20: Check for data leakage, e.g., target leakage, preprocessing contamination, or temporal leakage.

Answer 111

Standard CV (k=5): Split 100 samples into 5 folds. Try multiple combinations of C and gamma on the same CV folds. Compute average accuracy → may overestimate true performance. Nested CV (k_outer=5, k_inner=3): Outer fold: 20 samples as test, 80 samples as training. Inner fold: split 80 training samples into 3 folds. Try all hyperparameter combinations using inner CV. Select best hyperparameters, train on all 80 training samples. Evaluate on 20 outer test samples (never used in tuning). Repeat for all outer folds → average performance → unbiased estimate. ``` | Step | Action | | ------------------ | ------------------------------------------------------------ | | Outer Fold 1 | 20% of data = outer validation (test) | | Outer Training Set | 80% of data → used in inner loop | | Inner Loop | Split 80% into 3 folds, train/test for hyperparameter tuning | | Train Best Model | On full 80% with best hyperparameters | | Evaluate | On 20% outer validation fold | | Repeat | For all 5 outer folds → average performance | ```

Machine Learning Flashcards

(144 cards)