Machine Learning Flashcards

(144 cards)

1
Q

Imagine you work at a major credit card company and are given a dataset of 600,000 credit card transactions to build a fraud detection model.

How would you approach this task? Write your answer in the comments.

A

**1. Understanding the Data and Preprocessing*
* Explore data: check distributions, missing values, outliers.

  • Handle imbalanced classes (fraud is rare) using oversampling, undersampling, or synthetic data (SMOTE).
  • Feature engineering: transaction amount, time features, merchant type, location, user behavior patterns.
  • Normalize or scale numeric features; encode categorical features.
    2. Model Selection
  • Start with tree-based models (XGBoost, LightGBM) or neural networks.
  • Consider anomaly detection methods for rare fraud patterns. ( One class SVM, Auto Encoder based methods,….)

3. Training and Validation
* Split data into train/validation/test
* Use metrics suitable for imbalanced data: Precision, Recall, F1, ROC-AUC, PR-AUC.

4.Handling Imbalance & Bias:
* Weighted loss functions (higher weight to fraud - minority class in loss func) or sampling strategies (SMOTE - oversampling, ….) to account for rare fraud cases.
* Monitor for biases (e.g., merchant type, geography).

5. Deployment & Monitoring:

  • Real-time scoring for incoming transactions.
  • Track model degrade and update periodically.
  • Implement alerting thresholds with human-in-the-loop verification.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean to track model drift and update periodically in fraud detection systems?

A

Model drift occurs when the data distribution changes over time, causing the model’s performance to degrade.

How to track:

  • Monitor key metrics (Precision, Recall, F1, PR-AUC) over time.
  • Track feature distributions and detect shifts (population drift, covariate shift).
  • Compare recent predictions with historical patterns.

Updating the model:

  • Retrain periodically with new labeled data.
  • Use incremental learning or online learning if possible.
  • Incorporate feedback from human-in-the-loop reviews to improve accuracy.

Purpose:

  • Maintain high fraud detection performance.
  • Adapt to new fraud patterns or changes in customer behavior.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Let’s say your manager asks you to build a model with a neural network to solve a business problem.

How would you justify the complexity of building such a model and explain the predictions to non-technical stakeholders?

A

Justify complexity: Neural networks capture complex patterns and can improve accuracy over simpler models.

Explain predictions: Use visualizations (SHAP, feature importance) and show which factors influences predictions, focusing on business impact rather than technical details (e.g., “This model helps identify high-value customers likely to churn, enabling targeted retention campaigns”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Let’s say that you’re training a classification model.

How would you combat overfitting when building tree-based models?

A
  • Limit tree depth: Prevents overly complex splits (maximum depth, set minimum samples per leaf or node)
  • Minimum samples per leaf/node: Ensures splits have enough data.
  • Pruning: Remove branches that add little predictive value.
  • Use ensembles: Random Forest or Gradient Boosting reduces overfitting.
  • Regularization: Apply constraints like max features, max leaf nodes, or learning rate.
  • Cross-validation: Tune hyperparameters and evaluate generalization.
    Monitoring: Track training vs validation performance to detect overfitting early -> adjust tree complexity or ensemble parameters accordingly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Let’s say you work as a data scientist at a bank.

You are tasked with building a decision tree model to predict if a borrower will pay back a personal loan they are taking out.

How would you evaluate whether using a decision tree algorithm is the correct model for the problem?

Let’s say you move forward with the decision tree model. How would you evaluate the performance of the model before deployment and after?

A

Evaluating suitability of a decision tree:

  • Check if the problem is tabular, structured data with clear features and labels — decision trees handle these well.
  • Consider interpretability: trees are easy to explain to stakeholders, important in banking.
  • Compare with baseline models (logistic regression, random forest) on accuracy and interpretability.
  • Assess if the data has non-linear relationships or interactions — trees capture these naturally.

Performance evaluation before deployment:

  • Split data into train, validation, and test sets.
  • Use metrics suitable for class imbalance: Precision, Recall, F1-score, ROC-AUC.
  • Perform cross-validation for stable estimates.
  • Check for overfitting by comparing train vs validation performance.
  • Visualize feature importance to verify meaningful patterns.

Performance evaluation after deployment:

  • Monitor real-world predictions: track accuracy, fraud/missed repayment rates, false positives/negatives.
  • Detect data drift: changes in borrower behavior or features over time.
  • Incorporate human-in-the-loop feedback to correct errors and retrain the model periodically.

Summary:
Decision trees are suitable for structured, interpretable predictions. Evaluate with cross-validation and proper metrics before deployment, and continuously monitor performance and drift after deployment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Let’s say that you work at a bank that wants to build a model to detect fraud on the platform.

The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.

How would we build this model?

A

1. Data Preparation:

  • Collect historical transaction data with fraud labels.
  • Feature engineering: transaction amount, time, location, merchant, user behavior patterns.
  • Handle class imbalance (fraud is rare) with sampling or weighted loss.

2. Model Selection & Training:

  • Use tree-based models (XGBoost, LightGBM) or neural networks for structured data.
  • Consider anomaly detection methods for rare or novel fraud patterns.
  • Validate with cross-validation and metrics: Precision, Recall, F1, ROC-AUC.

3. Fraud Scoring & Thresholding:

  • Model outputs a fraud probability score.
  • Set thresholds to trigger alerts (e.g., >0.8 probability) – tune based on business risk tolerance

4. Text Messaging Service: Integrate with SMS provider (e.g., Twilio). -> send alert

5. Feedback Loop & Model Updating:

  • Incorporate customer responses to retrain or fine-tune the model.
  • Monitor drift and retrain periodically to adapt to new fraud patterns.

6. Security & Compliance:

  • Encrypt sensitive data in transit and at rest.
  • Comply with banking regulations (PCI DSS, GDPR).
  • Log alerts and actions for audit purposes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is feature engineering in machine learning?

A

Feature engineering is the process of creating, transforming, or selecting input variables (features) from raw data to improve a model’s performance -> learn more effectively

Purpose:

  • Highlight important patterns in the data.
  • Reduce noise and irrelevant information.
  • Make data suitable for the chosen model.

Examples:

  • From transaction data: compute average spending, time since last purchase, or merchant category.
  • From text: extract word counts, sentiment, or keywords.
  • From images: extract color histograms or edges.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Random Forest and how does it works?
Why is generalize better than using a single decision tree?

A

A Random Forest is an ensemble learning algorithm that builds multiple decision trees and combines their outputs to improve accuracy and reduce overfitting.

It works by using bagging (bootstrap aggregation) — each tree is trained on a random sample of the data (with replacement), and at each node split, it selects a random subset of features instead of using all features.

This introduces diversity among trees, reducing correlation and variance.

  • For classification, predictions are made by majority voting across trees.
  • For regression, predictions are the average of all tree outputs.

The core idea is that while individual trees may overfit, their ensemble average generalizes much better.

In short: Random Forest = many decorrelated trees → aggregated → stable, accurate model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does a Random Forest differ from a single decision tree? Is this easier to interpret?

A

A single decision tree build model by recursively splitting data based on the most informative features. While it’s easy to interpret, it tends to overfit — capturing noise and leading to high variance.

In contrast, a Random Forest is an ensemble of many decision trees trained on different bootstrap samples and random feature subsets. It combines their predictions (via majority vote or averaging), which:
- Reduces overfitting by averaging out noise.
- Improves generalization due to decorrelation among trees.
- Increases robustness and predictive accuracy.

Key differences:

| ---------------- | ------------- | ---------------------------------- |
| Model type       | Single tree   | Ensemble of many trees             |
| Variance         | High          | Low                                |
| Bias             | Low           | Slightly higher but better balance |
| Overfitting      | Common        | Greatly reduced                    |
| Interpretability | High          | Lower                              |
| Accuracy         | Moderate      | Higher                             |

Summary:

> A Random Forest sacrifices interpretability for higher stability and performance by combining many independent trees.

Aspect | Decision Tree | Random Forest |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Main advantage of using Random Forest?

A

Random Forests offer several advantages that make them one of the most widely used ensemble methods in practice:

  1. High Accuracy & Robustness:
  2. Resistance to Overfitting:
    The randomness in data sampling (bagging) and feature selection decorrelates trees, preventing overfitting compared to a single decision tree.
  3. Handles Both Classification and Regression:
  4. Feature Importance Estimation:
    They naturally provide feature importance scores, helping identify the most influential variables in prediction.
  5. Works Well with Missing or Noisy Data:
    Random Forests are relatively robust to missing values and outliers, as multiple trees can compensate for corrupted data.
  6. Nonlinear and Nonparametric:
    They don’t assume any data distribution or linearity, making them suitable for complex relationships.
  7. Built-in Validation via OOB Error:
    Out-of-Bag samples offer an internal estimate of test error without needing a separate validation set.

In summary:

> Random Forests combine simplicity, robustness, and high predictive power — a strong default choice for many ML problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Random Forest achieve feature randomness?

A

Random Forest introduces feature randomness to decorrelate trees and prevent any single feature from dominating all splits.

How it works:

  1. At each node of a decision tree, instead of considering all features to find the best split, Random Forest selects a random subset of features (denoted as max_features).
    • For classification: typically √p features, where p = total features
    • For regression: typically p/3 features
  2. The tree chooses the best split only among this subset.

Why it matters:

  • Prevents strong predictors from being repeatedly chosen, ensuring diversity among trees.
  • Increases robustness and reduces correlation between trees, improving ensemble performance.

In short:

> Feature randomness + bagging → decorrelated trees → lower variance and better generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

🌲 What is out-of-bag (OOB) error in Random Forest?

A

Answer:
Out-of-Bag (OOB) error is an internal validation metric in Random Forests that estimates model performance without a separate test set.

How it works:

  1. Each tree is trained on a bootstrap sample (~63% of the data).
  2. The remaining ~37% of samples not included in that tree are called out-of-bag samples.
  3. Each OOB sample is predicted only by trees that didn’t see it during training.
  4. Aggregating predictions for all OOB samples gives an estimate of the model’s generalization error.

Why it’s useful:

  • Provides a built-in cross-validation metric.
  • Efficient — no need for extra validation split.
  • Helps tune hyperparameters during training.

In short:

> OOB error = unbiased internal test error estimated from unused samples for each tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Are Random Forests biased towards attributes with more levels? Explain your answer.

A

Yes, Random Forests can exhibit slight bias toward attributes with more levels, but much less than a single decision tree.

Explanation:

  • Decision trees tend to favor features with many distinct values (high cardinality), because these features can create more splits and may appear more informative.
  • Random Forest mitigates this bias through:
    1. Feature randomness: Only a random subset of features is considered at each split, reducing the chance high-cardinality features dominate.
    2. Ensemble averaging: Multiple decorrelated trees reduce individual tree biases.

Practical note:

  • Bias can still occur for categorical features with extremely high cardinality.
  • Solutions include target encoding, feature binning, or careful preprocessing.

In short:

> Random Forest reduces, but does not fully eliminate, bias toward high-level categorical attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Question 8: How do you handle missing values in a Random Forest model?

A

Random Forests are relatively robust to missing data, and there are multiple strategies to handle them:

  1. During training:
    • Some implementations (e.g., R’s randomForest) use surrogate splits, where alternative splits are used if a value is missing.
    • Others (e.g., scikit-learn) require imputation before fitting.
  2. During prediction:
    • Missing values can be handled probabilistically, sending a sample down multiple branches weighted by training distributions.
  3. Practical preprocessing:
    • Numeric features: fill missing values with mean or median.
    • Categorical features: fill missing values with mode or a special “missing” category.

Key point:

> Random Forests tolerate missing data better than single trees, but explicit handling or imputation generally improves accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between bagging and boosting?

A
  • Bagging: Trains multiple models independently on different bootstrap samples; reduces variance; all models contribute equally (e.g., Random Forest).
  • Boosting: Trains models sequentially, each focusing on mistakes of the previous; reduces bias; later models weighted more (e.g., AdaBoost, Gradient Boosting).

Key point: Bagging = parallel + variance reduction; Boosting = sequential + bias reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain why Random Forest reduces overfitting compared to a single decision tree.

A
  • A single tree can memorize the training data → high variance and overfitting.
  • Random Forest combines many trees trained on different bootstrap samples and random subsets of features, making trees decorrelated.
  • Aggregating predictions (majority vote or averaging) cancels out individual tree errors, reducing variance and improving generalization.

In short: Ensemble averaging stabilizes predictions and prevents overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

True or False: Random Forest always uses all features at each split.

A

False

  • Random Forest considers only a random subset of features at each split (max_features) to decorrelate trees and improve generalization.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: Increasing n_estimators always guarantees better accuracy.

A

False

  • More trees reduce variance and stabilize predictions, but after a certain point, accuracy stay stable.
  • Very large n_estimators increases training time and memory usage without significant accuracy gain.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Compare Random Forest with Gradient Boosting in terms of **bias and variance

A
| --------------- | ----------------------------------------- | ---------------------------------------------------------- |
| Training        | Trees trained **independently** (bagging) | Trees trained **sequentially**, correcting previous errors |
| Bias            | Moderate bias                             | Low bias (sequential correction)                           |
| Variance        | Low variance (averaging reduces it)       | Higher variance; can overfit if too many trees             |
| Parallelization | Can train trees in parallel               | Sequential → slower training                               |
| Robustness      | More robust to noise/outliers             | Sensitive to noise/outliers                                |

Key takeaway: Random Forest reduces variance, Boosting reduces bias.

Aspect | Random Forest | Gradient Boosting |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Compare Random Forest and a single decision tree in **interpretability, accuracy, and robustness

A

Summary:

> Random Forest sacrifices interpretability for higher stability, accuracy, and robustness.

Aspect | Single Decision Tree | Random Forest |
| —————- | ————————— | —————————————– |
| Interpretability | High (easy to visualize) | Low (ensemble of many trees) |
| Accuracy | Moderate | Higher (ensemble reduces variance) |
| Robustness | Sensitive to noise/outliers | More robust due to averaging |
| Overfitting | Prone | Reduced by bagging and feature randomness |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

You train a Random Forest and notice high training accuracy but low test accuracy. What could be wrong?

A
  • Likely overfitting, possibly due to:
    • Trees being too deep (max_depth too high).
    • Insufficient number of trees to average out noise.
    • Data quality issues (outliers, noise).
  • Solutions:
    • Limit tree depth (max_depth) or increase min_samples_leaf.
    • Increase n_estimators to stabilize predictions.
    • Check preprocessing (handle missing values, normalize if needed).

Key point: Random Forest reduces overfitting but hyperparameters must be tuned carefully.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

If one feature has 1000 levels and another has 3 levels, which is likely to dominate a single tree split, and how does Random Forest mitigate this?

A
  • Single tree: High-cardinality feature (1000 levels) is more likely to dominate splits, introducing bias.
  • Random Forest mitigation:
    1. Feature randomness — only a subset of features considered at each split.
    2. Ensemble averaging — reduces impact of biased splits in individual trees.

Key point: Random Forest reduces, but does not completely eliminate, bias toward high-cardinality features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Given a Random Forest with 100 trees, if a sample is not included in 40 of the bootstrap datasets, how many trees will contribute to its OOB error?

A

OOB error uses only trees that did not see the sample during training.

Here, 40 trees did not include the sample → all 40 contribute to OOB prediction.

Key point: OOB error provides an unbiased internal estimate using unseen samples per tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

List 2 advantages of Random Forest over a single decision tree.

A
  1. Higher accuracy and robustness — ensemble averaging reduces variance.
  2. Less prone to overfitting — bagging and feature randomness decorrelate trees.

Optional bonus: Can handle both classification and regression tasks and provides feature importance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
You notice that some trees in your Random Forest are consistently giving poor predictions. What could be the reason?
1. **Insufficient features considered** at splits → poor split decisions. 2. **Small bootstrap sample** → the tree didn’t see enough data. 3. **Tree depth too shallow** → cannot capture patterns. 4. **Noisy or missing data** affecting this tree more than others. **Key point:** Individual trees can be weak; Random Forest **averages out errors**, so some poor trees do not heavily affect overall performance.
26
How can Random Forest be used to determine **feature importance
1. Measure how much each feature **reduces impurity** (Gini or variance) across all trees. 2. Features that contribute more to **splits that improve prediction** have higher importance. 3. Can also use **permutation importance**: randomly shuffle a feature and observe **drop in model accuracy**. **Key point:** Random Forest provides **built-in methods** to rank and interpret feature relevance.
27
You have a large dataset with many correlated features. How does Random Forest handle correlated features?
* Correlated features may lead to some trees relying on one over the other. * Random feature selection at splits helps **reduce correlation between trees**. * Overall, **ensemble averaging** ensures that correlated features do not dominate predictions. * For interpretation, feature importance may be **shared among correlated features**, so be cautious when ranking. **Key point:** Random Forest is robust to correlated features but feature importance may be less reliable.
28
How does Random Forest handle **outliers** in the dataset?
* Individual trees may be influenced by outliers, but **ensemble averaging** reduces their impact on overall predictions. * For **regression**, outliers affect the average less as many trees contribute. * For **classification**, a few trees misclassifying outliers are outvoted by the majority. * Optional preprocessing: remove or cap extreme values to improve robustness. **Key point:** Random Forest is **naturally robust to outliers** due to averaging across trees.
29
How do you measure the performance of a Random Forest for **regression** tasks?
Common regression metrics: 1. **Mean Squared Error (MSE):** Average squared difference between predicted and actual values. 2. **Root Mean Squared Error (RMSE):** Square root of MSE, interpretable in original units. 3. **Mean Absolute Error (MAE):** Average absolute difference between predicted and actual values. 4. **R² (Coefficient of Determination):** Proportion of variance explained by the model ((0 \le R^2 \le 1)). 5. **OOB R²/error:** Random Forest’s internal estimate using out-of-bag samples. **Key point:** Choose metric based on **sensitivity to outliers** (e.g., MAE less sensitive than MSE).
30
What are the main **limitations of Random Forest
1. **Interpretability:** Hard to visualize or explain individual predictions due to many trees. 2. **Large memory & computation:** Training many trees on large datasets can be slow and memory-intensive. 3. **Overfitting on noisy data:** Although robust, very noisy data can still affect predictions. 4. **Feature importance bias:** Can be biased toward features with more levels or high cardinality. 5. **Limited extrapolation for regression:** Random Forest cannot predict values outside the range seen in training. **Key point:** Random Forest is powerful but trades **interpretability and efficiency** for accuracy and robustness.
31
You train a Random Forest for a **binary classification** with 1% positive class. Your model predicts all negatives. What happened and how do you fix it?
* **Problem:** Severe class imbalance → model biased toward majority class. Accuracy is misleading. * **Solutions:** 1. **Class weighting:** Give higher weight to minority class samples. 2. **Resampling:** Oversample minority or undersample majority. 3. **Balanced bootstrap:** Ensure each tree sees a more balanced subset. 4. **Threshold adjustment:** Lower decision threshold for positive class. **Key point:** Random Forest can handle imbalance, but **explicit strategies** are needed for rare classes.
32
After training, your Random Forest **regression model shows high variance**. What metrics would you check, and what steps could you take?
* **RMSE or MSE** on training vs test set → high difference indicates overfitting. * **R²** on training vs test set → low test R² vs high training R² confirms high variance. * **Steps to reduce variance:** 1. Increase `n_estimators` → stabilize predictions. 2. Limit `max_depth` → shallower trees generalize better. 3. Increase `min_samples_split` or `min_samples_leaf` → prevent overfitting small nodes. 4. Feature selection or dimensionality reduction → reduce noisy features. **Key point:** Random Forest reduces variance via averaging, but **hyperparameter tuning** is still essential.
33
What is **variance** in the context of machine learning models?
* **Variance** measures how much a model’s predictions **change with different training data**. * High variance → model **fits training data too closely** (overfitting), performs poorly on unseen data. * Low variance → model predictions are **stable** across different datasets. **Key point:** Random Forest reduces variance by **averaging predictions** of multiple decorrelated trees, improving generalization.
34
What is **classification** in machine learning?
Classification is a supervised learning task where an algorithm learns to assign input data to predefined discrete categories or labels ## Footnote It is used in applications like email filtering, fraud detection, and medical diagnosis.
35
What does **training** involve in classification?
Training involves presenting the model with labeled examples so it can learn the mapping between features and labels, often by minimizing a loss function such as cross-entropy ## Footnote This process is critical for the model's learning.
36
What is a **decision boundary** in classification?
A surface (line in 2D, plane in 3D, or hyperplane in higher dimensions) that separates data points of different classes in the feature space ## Footnote It is essential for understanding how the model classifies data.
37
Which algorithms use **linear decision boundaries**?
* Logistic Regression * Linear Discriminant Analysis (LDA) * Support Vector Machines (SVM) with linear kernels ## Footnote These algorithms are effective for linearly separable data.
38
Which algorithms can learn **non-linear decision boundaries**?
* Decision Trees * Random Forests * Kernel SVMs * Neural Networks ## Footnote These algorithms can capture complex relationships in data.
39
What are common **metrics** for evaluating classification performance?
* Accuracy (overall correctness) * Precision (positive predictive value) * Recall (sensitivity) * F1-score (harmonic mean of precision and recall) ## Footnote These metrics help assess the effectiveness of classification models.
40
Why do we use the **log function** in **logistic loss**?
Because it transforms probability maximization into a minimization problem, penalizes confident wrong predictions sharply, ensures numerical stability when combining probabilities, and makes the loss convex for easier optimization. ## Footnote The log function is crucial for optimizing models in machine learning.
41
What **numerical benefit** does the **log function** provide in **logistic loss**?
It prevents underflow by turning products of small probabilities into sums of log-probabilities, which are easier and more stable to compute. ## Footnote This stability is important for numerical computations in machine learning.
42
Which activation functions are typically paired with cross-entropy loss?
"Sigmoid for binary classification and Softmax for multi-class classification."
43
What is the most effective **encoding** for low-cardinality nominal features?
One-hot encoding, especially for linear models or neural networks. ## Footnote This method creates binary columns for each category, making it suitable for models that require numerical input.
44
What is the most effective **encoding** for ordinal features?
Label encoding that reflects the order of categories. ## Footnote This method assigns integer values to categories based on their order.
45
What is the most effective **encoding** for high-cardinality features in neural networks?
Embedding layers are most effective for high-cardinality features. ## Footnote Embedding layers help in reducing dimensionality and capturing relationships between categories.
46
What encoding is recommended for **high-cardinality features** in tree-based models?
Label encoding, frequency encoding, or target encoding with careful cross-validation. ## Footnote These methods help maintain the integrity of the data while allowing tree-based models to make splits effectively.
47
What is the key principle for choosing **categorical encoding**?
Choose an encoding that captures the relationship between the feature and target, and is compatible with the model type and dataset size. ## Footnote This ensures that the encoding enhances model performance and interpretability.
48
What is logistic regression?,
"A supervised learning algorithm used for binary classification that models the probability of a class using the logistic (sigmoid) function."
49
What type of output does logistic regression predict?,
"It predicts probabilities between 0 and 1 for each class, which can then be thresholded to make binary predictions."
50
What does DBSCAN stand for and what is its main purpose?
DBSCAN = Density-Based Spatial Clustering of Applications with Noise. It groups points closely packed together (high density) and marks low-density points as noise (outliers).
51
What does HDBSCAN stand for?
HDBSCAN = Hierarchical Density-Based Spatial Clustering of Applications with Noise. It extends DBSCAN by building a hierarchy of clusters and extracting the most stable ones.
52
Which of the following parameters are required by DBSCAN? A. min_samples and eps B. min_cluster_size and min_samples C. eps and min_cluster_size D. Only min_samples
Answer: A. min_samples and eps
53
How does DBSCAN decide clusters vs noise?
It uses a fixed distance threshold (eps) — points within eps of at least min_samples neighbors form a cluster. Points not reachable are labeled as noise.
54
How does HDBSCAN handle the need for a fixed eps value?
It removes the need for a global eps by building a hierarchy of clusters over varying density levels and selecting clusters based on stability.
55
(True/False) HDBSCAN can detect clusters with varying densities better than DBSCAN.
Answer: True.
56
Which algorithm tends to perform better on datasets with variable density regions? A. DBSCAN B. HDBSCAN C. Both equally
B
57
What happens to clusters if you set a poor eps in DBSCAN?
Too small eps → many points labeled as noise. Too large eps → clusters merged incorrectly.
58
Question: In terms of complexity, how does HDBSCAN compare to DBSCAN?
HDBSCAN is typically more computationally expensive (O(n log n)) but more robust; DBSCAN is simpler and slightly faster.
59
Which algorithm is better suited for exploratory data analysis when the density structure is unknown? A. DBSCAN B. HDBSCAN
Answer: B. HDBSCAN
60
Why DBSCAN have problem with density level and HDBSCAN can solve it ?
In DBSCAN, you must set a global eps (epsilon) — a fixed distance threshold. Points within distance eps are considered neighbors. A cluster forms when a point has at least min_samples neighbors within eps. ⚠️ The problem: A single eps value works well only if all clusters in your data have similar density. If some clusters are dense and others are sparse, DBSCAN either: merges sparse clusters incorrectly, or labels many points in sparse regions as noise. ➡️ So DBSCAN can’t handle clusters with different densities because eps is global and static. ------ Meanwhile,
61
Question: If latitude range = 180° and longitude range = 360°, which of the following scaling methods correctly balances the axes? A. Latitude ×1, Longitude ×1 B. Latitude ×2, Longitude ×1 C. Latitude ×1, Longitude ×0.5 D. Both B and C
D. Both B and C
62
Question: What is the main problem if we use raw latitude and longitude for distance-based clustering? A. Latitude and longitude are categorical variables B. Differences in longitude dominate distances due to larger range C. Latitude values are always negative D. Euclidean distance does not work on geographic coordinates
. Differences in longitude dominate distances due to larger range
63
Why can a feature with a much larger range dominate clustering when using Euclidean distance
: Because Euclidean distance sums squared differences for each feature, so a feature with a larger numeric range contributes much more to the distance, making other features almost negligible.
64
: You have two features, X in range 0–10 and Y in range 0–1000. You run K-means on raw data. Which happens? A. X and Y contribute equally to clustering B. Y dominates clustering → clusters reflect mostly Y C. X dominates clustering → clusters reflect mostly X D. Clustering is random
✅ Answer: B. Y dominates clustering → clusters reflect mostly Y
65
: If a large-range feature dominates, clustering results may ignore differences in smaller-range features, causing “wrong clusters.”
True
66
Q: How can you fix the problem of large-range features dominating distances? A. Remove the large-range feature B. Scale all features to comparable ranges C. Use only the large-range feature D. Increase the number of clusters
B
67
Question: DBSCAN can find clusters of arbitrary shape, while K-Means assumes spherical clusters.
True K-Means partitions data into convex, roughly spherical clusters because it minimizes variance within each cluster. DBSCAN identifies dense regions, allowing for arbitrary shapes and noise detection.
68
Question: What is the main purpose of dimensionality reduction?
Answer: To reduce the number of input features while retaining the most important information. It simplifies models, removes redundant or noisy features, helps visualization in 2D/3D, and can prevent overfitting
69
PCA reduces dimensionality by: A. Selecting random features B. Projecting data onto directions of maximum variance C. Clustering features D. Scaling all features equally
B. Explanation: PCA computes principal components (orthogonal axes) that capture the largest variance in the data. It projects the data onto these components, keeping only the top ones to reduce dimensionality.
70
What’s the main difference between PCA and t-SNE?
PCA: Linear, preserves global structure (variance across all samples). t-SNE: Nonlinear, preserves local structure (pairwise neighborhood relationships). Thus, PCA is better for feature compression, t-SNE for visualizing clusters.
71
Question: Which of the following is not a feature engineering technique? A. One-hot encoding B. Normalization C. K-Means clustering D. Feature extraction via PCA
Explanation: K-Means is a clustering algorithm. However, you can use K-Means results (like cluster labels) as new features — which becomes feature engineering.
72
Why might we apply PCA before K-Means? A. To make K-Means faster and more stable B. To increase the number of clusters C. To visualize high-dimensional data D. Both A and C
Answer: D. Explanation: PCA reduces dimensionality, making K-Means computationally cheaper and less sensitive to noise. It also allows visualization of clusters in 2D or 3D.
73
Question: What are the benefits of using clustering for feature selection compared to correlation filtering?
Clustering groups features by global similarity patterns, not just pairwise correlation. It can capture nonlinear relationships. It’s more robust in high dimensions where correlations alone may be misleading.
74
Which workflow best combines the three concepts? A. Clustering → Dimensionality reduction → Feature engineering B. Dimensionality reduction → Clustering → Use cluster labels as engineered features C. Feature engineering → Dimensionality reduction → Drop clustering
Answer: B. Explanation: Reducing dimensions (e.g., via PCA) simplifies data → clustering reveals structure → cluster labels are used as additional engineered features to enhance model performance.
75
(True/False) Clustering is a supervised learning technique.
❌ A: False — it’s unsupervised (no labels are used).
76
What is hierarchical clustering?
It builds a tree (dendrogram) of clusters by recursively merging or splitting groups. Two types: Agglomerative (bottom-up): Start with each point as its own cluster, then merge. Divisive (top-down): Start with one cluster, then split recursively. You can “cut” the tree at a chosen level to define the number of clusters. → Used in biology for gene expression analysis or document taxonomy.
77
What is density-based clustering?
A: Algorithms like DBSCAN or HDBSCAN find clusters as areas of high point density separated by sparse regions. They automatically detect noise/outliers and handle clusters of irregular shape. → Used in spatial data (e.g., detecting hotspots of crimes or disease spread).
78
Q6: What are typical real-world use cases for clustering?
Marketing: Segment customers by demographics or spending patterns (K-Means). Healthcare: Cluster patients by symptoms or gene expression (Hierarchical). Cybersecurity: Detect anomalous network traffic (DBSCAN / HDBSCAN). Finance: Group similar investment profiles (GMM). NLP / Vision: Cluster embeddings from deep models (UMAP + HDBSCAN).
79
What is PCA and when is it used best?
Principal Component Analysis (PCA) is a linear technique that projects data onto orthogonal directions (principal components) capturing maximum variance. Best for linearly correlated features. Reduces redundancy and speeds up clustering. → Example: Before K-Means, to stabilize centroids and visualize data in 2D.
80
What is t-SNE and when is it used?
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear manifold learning technique preserving local structure (neighborhoods). Ideal for visualization in 2D/3D of high-dimensional clusters. Sensitive to parameters (perplexity, learning_rate). → Example: Visualizing clusters of image embeddings or word embeddings after deep learning models.
81
What is UMAP and how is it different from t-SNE?
UMAP (Uniform Manifold Approximation and Projection): Nonlinear and faster than t-SNE. Preserves both local and some global structure. Scales better for large datasets and can be used before clustering. → Example: UMAP + HDBSCAN for clustering text embeddings or genetic data.
82
What type of supervised learning model predicts probabilities for each class and uses a sigmoid function?
: Logistic Regression.
83
: Which algorithm directly uses the proximity (distance) between data points for classification?
KNN
84
: Which algorithm splits data into branches using feature thresholds?
Decision Trees
85
Which algorithm finds a hyperplane that maximizes the margin between two classes?
SVM
86
How does KNN differ from Logistic Regression in terms of decision boundaries?
KNN: Can produce nonlinear boundaries depending on the local structure of data. Logistic Regression: Produces a linear boundary unless features are transformed.
87
Which model can handle both linear and nonlinear classification problems?
: SVM (with kernel trick) and Decision Tree.
88
Which model is most sensitive to irrelevant features or scaling issues?
KNN — because distance measures can be distorted by unscaled or noisy features.
89
A company wants to predict whether a customer will purchase a product based on how close they are to other similar customers. Which model should they use?
K-Nearest Neighbors (KNN), since it classifies based on proximity to similar data points.
90
You have a dataset with thousands of samples and want a fast prediction model after training. Which algorithm is least suitable?
KNN, because it’s computationally expensive at prediction time.
91
: You want to understand which features influence your binary classification results the most. Which model is easiest for interpretation?
Logistic Regression — it provides interpretable coefficients that show feature influence.
92
: Your dataset is small but contains complex nonlinear relationships. Which model would likely perform best?
Decision Tree or SVM (with nonlinear kernel).
93
You have noisy data with many irrelevant features. Which algorithm might overfit most easily?
: Decision Tree — prone to overfitting unless pruned or regularized.
94
True / False Questions Q16: Logistic Regression is suitable only for regression problems.
❌ False — despite the name, it’s used for classification.
95
KNN is a parametric algorithm.
❌ False — it’s non-parametric because it makes no assumptions about data distribution.
96
SVM only works for linear separable data.
❌ False — it can use kernels to handle nonlinear data.
97
Decision Trees can handle both numerical and categorical features.
✅ True.
98
KNN requires a training phase to build a model before making predictions.
❌ False — it’s a lazy learner that does not build a model during training.
99
Suppose you have customer data with geographic coordinates, and you want to classify whether a new customer is likely to buy based on nearby customers’ behavior. Which model best suits this problem, and why?
: K-Nearest Neighbors (KNN) — because it directly uses distance (e.g., Euclidean distance) between customers to classify the new one based on similar nearby examples.
100
Q1: Which regression model is used to model constant growth over time?
A1: Linear Regression.
101
Between logarithmic and exponential regression, which one decelerates over time?
Logarithmic Regression.
102
What is the Elbow Method in clustering?
A6: A heuristic for choosing the optimal number of clusters (k) by plotting the sum of squared distances (inertia) vs. k and looking for the “elbow” where improvement slows.
103
: A product’s sales increase quickly at launch but then plateau. Which regression should be considered?
Logarithmic Regression.
104
Q13: A small company’s revenue grows steadily by $10,000 every month. Which regression is suitable?
A13: Linear Regression.
105
What is precision?
A3: Precision = TP / (TP + FP), the proportion of true positives among all predicted positives.
106
What is recall (sensitivity)?
A4: Recall = TP / (TP + FN), the proportion of true positives among all actual positives.
107
Q8: When is precision more important than recall?
A8: When false positives are costly (e.g., spam detection).
108
Q9: When is recall more important than precision?
A9: When false negatives are costly (e.g., disease detection).
109
Q12: A cancer detection model flags 90 out of 100 patients with cancer correctly but also misclassifies 50 healthy patients as sick. Should the doctor prioritize precision, recall, or F1-score?
A12: Recall is critical because missing cancer patients (FN) is more dangerous than false positives.
110
In email spam detection, users complain about too many normal emails being marked as spam. Which metric should you optimize?
A13: Precision — reduce false positives.
111
Q14: A fraud detection system has 99% accuracy, but the dataset has 99% non-fraud cases. Why is this misleading?
A14: The model predicts almost all as non-fraud, achieving high accuracy but failing to detect fraud (low recall for minority class).
112
Q11: You predict housing prices in dollars. Your model has MAE = 5,000. What does this mean?
A11: On average, the predicted house price differs from the actual price by $5,000.
113
Q12: Your model has RMSE = 8,000 and MAE = 5,000. What does the difference suggest?
A12: There are some large errors because RMSE > MAE (RMSE penalizes large errors more).
114
Q13: You fit a regression model with many features. R² = 0.95, but Adjusted R² = 0.75. What does this indicate?
A13: Some features do not contribute meaningfully; the model may overfit.
115
Q14: Which metric would you use to compare models across different datasets with different scales?
A14: R² (scale-independent).
116
Q19: Adjusted R² can decrease when adding a predictor.
A19: ✅ True — it penalizes irrelevant features.
117
Q1: Why is evaluating unsupervised learning models more challenging than supervised learning?
A1: Because there are no labeled outputs to directly compare predictions against, so evaluation relies on internal structure, heuristics, or external proxies.
118
119
Q2: Name three common evaluation approaches for unsupervised learning.
A2: Internal evaluation (based on model-intrinsic measures, e.g., cohesion and separation) External evaluation (if ground truth labels exist, e.g., Adjusted Rand Index) Relative evaluation / heuristics (comparison between models, cross-validation, stability, or visualization)
120
What does a low inertia indicate?
: Points are closer to their cluster centers, meaning clusters are tight and cohesive.
121
Q8: You compute inertia for k=2,3,4 clusters and get: 1200, 800, 790. What does the elbow method suggest?
A8: k=3 clusters, because adding a fourth cluster reduces inertia only slightly.
122
What is cross-validation?
A1: A technique to evaluate a model’s generalization ability by splitting the dataset into training and validation subsets multiple times.
123
What is k-fold cross-validation?
The dataset is divided into k equal folds. The model is trained on k-1 folds and validated on the remaining fold. This is repeated k times, and results are averaged.
124
Q3: What is leave-one-out cross-validation (LOOCV)?
A3: Special case of k-fold where k = n (number of samples). Each sample is used once as the validation set, and the rest as training.
125
What is stratified k-fold cross-validation?
A version of k-fold CV that preserves the class distribution in each fold, useful for imbalanced classification problems.
126
Q6: Why is cross-validation preferred over a simple train-test split?
A6: It reduces variance due to random splits and gives a more robust estimate of model performance.
127
Q8: Compare k-fold CV and LOOCV.
k-fold: Fewer iterations, less computationally expensive, slightly higher bias. LOOCV: One iteration per sample, very low bias, high computational cost, can have high variance.
128
Q12: You are tuning hyperparameters for an SVM. How should you estimate its generalization performance?
A12: Use nested cross-validation — inner loop for tuning, outer loop for performance evaluation.
129
Q22: Why is nested CV preferred when tuning hyperparameters in small datasets?
: It prevents data leakage from tuning into the evaluation set, providing an unbiased performance estimate.
130
What is data leakage?
A1: Data leakage occurs when a model has access to information it wouldn’t have in a real prediction scenario, leading to overestimated performance.
131
What is target leakage?
When a feature contains information directly related to the target, which wouldn’t be available at prediction time.
132
Give an example of target leakage.
Predicting loan default using a feature “Has account been written off?” — it directly indicates default.
133
What is temporal leakage?
Using future information to predict past or present, which is not available at prediction time.
134
If a model performs unusually well on a small dataset, it may indicate data leakage.
✅ True.
135
Data leakage can only happen in supervised learning tasks.
❌ False — it can also occur in unsupervised preprocessing or feature engineering.
136
You normalize a feature on the full dataset before splitting into train/test. Is this data leakage?
1: ✅ Yes — the test set influenced training.
137
How can you prevent train-test contamination?
Always split dataset first, then apply preprocessing only on training data. Apply same transformations to test data.
138
: How does nested CV prevent leakage?
A15: By keeping the outer validation set separate from hyperparameter tuning, avoiding leakage from tuning into evaluation.
139
Q16: How can you avoid target leakage?
A16: Exclude features that directly include target information or are only available after the outcome occurs.
140
Q17: What should you do for time-series data to prevent leakage?
A17: Ensure training data precedes validation/test data chronologically, never using future information for past prediction.
141
Q19: You are building a credit default model. One candidate feature is “monthly payment made last month”. The target is “will default next month?”. Is this safe?
✅ Safe — past information is allowed. Only future information or features including the target itself would cause leakage.
142
You notice your model achieves 99% accuracy on a small imbalanced dataset. What should you check first?
A20: Check for data leakage, e.g., target leakage, preprocessing contamination, or temporal leakage.
143
Dataset: 100 samples Model: SVM with C and gamma hyperparameters How it works when apply k-fold CV and nested CV
Standard CV (k=5): Split 100 samples into 5 folds. Try multiple combinations of C and gamma on the same CV folds. Compute average accuracy → may overestimate true performance. Nested CV (k_outer=5, k_inner=3): Outer fold: 20 samples as test, 80 samples as training. Inner fold: split 80 training samples into 3 folds. Try all hyperparameter combinations using inner CV. Select best hyperparameters, train on all 80 training samples. Evaluate on 20 outer test samples (never used in tuning). Repeat for all outer folds → average performance → unbiased estimate. ``` | Step | Action | | ------------------ | ------------------------------------------------------------ | | Outer Fold 1 | 20% of data = outer validation (test) | | Outer Training Set | 80% of data → used in inner loop | | Inner Loop | Split 80% into 3 folds, train/test for hyperparameter tuning | | Train Best Model | On full 80% with best hyperparameters | | Evaluate | On 20% outer validation fold | | Repeat | For all 5 outer folds → average performance | ```
144