Imagine you work at a major credit card company and are given a dataset of 600,000 credit card transactions to build a fraud detection model.
How would you approach this task? Write your answer in the comments.
**1. Understanding the Data and Preprocessing*
* Explore data: check distributions, missing values, outliers.
3. Training and Validation
* Split data into train/validation/test
* Use metrics suitable for imbalanced data: Precision, Recall, F1, ROC-AUC, PR-AUC.
4.Handling Imbalance & Bias:
* Weighted loss functions (higher weight to fraud - minority class in loss func) or sampling strategies (SMOTE - oversampling, ….) to account for rare fraud cases.
* Monitor for biases (e.g., merchant type, geography).
5. Deployment & Monitoring:
What does it mean to track model drift and update periodically in fraud detection systems?
Model drift occurs when the data distribution changes over time, causing the model’s performance to degrade.
How to track:
Updating the model:
Purpose:
Let’s say your manager asks you to build a model with a neural network to solve a business problem.
How would you justify the complexity of building such a model and explain the predictions to non-technical stakeholders?
Justify complexity: Neural networks capture complex patterns and can improve accuracy over simpler models.
Explain predictions: Use visualizations (SHAP, feature importance) and show which factors influences predictions, focusing on business impact rather than technical details (e.g., “This model helps identify high-value customers likely to churn, enabling targeted retention campaigns”).
Let’s say that you’re training a classification model.
How would you combat overfitting when building tree-based models?
Let’s say you work as a data scientist at a bank.
You are tasked with building a decision tree model to predict if a borrower will pay back a personal loan they are taking out.
How would you evaluate whether using a decision tree algorithm is the correct model for the problem?
Let’s say you move forward with the decision tree model. How would you evaluate the performance of the model before deployment and after?
Evaluating suitability of a decision tree:
Performance evaluation before deployment:
Performance evaluation after deployment:
Summary:
Decision trees are suitable for structured, interpretable predictions. Evaluate with cross-validation and proper metrics before deployment, and continuously monitor performance and drift after deployment.
Let’s say that you work at a bank that wants to build a model to detect fraud on the platform.
The bank wants to implement a text messaging service in addition that will text customers when the model detects a fraudulent transaction in order for the customer to approve or deny the transaction with a text response.
How would we build this model?
1. Data Preparation:
2. Model Selection & Training:
3. Fraud Scoring & Thresholding:
4. Text Messaging Service: Integrate with SMS provider (e.g., Twilio). -> send alert
5. Feedback Loop & Model Updating:
6. Security & Compliance:
What is feature engineering in machine learning?
Feature engineering is the process of creating, transforming, or selecting input variables (features) from raw data to improve a model’s performance -> learn more effectively
Purpose:
Examples:
What is Random Forest and how does it works?
Why is generalize better than using a single decision tree?
A Random Forest is an ensemble learning algorithm that builds multiple decision trees and combines their outputs to improve accuracy and reduce overfitting.
It works by using bagging (bootstrap aggregation) — each tree is trained on a random sample of the data (with replacement), and at each node split, it selects a random subset of features instead of using all features.
This introduces diversity among trees, reducing correlation and variance.
The core idea is that while individual trees may overfit, their ensemble average generalizes much better.
In short: Random Forest = many decorrelated trees → aggregated → stable, accurate model.
How does a Random Forest differ from a single decision tree? Is this easier to interpret?
A single decision tree build model by recursively splitting data based on the most informative features. While it’s easy to interpret, it tends to overfit — capturing noise and leading to high variance.
In contrast, a Random Forest is an ensemble of many decision trees trained on different bootstrap samples and random feature subsets. It combines their predictions (via majority vote or averaging), which:
- Reduces overfitting by averaging out noise.
- Improves generalization due to decorrelation among trees.
- Increases robustness and predictive accuracy.
Key differences:
| ---------------- | ------------- | ---------------------------------- | | Model type | Single tree | Ensemble of many trees | | Variance | High | Low | | Bias | Low | Slightly higher but better balance | | Overfitting | Common | Greatly reduced | | Interpretability | High | Lower | | Accuracy | Moderate | Higher |
Summary:
> A Random Forest sacrifices interpretability for higher stability and performance by combining many independent trees.
Aspect | Decision Tree | Random Forest |
Main advantage of using Random Forest?
Random Forests offer several advantages that make them one of the most widely used ensemble methods in practice:
In summary:
> Random Forests combine simplicity, robustness, and high predictive power — a strong default choice for many ML problems.
How does Random Forest achieve feature randomness?
Random Forest introduces feature randomness to decorrelate trees and prevent any single feature from dominating all splits.
How it works:
max_features).Why it matters:
In short:
> Feature randomness + bagging → decorrelated trees → lower variance and better generalization.
🌲 What is out-of-bag (OOB) error in Random Forest?
Answer:
Out-of-Bag (OOB) error is an internal validation metric in Random Forests that estimates model performance without a separate test set.
How it works:
Why it’s useful:
In short:
> OOB error = unbiased internal test error estimated from unused samples for each tree.
Are Random Forests biased towards attributes with more levels? Explain your answer.
Yes, Random Forests can exhibit slight bias toward attributes with more levels, but much less than a single decision tree.
Explanation:
Practical note:
In short:
> Random Forest reduces, but does not fully eliminate, bias toward high-level categorical attributes.
Question 8: How do you handle missing values in a Random Forest model?
Random Forests are relatively robust to missing data, and there are multiple strategies to handle them:
randomForest) use surrogate splits, where alternative splits are used if a value is missing.Key point:
> Random Forests tolerate missing data better than single trees, but explicit handling or imputation generally improves accuracy.
What is the difference between bagging and boosting?
Key point: Bagging = parallel + variance reduction; Boosting = sequential + bias reduction.
Explain why Random Forest reduces overfitting compared to a single decision tree.
In short: Ensemble averaging stabilizes predictions and prevents overfitting.
True or False: Random Forest always uses all features at each split.
❌ False
max_features) to decorrelate trees and improve generalization.True or False: Increasing n_estimators always guarantees better accuracy.
❌ False
n_estimators increases training time and memory usage without significant accuracy gain.Compare Random Forest with Gradient Boosting in terms of **bias and variance
| --------------- | ----------------------------------------- | ---------------------------------------------------------- | | Training | Trees trained **independently** (bagging) | Trees trained **sequentially**, correcting previous errors | | Bias | Moderate bias | Low bias (sequential correction) | | Variance | Low variance (averaging reduces it) | Higher variance; can overfit if too many trees | | Parallelization | Can train trees in parallel | Sequential → slower training | | Robustness | More robust to noise/outliers | Sensitive to noise/outliers |
Key takeaway: Random Forest reduces variance, Boosting reduces bias.
Aspect | Random Forest | Gradient Boosting |
Compare Random Forest and a single decision tree in **interpretability, accuracy, and robustness
Summary:
> Random Forest sacrifices interpretability for higher stability, accuracy, and robustness.
Aspect | Single Decision Tree | Random Forest |
| —————- | ————————— | —————————————– |
| Interpretability | High (easy to visualize) | Low (ensemble of many trees) |
| Accuracy | Moderate | Higher (ensemble reduces variance) |
| Robustness | Sensitive to noise/outliers | More robust due to averaging |
| Overfitting | Prone | Reduced by bagging and feature randomness |
You train a Random Forest and notice high training accuracy but low test accuracy. What could be wrong?
max_depth too high).max_depth) or increase min_samples_leaf.n_estimators to stabilize predictions.Key point: Random Forest reduces overfitting but hyperparameters must be tuned carefully.
If one feature has 1000 levels and another has 3 levels, which is likely to dominate a single tree split, and how does Random Forest mitigate this?
Key point: Random Forest reduces, but does not completely eliminate, bias toward high-cardinality features.
Given a Random Forest with 100 trees, if a sample is not included in 40 of the bootstrap datasets, how many trees will contribute to its OOB error?
OOB error uses only trees that did not see the sample during training.
Here, 40 trees did not include the sample → all 40 contribute to OOB prediction.
Key point: OOB error provides an unbiased internal estimate using unseen samples per tree.
List 2 advantages of Random Forest over a single decision tree.
Optional bonus: Can handle both classification and regression tasks and provides feature importance.