What is a Decision Tree?
A supervised learning algorithm that splits data into subsets based on feature values, forming a tree-like structure of nodes and leaves.
What are the two main types of Decision Trees?
Classification trees (categorical outcomes) and Regression trees (continuous outcomes).
What is a Random Forest?
An ensemble method that constructs multiple decision trees and outputs the mode or mean of their predictions.
Bagging ensemble model - bootstrap + aggregate.
What is Ensemble Learning?
A technique that combines multiple weak learners to create a single, more robust strong learner.
Define Bagging.
Bootstrap Aggregating: training models independently on random data subsets with replacement and averaging results.
Define Boosting.
A sequential ensemble technique where each new model corrects errors made by previous models.
What is a Root Node?
The topmost node in a tree, representing the entire dataset and the first split.
What is a Leaf Node?
A terminal node that contains the final prediction and does not split further.
What is Pruning?
Removing sections of a tree that provide little predictive power to reduce overfitting.
Name one advantage of tree-based models.
They are non-parametric, handle non-linearities well, and require minimal data preprocessing.
Difference between Gini Impurity and Entropy?
Gini: 1 minus sum of squared probabilities (faster). Measures probability of incorrectly classifying a randomly chosen element if it were randomly labelled. Range [0, 0.5].
Entropy: negative sum of p x log(p) (more sensitive). Measures randomness or uncertainty in data. Range [0,1].
How does Random Forest reduce variance?
By averaging uncorrelated trees built on different data and feature subsets, noise cancels out.
What is Feature Randomness?
Used in Random Forests.
At each split, only a random subset of features is considered at each node split instead of all features.
This decorrelates the individual trees and reduces variance.
Prevents overfitting by forcing some trees to learn less obvious patterns.
What is Out-of-Bag (OOB) error?
Used in Random Forest.
Due to sampling with replacement, a portion of the dataset is not used to train any trees - these are OOB samples which get pooled together and act as a validation set.
Internal method used to estimate model’s prediction error without needing a separate data set.
What is Information Gain?
The reduction in entropy or impurity achieved by splitting a dataset on a specific feature.
Difference between Gradient Boosting and AdaBoost?
AdaBoost - adaptive boosting. Weights misclassified points. Forces subsequent models (decision stumps) to focus on ‘hard’ samples.
Gradient Boosting fits new models to the residuals (negative gradient) of previous ones. Sequentailly builds models to correct errors via gradient descent.
Common hyperparameters for Decision Trees?
max_depth - number of levels
min_samples_split - min no. samples an internal node must contain before it can be considered for further splitting. High = reduce overfit - ensure decisions made on enough samples.
min_samples_leaf - min no. samples required to be present in a leaf node. High = reduce overfit - avoid leaf nodes that only contain a single sample result.
max_features - no. features to consider when looking for best split. introduce randomness in selecting features.
min_impurity_decrease - only split if causes decrease in impurity greater than this value.
Why are trees prone to overfitting?
They can continue splitting until they perfectly memorize training data noise.
Significance of Learning Rate in Boosting?
Scales the contribution of each new tree.
LR here controls influence of each successive model in ensemble.
Smaller rates usually improve generalization but require more trees.
Used as regularisation - if each tree is allowed to fully ‘fix’ errors of previous tree, the model will quickly start fitting the noise in training data. Shrinking each tree’s contribution ensures the model captures overall trend.
Can trees handle missing values?
Standard CART requires imputation, but XGBoost/LightGBM can learn default directions for missing values.
How does XGBoost handle regularization?
It includes L1 and L2 regularization in its objective function to penalize complex trees.
How is Feature Importance calculated in RF?
Typically by measuring the average decrease in impurity (Gini/Entropy) across all trees for that feature.
Curse of Dimensionality in trees?
High-dimensional sparse data makes it hard to find meaningful splits, leading to overfitting.
LightGBM vs XGBoost?
LightGBM grows trees leaf-wise (faster/efficient) while XGBoost traditionally grows level-wise.