ML: Trees Flashcards by Henry Ye

What is a Decision Tree?

A supervised learning algorithm that splits data into subsets based on feature values, forming a tree-like structure of nodes and leaves.

How well did you know this?

Not at all

Perfectly

What are the two main types of Decision Trees?

Classification trees (categorical outcomes) and Regression trees (continuous outcomes).

How well did you know this?

Not at all

Perfectly

What is a Random Forest?

An ensemble method that constructs multiple decision trees and outputs the mode or mean of their predictions.

Bagging ensemble model - bootstrap + aggregate.

How well did you know this?

Not at all

Perfectly

What is Ensemble Learning?

A technique that combines multiple weak learners to create a single, more robust strong learner.

How well did you know this?

Not at all

Perfectly

Define Bagging.

Bootstrap Aggregating: training models independently on random data subsets with replacement and averaging results.

How well did you know this?

Not at all

Perfectly

Define Boosting.

A sequential ensemble technique where each new model corrects errors made by previous models.

How well did you know this?

Not at all

Perfectly

What is a Root Node?

The topmost node in a tree, representing the entire dataset and the first split.

How well did you know this?

Not at all

Perfectly

What is a Leaf Node?

A terminal node that contains the final prediction and does not split further.

How well did you know this?

Not at all

Perfectly

What is Pruning?

Removing sections of a tree that provide little predictive power to reduce overfitting.

How well did you know this?

Not at all

Perfectly

Name one advantage of tree-based models.

They are non-parametric, handle non-linearities well, and require minimal data preprocessing.

How well did you know this?

Not at all

Perfectly

Difference between Gini Impurity and Entropy?

Gini: 1 minus sum of squared probabilities (faster). Measures probability of incorrectly classifying a randomly chosen element if it were randomly labelled. Range [0, 0.5].

Entropy: negative sum of p x log(p) (more sensitive). Measures randomness or uncertainty in data. Range [0,1].

How well did you know this?

Not at all

Perfectly

How does Random Forest reduce variance?

By averaging uncorrelated trees built on different data and feature subsets, noise cancels out.

How well did you know this?

Not at all

Perfectly

What is Feature Randomness?

Used in Random Forests.

At each split, only a random subset of features is considered at each node split instead of all features.

This decorrelates the individual trees and reduces variance.

Prevents overfitting by forcing some trees to learn less obvious patterns.

How well did you know this?

Not at all

Perfectly

What is Out-of-Bag (OOB) error?

Used in Random Forest.

Due to sampling with replacement, a portion of the dataset is not used to train any trees - these are OOB samples which get pooled together and act as a validation set.

Internal method used to estimate model’s prediction error without needing a separate data set.

How well did you know this?

Not at all

Perfectly

What is Information Gain?

The reduction in entropy or impurity achieved by splitting a dataset on a specific feature.

How well did you know this?

Not at all

Perfectly

Difference between Gradient Boosting and AdaBoost?

Study These Flashcards

AdaBoost - adaptive boosting. Weights misclassified points. Forces subsequent models (decision stumps) to focus on ‘hard’ samples.

Gradient Boosting fits new models to the residuals (negative gradient) of previous ones. Sequentailly builds models to correct errors via gradient descent.

Common hyperparameters for Decision Trees?

Study These Flashcards

max_depth - number of levels

min_samples_split - min no. samples an internal node must contain before it can be considered for further splitting. High = reduce overfit - ensure decisions made on enough samples.

min_samples_leaf - min no. samples required to be present in a leaf node. High = reduce overfit - avoid leaf nodes that only contain a single sample result.

max_features - no. features to consider when looking for best split. introduce randomness in selecting features.

min_impurity_decrease - only split if causes decrease in impurity greater than this value.

Why are trees prone to overfitting?

Study These Flashcards

They can continue splitting until they perfectly memorize training data noise.

Significance of Learning Rate in Boosting?

Study These Flashcards

Scales the contribution of each new tree.

LR here controls influence of each successive model in ensemble.

Smaller rates usually improve generalization but require more trees.

Used as regularisation - if each tree is allowed to fully ‘fix’ errors of previous tree, the model will quickly start fitting the noise in training data. Shrinking each tree’s contribution ensures the model captures overall trend.

Can trees handle missing values?

Study These Flashcards

Standard CART requires imputation, but XGBoost/LightGBM can learn default directions for missing values.

How does XGBoost handle regularization?

Study These Flashcards

It includes L1 and L2 regularization in its objective function to penalize complex trees.

How is Feature Importance calculated in RF?

Study These Flashcards

Typically by measuring the average decrease in impurity (Gini/Entropy) across all trees for that feature.

Curse of Dimensionality in trees?

Study These Flashcards

High-dimensional sparse data makes it hard to find meaningful splits, leading to overfitting.

LightGBM vs XGBoost?

Study These Flashcards

LightGBM grows trees leaf-wise (faster/efficient) while XGBoost traditionally grows level-wise.

What is Stochastic Gradient Boosting?

Training each tree on a random subsample of rows to improve robustness.

Handle imbalanced classes in Random Forest?

Use balanced class weights or undersample the majority class during bootstrapping.

Bias-Variance in Bagging vs Boosting?

Bagging reduces variance; Boosting reduces bias.

What is Gain in XGBoost?

The improvement in the objective function (loss + regularization) resulting from a specific split.

Why might RF perform worse than a single tree?

If the forest is too small or random feature selection misses the only truly predictive feature.

What is Isolation Forest?

An unsupervised tree-based algorithm for anomaly detection based on how easily points are isolated.

ML: Trees Flashcards

(30 cards)