In a decision tree what does the first (top-most) split indicate?
The most important predictor (most influential on the response variable)
Terminal nodes aka tree leaves
Internal nodes
Intermediate splits along the tree
Stump
Child nodes
Branches
The lines connecting any two nodes
How is a tree partitioned?
Predictor space is partitioned into g regions such that the SSE is minimized.
Recursive binary splitting
Top-down: start with everything in one region and select the next best split each time, working down the tree.
Binary: because two regions are created at each split.
Why is recursive binary splitting considered “greedy”?
Stopping criteria
Criteria for when to stop splitting the predictor space.
e.g. No region allowed to have fewer than some number of observations.
When the predictor space is split and observations are divided, how is the predicted value chosen?
By computing the sample mean of each region (y-bar Rm).
Pruning
Reducing the number of leaves (terminal nodes) in a tree to reduce flexibility, variance and the risk of overfitting (recursive binary splitting is prone to overfitting because it can result in a large and complex tree).
We can do this by creating a sub tree (eliminating an internal node in the tree)
Cost complexity pruning
What does cost complexity pruning result in?
Nested sequences of subtrees
How does cost complexity pruning work?
What does the graph of # of terminal nodes against lambda look like?
How do we select the best subtree after pruning the tree?
Cross-validation
Choose lambda by applying k-fold cross-validation, and pick the lambda that results in the lowest CV error.
Then the best subtree is the one that has the selected lambda value.
What does |T| denote?
The size of the tree
What is the primary objective of tree pruning?
To find a subtree that minimizes the TEST error rate.
Effect of tree pruning on:
Variance
Flexibility
Bias
Residual sum of squares (SSE)
Number of terminal nodes |T|
Tuning parameter (lambda)
Penalty term (lambda * |T|)
Variance: DOWN
Flexibility: DOWN
Bias: UP
SSE: UP
|T|: DOWN
Lambda: UP
Penalty term: DOWN (with |T|)
Examples of greedy models
Think one-step-at-a-time, without holistic consideration.
For a smaller subtree, what happens?
RSS + # terminal nodes * tuning parameter is minimized for a smaller subtree.
aka RSS + |T|* lambda
Decision Tree Yes/No
Yes = left branch
No = right branch
Impurity function (classification)
Used in classification decision trees to decide the best split (instead of splitting to minimize the error sum of squares)