CART - (6)
Classification and Regression Trees
How is the tree built?
start with a single region R1 and iterate.
How do we control for overfitting?
wrong method 1: find the optimal subtree using CV (too many possibilities so overfitting)
wrong method 2: use a rule to decide when we stop growing (doesn’t work because good cuts can live below bad ones)
solution: prune a large tree from leaves to root
Tree Pruning: (2)
weakest link pruning: starting with To, substitute a subtree with a leaf to obtain T1 by minimizing [RSS(T1) - RSS(To) ] / [|To| - |T1|]
iterate this pruning to obtain a sequence To, T1, … to Tm where Tm is the null tree
cost complexity pruning: solve the problem
minimize{RSS} + a*|T| where a is a penalty for large number of leaves
if a is large then we select null tree Tm
if a = 0 then we select the full tree
the solution for each a is among the sequence of trees chosen for weakest link pruning
choose optimal a by CV
Incorrect CV vs Correct CV
incorrect CV creates the trees first, then does kfold split.
correct CV: split data kfold ways. for each training set, select the tree which minimizes RSS. select the tree Ti which minimizes the average test error
Classification Trees - types of classification loss (4 total notes)
predict Ri by majority vote in each region.
types of classification loss: 0-1 loss, gini index, cross entropy (similar to gini)
Gini index is the expected misclassification rate. instead of predicting the most likely class we predict from a distribution of phat_mk
it is typical to use to use Gini and cross to grow the tree, then use misclassification to prune
Advantages of Trees
easy to interpret, closer to human decision making, easy to visualize, easily handle qualitative predictors and missing data
How do we deal with Categorical variables
For 2 categories, the split is obvious. if there are more than 2 categories, order the categories according to the average of the response (this can be shown to be the optimal way to partition
Missing Data
suppose we can assign every sample to a leaf Ri despite missing data
when choosing a new split with variable Xj (growing the tree)