Decision tree
A decision tree is used to predict the outcome. There are as many splits as attributes. After determining which the possible splits are, the best splits have to be determined to end up with pure nodes. To make this possible, classes are used. To end up with pure nodes, information gain and entropy is used.
Entropy (E)
Measures the uncertainty. When you have a pure node, E = 0. When there is a high uncertainty E = 1.
Information gain (I)
Measures the increase/decrease in uncertainty given certain information.
Theory regarding entropy
New entropy is always less than the original entropy, which causes an information gain.
Split on attributes
Decision boundary
Borderline between two neighbouring regions of different classes. Decision boundary is parallel to axes because test condition involves a single attribute at-a-time.
Types of decision trees
Type of predicted variable (end node):
Tree induction
Measures for node impurity (a.k.a. when to stop splitting)
Quality of a split
Objective: obtaining pure nodes, i.e., nodes that contain objects from a single class (if node is impure, take majority class as label). - Measures for impurity \+ Misclassification error - fraction of objects that are classified correctly and part incorrectly if we assign every object to the majority class in that node.
Splitting based on classification error
Classification error at a node t:
- Error(t) = 1 - maxP(i|t)
Measures misclassification error made by a node
- Maximum (1-1/nc), when records are equally distributed among all classes.
- Minimum (0.0) when all records belong to one class pure node.
For efficient computation
For each attribute:
Regression tree overfitting
Divide the x axis in parts in such a way that in each part the value of the node is equal to the average of the red points. By continuing splitting you can fit the red points arbitrarily well.
Growing the tree
Using the training data:
- Split nodes base on impurity criterium.
- Until all leave nodes are pure (impurity = 0)
Problem: if the tree is too large, the model also contains noise of the training data.
Underfitting
When model is too simple, both training and test errors are large.
Overfitting
When model is too complex, training errors are small and test errors are large:
- Overfitting results in decision trees that are more complex than necessary.
- Training error no longer provides a good estimate of how well the tree will perform on previously unseen records.
- Solutions:
Occam’s Razor principle:
+ Given two models of similar generalization errors, one should prefer the simpler model over the more complex model.
+ For complex models, there is a greater chance that it was fitted accidentally by errors in data.
Possible solution:
- Include model complexity when evaluating a model.
Stopping criteria
2 systematic methods:
Tree pruning