Gini Index
Impurity measure of node t.
i(t) = 1 - sum(p_jt)^2
p_jt is the realtive frequency of class j at node t.
Gini index of split is the sum of all i(t)s weighted by the relative amount of cases in each node.
We choose the attribute that provides the smallest GINI split measure.
Information gain
Measures the homogeneity of a node.
Entropy:
i(t) = sum(p_jt * log(p_jt))
Entropy gain of split is the difference between the entropy before the split and the sum of the entropy of each node after the split, weighted by hteir relative frequencies.
We choose the split that achieves the greatest reduction, and allows us to maximise the gain.
Split Info and Gain Ratio
Split info is the sum of the relative number of cases in the node times the log of the relative number of cases.
Gain ratio is the Entropy gain over the split info.
Stop criteria
CART
Characteristics:
CHAID
- Variables: Qualitative dependent variables - Split type: can be on binary or multiple nodes - Splitting rule: based on Chi-square test for the null hypothesis of statistical independence between the dependent variable and the explanatory variable. - Stop rule: explicit and must relate to the maximum dimension of the tree, maximum number of levels or the minimum number of elements in a node.
C4.5/C5.0
Similar to CART, but differs in the following respects:
QUEST
Quick, Unbiased, Efficient, Statistical Tree
Random Forest
The Random Forest method employs instead of a single tree, a set of
decision trees.
Each tree is implemented on an appropriate data resampling (Bootstrap with N samples), and on a subset of predictor variables.
X trees are estimated in this way and at the end the classification suggested by the majority of the trees is the final classification.
Pros and cons of Random Forest
Pros:
Cons:
XGBoost
The forest of trees is estimated sequentially, such that each new tree takes into account the prediction errors of the previous tree.
Pros:
Objective function of XGBoost: Loss + Regularisation
Ensemble, Bagging, Boosting
Types of Ensembles: