Why is decision tree popular classification technique?
In each nodes there are brackets. Based on the lecture, what do we assume is the order in the brackets?
[Non acceptor, acceptor] –> meaning [Negative, Positive]
Unless it is stated otherwise.
Where do we look at if we either want to look at incorrectly predicted or the amount of TN/TP/FP/FN ?
At the leave nodes and then to their color and the values in their brackets.
Which two types of split do you have for Nominal attributes?
Which two types of split do you have for continious attributes?
How do you determine the best split?
You look at the Information gain per split. You compute a measure of impurity that can either be Gini Index or Entropy and than you look at which split has the highest number.
The lower the Gini, the higher the information gain the better.
You use this when you are comparing splits!!!!!
How do you compute the GINI index?
If you have a split, you have two (or more) classes.
For each class, you divide the #records in that class by the #records of that node level. This way you have the proportion per class.
You square those proportions and you subtract them both from 1. That is your GINI.
Example: Class 1 has 2 and Class 2 has 4. Total of node level = 6.
1 - (2/6)^2 - (4/6)^2 = GINI
What are the number in the nodes?
The main number is the total amount in the node.
The number in the brackets is the amount per class.
Remember to compute TP, TN etc. only with the leave nodes.
What is the combined impurity?
You calculate the GINI index for both nodes in which a layer above is split.
You then perform a weighted average to get the combined GINI:
((#records Node 1 /((#records node 1 + 2) * GINI Node 1) )+ ((# Records node 2 / ((#Records node 1 + 2) * GINI Node 2)
What is Entropy measure?
Similar to GINI, but a different computation.