We have a particular life insurance product we would like to sell, we have a nice offer, but we incur a cost to target it. How should we proceed?
How to choose at each step which of the attributes to use to segment the population?
General rule: resulting groups to be as pure as possible
i.e., homogeneous with respect to the target variable.
The concept of information provides a way to…
… quantify the amount of surprise for an event measured in bits.
Intuition
the events that are rare (low probability) are more surprising and therefore contain more information than those events that are common (high probability)
Entropy
Information Gain
How to choose at each step which of the attributes to use to segment the population?
Rule: choose the variable that provides the most information gain with respect to the target variable
Do decision trees evaluate the information gain of all the variables at each split?
Yes
Can we use the same variable to split the data more than once?
Yes
How is the split done for continuous
variables (e.g., income)?
Different thresholds are tested; threshold
with highest IG is used
The confusion matrix
The confusion matrix allows visualization of the performance of a model
True Positives (TP)
actual positives correctly predicted as
positive
True Negatives (TN)
actual negatives correctly predicted as
negative
False Positives (FP)
negatives incorrectly predicted as positive
False Negatives (FN)
positives incorrectly predicted as negative
RapidMiner Studio
is a commercial software that provides an integrated environment for machine learning and business analytics