A fitting graph shows
the accuracy (or error rate) of a model as a function of model complexity
Fitting graph
Generally, there will be more overfitting as…
one allows the model to be more complex
Complexity is a measure of
the flexibility of a model
If the model is a mathematical function, complexity is measured by
the number of parameters
If the model is a tree, complexity is measured by
the number of nodes
To look at overfitting, you only look at the
holdout data
Generally: A procedure that grows trees until the leaves are pure tends to overfit
Ways to avoid overfitting for tree induction
- Prune back a tree that is too large (reduce its size)
Tuning model parameters
When choices all are made, then test on test set this “nested” test set is often called the
“validation” set (to differentiate from the final test set).
Cross-validation
the dataset was first shuffled, then divided into ten partitions
How to find which variables are the most important for the model?
There are multiple ways of determining how important is a variable:
“Random Forest”
is a tree-based model that uses multiple decision trees simultaneously
A “Random Forest” model is equivalent to a decision tree if:
Rapidminer has an operator called “Weight by Tree Importance” that calculates the variable importance based on information gain (among others)
This operator only works with models of type “Random Forest”
A learning curve is…
a plot of the generalization performance (testing data) against the amount of training data
Learning curve
Generalization performance improves as…
more training data are available
Steep initially, but then marginal advantage of more data decreases
The ROC graph shows
the entire space of performance possibilities for a given model, independent of class balance
A ROC curve can be constructed from
selecting different thresholds from a rank classifier
The area under the ROC curve
is the probability that the model will rank a randomly chosen positive case higher than a negative case
AUC is useful when
a single number is needed to summarize performance, or when nothing is known about the operating conditions
Comparing models
We can compare the performance of different models by looking
at their ROC curves.
We can choose the optimal model for each threshold
The expected value framework is …
an analytical tool that is extremely helpful in organizing thinking about data-analytic problems.
The expected value framework combines:
The benefit/cost matrix
summarizes the benefits and costs of each potential outcome, always comparing with a base scenario