What are the possible reasons for overfitting when creating a decision tree? How
can overfitting be avoided?
1) too much variance in training data such that data is not a representative sample of the instance space & the tree splits on irrelevant features 2) too much noise in training data: incorrect features or class labels 3) avoid by: - pre-pruning, stop growing the tree at some point when there is insufficient data to make reliable decisions - post-pruning, growing the full decision tree & removing nodes with insufficient evidence - mechanism: prune children of S if all children are leaves & the accuracy on the validation set does not decrease if the most frequent class label is assigned to all items at S
Outline the causes of the different forms of generalisation error
1) approximation error:
- due to hypothesis space being smaller than target space ⇒ underlying function may lie outside hypothesis space.
- poor choice of model space ⇒ large approximation error,
i. e., model mismatch
2) estimation error:
- due to learning procedure which selects non-optimal model from hypothesis space
3) ⇒ empirical risk ˆfn,N (x)
Three types of statistical learning
1) Empirical modelling:
2) Neural network
3) Support Vector Machines (SVMs)
Given a set of data samples what is the aim of a support vector machine?
Given a set of data samples what are the characteristics of an applied neural network?
Given a set of data samples what is the aim of empirical modelling?