Lecture 10 - Modelling Flashcards by Jeremy Robertson

What is overfitting?

An overfit model looks great on the training data and then performs poorly on new data.

Training Error: Model’s prediction error for the training data.

Generalisation Error: Model’s prediction error for new data.

Usually, the training error will be smaller than the generalisation error (no big surprise). Ideally, though, the two errors should be close to each other.

If the generalisation error is large and your model’s test performance is poor while your training error is small, then your model has probably overfit the training data.

How well did you know this?

Not at all

Perfectly

Why is overfitting bad, what is preferred?

An overfitting model has memorised the training data instead of discovering generalisable rules or patterns.

Simpler models are preferred as they tend to generalise better and avoid overfitting.

How well did you know this?

Not at all

Perfectly

What is log likelihood?

Log likelihood is a measure (a non-positive number) of how well a model’s predictions “match” the true class labels.

How well did you know this?

Not at all

Perfectly

Why is a larger log likelihood better>

The larger the magnitude of the log likelihood, the worse the match. So we prefer a higher log likelihood (as it is non-positive), that is close to 0.

The log likelihood of a model’s prediction on a specific instance is the logarithm of the probability that the model assigns to the instance’s actual class.

How well did you know this?

Not at all

Perfectly

What is deviance?

The deviance is defined as: measures how far your model is from a perfect model given by
−2 × (logLikelihood − S),
where S is a technical constant called “the log likelihood of the saturated model.”

In most cases, the saturated model is a perfect model that returns probability 1 for items in the class and probability 0 for items not in the class (so S = 0).

The lower the deviance, the better the model.

How well did you know this?

Not at all

Perfectly

What is Akaike Infromation Criterion (AIC)?

AIC is defined as

deviance + 2 × numberOfParameters

The more parameters are in the model, the more complex the model is;
the more complex a model is, the more likely it is to overfit.

Thus, AIC is deviance penalised for model complexity.

When comparing models (on the same test set), you will generally prefer
the model with a smaller AIC.

How well did you know this?

Not at all

Perfectly

What is AIC useful for?

The AIC is useful for comparing models with different measures of complexity
and modelling variables with differing numbers of levels.

How well did you know this?

Not at all

Perfectly

How can a model be scored using AIC?

A model can be scored with
- a bonus proportional to its scaled log likelihood on the calibration data
- minus a penalty proportional to the complexity of the model.

How well did you know this?

Not at all

Perfectly

What are evaluation method?

When we want to find the best single-variable model, we can use evaluation methods such as log likelihood and deviance
- It needs to have the largest log likelihood
- It needs to have the smallest deviance

How well did you know this?

Not at all

Perfectly

What are the steps to the evaluation method when checking how good a model is?

Compute the log likelihood
Run through categorical variables: pick variables based ont he reduction on the deviance will respect to the Null deviance
Run through the numerical values

REFER TO SLIDES FOR CODE EXAMPLES

How well did you know this?

Not at all

Perfectly

What are Decision tree models?

Decision trees are a simple model type – they make a prediction that is piecewise constant.
- The construction of a decision tree involves splitting the training data into pieces and using a simple constant on each piece.

Decision trees can be used to quickly predict categorical or numeric outcomes.

How well did you know this?

Not at all

Perfectly

How do decision trees work?

Decision trees are binary trees. A decision tree is built by iteratively

finding the optimal feature out of all the features and the optimal threshold to split a node of the tree, e.g., at the root node, Age is the optimal feature and 27 is the optimal threshold found by the decision tree’s algorithm.

This splitting process results in training instances being divided and passed down the branches to the two child nodes. As we progress down the decision tree, there are fewer and fewer training instances in each node.

How well did you know this?

Not at all

Perfectly

How are splitting of nodes terminated in decision trees?

The splitting of a node can be terminated by any of the following criteria:
- the node contains only instances of the same class;
- the node is at the pre-defined maximum depth value for the tree;
- the node has too few instances for further splitting.

How well did you know this?

Not at all

Perfectly

What is the objective of decision trees?

We can also consider that the objective behind decision tree methods is to partition the feature space into homogeneous regions (i.e., having instances belonging to one class only) as much as possible. Also, the regions should not be narrow and long.

How well did you know this?

Not at all

Perfectly

What are different types of decision trees?

Notable Decision Tree algorithms include:
- ID3 (Iterative Dichotomiser 3)
- C4.5 (successor of ID3)
- CART (Classification And Regression Tree)
- CHAID (CHi-squared Automatic Interaction Detector)
- MARS: extends decision trees to handle numerical data better.
- Conditional Inference Trees.

How well did you know this?

Not at all

Perfectly

What are the pros of decision tres?

Study These Flashcards

Decision trees take any type of data — numerical or categorical — without any distributional assumptions and without preprocessing.
Most implementations (in particular, R) handle missing data; the method is also robust to redundant and non-linear data.
The algorithm is easy to use, and the output (the tree) is relatively easy to understand.
They naturally express certain kinds of interactions among the input variables: those of the form “IF x is true AND y is true, THEN…”
Once the model is fit, scoring is fast.

What are the cons of decision trees?

Study These Flashcards

They have a tendency to overfit, especially without pruning.
They have high training variance: samples drawn from the same population can produce trees with different structures and different prediction accuracy.
Simple decision trees are not as reliable as other tree-based ensemble methods: e.g., random forests.

How to build a decision tree?

Study These Flashcards

We use a library such as rpart()

What can we do if the model looks too good on the training data and not as good on the calibration and test data?

Study These Flashcards

What we can do to work around this is fit on our reprocessed variables, which hide the categorical levels (replacing them with numeric predictions), and remove NAs (treating them as just another level).

If it is still performing quite poorly on the calibration data we turn our suspicion to overfitting

What hyperparameters can help with help inprove the AUC of the decision tree model?

Study These Flashcards

By setting the minsplit, minbucket, and maxdepth hyperparameters appropriately helped to improve only the AUC of the model. The next thing that we try is using only the reprocessed numerical variables that achieved high AUC scores.

How to interpret a decision tree?

Study These Flashcards

Node 1 is always called the root.
A node with no children is called a leaf node. Leaf nodes are marked with stars.
Each node other than the root node has a parent, and the parent of node k is node floor(k/2).
# Each node other than the root is named by what condition must be true to move from the parent to the node.The remaining three numbers reported for each node are:
the number of training items that navigated to the node,
the deviance of the set of training items that navigated to the node (a measure of how much uncertainty remains at a given decision tree node),
the fraction of items that were in the positive class at the node (which is the prediction for leaf nodes).

If there is poor performance of a decision tree, what other model can we use?

Study These Flashcards

The best guess is that this dataset is unsuitable for decision trees and a method that deals better with overfitting issues is needed – such as random forests.

What are KNNs?

Study These Flashcards

kNN: predicting a property of a datum based on the datum or data that are most similar to it. It can be used for regression and multi-class classification. For example,

How do KNNs work?

Study These Flashcards

The value of k needs to be determined a priori. This determines the number of neighbours to be used for the prediction.
A distance function (the default is the Euclidean distance) is needed as a measure of nearness between data points in the feature space.
For categorical variables in the dataset, the Hamming distance (see later) should be used.

What are the KNN distance functions that can be used?

- Euclidean Distance: Euclidean distance only makes sense when all the data is real-valued (quantitative). This is often referred to as straight-line distance. It is also referred to as the L2 norm of the difference vector x₁ − x₂. - Manhattan Distance: The Manhattan distance measures the total number of units along each dimension it takes to get from one (real-valued) point to the other (no diagonal moves). Also known as the L1 norm of the vector x₁ − x₂. - L-Infinity: L-infinity distance between two points x₁ and x₂. Also known as the L-infinity norm of the difference vector x₁ − x₂. === - Hamming Distance:for Categorical Variables Hamming distance counts the number of mismatches. i.e., the distance is 0 if two values are in the same category (i.e., perfect match), and 1 otherwise. If the categories are ordered (like small/medium/large) so that some categories are “closer” to each other than others, then you should convert them to a numerical sequence. REFER TO SLIDES FOR FORMULA AND EXAMPLES

What data preparation do you need to do for KNNs

Apart from splitting the dataset into a training set and a calibration set (assuming the test set is completely unknown), additional data preparation steps are required for kNN: Determine a set of k values. Use the calibration set to help find the optimal k value from the set. Numerical columns having too many NAs should be dropped from the set of input features. If the number of NAs and/or missing values is small in a numerical column, then they can be imputed by the median or mean value of the column; NAs in a categorical column may be treated as a separate level. Categorical columns whose levels can be ordered (e.g., small/medium/large) should be converted into numerical ones. kNN is sensitive to the different scales in the numerical columns. Normalization is usually required to ensure that there are no dominating numerical columns in the training set. Feature columns in the calibration and test set will need to be normalized the same way.

Why are KNNs not good sometimes?

They are expensive in both time and space and hence models like logistic regression are better.

Lecture 10 - Modelling Flashcards

(27 cards)