What is a target variable
A target variable is the dependent variable (the y variable)
Target variables can be continuous, catagorical or ordinal
What is the definition of features
Features are the x variables
What is the definition of a training set
Training set is the sample used to FIT the model
What is the definition of a hyperperameter
This is a model input which is specified by the researcher,
What is the definition of supervised machine learning
This is when you use labelled training data, where the y variable is clearly defined and provided to the algorithm.
The goal is to guide the machine towards higher accuracy by providing the answer to the question basically
What are the main key tasks, and when is each used
Key tasks are regression and classification
Regresssion is used when the target is continuous
And classification is used when the target is ordinal or categorical.
This can be a binary classification such as something either being a dog or not.
Multiple regression is a good example of supervised learning
What is unsupervised machine learning
Unsupervised machine learning is when the algorithm is not given any labelled training data.
Therefore there is no answer given to the model
Instead of using classifiers, like sector similarities, you could use unsupervised training in order to group stocks into groups wich have behaved the most similarly.
What is the definition of deep learning.
Deep learning is used for highly conplex tasks which involve non linearities
They are based on neural networks which have many hidden layers.
At least two but usually more than 20
Reinforcement learning is when models just basically use trial and error in order to maximise an outcome, based on the result of their previous attempts.
What are the key tasks that unsupervised learning is likely to perform
Unsupervised learning is likely to perform clustering, where a model groups observations.
And it’s also likely to be used for image recognition, deep learning, and natural language processing.
What is supervised learning
When do you use it
When you label training data where the y variable is clearly defined and provided to the algorithm
Used when the target variable is categorical rather than ordinal and can be a binary classification or a multi category classification.
What is the meaning of reinforcement learning.
Reinforcement learning is where an agent learns through trial and error how to maximise a set of constraints.
What is the meaning of overfitting when it comes to machine learning
When does overfitting happen
Overfitting is when a model is too complex, becuase it has too many features, it’s’ a bit like an ols regression having too many variables.
Overfitting happens when the model mistakes random noise for a signal or pattern
What happens to the r squared and the adjusted r squared when you have overfitting
When you have overfitting the r squared is likely to be high, whereas when you have overfitting the adjusted r squared is likely to be low.
What does it mean when a model is said to generalise well
A model generalises well when it retains the explanitory power when it’s applied to new and out of sample data.
What are the three prediction errors that data scientists use to understand and adress overfitting
1 bias error. This is the in sample error that results from a model having a poor fit
2 variance error. This is the out of sample error that results from over fitted models.
3 base errors. This is the random error that it is impossible to eliminate from a model due to random noise.
What happens to variance error and bias error through increased model complexity
Variance error increases with model complexity
Bias error decreases with complexity
What are the 5 ways that you can adress overfitting and explain each of them
1 complexity reduction
This is when you impose a penalty to exclude features that do not contribute to the out of sample prediction accuracy. You attempt to create a parsimonious model which is a model that achieves the highest level of explanation using the smallest number of variables possible,
2 penalised regression (LASSO regression)
It’s a method that minimises the sum squared errors plus a penalties based on the absolute value of the slope coefficients.
It automatically eliminates the least predictive features.
3 cross validation
Cross validation estimates the out of sample error by dividing the data into K parts.
The model is trained on K-1 parts and then validated one he remaining part, repeating the process k times.
4 regularisation and pruning.
In classification and regression trees overfittting is adressed by taking away sections of the decision tree that do not offer a lot of explanatory power.
5 ensemble learning
Random forests mitigate overfitting byt training the treees using different subsets of data. Because each tree uses different features errors across them tend to cancel out.
What does LASSO stand for and what does it measure
What is the metric used to determine the balance between overfitting and parsimony
Lasso stands for least absolute shrinkage and selection operator.
Lambda determines the balance between overfitting and parsimony.
What is K cross validation how does it work.
You first divide the data into k different parts, which are of equal size.
Then the model is trained k times, in each different training operating, k-1 parts are used in the training sample.
The remaining one piece is used for validation.
Then the process is repeated until every one of the folds has served as the validation set at least once.
Then the error rate is measured for each of the iterations and the mean of these errors is uesed as the estimate for the out of sample error.
What are penalised regressions and what are they used for
Penalised regression models are used to adress overfitting.
They impose a penalty on the models compelxity which will increase with the number of features.
The concept is that the models minimise the SSE plus a penalty term. This means the models acchives a high level of explaination but with as few predictors as possible.
LASSO where the penalty is the sum of the absolute values slope coefficients.
What is the support vector machine
What is a support vector
What is a soft margin classification
What is a kernel trick
Support vector machine, is an algorithm mainly used for classification
It tries to find the optimal descision boundary that separates data into two groups with the biggest margin between the two groups.
Support vector is a value that is near a boundary that is used to define its position.
A soft margin classification is an adaptation that allows for some missclassified observations to optimise between a wide margin and an error
A kernel trick is a method used to reshape data into higher dimensions, to find a clear split when groups are not separable.
What is the K nearest neighbour
What is the meaning of k
What happens if k is too small or k is too big
What is the investment application of this method
K nearest neighbour is used to classify the nearness of the observation to an observation in the training sample.
K is the number of neighbors that the algorithm considers
When k is too small the model might overfit, (high error rate)
When k is too big it dilutes the result by averaging across different outcomes.
This method is used when you need to asssign bonds to different ratings, or when you want to create different indices.
What is a classification and regression tree
What si the difference between the two
A classification and regression tree is basically a Question tree
It organisers data into different nodes
The root node at the top of the tree is the most important
And the desision nodes are points further down from the top of the tree where data is split further
Terminal nodes are final points where the algorithm stops splitting and provides the outcome of the model
A classification tree is used when the target variable is binary
Regression tress are used when the variable is continuous.
How do you adress overfitting in a CART model
You can set limits to the model’s complexity
You can prune sections of the model that have minimal explanatory power