Data are shuffled randomly and then divided into k equal subsamples.
One sample is saved to be used as validation sample, and the other k-1 samples are used as training samples
K-fold cross validation
Technique of combining predictions from a number of models, with the objective of canceling out noise
Ensemble Learning
Results in: more accuracy & stable predictions (vs single model)
Neural Networks
Unsupervised Neural Networks with many hidden layers (often >20), and reinforcorced learning learn from their own prediction errors
Used for: complex tasks; image, pattern, & character recognition
Deep Learning Networks
Reinforcement Learning
Inputs & outputs are identified for the computer, and the algorithm uses this labeled training data to model relationships
Supervised Learning
Computer is provided unlabeled data that the algorithm uses to determine the structure of the data
Unsupervised Data
Least Absolute Shrinkage and Selection Operator (LASSO) is useful in building:
Penalized regression model
Parsimonious models, through feature reduction
K-Nearest Neighbor, investment application includes:
Used in: classification & regression
Random Forest investment applications include:
Linear relationships
A penalized regression model tries to use a limited number of most important features that…
explain the variation in the dependent variable
Example: monthly returns on 100 stocks
Overfitting occurs when:
Bias error:
Variance error:
when model fits the training too well
Bias error: low
Variance error: high
displaying non linear characteristics
Generalize is the degree to which the model retains it’s explanatory power when:
predicting out of sample
Bias error is the degree to which:
the model fits the training data
Variance error shows how much the model responds to:
new data
How to prevent overfitting:
Complexity Reduction:
Dimensional Reduction
Use: PCA
With supervised data, the training data contains:
ground truth
Supervised ML algorithm
Classification focuses on sorting observation into:
distinct categories:
* pass or failure
Regression based uses:
continuous variables
Regression:
CART & Random forests are used for:
complex & non-linear
Classified unsupervised data:
K-means is used for:
complex & linear data
with a known number of k clusters