What are the full machine learning pipeline stages?
Data Collection –> Data Cleaning –> Feature Engineering –> Model Selection –> Hyperparameter tuning –> Model Training –> Model Evaluation –> Deployment –> Monitoring
What is the distinction between feature engineering and data cleaning?
Data cleaning is like “fixing” the data.
I.e. Removing duplicates, handling missing values, fixing inconsistent formatting, correcting data types, outlier removal
Feature engineering is adding “signal” to the data. Transforming the data to create better predictors.
I.e. encoding categorical vars, creating aggregate features, extracting features from text/images, normalization
What are the types of ways to handle missing data? When should you use each?
What is data leakage? What are the 2 types?
When information outside of the training dataset is accidentally used to train the model.
How do we know data leakage is occuring?
How do we mitigate data leakage?
How can we ensure that data leakage will not occur?
Why are imbalanced datasets an issue?
How can we deal with imbalanced datasets?
What kinds of metrics would you use for imbalanced datasets?
Recall, F1, and PR-AUC as these are all imbalance aware metrics.
What are the methods of hyperparameter tuning?
Grid Search: systematically tries all combinations of hyperparameter values in a grid
Random Search: Randomly Samples hyperparameter combinations
What are the key hyperparameters used in deep learning algorithms for optimization?
learning rate - step size in weight updates, usually denoted by alpha
Optimizer - algorithm used to minimize the loss i.e. Adam, SGD, RMSprop, etc
momentum - helps smooth updates
Weight-decay - regularization term (l2 or l1)
What are the key hyperparameters used in deep learning algorithms for architecture?
What is precision?
Of all positive predictions, how many were correct?
True Positives
_________________________________
True Positives + False Positives
What is recall?
Of all actual positives, how many were identified?
True Positives
_________________________________
True Positives + False Negatives
How do we mitigate overfitting?
How do we mitigate underfitting?