What are some naive feature engineering techniques that improve model efficiency?
What are three methods for scaling your data?
Explain one major drawback to each of the scaling methods.
When should you scale your data and why?
When you algorithm will weight each input, e.g. gradient descent used by many neural nets or use distance metrics (e.g. KNN), model performance can be improved by normalizing, standardizing, or otherwise scaling your data so that each feature is given relatively equal weight.
Scaling is also important when features are measured in different units. e.g., if feature A is measured in inches, feature B is measured in feet, and feature C is measured in dollars it is important that these features are scaled so that they are weighted and/or represented equally.
In some cases, efficacy will not change but perceived feature importance may change, e.g. coefficients in a linear regression.
Note: scaling your data typically does not change performance or feature importance for TREE-based models since the split points will simply shift to compensate for data.
Describe basic feature encoding for categorical variables.
Feature encoding involves replacing classes in a categorical variable with new values such as integers or real values; e.g., [‘red’, ‘blue’, ‘green’] could be encoded as [8, 5, 11].
When should you encode your features and why?
You should encode your categorical features so that they may be processed by algorithms, e.g. so that ML algorithms can learn from them.
What are three encoding methods for categorical features?