Describe the 2 methods of feature selection
Scalar methods -
Consider each feature independently, evaluating its importance and relevance to the task. Do not take into account the relationships or dependencies between features
Vector methods -
Consider the joint distribution or relationships between features
Evaluate features in groups or as a whole, taking into account interactions between them
More computationally intensive than scalar, but can potentially lead to better feature selection when features interact in a complex way
Key characteristics of scalar methods
Each feature is assessed on its own merit
They are simple and computationally efficient
Suitable for problems where features are independent or have minimal interactions
Common techniques of scalar methods
Filter methods: Apply a statistical measure to evaluate the relationship between a feature and a target variable
Univariate selection / Statistical tests: features are ranked based on a statistical test and the top ranked ones are selected
Key characteristics of vector methods
Features are considered together in their joint distributions
Can capture correlations and interactions between features
More computationally expensive but more powerful when dealing with correlated features
Define wrapper methods
Vector methods which evaluate subsets of features by training a model on them and measuring the model’s performance
3 examples of wrapper mtehods
Forwards selection: Starts with an empty set and adds the best features one by one
Backward elimination: Starts with all features and removes the least useful ones step by step
Recursive Feature Elimination: Recursively removes the least important features based on model performance
Define embedded methods
Vector method that performs feature selection during the model training process
Give 4 examples of embedded methods
Lasso:
Decision trees/Random forests
Principal Component Analysis
Independent Component analysis
Describe the lasso method
Penalises the absolute value of coefficients, effectively setting some coefficients to zero
Describe the decision tree/random forest method
Feature importance can be derived from tree-based models by examining the importance of each feature
Describe the PCA method
Transforms the feature space into a new set of orthogonal axes, capturing the maximum variance of the data. Not strictly a feature selection method, reduces the feature space by creating a smaller set of uncorrelated features
Describe the ICA method
Similar to PCA, ICA focuses on separating statistically independent components rather than uncorrelated components
Describe the process of simple scalar feature selection
Choose a 1-dimensional class separability criteria, C. (Choose something that evaluates one feature at a time e.g. divergence)
The value of C(k) is computed for each feature, k
Select the n features corresponding to the n best values of C(k)
Simple to perform but does not consider correlation between features
Describe the improved scalar feature selection method
Calculate the value of C(k) for each feature, k as before
Select the largest feature
Calculate C’(k) of the remaining features
C’(k) = C(k) - p(largest feature, current feature)
This will give the next best feature which doesn’t correlate with the feature already selected.
actual C’(k) has weights:
C’(k) = a1C(k) - a2p(largest feature, current feature)
Describe the Sequential forward selection algorithm
Start with empty vector and progressively add features
At each iteration try adding each feature in turn to see which gives the best n-dimensional separability measure
Repeat until the vector is of the required length
Describe the Sequential backward selection algorithm
Start by selecting all the features
Evaluate the impact of removing each feature one at a time
Select the feature whose removal has the least impact on the model performance and remove this
Keep selecting and removing until n features are left
When to use forward/backward selection
Both are suboptimal
If the number of features is close to the desired number of features, choose backward selection
If the number of features is closer to 1, choose forward selection