Why dimension reduction?
- Curse of dimensitionality
How dimension reduction?
2 Feature extraction:
Linear:
Non-linear:
How: Feature selection
How: Feature extraction
Linear:
Non-linear:
Why visualization?
Curse of conditionality
https://www.youtube.com/watch?v=8CpRLplmdqE
In short;
- Training examples needed grows exponentially with # of dim. to generalize accurately
If we want N training items per unit of “feature space”
Then for each additional dimension = # of training samples * N.
Support Vector Machine
SVMs can make more complex decision boundaries by using kernels (see next card for more kernel functions, soft margin formulation explained here)
In short:
https://www.youtube.com/watch?v=efR1C6CvhmEhttps://www.r-bloggers.com/support-vector-machines-with-the-mlr-package/
For linearly-separable data
The SVM algorithm finds an optimal linear hyperplane that separates the classes. For a two-dimensional feature space, such as in the example in figure 1, a hyperplane is simply a straight line. For a three-dimensional feature space, a hyperplane is a surface. The principle is the same: they are surfaces that cut through the feature space.
The margin is the shortest distance between threshold and observations. Linearly separable data can use a (1) Maximum margin kelner or (2) Soft margin kelner.
(2) We might not want to choose a decision boundary that perfectly separates the data to avoid overfitting. Allow SVM to make a certain number of mistakes and keep margin as wide as possible so that other points can still be classified correctly.
So a soft margin a margin allows miss-classification.
We use cross validation to determine how many misclassifications and observations to allow inside the soft margin to get the best classification.
When we use a soft margin to determine the location of a treshold, we use a soft margin classifier.
Kernel functions
A kernel performs a mathematical operation on instances in two classes that are not linearly separable, so that they do become linearly separable.
So , SVMs can fit linear decision boundaries for non-linearly separable problems by using the kernel trick.
Non-linearly separable data
The algorithm learns linear hyperplanes, this seems like a contradiction. Well here’s what makes the SVM algorithm so powerful: it can add an extra dimension to your data, to find a linear way to separate non-linear data.
The SVM algorithm adds an extra dimension to the data, such that a linear hyperplane can separate the classes in this new, higher dimensional space.
Linear kernel functions
==> Soft margins or Maximum margins
Polynonomial kernel function
==> Has parameter, d, that stands for the degree of polynomials. e.g. When d is 1 it computes the relationship between each pair of observations in 1-Dimension. When d is 2 it computes the 2-Dimensional relationship between each pair of observations.
Radial Basis Function
==> behaves as weighted nearest neighbors. Observations that are close have much influence on classification and the ones that are low have very little influence.
Radial Basis Function Kernels
Two hyperparameters
C -> proportional to the misclassification penalty
Gamma -> range of influence in feature space
Sigma -> 1/Gamma
Feature selection
https: //www.youtube.com/watch?v=sOWqWmJhQ8A
https: //www.youtube.com/watch?v=rA_WFUf2-YM
All the action in Feature Selection is in how to decide
which dimensions to remove?
3 strategies
The filter strategy (feature selection)
In short:
Filters features before handing it to a learning algorithm.
The criteria for knowing if you did good or not is buried inside the search algorithm itself.
Perhaps the most basic method of eliminating
features
• It considers each feature dimension separately
• Not depending on the type of model you are using
Example criteria: correlation, mutual information, or
various significance tests, information gain (decision tree), Entropy.
Criteria should be relatively fast to compute.
Advantages:
• Fast
• Independent of the model
Disadvantages:
• Only considers dimensions independently
• Not actually evaluate the importance for a model. There is no feedback. The learning algorithm can not inform the search algorithm (e.g. maybe the search algorithm took a feature away, but without that feature the learning algorithm does not perform well, but no way to communicate that)
• Independent of the model
The wrapper strategy (feature selection)
In short:
The search for the features is wrapped around whatever your learning algorithm is
The wrapper strategy (feature selection)
Advantage & Disadvantages
Advantages:
• Actually make decisions about features based on
performance (accuracy, precision, recall, etc.)
• Considers all possible sets of features
• Evaluates the features for this model, not in general
Disadvantages:
• This process can be really slow
• The set of possible feature dimensions to include could be really, really large
• With N dimensions, the number of possible sets are 2N
• The set of features is completely dependent on the model
The embedded strategy (feature selection)
• Embed the filtering strategy within the wrapper strategy
• These are special cases of wrapper strategies where
part of fitting the model involves filtering out some
feature dimensions
– Canonical example is LASSO regression where as part of the iterative estimation of the parameter weights (a wrapper strategy), the regression weight for many features is set to 0 (a filtering strategy)
The embedded strategy (feature selection)
Advantage & Disadvantages
Benefits:
Issues:
• Can be slow relative to filtering strategies
• The selected features are model dependent
• Only reduces dimensions, can’t find any new ones!
Feature Extraction
Feature Extraction is building a set of new and better (and fewer) features
Why (feature extraction)?
Maybe none of the existing features is a particularly
good feature
• Let’s build some new ones!
• You’ve already partially done this when normalizing
the existing features for regression and classification:
– Subtracting the overall mean
– Dividing by the overall variance
PCA (feature extraction technique)
NOTE
PCA performs dimensionality reduction and it reduces dimensions, is does not reduce dimensionality itself.
The most common (linear) technique is Principle Components Analysis (PCA).
It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important
PCA is a way of projecting a high dimensional space into a lower space that has nice properties.
PCA finds the best fitting line by MAXIMIZING the sum of the squared distances from the projected points to the origin. Looking for dimensions of greatest variance.
The steps of PCA are:
PCA: Recoding into new features
Assuming we have N original features
These N original features can be recombined into N
new features without losing any information where the new features are;
PCA: in higher dimensions
https://youtu.be/FgakZw6K1QQ
• For datasets with N features (N>2), PCA rotates the coordinate system in such a way that:
PCA in behavioural sciences
In data mining we determine the optimal number of components empirically!
Random Decision Forests
• The complexity of RDFs is determined by the
number of trees (and their depths)
• In some decision forests trees are induced on the
same complete set of features
• In random decision forests, trees are induced on
randomly selected subsets of features