Why use clustering?
Clustering in the CRISP-DM cycle
Mostly:
• Data understanding - exploration
• Data preparation - preprocessing, reducing dimensionality
• Modelling
How to interpret clusters?
Characteristically.
Characteristically.
• Interpreting clusters by looking at its members
OR
• Interpreting clusters by looking at a typical cluster member or typical characteristic(s).
How to interpret clusters?
Differentially.
Differentially.
• What differentiates Cluster X from Cluster Y?
• Supervised learning approach.
‣ Each data point has a new label - cluster ID.
‣ Predictive modelling with cluster ID as a target variable.
How would that work? Supervised learning approach - Differentially?
Set up a classification task: 1) a k-class task or 2) binary classification • Ensure intelligibility - be able to get the classifier definition
Cluster validity
How good is a given clustering?
Internal criteria
• Can be used to specify the optimal number of clusters.
Indices are often (not always) method-dependent.
Cluster validity
How good is a given clustering?
External criteria
• Compare the created clusters to some external reference:
‣ Experts opinion, existing theory.
‣ External variables/groupings.
‣ Labels generated by a different clustering method.
Steps in a typical cluster analysis:
A. Collect the data to use for clustering. Preprocess if needed.
B. Select the variables.
C. Select distance measure.
D. Select clustering method.
E. Experiment with different sets of variables/measures/methods.
F. Determine validity of the selected solution.
G. Interpret the results.
Experiment! - difficult to anticipate what combinations of variables, similarity measures, clustering methods will lead to interesting results.
Random Forest model to…
assess the explanatory power of each variable -> variable importance
Where else can we find clustering applications?