What are some examples of unsupervised machine learning?
Describe Clustering. What inputs does it take? What is the output?
Clustering is a way of grouping data into a number of clusters without having labels present
Input: Set of objects described by features xi
Output: An assignment of objects into “groups”
Unlike classification, we are not given the “groups”. The algorithm must figure these groups out
Can you give some examples of use cases for clustering?
How do you normalize/scale data?
You can either
What is K-Means Clustering? What is the input of the algorithm? What are the assumptions? Describe the 4 steps in the algorithm.
K Means clustering is one of the most popular clustering methods
Input:
- The number of clusters ‘k’ (hyperparameter)
Assumptions:
The four steps are like so:
What are the assumptions of K-Means clustering?
The center of each cluster is the mean of all samples belonging to that cluster
Each sample is closer to the center of its own cluster than to centers of other clusters
How can you relate K-Means clustering to set theory?
We can interpret K-Means steps as trying to minimize an objective:
Given a set of observations (x1,x2,…,xn) the algorithm’s goal is to partition the n observations into k sets S={S1,S2,…,Sk} so as to minimize the within-cluster sum of squares:
{See the rest of the math in Notion}
How can you determine how many K’s in K-Means clustering?
You can determine how many clusters using:
What is the Elbow Method?
Elbow Method:
What is silhouette analysis?
Thickness of the plot shows the size of the cluster (how many datapoints are assigned to the cluster)
The groups in the graph should be approximately similar in terms of the sihouette coefficient, they should not be under the mean of the s. coeff., and hopefully they would also be approximately of the same thickness (unless you can clearly see the diff. between the clusters that they really differ in terms of size)
What are some issues with K-Means clustering?
Final cluster assignment depends on initialization of centers
Assumes you know the number of clusters ‘k’
- Lots of heuristic approaches to picking ‘k’
Each object is assigned to one (and only one) cluster:
Sensitive to scale
When is a set convex?
A set is convex if a line between to points in the set stays in the set (See images on Notion)
Can K-Means cluster into non-convex sets?
No, K-Means clusters cannot
What is Density based Clustering
What is DBSCAN? Which hyperparameters does it have?
DBSCAN is a density based clustering algorithm.
It has two hyperparameters:
- Epsilon(ε): Distance we use to decide if another point is a “neighbour”.
Describe the algorithm of density-based clustering (the process)
For each example xi:
“Expand” cluster function:
What are some of the issues with density-based clustering?
Some points are not assigned to a cluster
- Good/bad depending on the application
Ambiguity of “non-core” (boundary) points between clusters
Consumes a lot of memory with large datasets
Sensitive to the choice of ε and miNeighbours
- Otherwise, not sensitive to initialization (except for boundary points)
What are the two ways of doing hierarchical clustering?
Hierarchical clustering can be split into the following two types of clustering:
In general, Agglomerative clustering works much better in practice
In Agglomerative clustering, clusters are successively merged…
until all samples belong to one cluster
True or False? If uncertain whether scaling is required, I should scale my data
True, if you’re not sure whether scaling is needed, scale it.
Hierarchical clustering is often visually inspected using…
A dendrogram
Which is a tree diagram that shows the hierarchy and how the data is split into clusters
Which distance metrics are typically used in Agglomerative clustering?
Euclidean Distance
Manhattan (block) distance
Which different linkages (for hierarchical clustering) are there?
What is a centroid linkage?
Centroid: The distance between the centroids of each cluster