Basic idea of unsupervised learning.
Find structure within a set of instances defined by descriptive features alone.
What is a clustering algorithm?
Given finite set of data points, finds homogeneous subgroups of data points with similar characteristics. End result is generation of a feature that describes a cluster.
Give one use case of clustering.
Customer segmentation.
Clustering algorithm fundamentals?
Feature-space and distance measure.
Representation learning focused on creating new representation instances with the expectation that this result will be useful later.
How is data pre-processed before k-means clustering?
Change categorical data to numerical data. Feature reduction techniques.
T or F. K-Means is highly sensitive to outliers.
True.
Name the four steps of k-means.
Step 1: Initialize K & Centroids
Step 2: Assigning Clusters to Datapoints
Step 3: Updating Centroids
Step 4: Stopping Criterion
Explain step 1: Initialize K & Centroids.
Tell model how many clusters there will be, pick k datapoints as initial centroids.
What are the initial centroids often called?
Seeds
Explain Step 2: Assigning clusters to datapoints.
Perform distance calculation between each datapoint and all the cluster centroids and assign it to cluster of its closest centroid.
Explain step 3: Updating cluster centroids.
We then split the data on features (e.g. x, y … co-ordinates). We get the average for each feature in each cluster to get a new cluster centroid. We do not include the centroid as a datapoint unless it is already a datapoint.
Explain step 4: Stopping criterion.
Step 2 and 3 are performed iteratively. Until stopping criterion met. E.g. Distance of datapoints from their centroid is below some threshold. No cluster membership change on a certain iteration.
Outline the k means clustering algorithm.
Select k cluster centroids.
Loop until stopping criterion met:
- Calculate distance of each datapoint from each cluster centroid.
- Assign each datapoint to its closest cluster centroid.
- Update cluster centroid, by getting average of datapoints in its cluster.
Return: The clusters and the datapoints in each of them at the end, along with the final k centroids.
What is the output of k-means clustering algorithm? How would new value be assigned?
Datapoints, what clusters they belong to and their centroids.
Which every centroid it is closest to.
What makes a good cluster? (informally)
Member datapoints close together and afar from other clusters.
How do we measure cluster quality?
The Inertia Score: Inertia tells how far away the data points within a cluster are from the closest centroid. Ranges up from 0 with low being desirable.
Silhouette Width: Considers both intra-cluster and inter-cluster distance measures to determine whether a given point is well placed. Ranges from -1 to 1. With 1 being desirable.
How do we calculate inertia?
Sum of distances between each datapoint at its centroid for all clusters.
How do you calculate silhouette width?
s(i) = b(i) - a(i) / max(a(i), b(i))
for point i where a(i) is average distance of points in its cluster to it and b(i) is average distance of points from next closest cluster to it.
How can we determine the number of clusters to have in k-means?
Explain The Elbow Method.
We want a low value of inertia and a small number of clusters (k). Inertia decreases as k increases. The “elbow point” in the inertia-k graph is a good choice because after that the change in the value of inertia is not significant.
Explain The Silhouette Method (for determining appropriate k).
Simply do silhouette method for different values of k and whichever has more values towards 1 and less outliers towards -1. This is better.
What is association rule learning?
Rule-based machine learning method, finds associations between any attributes to create rules to predict any attribute or combination of attributes.
How is association rule learning different to k-means?
There are no clusters / classes
What is the name for identified rules and format? What is the itemset?
Association Rules
{Antecedent} => {Consequent}
I.e. {data we find} => {data that often occurs at the same time}
List containing all antecedent and consequent terms for a given rule.