What is this clustering called?
Hierarhical clustering
Because:
It starts with each object as its own cluster at the bottom, then merges clusters step-by-step.
How useful are step 1 and 5?
NOT useful at ALL.
(you have to figure out where you stop, step 2, 3 or 4?)
How do we calculate the similarity between two observations?
(How do we group them together?)
CALCULATE THE DISTANCE BETWEEN OBJECTS!
What are the 3 main formulas used to calculate the distances between objects?
How does the “Euclidean” distance-calculation work?
Measure the direct distance from point B to C
How does the “City-Block” distance-calculation work?
City block calculates distances similar to city blocks in like new york (so its difference along x axis + difference along y axis)
How does the “Chebychev” distance-calculation work?
Chebychev distance looks at both distances (of city block), and takes biggest one only, so in this case the horizontal is bigger (x diff is larger), so thats the chebychev distance.
Which of the 3 distance-calculation formulas is most important/ popular?
Euclidean (direct distance from point B to C)
What are the 5 “Clustering Algorithms”?
What is the difference between “Single” and “Complete” Linkage?
They are OPPOSITES!
single = groups based on shortest distance between any two members in the clusters (nearest neighbour)
complete = based on longest distance between any two members (furthest neighbour)
What is the problem with “Single Linkage”, and why is it mainly used?
Its a CONTRACTIVE method
It tends to form few BIG groups (chains)
Lecturer says it still serves a useful purpose though:
–> QUICKLY spotting any outliers
What is “Average Linkage”?
Distance between two clusters is defined as the average distance between all pairs of the two clusters members
What is “Centroid”?
The geometric center (centroid) of each cluster is computed first, then that distance is considered
Why can “Average Linkage” and “Centroid” be useful “Clustering Algorithms”?
Because they are CONSERVATIVE methods which produce HOMOGENOUS CLUSTERS of SIMILAR SIZES
What is “Ward’s Method”?
The object whose merger increases the overall within-cluster variance to the smallest possible degree, are combined
(so rather than looking at “closest by distance” in the usual sense, it looks at least added variance/SSE)
Which “Clustering Algorithm” is superior?
(if clusters can be expected to have similar sizes and data does NOT include outliers)
Ward’s Method
What is this called?
Dendogram
From the dendogram, how do you determine the number of clusters?
You read dendogram from left to right
- find the longest horizontal line (this is where you “CUT”)
Is the dendogram always give you the exact answer for the number of clusters?
NOT NECESSARILY
After you read from the dendogram, what is the most important factor to determine the number of clusters further?
INTERPRETABILITY + VALIDITY
The clusters must be manageable and large enough to warrant attention
How can you assess “STABILITY”?
Stability:
- cut dataset in two, and test whether the same test yields the same results)
How can you assess “VALIDITY”?
As in… If clusters are hard to describe or you can’t identify who’s in them in practice → low validity, even if the algorithm produced them (aka..USELESS)