4b. Continued. Flashcards

Question 1

Q

What is this clustering called?

Answer

A

Hierarhical clustering

Because:
It starts with each object as its own cluster at the bottom, then merges clusters step-by-step.

Question 2

Q

How useful are step 1 and 5?

Answer

A

NOT useful at ALL.
(you have to figure out where you stop, step 2, 3 or 4?)

Question 3

Q

How do we calculate the similarity between two observations?
(How do we group them together?)

Answer

A

CALCULATE THE DISTANCE BETWEEN OBJECTS!

Question 4

Q

What are the 3 main formulas used to calculate the distances between objects?

Answer

A

Euclidean
City-block
Chebychev

Question 5

Q

How does the “Euclidean” distance-calculation work?

Answer

A

Measure the direct distance from point B to C

Question 6

Q

How does the “City-Block” distance-calculation work?

Answer

A

City block calculates distances similar to city blocks in like new york (so its difference along x axis + difference along y axis)

Question 7

Q

How does the “Chebychev” distance-calculation work?

Answer

A

Chebychev distance looks at both distances (of city block), and takes biggest one only, so in this case the horizontal is bigger (x diff is larger), so thats the chebychev distance.

Question 8

Q

Which of the 3 distance-calculation formulas is most important/ popular?

Answer

A

Euclidean (direct distance from point B to C)

Question 9

Q

What are the 5 “Clustering Algorithms”?

Answer

A

Single Linkage
Complete Linkage
Average Linkage
Centroid
Ward’s Method

Question 10

Q

What is the difference between “Single” and “Complete” Linkage?

Answer

A

They are OPPOSITES!

single = groups based on shortest distance between any two members in the clusters (nearest neighbour)

complete = based on longest distance between any two members (furthest neighbour)

Question 11

Q

What is the problem with “Single Linkage”, and why is it mainly used?

Answer

A

Its a CONTRACTIVE method
It tends to form few BIG groups (chains)
Lecturer says it still serves a useful purpose though:
–> QUICKLY spotting any outliers

Question 12

Q

What is “Average Linkage”?

Answer

A

Distance between two clusters is defined as the average distance between all pairs of the two clusters members

Question 13

Q

What is “Centroid”?

Answer

A

The geometric center (centroid) of each cluster is computed first, then that distance is considered

Question 14

Q

Why can “Average Linkage” and “Centroid” be useful “Clustering Algorithms”?

Answer

A

Because they are CONSERVATIVE methods which produce HOMOGENOUS CLUSTERS of SIMILAR SIZES

Question 15

Q

What is “Ward’s Method”?

Answer

A

The object whose merger increases the overall within-cluster variance to the smallest possible degree, are combined

(so rather than looking at “closest by distance” in the usual sense, it looks at least added variance/SSE)

Question 16

Q

Which “Clustering Algorithm” is superior?

(if clusters can be expected to have similar sizes and data does NOT include outliers)

Answer

Study These Flashcards

A

Ward’s Method

Question 17

Q

What is this called?

Answer

Study These Flashcards

A

Dendogram

Question 18

Q

From the dendogram, how do you determine the number of clusters?

Answer

Study These Flashcards

A

You read dendogram from left to right
- find the longest horizontal line (this is where you “CUT”)

Question 19

Q

Is the dendogram always give you the exact answer for the number of clusters?

Answer

Study These Flashcards

A

NOT NECESSARILY

Question 20

Q

After you read from the dendogram, what is the most important factor to determine the number of clusters further?

Answer

Study These Flashcards

A

INTERPRETABILITY + VALIDITY

The clusters must be manageable and large enough to warrant attention

Question 21

Q

How can you assess “STABILITY”?

Answer

Study These Flashcards

A

Stability:
- cut dataset in two, and test whether the same test yields the same results)

Question 22

Q

How can you assess “VALIDITY”?

Answer

Study These Flashcards

A

As in… If clusters are hard to describe or you can’t identify who’s in them in practice → low validity, even if the algorithm produced them (aka..USELESS)

4b. Continued. Flashcards

(22 cards)