4b. Continued. Flashcards

(22 cards)

1
Q

What is this clustering called?

A

Hierarhical clustering

Because:
It starts with each object as its own cluster at the bottom, then merges clusters step-by-step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How useful are step 1 and 5?

A

NOT useful at ALL.
(you have to figure out where you stop, step 2, 3 or 4?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we calculate the similarity between two observations?
(How do we group them together?)

A

CALCULATE THE DISTANCE BETWEEN OBJECTS!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 3 main formulas used to calculate the distances between objects?

A
  1. Euclidean
  2. City-block
  3. Chebychev
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does the “Euclidean” distance-calculation work?

A

Measure the direct distance from point B to C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the “City-Block” distance-calculation work?

A

City block calculates distances similar to city blocks in like new york (so its difference along x axis + difference along y axis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does the “Chebychev” distance-calculation work?

A

Chebychev distance looks at both distances (of city block), and takes biggest one only, so in this case the horizontal is bigger (x diff is larger), so thats the chebychev distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the 3 distance-calculation formulas is most important/ popular?

A

Euclidean (direct distance from point B to C)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 5 “Clustering Algorithms”?

A
  1. Single Linkage
  2. Complete Linkage
  3. Average Linkage
  4. Centroid
  5. Ward’s Method
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between “Single” and “Complete” Linkage?

A

They are OPPOSITES!

single = groups based on shortest distance between any two members in the clusters (nearest neighbour)

complete = based on longest distance between any two members (furthest neighbour)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the problem with “Single Linkage”, and why is it mainly used?

A

Its a CONTRACTIVE method
It tends to form few BIG groups (chains)
Lecturer says it still serves a useful purpose though:
–> QUICKLY spotting any outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is “Average Linkage”?

A

Distance between two clusters is defined as the average distance between all pairs of the two clusters members

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is “Centroid”?

A

The geometric center (centroid) of each cluster is computed first, then that distance is considered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why can “Average Linkage” and “Centroid” be useful “Clustering Algorithms”?

A

Because they are CONSERVATIVE methods which produce HOMOGENOUS CLUSTERS of SIMILAR SIZES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is “Ward’s Method”?

A

The object whose merger increases the overall within-cluster variance to the smallest possible degree, are combined

(so rather than looking at “closest by distance” in the usual sense, it looks at least added variance/SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which “Clustering Algorithm” is superior?

(if clusters can be expected to have similar sizes and data does NOT include outliers)

A

Ward’s Method

17
Q

What is this called?

18
Q

From the dendogram, how do you determine the number of clusters?

A

You read dendogram from left to right
- find the longest horizontal line (this is where you “CUT”)

19
Q

Is the dendogram always give you the exact answer for the number of clusters?

A

NOT NECESSARILY

20
Q

After you read from the dendogram, what is the most important factor to determine the number of clusters further?

A

INTERPRETABILITY + VALIDITY

The clusters must be manageable and large enough to warrant attention

21
Q

How can you assess “STABILITY”?

A

Stability:
- cut dataset in two, and test whether the same test yields the same results)

22
Q

How can you assess “VALIDITY”?

A

As in… If clusters are hard to describe or you can’t identify who’s in them in practice → low validity, even if the algorithm produced them (aka..USELESS)