Module 2 - Structured Data and Hindsight Flashcards

(6 cards)

1
Q

Explain the steps of K-Means Clustering

A
  1. Decide the number of clusters K to group objects into
  2. K centroids are randomly assigned
  3. Calculate the distance between each instance and the center of each cluster
  4. If the object is closer to the center of another cluster than the one it is currently assigned to, it is reassigned to the closer cluster
  5. Recalculate the centroids
  6. Do a number of iterations of this procedure until the clusters no longer change and the algorithm stops.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does the initial placement of centroids impact the clustering?

A

Initial centroids impact quality and speed, you can select them at random or maximize distance across initial centroids

  • They can give different final clusters results
  • bad starts can cause slow convergence and can be computationally expensive (many iterations)
  • can use kmeans ++ to pick smart starting centroids (first is random then next ones after are calculated to be placed in the most optimal spot).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you evaluate clustering?

A
  • No objective truth to compare against
  • No measurable notion of quality
    Goal is to obtain clusters with:
  • High intra-cluster similarity
  • Low inter-cluster similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When do we want to perform dimensionality reduction? What are the two options?

A

When the dataset has many dimensions impacting the ability to conduct EDA.

Option 1: Attribute selection - keeping only some of the most informative ones and dropping others

Option 2: Dimensionality reduction: merge redundant ones together into a lower dimensional space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain What is PCA?

A

Principal Component Analysis: A projection of high-dimensional space to a lower dimensional space (using matrix decomposition)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the steps of conducting PCA?

A
  1. Define a new coordinate system by maximizing variance across the axis
  2. List of components ordered by variance
  3. Select the first N components for the new space
  4. combine multiple dimensions of the original space giving more weight to certain dimensions e.g. combine attributes that have high correlation
  5. this removes redundant information and reduces the complexity of computing similarity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly