Clustering Flashcards

(19 cards)

1
Q

Clustering types and algorithms

A

Types:

Centroid based:
Cluster points based on proximity to centroid.
Algorithms: KMeans, Kmedoids

Connectivity based:
Cluster points based on proximity between clusters
Algorithms:
Hierarchical clustering(agglomerative and divisive)

Density-based:
Cluster points based on their density instead of proximity.
Algorithms:
DBscan
Optics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

K means clustering

A

k mean is a partitioning clustering algorithm.
the algorithm divides the data randomly into k clusters and iteratively improves the clustering until it cannot find better clusters. The value of k is predetermined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

K means clustering steps

A

Step-1: Select the number K to decide the number of clusters
Step-2: Select random K points as centroids
Step-3: Assign each data point to their closest centroid using
the Euclidean distance, which will form the K clusters
Step-4: Set a new centroid of each cluster by taking the average
of the points assigned to that cluster
Step-5: Repeat 3-4 until centroids don’t change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K means clustering objective

A

evaluating and improving clustering results. The K-Means algorithm iteratively updates the centroids and point assignments to minimize this cost, leading to more compact and well-separated clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Weaknesses of K-Means?

A

The number of clusters needs to be specified by the user
The algorithm is sensitive to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why use K-Means

A

Easy to Understand and Explain
Computationally Efficient
Works Well with Large Datasets
Cluster Centers can provide meaningful insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to choose number of k

A

Common method is elbow method.
To find the optimal number of clusters, run K-Means for different values of k, calculate the WCSS for each, and plot WCSS to see the elbow point.

The elbow point is where adding more clusters doesn’t significantly
improve the WCSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The K-Medoids algorithm steps

A

Choose randomly k medoids from original dataset
Assign each of the remaining points in the dataset to their closest medoid. iteratively replace one of the medoids by one of the non-
medoids if it improves the total clustering cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

K means vs medoids

A

K-means clustering,
which is the mean of all points in a cluster and
may not necessarily be an actual data point, a
medoid is an actual data point from the
dataset.
Medoid are more costly but can be used with any distance measure, wile k means can only work with euclidean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hierarchical clustering

A

Produces a set of nested
clusters organized as a
hierarchical tree. Can be visualized as a
dendrogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of hierarchical clusterings

A

Agglomerative(bottom up), Divisive(Top down)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is hierarchical clustering

A

Hierarchical clustering is a clustering algorithm based on distances between observations
(not distances from centroids)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pros and cons of average linkage

A

the average linkage does well in separating clusters if there is noise between clusters. but its biased towards globular clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DBSCAN

A

DBSCAN is a density-based algorithm that identifies regions of high point density separated by areas of low density.
Density is defined as the number of points within a specified radius (ε).
A point is considered a core point if it has at least MinPts points (including itself) within that radius, indicating that it lies in a dense region of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When dbscan works well

A
  • Resistant to Noise
  • Can handle clusters of different shapes and sizes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When dbscan does not work well

A
  • Cannot handle varying densities
  • Sensitive to parameters
17
Q

How to evaluate clustering

A

● External evaluation
Employ criteria not inherent to the clusters (e.g., expert specific knowledge or class labels)

● Internal evaluation
Employ criteria that are derived from the data itself (e.g., how similar are the points in the
same clusters, how far apart are points in different clusters)

18
Q

Silhouette score

A

The silhouette score is specialized for measuring cluster quality. Measuring how well every point fits in its cluster in comparison to another cluster. score close to 1 is good(internal method for evaluation)

19
Q

Cluster Purity

A

A metric used to evaluate the quality of clustering results(external method for evaluationl)