Clustering Flashcards

Question 1

Q

Clustering types and algorithms

Answer

A

Types:

Centroid based:
Cluster points based on proximity to centroid.
Algorithms: KMeans, Kmedoids

Connectivity based:
Cluster points based on proximity between clusters
Algorithms:
Hierarchical clustering(agglomerative and divisive)

Density-based:
Cluster points based on their density instead of proximity.
Algorithms:
DBscan
Optics

Question 2

Q

K means clustering

Answer

A

k mean is a partitioning clustering algorithm.
the algorithm divides the data randomly into k clusters and iteratively improves the clustering until it cannot find better clusters. The value of k is predetermined

Question 3

Q

K means clustering steps

Answer

A

Step-1: Select the number K to decide the number of clusters
Step-2: Select random K points as centroids
Step-3: Assign each data point to their closest centroid using
the Euclidean distance, which will form the K clusters
Step-4: Set a new centroid of each cluster by taking the average
of the points assigned to that cluster
Step-5: Repeat 3-4 until centroids don’t change

Question 4

Q

K means clustering objective

Answer

A

evaluating and improving clustering results. The K-Means algorithm iteratively updates the centroids and point assignments to minimize this cost, leading to more compact and well-separated clusters.

Question 5

Q

Weaknesses of K-Means?

Answer

A

The number of clusters needs to be specified by the user
The algorithm is sensitive to outliers

Question 6

Q

Why use K-Means

Answer

A

Easy to Understand and Explain
Computationally Efficient
Works Well with Large Datasets
Cluster Centers can provide meaningful insights

Question 7

Q

How to choose number of k

Answer

A

Common method is elbow method.
To find the optimal number of clusters, run K-Means for different values of k, calculate the WCSS for each, and plot WCSS to see the elbow point.

The elbow point is where adding more clusters doesn’t significantly
improve the WCSS

Question 8

Q

The K-Medoids algorithm steps

Answer

A

Choose randomly k medoids from original dataset
Assign each of the remaining points in the dataset to their closest medoid. iteratively replace one of the medoids by one of the non-
medoids if it improves the total clustering cost

Question 9

Q

K means vs medoids

Answer

A

K-means clustering,
which is the mean of all points in a cluster and
may not necessarily be an actual data point, a
medoid is an actual data point from the
dataset.
Medoid are more costly but can be used with any distance measure, wile k means can only work with euclidean.

Question 10

Q

Hierarchical clustering

Answer

A

Produces a set of nested
clusters organized as a
hierarchical tree. Can be visualized as a
dendrogram

Question 11

Q

Types of hierarchical clusterings

Answer

A

Agglomerative(bottom up), Divisive(Top down)

Question 12

Q

What is hierarchical clustering

Answer

A

Hierarchical clustering is a clustering algorithm based on distances between observations
(not distances from centroids)

Question 13

Q

Pros and cons of average linkage

Answer

A

the average linkage does well in separating clusters if there is noise between clusters. but its biased towards globular clusters.

Question 14

Q

DBSCAN

Answer

A

DBSCAN is a density-based algorithm that identifies regions of high point density separated by areas of low density.
Density is defined as the number of points within a specified radius (ε).
A point is considered a core point if it has at least MinPts points (including itself) within that radius, indicating that it lies in a dense region of the data.

Question 15

Q

When dbscan works well

Answer

A

Resistant to Noise
Can handle clusters of different shapes and sizes

Question 16

Q

When dbscan does not work well

Answer

Study These Flashcards

A

Cannot handle varying densities
Sensitive to parameters

Question 17

Q

How to evaluate clustering

Answer

Study These Flashcards

A

● External evaluation
Employ criteria not inherent to the clusters (e.g., expert specific knowledge or class labels)

● Internal evaluation
Employ criteria that are derived from the data itself (e.g., how similar are the points in the
same clusters, how far apart are points in different clusters)

Question 18

Q

Silhouette score

Answer

Study These Flashcards

A

The silhouette score is specialized for measuring cluster quality. Measuring how well every point fits in its cluster in comparison to another cluster. score close to 1 is good(internal method for evaluation)

Question 19

Q

Cluster Purity

Answer

Study These Flashcards

A

A metric used to evaluate the quality of clustering results(external method for evaluationl)

Clustering Flashcards

(19 cards)