Clustering types and algorithms
Types:
Centroid based:
Cluster points based on proximity to centroid.
Algorithms: KMeans, Kmedoids
Connectivity based:
Cluster points based on proximity between clusters
Algorithms:
Hierarchical clustering(agglomerative and divisive)
Density-based:
Cluster points based on their density instead of proximity.
Algorithms:
DBscan
Optics
K means clustering
k mean is a partitioning clustering algorithm.
the algorithm divides the data randomly into k clusters and iteratively improves the clustering until it cannot find better clusters. The value of k is predetermined
K means clustering steps
Step-1: Select the number K to decide the number of clusters
Step-2: Select random K points as centroids
Step-3: Assign each data point to their closest centroid using
the Euclidean distance, which will form the K clusters
Step-4: Set a new centroid of each cluster by taking the average
of the points assigned to that cluster
Step-5: Repeat 3-4 until centroids don’t change
K means clustering objective
evaluating and improving clustering results. The K-Means algorithm iteratively updates the centroids and point assignments to minimize this cost, leading to more compact and well-separated clusters.
Weaknesses of K-Means?
The number of clusters needs to be specified by the user
The algorithm is sensitive to outliers
Why use K-Means
Easy to Understand and Explain
Computationally Efficient
Works Well with Large Datasets
Cluster Centers can provide meaningful insights
How to choose number of k
Common method is elbow method.
To find the optimal number of clusters, run K-Means for different values of k, calculate the WCSS for each, and plot WCSS to see the elbow point.
The elbow point is where adding more clusters doesn’t significantly
improve the WCSS
The K-Medoids algorithm steps
Choose randomly k medoids from original dataset
Assign each of the remaining points in the dataset to their closest medoid. iteratively replace one of the medoids by one of the non-
medoids if it improves the total clustering cost
K means vs medoids
K-means clustering,
which is the mean of all points in a cluster and
may not necessarily be an actual data point, a
medoid is an actual data point from the
dataset.
Medoid are more costly but can be used with any distance measure, wile k means can only work with euclidean.
Hierarchical clustering
Produces a set of nested
clusters organized as a
hierarchical tree. Can be visualized as a
dendrogram
Types of hierarchical clusterings
Agglomerative(bottom up), Divisive(Top down)
What is hierarchical clustering
Hierarchical clustering is a clustering algorithm based on distances between observations
(not distances from centroids)
Pros and cons of average linkage
the average linkage does well in separating clusters if there is noise between clusters. but its biased towards globular clusters.
DBSCAN
DBSCAN is a density-based algorithm that identifies regions of high point density separated by areas of low density.
Density is defined as the number of points within a specified radius (ε).
A point is considered a core point if it has at least MinPts points (including itself) within that radius, indicating that it lies in a dense region of the data.
When dbscan works well
When dbscan does not work well
How to evaluate clustering
● External evaluation
Employ criteria not inherent to the clusters (e.g., expert specific knowledge or class labels)
● Internal evaluation
Employ criteria that are derived from the data itself (e.g., how similar are the points in the
same clusters, how far apart are points in different clusters)
Silhouette score
The silhouette score is specialized for measuring cluster quality. Measuring how well every point fits in its cluster in comparison to another cluster. score close to 1 is good(internal method for evaluation)
Cluster Purity
A metric used to evaluate the quality of clustering results(external method for evaluationl)