What is Cluster Analysis and what is its Goal
Goal: Get better understanding of the data
Types of clustering and the different between them
Application areas of cluster analysis
Market Research:
-Identify similar customers
E-Commerce:
-Identify offers of the same product
Image Recognition:
-Identify parts of an image that belong to the same object (1. Cluster identifies region 2. classifiers identifies region as an object)
Which type of learning tasks exist and what type of is cluster analysis
Supervised Learning:
Unsupervised Learning (Cluster analysis):
List cluster analysis methods
Describe K-Means Clustering
What is a Convergence Criterion and list them for K-means
Convergence criterion: Stopping condition for algorithm (a setting that fits the tasks requirements)
Default:
- no or minimum change of centroids
Alternative:
- no or minimum re-assignments of data points to different clusters
- stop after x iterations
- min decrease in sum of squared errors (SSE)
What is a cohesion measure
Similarity of the data points in the cluster
how close are the data points in a cluster to their centroid
How to evaluate K-Means Clustering
Most common: Sum of squared Errors
Error: Distance of a point to the nearest centroid
SSE: Square the errors and sum them for the k clusters
SUM Cluster ( SUM Points(Error))
Minimization task
Weaknesses of K-Means
How can you increase the chance to find good clusters
Option 1) Restart several times with different random seeds
Option 2) Run k-means with different k’s
- SSE of clusters with different k’s can’t be compared (The higher k’ the lower the SSE)
Workaround:
1. Choose k where the SSE improvement decreases
(knee value of k)
2. Use X-Means (automatically determines k)
- start with small k, split large clusters and check if
results improve
How to handle outliers with K-Means
Use K-Medoids
Use DBSCAN
Advantages of K-Means
t: iterations k: clusters n: data points
Characteristics of DBSCAN
Density: number of points within a specified radius (eps)
Three different classes of data points:
1) Core points: Neighboring points (within eps) >= MinPts
2) Border points: Neighboring points (within eps) < MinPts but a neighbor of a core points
3) Noise point: any point that is not 1 or 2
How does DBSCAN work
Time complexity of O(n log n): dominated by neighborhood search fo reach point
How to determine suitable EPS and MinPts values
Ideal: Points in a cluster should have their kth nearest neighbours at roughly the same distance. Noise points have their kth nearest neighbor at farther distance
Adjusting:
When does DBSCAN work well
- If clusters have different shapes and sizes
When does DBSCAN wont work well
What is Hierarchical Clustering
- Can be visualized as a Dendrogram
What is a Dendrogram
Strength of Hierarchical Clustering
What are the two main types of hierarchical clustering
Agglomerative (bottom-up):
Divisive (Top-down):
Agglomerative is more often used
The basic agglomerative algorithm
Key operation: calculating proximity matrix
How to determine Inter-Cluster Similarity