Document Clustering
Classification
- Set of categories
Methods of clustering
All about distance metrics.
K-means clustering
Problems
Hierarchical Clustering
Agnes (agglomerative nesting) - one big cluster, every document as a cluster, merge documents together, keep merging until all documents are merged into a super cluster. Ward’s minimum variance. Way to merge clusters. Pick one when the result will create the least amount of variance within the documents within the cluster.
Diana (divisive analysis) - start with super cluster, then split, then split, etc. Look within one cluster, look for farthest outlier, doc least like the other and break into another cluster, then move documents that are more similar to it with that cluster. Then repeat. Keep dividing. Done when one doc per cluster, but can stop sooner. Stop when if you continue dividing it would be difficult to tell the difference between clusters.
Principal Component Analysis
- 2d lot even if there are 30000, etc. dimensions
Labelling clusters
Cluster Cohesion
Measures how closely related objects are in a cluster
Cluster Separation
Measures how distinct or well-separated a cluster is from other clusters. Clusters should be as far apart as possible. If they aren’t far apart, then they should be one cluster.
F measure
Harmonic mean between precision and recall
Evaluating Cluster Quality
Internal measures
External measures
Entity Resolution
Identifying and linking/grouping different manifestations of the same real-world object, e.g., addressing the same person in text (can’t go just by the name), web pages of same business, etc.
Clustering on graphs
Querying over a graph (traversing a graph) is faster than joining many tables especially when talking about a trillion records.