Clustering - what is it?
• Finding groups in data.
• Organising data into groups such that there is:
(1) high similarity within each group,
(2) low similarity across the groups.
Is clustering the same as classification?
No
• Class labels can be found directly in the data. E.g., blood type.
• Different goals: to “understand” the data better (explore), to organise the
information we have.
Distance measures
Jaccard distance used when
The possession of a common characteristic between two items is important, but the common absence of a characteristic is not.
• Especially useful when dealing with problems that involve (large) sets of
characteristics that may not be ‘symmetrically’ important.
• Text mining: compare whether two documents contain the same word.
Cosine distance often encountered in
text mining or recommendation engines
Edit distance (Levenshtein metric)
* Applications: Autocorrect (spelling mistakes).
Euclidean distance
Manhattan distance
Cosine distance
The term relates to the method of measurement - the cosine of the angle between two vectors.