Types of Semantic Similarity
Word Similarity
When are two words functioning in a similar way (can be interchangeable).
Statistical approach - use statistics to see how closely associated are two words in a corpus
Structural Approach
Dependent word probabilities
Jack & Jill example, probability of Jack and Jill co-occurrence. Dependent higher than independent probability. When dependent is higher than there is some kind of relationship between them.
PPMI, sometimes referred to as PMI
Co-occurrence
Pointwise Mutual Independence
PMI(X,Y) = log2 =P(x, y)/P(x)P(y)
0 or lower through it out, not similar so change negative to zerio
Vector semantics, aka distributional semantics
A word is characterized by the company it keeps
Latent Semantic Analysis (LSA)
Algorithms to reduce dimensions in the vector space
Cosine similarity
Looks at angles between vectors, the smaller the angle, the more similar the words, are directionally similar
Document similiarity
The heart and soul of NLP
Jaccard similarity
How many terms the two documents share:
overlap is words they have in incommon
Cosine Similarity
Efficient on sparse vectors
Autoadjects for documents of different links
Document similarity is the angle
Dot product multiple every feature in doc 1 against doc2 and then addd
Why don’t we want term frequency?
Why is TF-IDF so important?
Documents as Probability Distribution
Helliger distance:
BM25
Wordnets, get synsets and documents that are similar but don’t have many of the same words. Replace
Probability
20% chance that randomly selecting a word
Similarity does not equal synonymy
-collocation in key terms “boot” + “camp” vs. “bootcamp” which are similar but not synonymous
Word2Vec
Lightweight neural network model
Skip-gram
Opposite of CBOW model
Focus word is the single input vector, and the target context words are the output layer
Works well with a small amount of training data (<100k samples)
-represents rare words and phrases well
CBOW
Doc2Vec
Dynamic word embeddings