Distributional hypothesis
If two words have similar contexts, we can assume that they have similar meanings.
A distributional approach to lexical semantics
Context windows
”I bake bread for breakfast”
Bag of Words (BoW)
Grammatical context
Tokenization
Splitting a word into sentences and words or other units.
Stop-list
Filter out closed-class words or function words. The idea is that only content words provivde relevant context
Lemmatized string from the raw string “The programmer’s program had been programmed”.
“The programmer ‘s program have be program”
“Relatedness” vs. “Sameness”
Similarity in domain:
{car, gas, road, service, traffic, driver}
Similarity in content:
{car, train, bicycle, truck, vehicle, airplane}
Vector space model
Semantic spaces
Semantic spaces AKA distributional semantic models or word space models
A semantic space is a vector space model where:
One standard metric for spatial proximity
Euclidian distance:

Norm of a vector
Norm of a vector

Potential problem with Euclidian distance
It is very sensitive to extreme values and the length of the vectors.
Overcome length biaz
Normalize vector to unit length
Divide vector by the length of the vector
Cosine of angle between vectors

Cosine of angle between unit vectors

The cosine measures ____ rather than distance
The cosine measures proximity rather than distance
___ has the same relative rank order as the Euclidean distance for _____
Cosine measure has the same relative rank order as the Euclidean distance for unit vectors!