Different forms of (static) word embeddings
(representing words mathematically)
universal dependencies tagset classes
UDT: open class
UDT: closed class words
UDT: other
Why might it be useful to predict upcoming words or assign probabilities to sentences?
the ability to predict upcoming words or assign sentence probabilities enhances machine interaction with language across various applications, leading to more intuitive, accurate, and efficient technological solutions
one-hot vector
word2vec + process
word embeddings
cosine similarity
quantifies similarity between two vectors by calculating the cosine of the angle between them
representing words as discrete symbols
problem with representing words as discrete symbols + solution
what drives semantic similarity
how do we approximate semantic similarity
representing words by their context: distributional semantics
distributional semantics
a word’s meaning is given by the words that appear frequently close-by
‘you shall know a word by the company it keeps’
context in distributional semantics
the set of words that appear nearby a target word within a fixed-size window
word vectors (word embeddings/word representations)
properties of dense word embeddings
problem with word embeddings
they can reinforce and propagate biases present in the data they are trained on
skip-gram algorithm
–> to identify context words, we define a window of size j, which means our model will look at words in position t +/- j as the context
word2vec objective function
minimizing objective function <–> maximizing predictive accuracy
calculating L(theta) = P(O|C)
softmax function: maps all arbitrary values Xi to a probability distribution Pi.
P(O|C): dot product
compares similarity of O and C.
a larger dot product means higher probability = higher similarity
P(O|C): exponentiation
exp of the dot product ensures a positive result