What are the problems with resources like WordNet?
What are the problems with one-hot vectors?
• Example: in web search, if user searches for “Seattle motel”, we would like to match documents containing “Seattle hotel”.
• But
Motel = [0,0,0,1,0,…,0,0]
Hotel = [0,1,0,0,0,…,0,0]
• These two vectors are orthogonal à Every pair of words are orthogonal, and hence have the same “distance” to each other
How do you represent the meaning of a word?
• Distributional semantics (aka distributional hypothesis): A word’s
meaning is given by the words that frequently appear close-by.
“You should know a word by the company it keeps” (J. R. Firth)
“Words are similar, if they occur in similar contexts” (Harris)
Distributional semantics based word vector solution could be a?
Solution 1: count co-occurrence of words (co-occurrence matrix)
• Capture in which contexts a word appears
• Context is modelled using a window over the words
Detail the details with regards to the co-occurrence matrix.
Problems with SVD + co-occurrence matrix?
• Instead of using the high- dimensional original co- occurrence matrix M, use U(t)
(dimension t is given by the user)
Cons: • High computational cost for large datasets • Cannot be dynamically updated with new words • Didn’t work too well in practice
How do you learn low-dimensional word vectors?
Target:
• build a dense vector for each word, chosen so that it is similar to vectors of
words that appear in similar contexts
• Dense: in contrast to co-occurrence matrix word vectors, which are sparse
(high dimensional)
• Word vectors are sometimes called word embeddings or word representations. They are a distributed representation.
What is word2vec and what is the idea of word2vec?
Check slide 11 pg 18 - 22
What should you do with Negative Sampling.
Down-sample the non-contextual words, so as to make sure the numbers of
contextual and non-contextual words in O are the same
Detail information about dot product in word2vec.
What are word2vec’s two approaches?
CBOW (continuous bag-of-word): Given a context, predict the missing word
• same procedure ____ every year
• as long ___ you sing
• please stay ___ you are
Skip-gram: given a word, predict the context words
• ____ ______ as _____ ______
• If window size is two, we aim to predict:
• (w,c-2), (w,c-1), (w,c1) and (w,c2)
Note that order information is not preserved, i.e. we do not distinguish whether a word is more likely to occur before or after the word
Name Embedding Models (Two at least)
Word2vec
GloVe
Many other word embedding models; these two are (arguably) most
popular
Bengio’s NNLM vs word2vec
• Uni-directional vs bi-directional
• NNLM (neural network based language model) predicts the next word using
the preceding words (left to right prediction)
• Word2vec predicts both preceding words and succeeding words (or using
context words to predict the centre word, in CBOW), hence is bi-directional
prediction
Why does word2vec work?
• Not all neural embeddings are good
• Mikolov et al. (2013) survey four models (Recurrent, MLP NNLM, CBOW, Skip- Gram)
• Some of them work quite poorly (under their parametrizations)
• Skip-Gram and CBOW are the “simplest” possible models
• Can run on much more data
• And are much faster than predecessor neural models
• “Tricks” play an important role (negative sampling)
• It took a long stream of experimentation to make neural net language models
successful (~10 years)
Why do “similar” words have similar vectors under skip-gram or CBOW?
* Word2vec training push words in the same context window have similar vectors (i.e. high dot product)
Which notion of “similarity” does word2vec capture?
But also syntactic/morphologic similarity