Assignment 8: Word2Vec

Deadline: December 13th, 9am

In this week, we will look at “the” classic model for learning word embeddings. This will be another tutorial-based assignment. Find the link here.

The key points are:

Getting to know an example of self-supervised learning, where we have data without labels, and are constructing a task directly from the data (often some kind of prediction task) in order to learn deep representations,
Understanding how Softmax with a very large number of classes is problematic, and getting to know possible workarounds,
Exploring the idea of word embeddings.

Questions for Understanding

As in the last assignment, answer these questions in your submission to make sure you understand what is happening in the tutorial code!

Given the sentence “I like to cuddle dogs”, how many skipgrams are created with a window size of 2?
In general, how does the number of skipgrams relate to the size of the dataset (in terms of input-target pairs)?
Why is it not a good idea to compute the full softmax for classification?
The way the dataset is created, for a given (target, context) pair, are the negative samples (remember, these are randomly sampled) the same each time this training example is seen, or are they different?
For the given example dataset (Shakespeare), would the code create (target, context) pairs for sentences that span multiple lines? For example, the last word of one line and the first word of the next line?
Does the code generate skipgrams for padding characters (index 0)?
The skipgrams function uses a “sampling table”. In the code, this is shown to be a simple list of probabilities, and it is created without any reference to the actual text data. How/why does this work? I.e. how does the program “know” which words to sample with which probability?

Possible Improvements & Extensions

If the code generates skipgrams for padding characters: This is probably not a good idea. Can you prevent this from happening?
If the code is not re-drawing negative samples each iteration: Can you change it so that it does? This may give less biased results.
The candidate sampler may accidentally draw the true context word as one of the negative words. Can you find a way to detect and avoid this? Note that there is tf.nn.sampled_softmax_loss which supports such an argument. Using this would require significant re-writing of the code, however (e.g. getting rid of the uniform_candidate_sampler entirely).
One of the most “impressive” features of these word embeddings is that, given a well-trained model, analogies can be performed via vector arithmetic. Try this:
- Get the learned vectors (either target or context embeddings) for the words king, queen, man, woman. Of course, this assumes that these words are present in the training data.
- Compute the vector king - man + woman.
- Compute the similarity between the resulting vector and all word vectors. Which one gives the highest similarity? It “should” be queen. Note that it might actually be king, in which case queen should at least be second-highest. To compute the similarity, you should use cosine similarity.
- You can try this for other pairs, such as Paris - France + Germany = Berlin etc.
Use a larger vocabulary and/or larger text corpora to train the models. See how embedding quality and training effort changes. You can also implement a version using the “naive” full softmax, and see how the negative sampling increases in efficiency compared to the full version as the vocabulary becomes larger!

Optional: CBOW Model

The tutorial only covers the Skipgram model, however the same paper also proposed the (perhaps more intuitive) Continuous Bag-of-Words model. Here instead of predicting the context from the center word, it’s the other way around. If you are looking for more of a challenge implementing a model by yourself, the changes should be as follows:

In CBOW, each training example consists of multiple context words and a single target word. There is no equivalent to the skipgrams preprocessing function, but you can simply iterate over the full text data in small windows (there is tf.data.Dataset.window which may be helpful here) and for each window use the center word as the target and the rest as context.
The context embedding is computed by embedding all context words separately, and then averaging their embeddings.

The rest stays pretty much the same. You will still need to generate negative examples through sampling, since the full softmax is just as inefficient as with the Skipgram model.

Compare the results of the CBOW model with the Skipgram one!