Deadline: November 14th, 9am
In this assignment and the next, we are switching to a different modality of data: Text. Namely, we will see how to assign a single label to input sequences of arbitrary length. This has many applications, such as detecting hate speech on social media or detecting spam emails. Here, we will look at sentiment analysis, which is supposed to tell what kind of emotion is associated with a piece of text.
In part 1, we are mainly concerned with implementing RNNs at the low level so that we understand how they work in detail. The models themselves will be rather rudimentary. We will also see the kinds of problems that arise when working with sequence data, specifically text. Next week, we will build better models and deal with some of these issues.
The notebook associated with the practical exercise can be found here.
We will be using the IMDB movie review dataset. This dataset comes with Keras and consists of 50,000 movie reviews with binary labels (positive or negative), divided into training and testing sets of 25,000 sequences each.
The data can be loaded the same way as MNIST or CIFAR –
tf.keras.datasets.imdb.load_data()
.
If you print the sequences, however, you will see that they are numbers, not text.
Recall that deep learning is essentially
a pile of linear algebra.
As such, neural networks cannot take text as input, which is why it needs to be
converted to numbers.
This has already been done for us – each word has been replaced by a number,
and thus a movie review is a sequence of numbers (punctuation has been removed).
If you want to restore the text, tf.keras.datasets.imdb.get_word_index()
has
the mapping – see the notebook for how you can use this, as well as some
additional steps you need to actually get correct outputs.
Our sequences are numbers, so they can be put into a neural network. But does this make sense? Recall the kind of transformations a layer implements: A linear map followed by a (optional) non-linearity. But that would mean, for example, that the word represented by index 10 would be “10 times as much” as the word represented by index 1. And if we simply swapped the mapping (which we can do, as it is completely arbitrary), the roles would be reversed! Clearly, this does not make sense.
A simple fix is to use one-hot vectors: Replace a word index by a vector with as many entries as there are words in the vocabulary, where all entries are 0 except the one corresponding to the respective word, which is 1 – see the notebook.
Thus, each word gets its own “feature dimension” and can be transformed separately.
With this transformation, our data points are now sequences of one-hot vectors,
with shape (sequence_length, vocabulary_size)
.
Of course, not all movie reviews have the same length. This actually represents a huge problem for us: We would like to process inputs in batches, but tensors generally have to be “rectangular”, i.e. we cannot have different sequence lengths in the same batch! The standard way to deal with this is padding: Appending additional elements to shorter sequences such that all sequences have the same length.
In the notebook, this is done in a rather crude way: All sequences are padded to the length of the longest sequence in the dataset.
Food for thought #1: Why is this wasteful? Can you think of a smarter padding scheme that is more efficient? Consider the fact that RNNs can work on arbitrary sequence lengths, and that training minibatches are pretty much independent of each other.
Once we define the model, we will run into two issues with our data:
load_data
and pad_sequences
have arguments to do this. We recommend the latter as you
can choose between “pre” or “post” truncation.load_data
has an argument to keep only the n
most common words and replace less frequent ones by a special “unknown word”
token (index 2 by default). As a start, try keeping only the 20,000 most common words or so.
Food for thought #2: Between truncating long sequences and removing them, which option do you think is better? Why?
Food for thought #3: Can you think of a way to avoid the one-hot vectors completely? Even if you cannot implement it, a conceptual idea is fine.
With these issues taken care of, we should be ready to build an RNN!
A Tensorflow RNN “layer” can be
confusing due to its black box character: All computations over a full sequence
of inputs are done internally. To make sure you understand how an RNN “works”,
you are asked to implement one from the ground up, defining variables yourself
and using basic operations such as tf.matmul
to define the computations at
each time step and over a full input sequence. There are some related tutorials
available on the TF website, but all of these use Keras.
For this
assignment, you are asked not to use the RNNCell
classes nor any related Keras
functionality. Instead, you should study the basic RNN equations and “just”
translate these into code. You can still use Keras optimizers, losses etc.
You can also use Dense
layers instead of low-level ops,
but make sure you know what you are doing.
You might want to proceed as follows:
batch x features
just like with an MLP!Food for thought #4: How can it be that we can choose how many outputs we have, i.e. how can both be correct? Are there differences between both choices as well as (dis)advantages relative to each other?
To compute the state at the first time step, you would need a “previous state”, but there is none. To fix this, you can define an “initial state” for the network. A common solution is to simply use a tensor filled with zeros. You could also add a trainable variable and learn an initial state instead!
Food for thought #5: All sequences start with the same special “beginning of sequence” token (coded by index 1). Given this fact, is there a point in learning an initial state? Why (not)?
Recall that we padded all sequences to be the same length. Unfortunately, the RNN is not aware that we did this. This can be an issue, as we are basically computing new states (thus computing outputs as well as influencing future states) based on “garbage” inputs.
Food for thought #6: pad_sequences
allows for pre
or post
padding. Try
both to see the difference. Which option do you think is better? Recall that
we use the final time step output from our model.
Food for thought #7: Can you think of a way to prevent the RNN from computing new states on padded time steps? One idea might be to “pass through” the previous state in case the current time step is padding. Note that, within a batch, some sequences might be padded for a given time step while others are not.
Be aware that it might take several thousand steps for the loss to start moving at all, so don’t stop training too early if nothing is happening. Experiment with weight initializations and learning rates. For fast learning, the goal is usually to set them as large as possible without the model “exploding”.
A major issue with our “last output summarizes the sequence” approach is that the information from the end has to backpropagate all the way to the early time steps, which leads to extreme vanishing gradient issues. You could try to use the RNN output more effectively. Here are some ideas:
Food for thought #8: What could be the advantage of using methods like the above? What are disadvantages? Can you think of other methods to incorporate the full output sequence instead of just the final step?