Deadline: June 3rd, 20:00
No exercise on May 28th/30th!
This week, we will switch gears and consider recurrent neural networks for working with text sequences. While RNNs have been largely supplanted by Transformers, it’s still a good idea to get a basic grasp of the concept. In order to stay in somewhat familiar territory, we will look at a sequence classification task, where entire sequences of text are assigned a single label. Later, we will also consider sequence-to-sequence tasks like language modeling.
We will use the IMDB Movie Review Dataset for our model. This is a dataset of 50,000 movie reviews (25,000 each for training and testing), labeled either positive or negative. As such, this is a binary classification task, where each data point is a sequence of text. Since neural networks do mathematical operations, however, we first need to convert this text to numbers in some way. There are actually many already preprocessed versions of this dataset around the web. Still, it can be valuable to see how this is done.
You can find a notebook with the complete preprocessing pipeline on E-Learning.
It also contains explanations of the different steps taken.
Download the raw data from the official website and unpack the
.tar.gz archive.
The notebook assumes the resulting aclImdb folder is itself in a folder called data, although you can of course change
the folder structure as you like.
Note for Colab users: You need to get the data into the virtual machine somehow. Since the runtime gets deleted after you stop using it for some time, uploading it right there is not a good idea. The best option is probably to upload it to your drive. See the notes in Assignment 1 on Google Colab for instructions on how to access your drive.
Your main task is to build the RNN model class itself.
You can use modules such as nn.Linear, but do not use any of
the already built RNN modules!
Some guiding principles can be:
RNNCell class that implements the computation for a single time step.
The forward method should return a matrix of inputs (for a single time step) and the previous state, and return the new
state using the RNN update equation.
Wx + b or Uh + c are exactly what a nn.Linear module implements.RNN class that performs the loop over the input and calls the RNNCell at each time step.
batch x hidden_dim.torch.nn.functional.one_hot or nn.Embedding, respectively.BCEWithLogitsLoss
(link).
This expects logits, i.e. do not apply sigmoid yourself.
However, you would need to change the accuracy computation in the training/evaluation code:
We only have one output, so taking the argmax makes no sense.
Instead, set a threshold (for example 0.5 for the sigmoid output, or equivalently 0 for the logits) and classify all
outputs above the threshold as 1, below as 0.You should be able to completely re-use our previous training & evaluation code (e.g. from the CNN assignments) without any changes (unless you want to use sigmoid output, see above)! The only things that change are the data and the model!
Unfortunately, there is a good chance that, even if you implemented everything correctly, the network will not perform well. In particular, training is often very unstable, with both training and validation performance jumping wildly. You can try adjusting learning rates, weight initialization or parameters such as weight decay. You can also make the maximum sequence length smaller, which should make problems like exploding or vanishing gradients slightly better, at the cost of losing more information from the reviews. But at the end of the day, there is a reason architectures like the LSTM were invented!
The most promising route of improvement is to replace the basic RNN cell with a more advanced one.
The “classic” one is the LSTM.
However, this can be a bit annoying to implement because of the large number of gates (four in total) as well as the
fact that we now have a state h_t as well as what is often called the “cell state” c_t, and we have to carry both
through time.
In particular, this would make your loop implementation more complex, as the cell may have two (LSTM) or just one
(basic RNN) state variable.
As an alternative, consider the Gated Recurrent Unit (GRU). This design only has two gates and no additional cell state. The equations for both LSTM and GRU can be found in many places, but Wikipedia serves just fine:
You just need to implement the equation for h_t, which necessitates all the equations before that, as well.
For the GRU, you only need to write a new cell; the RNN loop should not have to change.
With these advanced architectures, even a small RNN with around 100 hidden units should easily fit the training data.
You should of course be using advanced optimizers like AdamW with learning rate decay.
For reference, we got close to 98% training and 87% testing accuracy with a GRU with 128 hidden units and an embedding
dimension of just 32.
A naive GRU or LSTM implementation will be much slower than the basic RNN. This is due to the large number of linear transformations we have to do, which are all done in sequence. However, it is actually possible to combine multiple matrix multiplications into a single one!
W*x + U*h, this is equivalent to a single matrix multiplication [W U] * [x h] where the square
brackets denote concatenation (you have to check which axis to concatenate along…).y1 = W*x and y2 = U*x can be written as [y1 y2] = [W U] * x.
Again, the square brackets denote concatenation along a suitable axis.
The result can then be split into two parts to recover separate y1, y2 outputs.There are concrete code examples for both in the notebook on E-Learning! These so-called fused implementations are much more efficient than the naive ones. For an LSTM, all linear transformations (eight in total) can be fused into a single one! For a GRU, we unfortunately need two, since the reset gate needs to be computed first, which then serves as an input to the “candidate activation vector” (using the Wikipedia terminology).
Even with this simple setup, there are already plenty of things to investigate:
<start> token compared to leaving it out?