Assignment 6: More Realistic Language Modeling & Recurrent Neural Networks

Deadline: December 4th, 9am

In this task, we will once again deal with the issue of language modeling. This time, however we will be using Tensorflow’s RNN functionalities, which makes defining models significantly easier. On the flipside, we will be dealing with issues that come with variable-length inputs. This in turn makes defining models significantly more complicated. We will stick with character-level models for now; while word models are more common in practice, they come with additional problems.

Preparing the Data

Once again we provide you with a script to help you process raw text into a format Tensorflow can understand. Download the script here. This script differs from the previous one in a few regards:

The issue of differing sequence lengths also means that you need to provide the data to Tensorflow in a slightly different way during training: At the end of the day, Tensorflow works on tensors, and these have fixed sizes in each dimension. That is, sequences of different lengths can’t be put in the same tensor (and thus not in the same batch). The standard work-around for this issue is padding: Sequences are filled up with “dummy values” to get them all to the same length (namely, that of the longest sequence of the batch). The most straightforward approach is to simply add these dummy values at the end, and the most common value to use for this is 0. Doing padding is simple in Tensorflow: Use padded_batch instead of batch in tf.data.

Building an RNN

Defining an RNN is much simpler when using the full Tensorflow/Keras library. There are tutorials available here and here, but these are the basic steps:

Way 1: Fully pre-built

Classes like tf.keras.layers.SimpleRNN/LSTM/GRU define a “full” RNN: These classes expect 3D inputs (i.e. batch, time, feature axes) and run the whole sequence through the network in one go. This can be extremely efficient because the whole thing can be implemented as one “operation” in Tensorflow. However, you are somewhat limited with regard to customizability.

Way 2: Using Cells

Another way to define an RNN is by first defining a “cell”, e.g. tf.keras.layers.SimpleRNNCell or LSTMCell, and then define a tf.keras.layers.RNN where you put in this cell. Basically, the cell defines the computations at one time step and the RNN wraps this into a computation over a sequence. This allows for more control over what the cell does, but tends to be less efficient.
Note that you can build a “deep RNN” by putting a list of cells into RNN.

Putting it together

No matter how you built your RNN, you should integrate it in a keras Model along with a dense output layer mapping to the vocabulary again (this is not part of the Keras RNNs). Also, you will want return_sequences to be True in your RNN since we want to get output probabilities for the next character at each time step.

The very least you should do is to re-implement the task from last assignment with these functionalities. That is, you may work with fixed, known sequence lengths as a start. However, the real task lies ahead and you may skip the re-implementation and go straight for that one if you wish.

Dealing with Variable-length Sequences

You may have noticed that there is one problem remaining: We padded shorter sequences to get valid tensors, but the RNN functions as well as the cost computations have no way of actually knowing that we did this. This means that the network will “learn” on the padding elements as well, which will artificially lower the loss and divert the training process from optimizing on those parts of the data that really count (the not-padding). Let’s fix these issues.

While this was a lot of explanation, your program should hopefully be more succinct than the previous one, and it’s more flexible as well! Experiment with different cell types or even multiple layers and see the effects on the cost. Be prepared for significantly longer training times than with feedforward networks like CNNs.

Sampling with Keras

Sampling with Keras works a lot like before: We can input single characters by simply treating them as length-1 sequences. The process should be stopped not after a certain amount of steps, but when the special character </S> is sampled (you could also continue sampling and see if your network breaks down…). However, we have a bit of a problem now because the Keras RNNs take a whole sequence as input at once, but we want to generate the sequence ourselves. Here’s what you can do:

Applying a Language Model

Finally, keep in mind that language modeling is actually about assessing the “quality” of a piece of language, usually formalized via probabilities. The probability of a given sequence is simply the product of the probability of each character (i.e. the softmax output of the RNN, for the character that actually appeared) given the previous characters. Try this out: Compute the probabilities for some sequences typical for the training corpus (you could just take them straight from there). Compare these to the probabilities for some not-so-typical sequences. Note that only sequences of the same length can be compared since longer sequences automatically receive a lower probability. For example, in the King James corpus you could simply replace the word “LORD” by “LOOD” somewhere and see how this affects the probability of the overall sequence.
Note that for numerical reasons, you will generally want to work with log probabilities instead of “regular” ones.