Assignment 6: Language Modeling & Recurrent Neural Networks

Deadline: November 21st, 9am

Send your colab link and group member info to jens.johannsmeier@ovgu.de

In this task, you will tackle the task of language modeling using RNNs. Language Modeling forms an important basis for most NLP applications such as tagging, parsing or machine translation. However, it can also be used on its own to generate “natural” language.

NOTE: This and the next assignment will be quite wordy. This basically replaces the usual extra readings of official Tensorflow tutorials. If you have questions or run into trouble, use Gitlab or Mattermost!

Language Modeling

A language model assigns a probability (or, more generally, some kind of score) to a piece of text. Most of the time, this is done by interpreting the text as a sequence of words and computing probabilities of each word given the previous ones. Check out this Wikipedia article for a quick overview, especially on the classic n-gram models.

A consequence of having a probability distribution over words given previous words is that we can sample from this distribution. This way, we can generate whole sequences of language (usually of questionable quality and sense).

Language Modeling can also be done on a character level, however. That is, the text is predicted character-for-character instead of word-for-word. n-gram models quickly fail here due to their limited context. RNNs offer a compelling alternative due to their memory reaching back an arbitrary amount of time (in theory). Check out this “famous” blog post by Andrej Karpathy to get an impression of what can be done here.

The basic idea is that we train the RNN to predict the next element of a sequence given the previous elements. That is, at each time step the RNN receives a character as input. From this input and its current state, it computes a new state and produces a probability distribution over the next character. Later, we can generate sequences by sampling single elements from the RNN’s output probability distribution and feeding them back into the network as input.

RNNs in Tensorflow

Tensorflow has a reputation of having not-so-great support for RNNs, though this has gotten much better in recent times. However, an RNN “layer” can be confusing due to its black box character: All computations over a full sequence of inputs are done internally. To make sure you understand how an RNN “works”, you are asked to implement one from the ground up, defining variables yourself and using basic operations such as tf.matmul to define the “unrolled” computation graph. There is an RNN tutorial on the Tensorflow website, but this is severely lacking, presenting incomplete code snippets out of context while the full tutorial code is extremely bloated. There is another, more recent tutorial on text generation but this uses eager execution and the high-level Keras API. For this assignment, you are asked not to use the RNNCell classes nor any Keras functionality. Also, we recommend not to use eager execution: While eager makes the RNN definition significantly more “natural” its also seems to have a very negative impact on performance in this case (why do you think this might be?). You might want to proceed as follows:

Download this helper file for preparing datasets. It provides functions to process text files into a “TFRecords” dataset. Some text corpora (as well as pointers to more resources) can be found here. The TFRecords format is the “recommended” data format to be used with Tensorflow. It’s also not properly documented at all. You don’t need to know how the format works for this assignment, but reading through the processing code should give you a rudimentary understanding. You can also find (usually outdated) tutorials on the web. The only official bits seem to be here and here. If you can find any more useful information on this format, definitely share it with the course.
The processing function will actually split your text into character sequences of equal length (try 100 or 200 as a starting point). This means that any dependencies inbetween those sequences will be lost, as they will be presented as independent examples to the RNN (although this effect can be lessened by supplying an overlap). This is unfortunate, but makes the task much simpler. It will also store each character as an index that actually represents a one-hot vector of “features”. I.e. at each time step the features serving as input to the RNN are one-hot vectors coding the presence of a certain character.
To get an understanding of how to actually use the dataset, have another look at the dataset tutorial on the Tensorflow website. In particular, look at how to consume TFRecord data and how to parse tf.Example protobufs. Basically, all you need to do is create a TFRecordDataset and map this via the parse_seq function we provide. Hint: You will need to create a new function from this with a fixed sequence length that only takes an example as input, e.g. data.map(lambda x: parse_seq(x, 200)) for sequences of length 200.

Having prepared the data, build an RNN as follows. Note that you should write this model at a low level, i.e. do not use the Estimator interface.

Expand your data from indices into one-hot vectors. The necessary depth can be gotten from the vocabulary file that was created alongside the dataset (you can load it via pickle.load(open(path, mode='rb'))). That is, turn your input from batch_size x seq_len into batch_size x seq_len x vocab_size.
Set up the variables for computing one time step of the RNN. You will need weights/biases to go from input to hidden, hidden to hidden and hidden to output. You also need to set up an initial state. You could do this as a placeholder that you feed with a matrix of zeros batch_size x state_size at training time .
Iterate over the time axis of your input (simply with a Python for-loop) and do the RNN computations for each step: Compute the new state given the old state and the current input, and from this compute the output (logits).
Compute the softmax loss between the output and the target at this time step. Keep in mind that the targets are just the input shifted by one time step.
Either sum or average the loss over time. Which option do you think is preferable and why?

All of these things happen at the symbolic level!! At this point you haven’t launched a session or anything like that!! It is critical that you understand this – you are defining the full computation graph first, with all time steps being represented explicitly, and then a single session.run later will go through all time steps.

For now you might be happy with just training the RNN. Experiment with different layer sizes or sequence lengths. As a reference, an average loss of ~1.5 should be achievable on the Shakespeare corpus using length-200 sequences, with 512 hidden units (batch size 128 and Adam optimizer, 20 or so epochs). Training might take a while – it’s okay to shoot for values around 2.0 instead as a start. Visualize the computation graph in Tensorboard and contemplate your life choices. If you’re feeling fancy, you could even construct a “deep” RNN (stacking multiple RNN layers) or implement more advanced architectures such as LSTMs or GRUs, but these will appear in the next assignment anyway.

Generating Language

Having trained an RNN, you can use it to generate language – technically, you’re “sampling from the language model”. To do this, you should:

Save a checkpoint of your trained RNN using tf.train.Saver (see here for a refresher – first section only).
Re-create the RNN (you probably want to do this in a new file). You should define all the variables again, with the same names (so that the saver can associate them). However, this time you only need to define computations for a single time step. Also you will need the actual probabilities this time (not just logits – simply apply softmax) but no costs since we’re not training anything.
Restore your checkpoint using saver.restore. You might be able to simplify this and the previous step using the meta graph functionality in Tensorflow.
Generate a probability distribution over the next character. Feed as input the last character as well as the current state. To start this process, the “last” character should be <S> (the beginning-of-sequence character inserted when creating the dataset) and the last state filled with zeros. Make sure to output the resulting state along with the probabilities so you can feed it into the network for the next step (this is where defining the initial state as a placeholder becomes useful).
Sample from the probability distribution (e.g. using random.choice). This will give you an index that you can feed back as input into the network for the next step. Also, you can map this to a character using the vocabulary file.
Repeat this process for as long as you want, maybe for several hundred characters. Join all the characters into a single string and look at your output. Note that even if your network was only trained on sequences of a certain length, you can keep this process going for much longer – although your network might not be able to handle it.

Assuming your network was trained properly and your generation process works, the output should at least superficially resemble the training data. For example, in the case of Shakespeare you should see a dialogue structure with proper use of newlines and whitespace. Depending on how long you trained, the text itself should hopefully “look like” English, although there will likely be plenty of fantasy words. This is not a problem per se – chances are the task is just too difficult for this simple network. If your output looks completely jumbled, there is probably something wrong with your generation process.