Assignment 6: Autoregressive Language Modeling

Discussion: May 30th
Deadline: May 29th, 20:00

By far the most “hyped” AI application right now is generating natural language (ChatGPT). The good news is that state-of-the-art systems are conceptually very simple, relying on sequential autoregressive generation. The bad news is that they require massive scale.

In this assignment, we will build such a system on a small scale to understand the underlying principles, and have some fun with the terrible generations we will probably get.

Tutorials for this are plentiful:

As usual, please do not just copy a tutorial and call it a day; you will not learn anything. If you want to roll your own version, below is a rough summary of the necessary steps:

Preparing Data

Finding a Dataset

You have LOTS of options here. Some of the tutorials linked above also provide data. A common tutorial choice are things like the collected works of Shakespeare or the IMDB dataset. However, these are tiny by modern standards, and you will likely not get a good model from such datasets. Another option could be Tensorflow Datasets – check the “Text generation” tab in the catalogue, and the guide for how to use the library. Another option that works well is wikitext – there is the small wikitext-2, and the larger (but still manageable) wikitext-103.

Tokenization

Independently of the dataset, you will have to settle for a level of representation. Do you want to model characters, words, or something inbetween? You will have to process the data, which is usually provided as raw strings, into sequences of tokens.

You should also convert your tokens to indices by a vocabulary (token-to-index mapping) that you will also need to create.

Making Sequences

Obviously, you can’t put the entire dataset into your model in one go – you will have to divide into smaller subsequences that can be treated as single training examples. There are basically two choices:

Building a Model

This part is actually quite simple! You essentially need three components:

  1. Embedding layer for your tokens.
  2. A sequence model. You will probably want to start with an RNN, e.g. a GRU or LSTM. Later, if you want, you could also attempt a Transformer.
  3. A (softmax) output layer with one class per vocabulary entry.

The model needs to be trained to predict the next token given the previous ones. One step of training proceeds as follows:

  1. Get a batch of sequences.
  2. Put the batch in your model to get a batch of output sequences.
  3. Compute the cross-entropy between the outputs and the targets. This can be done on the entire sequence at once; it will compute the cross-entropy per-timestep and average over time.

What are the targets? For any time step, it should always be the bext token. This can be achieved as follows:

inputs = sequences[:-1]
targets = sequences[1:]

Inputs exclude the final token (doesn’t predict anything); targets exclude the first token (not predicted by anything). This aligns inputs and targets in such a way that at each time step, the target is the next token.

Generating Text

Having trained your model, generation proceeds as follows:

Note: If using an RNN, you will need to “conserve” the state after inputting each new token, as you obviously can’t just apply it to the sequence (it doesn’t exist yet). As such, you should set up your model such that it returns the current state along with the outputs, as well as take a state as an argument alongside the input. The Tensorflow RNNs provide functionality for this:

Experiments