Discussion: May 30th
Deadline: May 29th, 20:00
By far the most “hyped” AI application right now is generating natural language (ChatGPT). The good news is that state-of-the-art systems are conceptually very simple, relying on sequential autoregressive generation. The bad news is that they require massive scale.
In this assignment, we will build such a system on a small scale to understand the underlying principles, and have some fun with the terrible generations we will probably get.
Tutorials for this are plentiful:
As usual, please do not just copy a tutorial and call it a day; you will not learn anything. If you want to roll your own version, below is a rough summary of the necessary steps:
You have LOTS of options here. Some of the tutorials linked above also provide data. A common tutorial choice are things like the collected works of Shakespeare or the IMDB dataset. However, these are tiny by modern standards, and you will likely not get a good model from such datasets. Another option could be Tensorflow Datasets – check the “Text generation” tab in the catalogue, and the guide for how to use the library. Another option that works well is wikitext – there is the small wikitext-2, and the larger (but still manageable) wikitext-103.
Independently of the dataset, you will have to settle for a level of representation. Do you want to model characters, words, or something inbetween? You will have to process the data, which is usually provided as raw strings, into sequences of tokens.
You should also convert your tokens to indices by a vocabulary (token-to-index mapping) that you will also need to create.
Obviously, you can’t put the entire dataset into your model in one go – you will have to divide into smaller subsequences that can be treated as single training examples. There are basically two choices:
This part is actually quite simple! You essentially need three components:
The model needs to be trained to predict the next token given the previous ones. One step of training proceeds as follows:
What are the targets? For any time step, it should always be the bext token. This can be achieved as follows:
inputs = sequences[:-1]
targets = sequences[1:]
Inputs exclude the final token (doesn’t predict anything); targets exclude the first token (not predicted by anything). This aligns inputs and targets in such a way that at each time step, the target is the next token.
Having trained your model, generation proceeds as follows:
tfp.distributions.Categorical). Put the sampled token back into
your network, get the probability for the next token etc. Continue for some number
of steps. What happens if you use the highest probability (argmax) instead
of sampling?START tokens into
your data (e.g. at the beginning of every movie review, paragraph, etc.) and
provide that as input – the model can then learn what should come after a START
token and generate a reasonable beginning itself.Note: If using an RNN, you will need to “conserve” the state after inputting each new token, as you obviously can’t just apply it to the sequence (it doesn’t exist yet). As such, you should set up your model such that it returns the current state along with the outputs, as well as take a state as an argument alongside the input. The Tensorflow RNNs provide functionality for this:
return_state=True will return the state at the end.call an takes initial_state argument.stateful RNN, but this is a bit awkward and not recommended.T. With high T, the probabilities become more uniform, leading to more
chaotic samples. With low T, probabilities become more “extreme”, leading to more
deterministic behavior. As T -> 0, we approach argmax instead of random sampling.
Often, T slightly smaller than 1 leads to better results (e.g. T=0.8).k (another hyperparameter) tokens with the largest logits, and
only sample between these. This guarantees that only the most probable next tokens
may be sampled.