Discussion: November 21st
Deadline: November 20th, 18:00
When deep generative models first started emerging, the biggest competitors were VAEs and GANs. GANs are still among the top contenders for high-quality outputs, however their training is rather difficult. In this assignment, we will implement basic GANs and see if we can improve them a little.
Compared to VAEs, GANs are somewhat light on theory, making the implementation more straightforward: We don’t have to think as much about how to put mathematical equations into practice. However, difficulties arise from other sources:
TrainerBase completely.
In fact, GANs will be the only model where we cannot just implement a core_step; we will have to overwrite the entire train_step instead.We once again have a starter notebook (05_gan_starter.ipynb) and parts of the model/training code (lgm/gan.py) given.
This has been optimized somewhat by creating a single GAN object that contains both the generator (G) and discriminator (D);
this allows us to get away with a single model/optimizer input to the Trainer class.
Otherwise, we would have to create a completely new class that doesn’t inherit from TrainerBase to take two separate model/optimizer inputs.
This also puts some constraints on our training loop. There is a “naive” version of GAN training where you would train the models alternatingly. This would require calling each model twice (once for D training, once for G training). We instead opt for a more optimized version that only calls each model once and updates them in parallel. This is faster, and only differs slightly from the alternating version. It’s also the only version we can really use with a single optimizer object. On the flipside, this version is perhaps slightly more difficult to understand than the “naive” one.
You can view D similarly to the encoder in a VAE, and G similar to the decoder. This makes it relatively easy to build neural network architectures. A few things to keep in mind:
nn.GroupNorm instead.Even after you got a technically functioning training loop, chances are the results are very bad. Thus, we really need to think about employing some advanced methods to improve performance. At the same time, we want to keep the training relatively simple/close to the original GAN formulation.
First, let’s start with a collection of simple tricks:
Many GAN training loops make some changes to the standard optimizer setup:
0.001.
Try reducing this by a factor of 5-10, e.g. 0.0002.It is often advocated to scale data such that it lies in the range [-1, 1] rathern than [0, 1].
This is trivially implemented using the Normalize transform from lgm.data.
An example is given in the starter notebook.
Note that we will need to revert this transformation whenever plotting data.
Also, G needs to output data in the correct range, but this can be enforced using nn.Tanh as an output activation.
To prevent the D loss from saturating too much, we can employ label smoothing, replacing the 0/1 labels by soft versions. Improved Techniques for Training GANs proposes to only smooth the positive labels away from 1 (e.g. 0.9), but leave the negative labels as 0.
We can add small amounts of noise to the data, as well as generated outputs, before putting them into D. This makes the job slightly harder for the latter.
A popular technique for modern generative models is to use an exponential moving average (EMA) over model parameters. During training, we keep an average of the parameters at each step. After training, the parameters of the final learning step are replaced by the EMA parameters. This supposed to smoothen the variance introduced by singular training steps, and can reduce issues with models “going in circles”.
This is already implemented in lgm.common.EMA class.
You just have to create the object and pass it to the Trainer class.
Afterwards, one function call applys the EMA parameters.
It also creates a backup of the original parameters, so in case the EMA parameters don’t perform well, it’s easy to revert.
There are other possibilities to use for the loss function besides the classical cross-entropy. One is the Least-Squares-GAN (LSGAN). While in principle there is quite a bit of theory as well as different possible setups, the basic idea is just to use the squared error for the D loss. That’s it!
We can help D enforce variety in the generated batches by explicitly computing features over the batch, rather than the typical per-input features.
A simple example is the SDLayer given in the provided code.
This can easily be added into D.
This is the most important of all these techniques. The intuition is quite simple. If we had a good G that was producing samples whose overall distribution looks like the data distribution, then the features computed by D in the hidden layers should on average be similar between the real data and samples from G. This means:
In principle, we can just add this so-called feature matching loss to the regular G loss, or even replace the old one completely. This is a very powerful technique that most modern GANs will employ in some way.
There is just one problem:
How do we get the features?
Turns out we must adapt our D model to return hidden layer outputs (i.e. “features”) in addition to its regular classification output.
This can be achieved by a change to our layers module.
We discuss this in the exercise; otherwise there is a usage example in the starter notebook.
Which features to use is completely arbitray.
We decide to return all block outputs.
Then the difference outlined above is computed for each block output, and everything is summed together.
This ensures that both early, mid and late features from D give a training signal to G.
Note with this setup it’s quite likely that D achieves pretty much perfect classification, and yet G is still learning fine. We don’t have to balance the “game” between the two networks, we just have to make sure there is a good training signal for G, and this is much simpler with feature matching. Try it out!
All in all, we need to
Ideally, you would add these tricks one-by-one and train a new GAN each time. But this will take quite long. So you may bundle a few of them up and add each “package” at once. That way, you might get 2-4 or so iterations on your model. If you are short on time, the most impactful one should be feature matching, so definitely try that one!
At the end, you should reflect on which methods worked particularly well, and which seemed to make little or no difference. Once again, we have the issue of how to evaluate the models (this will be alleviated next week). For now, you can judge the visual sample quality. Also look out for signs of low mode coverage: Generate a large batch of images, say 100. Can you see different images that look almost or exactly the same?
Finally, in the starter notebook there is some code to achieve a kind of “approximate inference” using GANs, even though they lack the encoder network of VAEs. This can be considered more of a bonus.