Assignment 8: Improved GANs

Discussion: June 22nd

This week, we want to try out some advanced architectures. In particular, we want to focus on the implementation of Wasserstein GANs. Although the derivation of WGANs is mathematically sophisticated, their implementation is quite simple. Recall the reading from the lecture; the papers contain pseudocode algorithms you can use as a reference.

Basic WGAN

For the standard WGAN, the differences are as follows:

The discriminator uses a linear output activation.
The loss is a simple difference of outputs between real/fake samples. For the generator, the loss is flipped.
The discriminator includes weight clipping. You can do this easily in Keras using a layer’s kernel_constraint and bias_constraint arguments and the clip_by_value function.
The discriminator should be trained for several steps per generator step.

Experiment with hyperparameters, such as the ratio of discriminator/generator training and, most importantly, the weight clipping boundaries. See what happens when you use either extremely tight bounds, or exceedingly loose ones (or remove clipping altogether). How do the outputs differ in each case?

Improved WGAN

Next, move on the WGAN-GP formulation. Compared to the above, there are just two differences:

You need to implement the gradient penalty. To do this,
- Create an “interpolation batch” between the real and generated batches.
- Run this batch through the critic, and compute gradients of the critic’s output with respect to the input batch. Note that this means having a GradientTape (to compute input gradients) inside another one (to compute parameter gradients), but this is fine.
- Compute the deviation of these gradients from the desired value of 1, and add this term to the original loss function.
With the gradient penalty, you can (should) remove gradient clipping again.

Optional: Advanced Architectures

If you have more time, try to implement other architectures from the reading. In particular,

Progressive growing should require some extra code, but otherwise be straightforward. Take note that, in the paper, new layers are slowly “faded in” by interpolating between the previous (smaller) size and the new, larger one.
StyleGAN requires and additional MLP to compute the vector w from z, as well as an implementation of adaptive instance normalization (there are probably plenty of those around the web) to inject the style at each layer.