Assignment 8: Improved GANs
Discussion: June 22nd
This week, we want to try out some advanced architectures. In particular, we
want to focus on the implementation of Wasserstein GANs. Although the derivation of WGANs is mathematically sophisticated, their
implementation is quite simple. Recall
the reading from the
lecture; the papers contain pseudocode algorithms you can use as a reference.
Basic WGAN
For the standard WGAN, the differences are as follows:
- The discriminator uses a linear output activation.
- The loss is a simple difference of outputs between real/fake samples. For the
generator, the loss is flipped.
- The discriminator includes weight clipping. You can do this easily in Keras
using a layer’s
kernel_constraint
and bias_constraint
arguments and the
clip_by_value
function.
- The discriminator should be trained for several steps per generator step.
Experiment with hyperparameters, such as the ratio of discriminator/generator
training and, most importantly, the weight clipping boundaries. See what happens
when you use either extremely tight bounds, or exceedingly loose ones (or remove
clipping altogether). How do the outputs differ in each case?
Improved WGAN
Next, move on the WGAN-GP formulation. Compared to the above, there are just
two differences:
- You need to implement the gradient penalty. To do this,
- Create an “interpolation batch” between the real and generated batches.
- Run this batch through the critic, and compute gradients of the critic’s
output with respect to the input batch. Note that this means having a
GradientTape
(to compute input gradients) inside another one (to compute
parameter gradients), but this is fine.
- Compute the deviation of these gradients from the desired value of 1, and
add this term to the original loss function.
- With the gradient penalty, you can (should) remove gradient clipping again.
Optional: Advanced Architectures
If you have more time, try to implement other architectures from the reading.
In particular,
- Progressive growing should require some extra code, but otherwise be
straightforward. Take note that, in the paper, new layers are slowly “faded in”
by interpolating between the previous (smaller) size and the new, larger one.
- StyleGAN requires and additional MLP to compute the vector w from z, as
well as an implementation of adaptive instance normalization (there are probably
plenty of those around the web) to inject the style at each layer.