Assignment 6/7: GANs
Discussion: June 3rd
Deadline: June 2nd, 23:59
No exercise on May 27th!
Guess what, we’re doing GANs this time.
Note: For this assignment and in general, you can upload more than one notebook
if you find this better, e.g. to avoid overly long notebooks with a loooong string
of pretty much independent experiments.
Basic Setup
Implementing the basic GAN logic isn’t too difficult. You will likely want to
use low-level training loops (i.e. GradientTape) because of the non-trvial
control flow. There are many examples around the web that can help you get
started (but note that many are outdated, e.g. using TF 1.x).
For example, there is a DCGAN guide
on the TF website.
- Define two models: One that maps from a noise space to data (generator) and
one that maps from data to a single number (discriminator). Use whichever dataset
you like!
- Train the two models. The simplest setup is to alternate one step of generator
training and one step of discriminator training.
- To train D, get a batch of generated samples from G and a batch of real samples
from the dataset and minimize D’s classification error. Be sure to only
update the parameters of D!
- To train G, generate a batch of samples (in theory you are not supposed to
reuse the one generated before, but many implementations will actually do this)
and maximize D’s classification error. Be sure to only update the parameters of
G!
This should be enough for a functional training procedure. Train some models and
generate samples for evaluation. They will most likely be terrible.
Take note: Evaluating whether GAN training is progressing/”working” is
difficult. The loss values are not very informative. You will want
to take some samples and plot them every so often while training is progressing
to get an impression of the current state. However, even this can be misleading:
You might run into mode collapse problems early on, which you can take as
evidence that training is not working, and stop the process early. However, it
could actually happen that the mode collapse “magically” gets fixed over the
course of a few training iterations, and diverse samples are produced. For this
reason, consider always training for a large number of steps (larger than e.g.
VAEs) and just see what happens.
Improving GAN training
GANs are notoriously difficult to train. In the rest of this assignment, you are
asked to try out various ways to improve the basic procedure. There are countless
advanced GAN variants, but for now you may focus on “tricks” to make the original
formulation more stable. Here are some leads:
- The original GAN paper
proposes (at the end of section 3) to “flip” the generator loss, which should
lead to better gradients early in training.
- This follow-up discusses some techniques
for improved training. Some of these are probably a bit much to implement, but
you could try things like one-sided label smoothing, minibatch discrimination or
making use of class labels (semi-supervised learning).
- In particular, feature matching can stabilize and speed up GAN training
significantly, as well as mitigating mode collapse. Here, you train D as usual,
but G is trained with a loss that tries to match the features in D for generated
data with those for real data. The paper proposes using “an intermediate layer”
of D to get the features, without specifying further. In practice, you can
get good results by summing this loss over all or at least a number of layers
of D (e.g. after every activation function, or after every second, or…).
- DCGAN is a reference architecture for
how to implement GANs with CNNs (though it is outdated by now)
and includes many “tricks” that seem to work well
in practice. In particular, if using Adam as optimizer, have a look at their
hyperparameters (very non-standard).
- ganhacks is a repository of yet more
tricks that might help.
- Don’t be afraid to go big! These models can really profit from increased capacity
in terms of number of layers and layer sizes.
- It can take literally thousands or even tens of thousands of training steps
on more complex datasets until results really start “cleaning up”. Be prepared to
wait some time for your final models.
Include as many of these methods as you want/need into your model and try to
achieve some nice samples!
Part 2: Advanced GAN Architectures
GANs have come a long way since their original inception in 2014, and discussing
and/or implementing all significant improvements is not feasible. Below you can
find some more leads for advanced architectures; implement at least one of these,
experiment with the associated parameters and compare with your previous implementation.
In principle you can also mix and match these methods with each other or the
improvements from part 1. Some of these mixes might work well, others not at all…
- Conditional GAN
- Works for datasets that have class information or any other kind of metadata
to condition on.
- Simply add this information (encoded somehow, e.g. as a one-hot class vector
or an embedding) to the generator and discriminator. Now the discriminator can
learn to reject/accept data based on the class, and the generator can generate
data for specific labels.
- Wasserstein GAN
- Use a linear output activation in the discriminator.
- The loss is just the difference between real/fake outputs,
flipped for the generator loss.
- The discriminator includes weight clipping.
You can do this easily in Keras using a layer’s
kernel_constraint and
bias_constraint arguments and the clip_by_value function.
All layers with weights need to have such constraints!
Be careful with normalization layers such as Batch Normalization, as these
could easily “break” the discriminator by rescaling the outputs to arbitrary ranges.
- Usually the discriminator is trained for several steps each time (e.g. 5-10
discriminator steps per generator steps).
- Improved Wasserstein GAN
- Remove the weight clipping from the WGAN.
- Add the gradient penalty as described in the paper:
- Create an “interpolation batch” between the real and generated batches.
- Run this batch through the critic, and compute gradients of the critic’s
output with respect to the input batch.
Note that this means having a
GradientTape (to compute input gradients)
inside another one (to compute parameter gradients), but this is fine.
- Compute the deviation of these gradients from the desired value of 1, and
add this term to the original loss function, scaled by some factor.
- Least Squares GAN
- Simple: Use a linear output activation in the discriminator, and a squared
error loss function. However, note in the paper that there are different choices
for what numbers you use as labels for real/fake data.
- Spectral Normalization
- Heavy on theory, but to implement this you basically just need to wrap all
layers in D that have weights in a
SpectralNormalization object you can find
in tensorflow_addons.layers. You should also remove all other normalization
layers such as batchnorm from D, and only use “contractive” activation
functions (ReLU and LeakyReLU are fine).
- In principle, you can view this as enforcing a 1-Lipschitz constraint,
instead of only regularizing for it as in the WGAN. However, you can use spectral
normalization with any kind of GAN loss (cross-entropy, least-squares, Wasserstein…).
- Progressive Growing,
StyleGAN or
StyleGAN2 (these will all require
significantly more effort to implement).