Assignment 8: Score-based Generative Models

Discussion: June 13th
Deadline: June 12th, 20:00

We continue our mission to implement every single kind of generative model ever invented by tackling score-based generative models. Arguably the simplest approach is to use denoising score matching with multiple noise scales. The ingredients are:

Let’s look at these components in turn. We will be referring back to these papers:

Once again, there is a template on Gitlab!

Noise-conditional Score Model

In general, a score model should return the gradient of log(p(x)), given x. This means the model should have a structure where it takes in a data sample, and returns an output with the same shape. It should also have no output activation so it can output any real number.

We cannot expect the score model to work at all noise scales, but training one model per scale is not feasible. Instead, the model should take a second input, which is the scalar noise value sigma. Then the model, can return different scores depending on the noise level. Recall that you can create multi-input models like this:

model = tf.keras.Model([x_input, noise_input], score_output)

The bigger question is how this conditioning should look like, exactly. The original paper proses conditional Instance normalization, which normalizes the hidden layers to different means and standard deviations depending on sigma. The follow-up proposes something much simpler: Simply run the model on x, and then divide the output by sigma. See “Technique 3” in section 3.3 of the improved paper. If you do this, you can even avoid the hassle of multi-input models, as long as you remember to divide the score model output by sigma everywhere you use it.

A note on the model architecture: Do not use batch normalization! See if you can figure out why this is a bad idea for these models. Recommended replacements are GroupNormalization or InstanceNormalization (the latter is only in the tensorflow_addons package).

Loss function

The loss is relatively simple, for example, you can find it in equation 2 of the improved paper (section 2.2), or equation 5 of the original (section 4.2). There, x is a data sample, and “x tilde” is a noisy version of that sample. Noisy versions are attained simply by adding random normal noise with mean 0 and the given standard deviation.

The problem is that we need to sum the loss over all noise scales, and the score network needs to be applied separately to each noisy sample. This is very slow for many noise scales. Because of this, you should sample a noise scale randomly at each training step, and only do the training for that noise scale.

There is also a noise-dependent weighting function that the papers propose to set to sigma**2. In the aforementioned equation 2 in the improved paper, this has already been “integrated” into the loss itself, which is why it looks different from the original.

Finally, note that the sampling approach makes the loss somewhat unreliable. It may fluctuate significantly, depending on what noise is sampled at each step, so don’t be too alarmed if it goes up and down a lot.

Annealed Langevin Dynamics

Again, this can be found in the papers, e.g. Algorithm 1 of the improved version.

For efficiency, it’s a good idea to use tf.function. However, wrapping loops in a function often doesn’t work so well. Instead, you can offload the per-step update into a separate function, only wrap that with tf.function, and then loop over that.

It is typical to choose the overall number of steps on the order of 1000 or so – so you would use 1000 / number_of_noise_scales steps per scale. More may be better.

Note the algorithms in the first and second paper are slightly different with respect to alpha; one uses two times that of the other. If you are using the recommendations from the improved version, you should also implement the algorithm of that paper!

Finally, pay attention that alpha is scaled using the variances, not standard deviations. Things like this are easy to get wrong and may screw up your entire implementation!

Choosing Noise Scales

So far, we have not discussed how to choose the noise scales. The original paper does this in an ad-hoc fashion:

These values may work for small datasets. In the follow-up, the authors conduct analyses to arrive at more principled choices. These are techniques 1 and 2 in section 3 of the paper. These result in much higher sigma_L and L. For example, for MNIST I arrive at sigma_L ~ 20 and L > 100. The necessary formulas are available in the template notebook.

In principle, your code should not get any more complex whether you have 10 noise scales or 10000! It’s all just loops anyway.

Other Techniques

The improved paper also offers guidance on how to choose the step size epsilon in Langevin sampling. The aforementioned notebook also implements this.

Finally, “Technique 5” states that we should use exponential averages of the learned parameters in inference, instead of using the parameters directly. This functionality is built into TF optimizers in recent versions:

Fun Variations on Generation

Score-based models are quite flexible in their applications. A few things you could try are:

Inpainting

The original paper has a variation of Langevin dynamics in the appendix B, algorithm 2. Here, we mask out certain parts of an input and have the model generate the rest. That is, the model has to “fill in the gaps”.

Interpolation

Interpolating between images is possible, but a bit more complicated than e.g. in VAEs or GANs. One method is named in appendix B.2 of the improved paper.

Creating Variations

It’s possible to partially diffuse an image and then reconstruct it from there to receive a similar, but different picture. You can modify Langevin dynamics as such:

This will “reconstruct” the noisy data sample, but since the process is random, the result will be different. How different will depend on what noise scale you started from. The larger, the more different it will be.