Discussion: June 13th
Deadline: June 12th, 20:00
We continue our mission to implement every single kind of generative model ever invented by tackling score-based generative models. Arguably the simplest approach is to use denoising score matching with multiple noise scales. The ingredients are:
Let’s look at these components in turn. We will be referring back to these papers:
Once again, there is a template on Gitlab!
In general, a score model should return the gradient of log(p(x)), given x.
This means the model should have a structure where it takes in a data sample, and
returns an output with the same shape. It should also have no output activation
so it can output any real number.
We cannot expect the score model to work at all noise scales, but training one model
per scale is not feasible. Instead, the model should take a second input, which is
the scalar noise value sigma. Then the model, can return different scores depending
on the noise level. Recall that you can create multi-input models like this:
model = tf.keras.Model([x_input, noise_input], score_output)
The bigger question is how this conditioning should look like, exactly. The original
paper proses conditional Instance normalization, which normalizes the hidden
layers to different means and standard deviations depending on sigma. The
follow-up proposes something much simpler: Simply run the model on x, and then
divide the output by sigma. See “Technique 3” in section 3.3 of the improved paper.
If you do this, you can even avoid the hassle of multi-input models, as long as
you remember to divide the score model output by sigma everywhere you use it.
A note on the model architecture: Do not use batch normalization! See if you
can figure out why this is a bad idea for these models. Recommended replacements
are GroupNormalization or InstanceNormalization (the latter is only in the tensorflow_addons
package).
The loss is relatively simple, for example, you can find it in equation 2 of the
improved paper (section 2.2), or equation 5 of the original (section 4.2).
There, x is a data sample, and “x tilde” is a
noisy version of that sample. Noisy versions are attained simply by adding
random normal noise with mean 0 and the given standard deviation.
The problem is that we need to sum the loss over all noise scales, and the score network needs to be applied separately to each noisy sample. This is very slow for many noise scales. Because of this, you should sample a noise scale randomly at each training step, and only do the training for that noise scale.
There is also a noise-dependent weighting function that the papers propose to set
to sigma**2. In the aforementioned equation 2 in the improved paper, this has
already been “integrated” into the loss itself, which is why it looks different
from the original.
Finally, note that the sampling approach makes the loss somewhat unreliable. It may fluctuate significantly, depending on what noise is sampled at each step, so don’t be too alarmed if it goes up and down a lot.
Again, this can be found in the papers, e.g. Algorithm 1 of the improved version.
For efficiency, it’s a good idea to use tf.function. However, wrapping loops
in a function often doesn’t work so well. Instead, you can offload the per-step
update into a separate function, only wrap that with tf.function, and then loop
over that.
It is typical to choose the overall number of steps on the order of 1000 or so –
so you would use 1000 / number_of_noise_scales steps per scale. More may be better.
Note the algorithms in the first and second paper are slightly different with
respect to alpha; one uses two times that of the other. If you are using the
recommendations from the improved version, you should also implement the algorithm
of that paper!
Finally, pay attention that alpha is scaled using the variances, not standard
deviations. Things like this are easy to get wrong and may screw up your entire
implementation!
So far, we have not discussed how to choose the noise scales. The original paper does this in an ad-hoc fashion:
sigma_1 = 0.01. This can still be a bit high (i.e. visible to the eye)
in some cases; maybe try 0.001.sigma_L = 1 for image data scaled to [0, 1].L = 10 noise scales, with a geometric progression. Check out np.geomspace
to generate this.These values may work for small datasets. In the follow-up, the authors conduct
analyses to arrive at more principled choices. These are techniques 1 and 2 in
section 3 of the paper. These result in much higher sigma_L and L. For example,
for MNIST I arrive at sigma_L ~ 20 and L > 100. The necessary formulas are
available in the template notebook.
In principle, your code should not get any more complex whether you have 10 noise scales or 10000! It’s all just loops anyway.
The improved paper also offers guidance on how to choose the step size epsilon
in Langevin sampling. The aforementioned notebook also implements this.
Finally, “Technique 5” states that we should use exponential averages of the learned parameters in inference, instead of using the parameters directly. This functionality is built into TF optimizers in recent versions:
use_ema=True to the optimizer constructor.ema_momentum in the constructor to an appropriate value. ;)model.fit(), the variables are automatically overwritten by the averages.optimizer.finalize_variable_values(model.variables).Score-based models are quite flexible in their applications. A few things you could try are:
The original paper has a variation of Langevin dynamics in the appendix B, algorithm 2. Here, we mask out certain parts of an input and have the model generate the rest. That is, the model has to “fill in the gaps”.
Interpolating between images is possible, but a bit more complicated than e.g. in VAEs or GANs. One method is named in appendix B.2 of the improved paper.
It’s possible to partially diffuse an image and then reconstruct it from there to receive a similar, but different picture. You can modify Langevin dynamics as such:
This will “reconstruct” the noisy data sample, but since the process is random, the result will be different. How different will depend on what noise scale you started from. The larger, the more different it will be.