Discussion: July 1st
Deadline: June 30th, 23:59
In this assignment, we will implement the final “class” of generative models that have been treated in this course. Diffusion models have seen a stark rise in popularity recently. They are stable to train and can produce very high-quality outputs.
See the reference paper here. In what follows, we will list the main steps required for a successful implementation.
First, make sure that you have understood the forward process and verified your implementation of it.
beta_t. The paper recommends T = 1000 steps and beta_t
increasing linearly from 1e-4 to 0.02 (see top of section 4 or appendix B).
You can use a linspace to create this directly.alpha_t and alpha_bar_t as noted
in the paper (note np.cumprod). Be aware that the paper uses the covariance matrix
to define normal distributions, but you may need to use the standard deviation
instead at some points!t (say, every 50 or 100 steps or so). You should be able to see how the data is slowly
overpowered by noise.To train the backward process, we need a parametric model that basically separates the noise from the data. You can use any kind of model that takes an input with the same shape as the data, and returns the same shape as well. Papers tend to use U-net-like models, which are basically encoder-decoder networks with skip-connections connecting encoder layers to decoder layers operating at the same resolution. Building such a model should be simple in principle. There is just one complication…
In the mathematical framework of diffusion models, our network has to be able to
return different outputs depending on where we are in the process (step t).
Thus, t needs to be given as input to the model in some shape or form, or the
model has to be otherwise conditioned on it.
Some possible choices were discussed in the
exercise, but the large t we tend to deal with in diffusion models makes many
of these infeasible.
t values
and returns the encoding for each, with a desired number of frequencies for the
sinusoids.t values to [0, 1] (simply divide by your chosen maximum T, e.g. 1000).T=1000, this would be a frequency of 500.t is mapped to a 20-dimensional positional encoding
(10 frequencies with a sine and cosine for each).For a working model, it should suffice to add t only to the input, but note that
the paper proposes adding it to each residual block in the model!
Training is surprisingly simple; just follow algorithm 1 on page 4 of the paper. For each training step
t values; it is recommended to sample a t for each batch
element instead of using one t for the whole batchtt into the modelYou may have trouble doing all these steps in tensorflow. But note that everything before running the actual model does not need to be backpropagated through. That is, you can do all these steps “outside” the training step, using general Python code, numpy etc.
For sampling, we have to fully implement the “backward process”. This is given in algorithm 2 of the paper.
t steps backwards (important to apply the correct conditioning
in the model etc) and apply the formulas. Be mindful of choosing the correct
alpha_t, alpha_bar, correctly applying square roots (or not), etc.sigma_t; possible choices for this are discussed in
the beginning of section 3.2 of the paper, e.g. sigma_t = sqrt(beta_t)That’s it! Have fun with your model. Besides straight generation, you can also do a few more things with the diffusion process:
t instead
of the maximum, and use the diffused image instead of a random sample as the
initialization. You should get images that look similar to the original, but
somewhat different (the more diffusion steps you run, the more different it will be).