Discussion: June 17th
Deadline: June 16th 23:59
In this assignment, we want to implement some simple flow models on toy datasets, as well as attempt to fit simple “real” datasets like MNIST.
The OG flow model is the NICE model from this paper. It is also a very simple, making it a good candidate for first experiences with these models. Recall that in one of the readings from the lecture, code examples for how to implement flows in Tensorflow Probability are given. However,
But first, a note on terminology: In principle, it doesn’t matter which direction
of the flow you call forward or backward, in which direction a function f is
applied and in which its inverse, etc. However, it’s easy to get confused here
because people use different conventions. I will strictly stick to the convention
from the NICE paper, which is:
forward goes from the data (complex distribution) to the simple distribution.
This is the function f (“encoder” if you will).backward goes from simple distribution to complex (data). This is f^-1,
the inverse of f (“decoder”).You might want to proceed as follows:
tf.keras.layers.Layer might still be useful)!
m. You can use a Keras model for this. This should
take a d-dimensional input and also produce a d-dimensional input (what
d should be is explained further below). The network itself can be arbitrarily
complex (this also means hidden layers can be much larger or smaller).y1 = x1 and y2 = x2 + m(x1),
where m is the network defined above. For the backward layer, subtract
the shift instead.
d above would be half the input dimensionality.
A very simple one would be to simply split it in the middle (e.g. tf.split). You could also try
other schemes, e.g. all even dimensions go into part 1 and all odd dimensions
into part 2, or…x1 through unchanged! The usual way to solve this issue
is to alternate layers, such that one modifies the first half, the next one the
second half, the next one the first half again, etc. There are different ways
to implement this.
tf.split followed
by tf.concat, but the other way around. This may be more “elegant”, but is
error-prone.tf.Variable, one number per data dimension,
and simply multiply this with the output at the very end of forward.
Accordingly, you need to invert this at the start of backward (via
division).m to x, but it should most certainly be the respective h
output of the previous layer.forward and backward functions are actually inverses of
each other. That is, check that the difference between data and
backward(forward(data)) (and/or the other way around) is near 0 (small
differences due to numerical reasons are okay).That takes care of the model itself. Once this works, setting up training is very simple!
log(p(data)) = log(p(forward(data))) + log(det(forward_jacobian(data))). The
ingredients are:
forward is simple, if you sanity-checked that your
implementation works! Now we need a “simple” probability distribution for the
“latent” space. In principle, this can be anything that is easy to evaluate.
As is often the case, the Normal distribution is a popular choice, although
the paper proposes using a logistic distribution instead. You should
use tensorflow_probability to create a distribution and use the log_prob
function to evaluate probabilities.1
for the Jacobian, meaning that the term disappears completely. If you use a
scaling layer at the end (recommended) the determinant is simply
prod(scale_factors) (or sum(log(scale_factors)) for the log deteminant). See section 3.3 in the paper!With all this taken care of, your model is ready to train. First, try it on simple toy data. See the notebook for a sample dataset (parabola). Training proceeds as usual, by gradient descent. You can use the negative log likelihood as a loss function and use standard TF optimizers. Feel free to try other toy datasets as well. Make sure you can successfully fit such datasets before moving on! If training fails, there are likely problems with your implementation.
Building a functional NICE model from the ground up is already quite an achievement. If you want to move beyond this, try the following: