Discussion: June 29nd
In this assignment, we want to implement some simple flow models on toy datasets, as well as attempt to fit simple “real” datasets like MNIST. Please note that there is a notebook in the gitlab with some starter/example code that could be useful!
The OG flow model is the NICE model from this paper. It is also a very simple, making it a good candidate for first experiences with these models. Recall that in one of the readings from the lecture, code examples for how to implement flows in Tensorflow Probability are given. However,
But first, a note on terminology: In principle, it doesn’t matter which direction
of the flow you call forward or backward, in which direction a function f
is
applied and in which its inverse, etc. However, it’s easy to get confused here
because people use different conventions. I will strictly stick to the convention
from the NICE paper, which is:
forward
goes from the data (complex distribution) to the simple distribution.
This is the function f
(“encoder” if you will).backward
goes from simple distribution to complex (data). This is f^-1
,
the inverse of f
(“decoder”).You might want to proceed as follows:
class
instead of inheriting
from tf.keras.layers.Layer
.
m
. You can use a Keras model for this. This should
take a d-dimensional input and also produce a d-dimensional input (what
d should be is explained further below). The network itself can be arbitrarily
complex (this also means hidden layers can be much larger or smaller).y1 = x1
and y2 = x2 + shift(x1)
,
where shift
is the network defined above. For the backward layer, subtract
the shift instead.
tf.split
). In this
case, d
above would be half the input dimensionality. You could also try
other schemes, e.g. all even dimensions go into part 1 and all odd dimensions
into part 2, or…x1
through unchanged! The usual way to solve this issue
is to alternate layers, such that one modifies the first half, the next one the
second half, the next one the first half again, etc. There are different ways
to implement this.
tf.split
followed
by tf.concat
, but the other way around.tf.Variable
, one number per data dimension,
and simply multiply this with the output at the very end of forward
.
Accordingly, you need to invert this at the start of backward
(via
division).forward
and backward
functions are actually inverses of
each other. That is, check that the difference between data
and
backward(forward(data))
(and/or the other way around) is near 0 (small
differences due to numerical reasons are okay).That takes care of the model itself. Once this works, setting up training is very simple!
log(p(data)) = log(p(forward(data))) + log(det(forward_jacobian(data)))
. The
ingredients are:
forward
is simple, if you sanity-checked that your
implementation works! Now we need a “simple” probability distribution for the
“latent” space. In principle, this can be anything that is easy to evaluate.
As is often the case, the Normal distribution is a popular choice. You should
use tensorflow_probability
to create a distribution and use the log_prob
function to evaluate probabilities.1
for the Jacobian, meaning that the term disappears completely. If you use a
scaling layer at the end (recommended) the determinant is simply
sum(scale_factors)
. See section 3.3 in the paper!With all this taken care of, your model is ready to train. First, try it on simple toy data. See the notebook for a sample dataset (parabola). Training proceeds as usual, by gradient descent. You can use the negative log likelihood as a loss function and use standard TF optimizers. Feel free to try other toy datasets as well. Make sure you can successfully fit such datasets before moving on! If training fails, there are likely problems with your implementation.
Building a functional NICE model from the ground up is already quite an achievement. If you want to move beyond this, try the following: