Discussion: June 6th
Deadline: June 5th, 20:00
In this assignment, we want to implement some simple flow models on toy datasets, as well as attempt to fit simple “real” datasets like MNIST.
NOTE: On Gitlab, you can find assignment07_template.ipynb. This is a full
NICE implementation with just some (key) steps missing. Thus, if you want, you can
approach this by filling in the gaps (look for NotImplementedError) or SyntaxError
caused by ...). This should alleviate the load in terms of code design etc.
If you want, you can of course also do everything from the ground up yourself,
or make other changes to the template as desired.
The OG flow model is the NICE model from this paper. It is also a relatively simple, making it a good candidate for first experiences with these models. Recall that in one of the readings from the lecture, code examples for how to implement flows in Tensorflow Probability are given. However,
But first, a note on terminology: In principle, it doesn’t matter which direction
of the flow you call forward or backward, in which direction a function f is
applied and in which its inverse, etc. However, it’s easy to get confused here
because people use different conventions. I will strictly stick to the convention
from the NICE paper, which is:
forward goes from the data (complex distribution) to the simple distribution.
This is the function f (“encoder” if you will).backward goes from simple distribution to complex (data). This is f^-1,
the inverse of f (“decoder”).You might want to proceed as follows:
Implement a single NICE coupling layer. Note that this is not a single neural network
layer (but inheriting from tf.keras.layers.Layer is still useful)!
m. You can use a Keras model for this. This should
take a d-dimensional input and also produce a d-dimensional input (what
d should be is explained further below). The network itself can be arbitrarily
complex (this also means hidden layers can be much larger or smaller than d).
y1 = x1 and y2 = x2 + m(x1),
where m is the network defined above. For the backward layer, subtract
the shift instead. Afterwards, put the two parts together again and return the result.
d above would be half the input dimensionality.
A very simple way would be to simply split it in the middle (e.g. tf.split).The full NICE model simply stacks an arbitrary number of such coupling layers.
x1 through unchanged! The usual way to solve this issue
is to alternate layers, such that one modifies the first half, the next one the
second half, the next one the first half again, etc. There are different ways
to implement this.
tf.split followed
by tf.concat, but the other way around. This may be more “elegant”, but is
error-prone. Don’t do this.tf.Variable, one number per data dimension,
and simply multiply this with the output at the very end of forward.
Accordingly, you need to invert this at the start of backward (via
division).
tf.exp, which makes many computations
(such as taking logarithms) easy – check the paper!m to x, but it should be the respective h
output of the previous layer.As a first sanity check, you should set up a very simple model on some toy data
and check that the forward and backward functions are actually inverses of
each other. That is, check that the difference between data and
backward(forward(data)) (and/or the other way around) is near 0 (small
differences due to numerical reasons are okay).
That takes care of the model itself. Once this works, setting up training is very simple!
log(p(data)) = log(p(forward(data))) + log(det(forward_jacobian(data))). The
ingredients are:
forward is simple, if you sanity-checked that your
implementation works! Now we need a “simple” probability distribution for the
“latent” space. In principle, this can be anything that is easy to evaluate.
As is often the case, the Normal distribution is a popular choice, although
the paper proposes using a logistic distribution instead. You should
use tensorflow_probability to create a distribution and use the log_prob
function to evaluate probabilities.1
for the Jacobian, meaning that the term disappears completely. If you use a
scaling layer at the end (recommended) the determinant is simply
prod(scale_factors) (or sum(log(scale_factors)) for the log deteminant). See section 3.3 in the paper!With all this taken care of, your model is ready to train. First, try it on simple toy data. See the notebook for a sample dataset (parabola). Training proceeds as usual, by gradient descent. You can use the negative log likelihood as a loss function and use standard TF optimizers. Feel free to try other toy datasets as well. Make sure you can successfully fit such datasets before moving on! If training fails, there are likely problems with your implementation.
Do not expect great results on datasets like MNIST. At the end of the day, this is not such a nice model. Haha.
Here are a few things you can do with your trained model.
See Section 5.2 of the NICE paper. You can fix a part of the input, and have the model generate the rest. This is also possible with e.g. autoregressive models, but they are limited due to their fixed generation order. With flows, any dimensions can be inpainted given any others.
The method simply uses gradient ascent to create an input that maximizes the likelihood, but only optimizing the “unknown” pixels (fixing the known ones).
A trained flow should be able to detect “atypical” inputs. A simple experiment could go like this: