Assignment 6: Variational Autoencoders & Parzen Windows

Discussion: June 8th

In this assignment, we will implement a variational autoencoder (VAE). Furthermore, we will have a look at Parzen windows, which are a popular (albeit flawed) method of evaluating generative models.

Implementing a VAE

From an implementation standpoint, a VAE is pretty much just an autoencoder with a stochastic encoder and a regularized latent space. As such, you might want to proceed as follows:

Build an autoencoder. Use any dataset/model of your choice.
Add stochasticity in the last encoder layer. With the common choice of a Gaussian distribution, this just means splitting the layer into two parts, one of which generates means, the other variances. Then you use these values to take Gaussian samples. You can use tf.random.normal for this – take samples from a standard normal distribution, multiply with the standard deviation and add the mean. Be careful with the layer that generates stds/variances; think about what value range these can be in and what value range your layer returns. That is, choose a sensible activation function!
Add a regularizer term to the reconstruction loss.

You will likely find many VAE implementations around the web. Feel free to use these for “inspiration”, but make sure you understand what you are doing!

Depending on the details of the data, loss function etc. you might need to scale down the regularizer significantly (by multiplying with a number much smaller than 1) to achieve any learning at all. A typical sign of “overregularization” is when all reconstructions look the same.

Train your VAE and generate some samples, perhaps trying out multiple architectures and datasets. Think about the following issues:

In case you ran into the aforementioned problem and your VAE refused to reconstruct anything, forcing you to tone down the regularizer: Why do you think this happens? Even if you don’t run into this issue, think about why the VAE regularization might be a particularly troublesome one.
How can you check whether the regularization was “successful”? Try your method of choice on your own model(s).
Compare VAE reconstructions with those of a normal autoencoder. They will likely be significantly more blurry. Why does this happen? Aside from that, why is blurriness already an issue in “normal” AEs?

Density Estimation Using Parzen Windows

Sometimes you might be wondering how to evaluate a generative model. Evaluating whatever loss function you trained with might be a start, but it is often not very informative. Also, this precludes comparison between different frameworks (e.g. RBMs vs VAEs). Looking at samples is cumbersome and highly subjective. An objective measure of the “goodness of fit” of the model would be nice. One attempt at such a measure uses kernel density estimation, in particular the method of Parzen Windows. Read this short doc for a simple explanation of the method. If you prefer a more detailed explanation, check this one. You can use Parzen windows to evaluate your models as follows:

The quantity we want to estimate is the likelihood of the test set given the model. The higher this is, the better we assume our model to be.
Take an arbitrary number of samples from your model (a common number is 10000, actually recommended is much more; start with less for debugging).
Choose some kernel. A Gaussian kernel is the standard choice. You can start with a uniform one (a “hypercube”) but you are likely to get rather useless results, in particular using MNIST. Why do you think this is?
For each element of the test set (of whichever dataset you are using), compute the Parzen estimate of the probability, using the aforementioned samples to provide the kernels. Since the data is very high-dimensional, you will likely run into numerical issues already when trying to compute a single probability. In this case, use log probabilities instead. You should try to implement this yourself, but if you get stuck there might be something in utils.math in the course repo to help you.
The test set likelihood would be the product of all individual probabilities. Since this will definitely cause numerical issues, it is customary to compute the log-likelihood instead by summing over all the log probabilities.
You will usually need to choose some width parameter for the kernel (e.g. variance for the Gaussian kernel). This is usually chosen such that the log-likelihood is maximized. If you want to do this properly, you should put aside a separate validation set and choose the variance to maximize the likelihood on here, then report the value on the test set with this variance. If you’re short on time, just try to maximize it on the test set directly , but don’t tell anyone you did this.
Evaluate at least two different models and compare results. These could be two VAEs, or an RBM and a VAE, or any other generative model. You could also construct one of the models to perform badly on purpose and see whether the Parzen estimate agrees.
Have a look at this paper and immediately forget all of this.