Assignment 5: Density Estimation Using Parzen Windows

Discussion: May 20th
Deadline: May 19th, 23:59

Sometimes you might be wondering how to evaluate a generative model. Evaluating whatever loss function you trained with might be a start, but it is often not very informative. Also, this precludes comparison between different frameworks (e.g. RBMs vs VAEs). Looking at samples is cumbersome and highly subjective. An objective measure of the “goodness of fit” of the model would be nice. One attempt at such a measure uses kernel density estimation, in particular the method of Parzen Windows. Work through this interactive article for a good explanation. Then implement Parzen windows and use the method to evaluate your models as follows:

The quantity we want to estimate is the likelihood of the test set given the model. The higher this is, the better we assume our model to be.
Take an arbitrary number of samples from your model (a common number is 10,000, actually recommended is much more; start with less for debugging).
Choose some kernel. A Gaussian kernel is the standard choice. You can start with a uniform one (a “hypercube”), but you are likely to get rather useless results, in particular using MNIST. Why do you think this is?
For each element of the test set (of whichever dataset you are using), compute the Parzen estimate of the probability, using the aforementioned samples to provide the kernels. To do this, compute the value of the kernel function (appropriately normalized) between the test set element and each sample, then average those.
- Since the data is very high-dimensional, you will likely run into numerical issues already when trying to compute a single probability. In this case, use log probabilities instead (keep in mind tf.reduce_logsumexp or the logprob methods of tfp.distributions classes).
- Iterating over samples and computing each kernel function one after the other is very slow. Think about how you can optimize this process. In principle, you can compute the kernel between a test sample and all model samples at the same time, or even the kernel between all test samples and all model samples, through appropriate tensor operations. In practice, this will take too much memory, however (generally there is a time-space tradeoff), so you will need to find some middle ground.
The test set likelihood would be the product of all individual probabilities of the test set elements. Since this will definitely cause numerical issues, it is customary to compute the log-likelihood instead by summing over all the log probabilities.
You will need to choose some width parameter for the kernel (e.g. variance for the Gaussian kernel). This is usually chosen such that the log-likelihood is maximized. If you want to do this properly, you should put aside a separate validation set and choose the variance to maximize the likelihood on here, then report the value on the test set with this variance. If you’re short on time, just try to maximize it on the test set directly, but don’t tell anyone you did this.
Evaluate at least two different models and compare results. These could be two VAEs, or an RBM and a VAE, or any other generative model. You could also construct one of the models to perform badly on purpose and see whether the Parzen estimate agrees.
Have a look at this paper and immediately forget all of this. In case you are short on time, some takeaways are:
- Likelihood and sample quality are more or less independent,
- Parzen windows are very bad at estimating likelihoods,
- In general, the evaluation method needs to be chosen appropriately depending on how the model will be used.