Reading Assignment 5: Evaluating Generative Models

There is no one “correct” method to evaluate generative models. First off, you can get an overview over different approaches from the following articles:

The method of Parzen windows is relatively straightforward, though outdated.
The inception score evaluates the class distribution of generated images and is thus of limited applicability.
The Fréchet Inception Distance is currently the most used metric. You don’t need to understand the theory in full, but try to grasp the formula for the Gaussian case. Another article, including code, can be found here.

Aside from that, there is a very influential paper which shows how problematic evaluation really is, e.g. in that different metrics can be effectively independent.

Finally, here is an interesting article on the concept of typicality and how it relates to likelihood. All in all, the latter two articles show that maximum likelihood may not be the ideal goal for training generative models.