Assignment 8: Autoencoders

Deadline: January 8th, 9am

In this assignment, you will implement various autoencoder architectures on our beloved (Fashion) MNIST data. In particular, you will gain some insight into the problem of training convolutional autoencoders.

Autoencoders in Tensorflow

Building autencoders in Tensorflow is pretty simple. You need to define an encoding based on the input, a decoding based on the encoding, and a loss function that measures the distance between decoding and input. A common choice for image data is simply the mean squared error. To start off, you could try simple MLPs. Note that you are in no way obligated to choose the “reverse” encoder architecture for your encoder; e.g. you could use a 10-layer MLP as an encoder and a single layer as a decoder if you wanted. Note: The activation function of the last decoder layer is very important, as it needs to be able to map the input data range. Having data in the range [0, 1] allows you to use a sigmoid output activation, for example. Experiment with different activations such as sigmoid, relu or linear (no) activation and see how it affects the model. Your loss function should also “fit” the output function, e.g. a sigmoid output layer goes well with a binary (!) cross-entropy loss.

Convolutional Autoencoders

Next, you should switch to a convolutional encoder/decoder to make use of the fact that we are working with image data. The encoding should simply be one or more convolutional layers, with any filter size and number of filters (you can optionally apply fully-connected layers at the end). As an “inverse” of a Conv2D, Conv2DTranspose is commonly used. However, you could also use UpSampling2D along with regular convolutions. Again, there is no requirement to make the parameters of encoder and decoder “fit”, e.g. you don’t need to use the same filter sizes. However, you need to take care when choosing padding/strides such that the output has the same dimensions as the input. This also means that the last convolutional (transpose) layer should have as many filters as the input space (e.g. one filter for MNIST or three for CIFAR).

What do Autoencoders Learn?

Keep in mind that the reconstruction loss is not a good proxy for the “quality” of an autoencoder; instead you need to get an impression of what the model learned about the input. Note that if you use a single-layer (fully-connected) decoder, its weight matrix will be h_dim x 784 and each of the h_dim rows can be reshaped to 28x28 to get an impression of what kind of image the respective hidden dimension represents. The same holds for the encoder of course, which in the single-layer case will have a 784 x h_dim weight matrix. You should visualize some of your model’s filters to see what it learns.

Another way to interpret what a model has learned is by manipulating the code space. This is particularly useful for deeper models where the above method is of limited use. The idea is as follows: Take an image and encode it. Pick a single encoding dimension (you could also pick multiple ones of course) and change the value, then decode and inspect the resulting image. By attempting many different values for the dimension (we may call this “walking the code space”), we might be able to interpret what this particular variable encodes. For example, gradually changing the value in one dimension might change the stroke thickness in MNIST digits. Try this out for a few dimensions of a trained model’s code and see whether you can interpret their “meaning”. This will likely work better for models with small and/or sparse encodings.

Unsupervised Pretraining

Autoencoders are useful in that they can learn from unlabeled data. This can significantly improve performance in settings where large amounts of data are available, but few labels. We can artificially evoke such a situation by just “pretending” that parts of our data has no labels. Try this: