Deadline: January 4th, 11am
In this assignment, you will implement various autoencoder architectures on our beloved (Fashion) MNIST data. In particular, you will gain some insight into the problem of training convolutional autoencoders.
Building autencoders in Tensorflow is pretty simple. You need to define an encoding based on the input, a decoding based on the encoding, and a loss function that measures the distance between decoding and input. An obvious choice may be simply the mean squared error (but see below). To start off, you could try simple MLPs. Note that you are in no way obligated to choose the “reverse” encoder architecture for your encoder; e.g. you could use a 10-layer MLP as an encoder and a single layer as a decoder if you wanted. As a start, you can/should opt for an “undercomplete” architecture where the encoding is smaller than the data.
Note: The activation function of the last decoder layer is very important, as it needs to be able to map the input data range. Having data in the range [0, 1] allows you to use a sigmoid output activation, for example. Experiment with different activations such as sigmoid, relu or linear (no) activation and see how it affects the model. Your loss function should also “fit” the output function, e.g. a sigmoid output layer goes well with a binary (!) cross-entropy loss.
Next, you should switch to a convolutional encoder/decoder to make use of the
fact that we are working with image data. The encoding should simply be one or
more convolutional layers, with any filter size and number of filters (you can
optionally apply fully-connected layers at the end). As an “inverse” of a
Conv2D
, Conv2DTranspose
is commonly used. However,
you could also use UpSampling2D
along with regular convolutions.
Again, there is no
requirement to make the parameters of encoder and decoder “fit”, e.g. you don’t
need to use the same filter sizes. However, you need to take care when choosing
padding/strides such that the output has the same dimensions as the input. This
can be a problem with MNIST (why?).
It also means that the last
convolutional (transpose) layer should have as many filters as the input
space (e.g. one filter for MNIST or three for CIFAR).
Keep in mind that the reconstruction loss is not a good proxy for the “quality” of an
autoencoder; instead you need to get an impression of what the model learned
about the input.
Note that if you use a single-layer (fully-connected) decoder, its weight
matrix will be h_dim x 784
and each of the h_dim
rows can be reshaped to
28x28 to get an impression of what kind of image the respective hidden
dimension represents.
The same holds for the encoder of course, which in the single-layer case will
have a 784 x h_dim
weight matrix. You should visualize some of your model’s
filters to see what it learns.
Another way to interpret what a model has learned is by manipulating the code space. This is particularly useful for deeper models where the above method is of limited use. The idea is as follows: Take an image and encode it. Pick a single encoding dimension (you could also pick multiple ones of course) and change the value, then decode and inspect the resulting image. By attempting many different values for the dimension (we may call this “walking the code space”), we might be able to interpret what this particular variable encodes. For example, gradually changing the value in one dimension might change the stroke thickness in MNIST digits. Try this out for a few dimensions of a trained model’s code and see whether you can interpret their “meaning”. This will likely work better for models with small and/or sparse encodings.
Autoencoders are useful in that they can learn from unlabeled data. This can significantly improve performance in settings where large amounts of data are available, but few labels. We can artificially evoke such a situation by just “pretending” that parts of our data has no labels. Try this:
Note: This will only “work” if your autoencoder has learned the “correct” level of abstraction of the data. Else, its features might not actually be useful for classification. The model is more likely to appropriate with proper regularization on the latent space.
There are many other things to try here. For example, you could try different kinds of data, including non-image data. You have already worked with text, although this is rarely used in an autoencoder context. Another option could be audio data, e.g. the Tensorflow Speech Commands dataset. This is discussed here in a classification context, but the sections on preprocessing should be helpful.
Aside from that, you may try different kinds of autoencoders from the book:
Another interesting approach is offered by winner-take-all autoencoders, which use an extremely sparse encoding mechanism to learn very complex filters.