In this assignment, you will implement various autoencoder architectures on our beloved (Fashion) MNIST data. In particular, you will gain some insight into the problem of training convolutional autoencoders.
Building autencoders in Tensorflow is pretty simple. You need to define an encoding based on the input, a decoding based on the encoding, and a loss function that measures the distance between decoding and input. A common choice for image data is simply the mean squared error. To start off, you could try simple MLPs. Note that you are in no way obligated to choose the “reverse” encoder architecture for your encoder; e.g. you could use a 10-layer MLP as an encoder and a single layer as a decoder if you wanted. Note: The activation function of the last decoder layer is very important, as it needs to be able to map the input data range. Having data in the range [0, 1] allows you to use a sigmoid output activation, for example. Experiment with different activations such as sigmoid, relu or linear (no) activation and see how it affects the model.
Next, you should switch to a convolutional encoder/decoder to make use of the
fact that we are working with image data. The encoding should simply be one or
more convolutional layers, with any filter size and number of filters (you can
optionally apply fully-connected layers at the end). The “inverse” of a
tf.layers.conv2d is tf.layers.conv2d_transpose. Again, there is no
requirement to make the parameters of encoder and decoder “fit”, e.g. you don’t
need to use the same filter sizes. As long as you use “same” padding and unit
strides, there should be no problems with input/output shapes mismatching. The
last convolutional transpose should have only one filter to go back to what is
basically black-and-white image space (one channel, like the MNIST input).
Note that pooling cannot be reverted that easily, so you should not use it here. Instead, you should go for all-convolutional networks, that exclusively employ (strided) convolutions. These are easy to invert via transposed convolutions with an according stride. You can also try a fully-convolutional network with no strides. Visualize the learned filters and feature maps. What behavior do you observe and why do you think this is happening?
Keep in mind that the cost is not a good proxy for the “quality” of an
autoencoder; instead you need to get an impression of what the model learned
about the input.
Note that if you use a single-layer (fully-connected) decoder, its weight
matrix will be h_dim x 784 and each of the h_dim rows can be reshaped to
28x28 to get an impression of what kind of image the respective hidden
dimension represents.
The same holds for the encoder of course, which in the single-layer case will
have a 784 x h_dim weight matrix. You should visualize some of your model’s
filters to see what it learns.
Another way to interpret what a model has learned is by manipulating the code space. This is particularly useful for deeper models where the above method is of limited use. The idea is as follows: Take an image and encode it. Pick a single encoding dimension (you could also pick multiple ones of course) and change the value, then decode and inspect the resulting image. By attempting many different values for the dimension (we may call this “walking the code space”), we might be able to interpret what this particular variable encodes. For example, gradually changing the value in one dimension might change the stroke thickness in MNIST digits. Try this out for a few dimensions of a trained model’s code and see whether you can interpret their “meaning”.
As you should have seen, unregularized convolutional autoencoders don’t learn interesting filters. One way of regularizing is by enforcing extremely sparse activations. Read the paper on Winner-Take-All Autoencoders (an understanding of the method is enough, no need to read e.g. the experiments in detail) and implement this scheme. You only need to implement spatial sparsity for convolutional networks; consider lifetime sparsity “advanced optional”. However, it might actually be easier to start with implementing fully-connected WTA autoencders with lifetime sparsity, and you may be able to resuse this functionality for the convolutional variant. Implementing any of these parts requires some careful manipulation of tensors and indices and is a good exercise in and of itself. Some hints:
batch x filters x w x h to batch x filters x w*h) and use tf.max/
tf.argmax over the dimension to get the maximum value and its location for
each input and feature map.tf.nn.top_k should be interesting. Note that WTA
autoencoders are defined in terms of a sparsity proportion, whereas top_k
takes an integer k, so you will need to convert between the two.tf.scatter_nd allows you to fill a mostly-zero tensor with a selection of
values at specific indices. This function can be hard to understand in its full
generality, and building appropriate index arrays requires some careful
tinkering. It it is probably easiest to restrict the use of scatter to
filling up 1D arrays, and use tf.unstack and tf.stack to take apart/build
up higher-dimensional arrays.Experiment with different sparsity settings and see how the learned filters are affected. If you want to go the extra mile, you could even compare the learned filters and/or embedding spaces of autoencoders with a comparable classification model (i.e. the encoder followed by a softmax classification layer) or even use a trained encoder to initialize the weights of such a model.