Assignment 5: Chairs!

Discussion: May 25th

In this assignment, we will have a first look at “differentiable generator networks” in the convenient scenario where our data has all relevant “latent variables” fully specified.

Dataset

We will be using a dataset of 3D-rendered chairs that was developed here. There are about 1400 chairs in total, with 62 images (31 rotation angles around the side and two elevation angles) per chair.

You will need to download and process this dataset. Please check the course repository on Gitlab for the quick_data.py script. Running this script with the h string for the dataset argument will process the chairs data to a TFRecords file. Then, you can use the parse_chairs function from utils/data to read the images. A bare-bones example notebook can also be found in the repository.

Each example consists of an image, the ID of the chair, an ID encoding rotation angle (between 0 and 30) and an ID encoding elevation angle (0 or 1). The great thing is that, in principle, each image is fully determined by the information about the type of chair and the rotation/elevation angles, and thus we can learn a mapping from these “latent variables” to the chair images.

Setting Up a Model

There is a paper that deals with this issue; this is also referenced in the Deep Learning book. You can read this for inspiration, but there is no need to copy their approach exactly (in particular, you can ignore the segmentation aspect entirely).

In principle, you just need to build a network that maps from the latent input to an image, and then optimize the reconstruction error between the generated images and the true ones. The images are 600x600 (3 channels), but by default the parse function resizes them to 128x128. Feel free to change this, e.g. you could start with 64x64 for prototyping (any less starts to look horrible). Of course you can also go larger for a more challenging task. You could use either an MLP or a CNN for generation, and in the latter case either transposed convolution or upsampling followed by regular convolution to construct the images.

Representing the Input

You might be wondering how best to supply the latent inputs to the model. Neural networks generally work on continuous/real numbers as input, so inputting the IDs directly makes little sense, e.g. the chair with ID 2 would be “twice as large” as the chair with ID 1. There are two main options here (as discussed in the exercise):

Embeddings are generally preferred due to their statistical efficiency.

With a working model, you can try some fun stuff such as

Since the task is conceptually quite simple, you might want to spend some extra time on testing different architectures to find some that produce good results – you might be able to use the same architectures in the future for more complicated models (such as GANs).

Bonus: A More Challenging Dataset

In case you are underwhelmed by chairs, there is the NSynth dataset. This is a dataset of synthetic musical notes, containing about 300,000 examples of four seconds each. At 16 kHz, this means each example contains 64000 numbers! The dataset is available as TFRecords, and there is a parse function in the course repo. Check the dataset website for which entries the data consists of. Examples are richly annotated; probably the instrument, pitch and velocity entries should be enough (definitely use embeddings!) but perhaps the model could also profit from further entries such as instrument_family or qualities.

While the basic setup is identical to the chairs task, this dataset is much more challenging. You could try 1D CNNs on the raw data, or maybe autoregressive models (RNNs), or 2D CNNs on spectrogram representations, or maybe DDSP… Also, it might make sense to only take a subset of the data, e.g. only acoustic instruments, or only piano sounds or something similar, to make the task more manageable.

Assuming you have a model set up that can produce audio, you can listen to it inside a notebook by importing IPython and using IPython.display.Audio (check google for docs).