Discussion: May 25th
In this assignment, we will have a first look at “differentiable generator networks” in the convenient scenario where our data has all relevant “latent variables” fully specified.
We will be using a dataset of 3D-rendered chairs that was developed here. There are about 1400 chairs in total, with 62 images (31 rotation angles around the side and two elevation angles) per chair.
You will need to download and process this dataset. Please check the course
repository on Gitlab for the quick_data.py
script. Running this script with
the h
string for the dataset argument will process the chairs data to a
TFRecords file. Then, you can use the parse_chairs
function from utils/data
to read the images. A bare-bones example notebook can also be found in the
repository.
Each example consists of an image, the ID of the chair, an ID encoding rotation angle (between 0 and 30) and an ID encoding elevation angle (0 or 1). The great thing is that, in principle, each image is fully determined by the information about the type of chair and the rotation/elevation angles, and thus we can learn a mapping from these “latent variables” to the chair images.
There is a paper that deals with this issue; this is also referenced in the Deep Learning book. You can read this for inspiration, but there is no need to copy their approach exactly (in particular, you can ignore the segmentation aspect entirely).
In principle, you just need to build a network that maps from the latent input to an image, and then optimize the reconstruction error between the generated images and the true ones. The images are 600x600 (3 channels), but by default the parse function resizes them to 128x128. Feel free to change this, e.g. you could start with 64x64 for prototyping (any less starts to look horrible). Of course you can also go larger for a more challenging task. You could use either an MLP or a CNN for generation, and in the latter case either transposed convolution or upsampling followed by regular convolution to construct the images.
You might be wondering how best to supply the latent inputs to the model. Neural networks generally work on continuous/real numbers as input, so inputting the IDs directly makes little sense, e.g. the chair with ID 2 would be “twice as large” as the chair with ID 1. There are two main options here (as discussed in the exercise):
tf.one_hot
with the corresponding depth).
You do the same for rotation and elevation angles and
then stick the three one-hot vectors together, ending up with a 1457-element
vector with three non-zero entries.tf.keras.layers.Embedding
for this. You should
create one embedding each for chairs, rotation and elevation, but stick the
three resulting embeddings together into one feature vector just as with one-hot
vectors.Embeddings are generally preferred due to their statistical efficiency.
With a working model, you can try some fun stuff such as
Since the task is conceptually quite simple, you might want to spend some extra time on testing different architectures to find some that produce good results – you might be able to use the same architectures in the future for more complicated models (such as GANs).
In case you are underwhelmed by chairs, there is the
NSynth dataset. This
is a dataset of synthetic musical notes, containing about 300,000 examples of
four seconds each. At 16 kHz, this means each example contains 64000 numbers!
The dataset is available as TFRecords, and there is a parse function
in the course repo. Check the dataset website for which entries the data
consists of.
Examples are richly annotated; probably the instrument
, pitch
and velocity
entries should be enough (definitely use embeddings!) but perhaps the model could
also profit from further entries such as instrument_family
or qualities
.
While the basic setup is identical to the chairs task, this dataset is much more challenging. You could try 1D CNNs on the raw data, or maybe autoregressive models (RNNs), or 2D CNNs on spectrogram representations, or maybe DDSP… Also, it might make sense to only take a subset of the data, e.g. only acoustic instruments, or only piano sounds or something similar, to make the task more manageable.
Assuming you have a
model set up that can produce audio, you can listen to it inside a notebook
by importing IPython
and using IPython.display.Audio
(check google for docs).