Discussion: July 4th
Deadline: July 3rd, 20:00
This is the final assignment of the class. For once, we will not look at a new type of generative model, but focus on a different modality, namely audio. This is used in various tasks such as text-to-speech or music generation. To do this properly would require a significant amount of setup and prior knowledge in signal processing. As such, this should be taken as a very basic example showcasing the wide applicability of deep generative models in different domains.
Actually finding a dataset with desired properties is not simple:
We will settle for
Tensorflow Speech Commands.
This does not fulfill the “high quality” requirement, but oh well. Since this is
not one of the nicely processed keras.datasets, we will have to some preprocessing ourselves.
We have two options:
First, use the tensorflow_datasets module and associated methods. An example can be found in
this tutorial focused on classification.
Note that I did not test this as the tensorflow_datasets download have issues
on our group’s servers. I used the second method instead.
Second option, follow these steps:
data/process_tfsc.ipynb.
This notebook walks through the process of going over the unpacked directory and
putting everything into a “TF Records” file format. You don’t have to understand
the details of this format to use it. This format is read off disk in “chunks”,
so the whole dataset doesn’t need to be in memory at the same time (like with a
numpy array). It’s also more efficient than loading many raw audio files one by one.
Make sure base_path is correctly set to the directory containing the uncompressed
dataset.
librosa library to load audio. You will likely need
to install this. It is only required for preprocessing, not for training etc.,
so you don’t have to keep installing it on Colab after the data has been processed.desired_sampling_rate in preprocessing.
This is set to 16000 by default. You could reduce it to 8000; this reduces
the dimensionality of the data, but also the audio quality. Even lower sampling
rates can cause issues with audio playback, so this is not recommended. 4000 may still work,
but likely the audio will sound no better than through a telephone.assignment11_starter.ipynb.
Basically, TFRecords is just pure uninterpreted bytes, and we have to tell Tensorflow
how to parse those bytes back into tensorsYou can find a starter notebook on Gitlab. This does not include the model; only
preparation of the dataset and example code on how to play audio (so you can later
listen to your generations). You can now, in principle, train any model we have
previously implemented on this dataset (diffusion models work well). The only difference is that the data now
has shape (16000, 1) – 16000 time steps and one channel (you need to add the channel
axis!).
Note: If you changed the sampling rate in preprocessing, be sure to also change
it in the model! Then of course, the data shapes will be different.
layers.Conv1D) and similar
1D analogues for pooling, upsampling, etc. Aside from that, the architectures can
stay exactly the same!assignment10),
there are a few more changes needed as some parts of the code broadcast lower-dimensional
tensors to the dimensionality of the data, which is 4D for images but only 3D for
audio. In the assignment 10 upload, these points are marked with capital # WARNING
comments. Note that I may have forgotten to mark some parts – if you run into
trouble with dimensions, feel free to ask for help!