Deadline: November 8th, 9am
In this assignment, you will get to know Keras, the high-level API in Tensorflow. You will also create a better model for the MNIST dataset using convolutional neural networks.
The low-level TF functions we used so far are nice to have full control over everything that is happening, but they are cumbersome to use when we just need “everyday” neural network functionality. For such cases, Tensorflow has integrated Keras to provide abstractions over many common workflows. Keras has tons of stuff in it; we will only look at some of it for this assignment and get to know more over the course of the semester. In particular:
tf.keras.layers
provides wrappers for many common layers such as dense
(fully connected) or convolutional layers. This takes care of creating and
storing weights, applying activation functions, regularizers etc.tf.keras.Model
in turn wraps layers into a cohesive whole that allows us to
handle whole networks at once.tf.optimizers
make training procedures such as gradient descent simple.tf.losses
and tf.metrics
allow for easier tracking of model performance.Unfortunately, none of the TF tutorials are quite what we would like here, so you’ll have to mix-and-match a little bit:
tf.nn.sparse_softmax_cross_entropy_with_logits
.Optimizer
instances instead of manually subtracting gradients
from variables.metrics
to keep track of model performance.Sequential
. For additional examples, you can look at the top of
this tutorial, or
this one,
or maybe this one…
In each, look for the part model = tf.keras.Sequential...
. You just put in a
list of layers that will be applied in sequence. Check
the API to get an
impression of what layers there are and which arguments they take.Later, we will see how to wrap entire model definitions, training loops and evaluations in a hand-full of lines of code. For now, you might want to rewrite your MLP code with these Keras functions and make sure it still works as before.
An example notebook can be found here.
You should have seen that (with Keras) modifying layer sizes, changing activation functions etc. is simple: You can generally change parts of the model without affecting the rest of the program (training loop etc). In fact, you can change the full pipeline from input to model output without having to change anything else (restrictions apply).
Replace your MNIST MLP by a CNN. The tutorials linked above might give you some ideas for architectures. Generally:
width x height x channels
. So for MNIST,
make sure your images have shape (28, 28, 1)
, not (784,)
!Conv2D
and possibly MaxPool2D
layers.Flatten
.Dense
layers and the final classification (logits) layer.Note: Depending on your machine, training a CNN may take much longer than the MLPs we’ve seen so far. Here, using Colab’s GPU support could be useful (Edit -> Notebook settings -> Hardware Accelerator). Also, processing the full test set in one go for evaluation might be too much for your RAM. In that case, you could break up the test set into smaller chunks and average the results (easy using keras metrics) – or just make the model smaller.
You should consider using a better optimization algorithm than the basic
SGD
. One option is to use adaptive algorithms, the most
popular of which is called Adam. Check out tf.optimizers.Adam
. This will
usually lead to much faster learning without manual tuning of the learning rate
or other parameters. We will discuss advanced optimization strategies later in
the class, but the basic idea behind Adam is that it automatically
chooses/adapts a per-parameter learning rate as well as incorporating momentum.
Using Adam, your CNN should beat your MLP after only a few hundred steps of
training. The
general consensus is that a well-tuned gradient descent with momentum and
learning rate decay will outperform adaptive methods, but you will need to
invest some time into finding a good parameter setting – we will look into
these topics later.
If your CNN is set up well, you should reach extremely high accuracy results.
This is arguably where MNIST stops being interesting. If you haven’t done so,
consider working with Fashion-MNIST instead (see
Assignment 1). This should
present more of a challenge and make improvements due to hyperparameter tuning
more obvious/meaningful. You could even try CIFAR10 or CIFAR100 as in one of the
tutorials linked above. They have 32x32 3-channel color images with much more
variation. These datasets are also available in tf.keras.datasets
.
Note: For some reason, the CIFAR labels are organized somewhat differently –
shaped (n, 1)
instead of just (n,)
. You should do something like
labels = labels.reshape((-1,))
or this will mess up the loss function.
GradientTape
).