Deadline: May 13th, 20:00
The convolution/correlation operation used in CNNs is equivariant to translation (i.e. moving the image). We further get some degree of invariance from pooling layers. This might make you thing that a CNN in general has some amount of “resistance” to moving input images around. Unfortunately, this is not really the case. A major problem here is the use of padding, which we want to investigate in this week’s assignment. There is a very interesting paper on this; you do not need to read it to do the assignment, but it provides a lot of background detail.
Note: This assignment is a little bit experimental. We wanted to go beyond just “build a CNN”. We hope you learn something interesting. :)
On E-Learning, you can find an implementation of a VGG-style CNN for CIFAR10.
You can train it as-is, or make changes for potentially better performance.
There is also a model checkpoint already available on E-Learning that you can load via model.load_state_dict in case
you don’t want to do the training yourself (e.g. lack of hardware).
In that case you can skip the training.
See here
for information on how to save/load models.
Make sure the model is working fine by using the evaluate function given in the notebook.
You should reach over 93% accuracy on the test set.
While not state of the art, this is fairly decent performance on CIFAR10.
Let’s take it apart a little bit!
There are two options for things you can investigate here. You may choose one of them! Of course you could do both…
How robust is our CNN actually to translation (i.e. shifting) of images? A systematic approach could go as follows:
evaluate function that translates all images before putting them into the model.
For this, you can use torchvision.transforms.v2.functional.affine (docs).
Set angle=0, scale=1, shear=0 and translation to the desired amount of translation. For example, translation = (-5, 3)
would shift the images to the left by 5 pixels, and down by 3 pixels.-32, -31, -30... you could go -32, -30, -28....Display the results in a suitable manner, e.g. plotting as a matrix with plt.imshow, with loss and/or accuracy shown
as a function of x/y translation.
What do you think about the results?
Things to look out for are, for example:
Finally, are there any issues with this evaluation method that may give us a wrong impression?
What does the model “see” for completely empty inputs? Are there any consistent patterns in the features across many inputs, that are seemingly unrelated to the data?
To do this, you have to “take apart” the model.
There are various ways of doing this, but with nn.Sequential the easiest way is probably to recognize the following:
model[0] will give you the first submodule, model[1]
the second, and so on.model[0][1] giving you the second submodule of the first submodule of model.x = inputs, then x = model[0](x), then x = model[1](x) etc.
We can store x somewhere after each step, and then later go through all hidden outputs.
model[i] is an entire “level” of the network which consists of multiple layers.
If you want to look at each convolutional layer one-by-one, you would have to look at model[i][j] and iterate over
both i and j.You can try this on any image input – best only a single image, but keep in mind you always need a batch axis.
So a single image input would have the shape (1 x 3 x w x h).
Looking at each layer one by one, you should see feature maps that somewhat resemble the input in the early layers,
with increasingly abstract results as you go deeper into the network.
Now put in a completely blank image, i.e. a tensor full of 0s. Most likely, it will be as in Figure 2 in the paper linked at the top: You will see some strange artifacts at the edges/corners, that will move closer to the center, and become pronounced and complex, as you go deeper. What do you think is going on here?
Another thing you can do is to put in actual images from the dataset, but average the feature responses and plot these averages (Figure 1 in the paper). The artifacts will likely still be present, hinting that this is a systematic issue plaguing the CNN hidden representations, NOT just some strange occurrence for completely “empty” inputs.
Note: This is likely more work/more difficult than Option 1 (Translation), but it’s also really good practice to see how to “look into” the networks we built, rather than just treating them as black boxes that just return some output through arcane magic.
Since this one can take a bit to figure out what you really need to be doing, there is no extra work this time. :) But as an optional bonus, you could look into ways of how to fix or at least alleviate these issues. The paper linked above proposes some:
reflect padding instead of the symmetric one they prefer in the paper.Resizing transform in torchvision.