Assignment 4: Sorry, This Assignment Is Just Padding!

Alternate Title: Lost in Translation

Deadline: May 13th, 20:00

The convolution/correlation operation used in CNNs is equivariant to translation (i.e. moving the image). We further get some degree of invariance from pooling layers. This might make you thing that a CNN in general has some amount of “resistance” to moving input images around. Unfortunately, this is not really the case. A major problem here is the use of padding, which we want to investigate in this week’s assignment. There is a very interesting paper on this; you do not need to read it to do the assignment, but it provides a lot of background detail.

Note: This assignment is a little bit experimental. We wanted to go beyond just “build a CNN”. We hope you learn something interesting. :)

Our Model

On E-Learning, you can find an implementation of a VGG-style CNN for CIFAR10. You can train it as-is, or make changes for potentially better performance. There is also a model checkpoint already available on E-Learning that you can load via model.load_state_dict in case you don’t want to do the training yourself (e.g. lack of hardware). In that case you can skip the training. See here for information on how to save/load models.

Make sure the model is working fine by using the evaluate function given in the notebook. You should reach over 93% accuracy on the test set. While not state of the art, this is fairly decent performance on CIFAR10. Let’s take it apart a little bit!

CNN Properties And Their Reality

There are two options for things you can investigate here. You may choose one of them! Of course you could do both…

Option 1: Translation

How robust is our CNN actually to translation (i.e. shifting) of images? A systematic approach could go as follows:

  1. Write a modified evaluate function that translates all images before putting them into the model. For this, you can use torchvision.transforms.v2.functional.affine (docs). Set angle=0, scale=1, shear=0 and translation to the desired amount of translation. For example, translation = (-5, 3) would shift the images to the left by 5 pixels, and down by 3 pixels.
  2. Iterate over all possible shifts in the x and y direction. Note that the maximum amount of translation is equal to the size of the image, since any more results in just a black image. E.g. for uncropped CIFAR you could iterate from -32 to 32 in both x and y. This would result in many (around 4000) evaluations of the entire dataset, which can take a very long time. To speed this up, you could try larger steps, e.g. instead of going -32, -31, -30... you could go -32, -30, -28....
  3. Evaluate the model performance for each x/y shift.

Display the results in a suitable manner, e.g. plotting as a matrix with plt.imshow, with loss and/or accuracy shown as a function of x/y translation. What do you think about the results? Things to look out for are, for example:

Finally, are there any issues with this evaluation method that may give us a wrong impression?

Option 2: Feature Map Artifacts

What does the model “see” for completely empty inputs? Are there any consistent patterns in the features across many inputs, that are seemingly unrelated to the data?

To do this, you have to “take apart” the model. There are various ways of doing this, but with nn.Sequential the easiest way is probably to recognize the following:

You can try this on any image input – best only a single image, but keep in mind you always need a batch axis. So a single image input would have the shape (1 x 3 x w x h). Looking at each layer one by one, you should see feature maps that somewhat resemble the input in the early layers, with increasingly abstract results as you go deeper into the network.

Now put in a completely blank image, i.e. a tensor full of 0s. Most likely, it will be as in Figure 2 in the paper linked at the top: You will see some strange artifacts at the edges/corners, that will move closer to the center, and become pronounced and complex, as you go deeper. What do you think is going on here?

Another thing you can do is to put in actual images from the dataset, but average the feature responses and plot these averages (Figure 1 in the paper). The artifacts will likely still be present, hinting that this is a systematic issue plaguing the CNN hidden representations, NOT just some strange occurrence for completely “empty” inputs.

Note: This is likely more work/more difficult than Option 1 (Translation), but it’s also really good practice to see how to “look into” the networks we built, rather than just treating them as black boxes that just return some output through arcane magic.

6 CP Extra

Since this one can take a bit to figure out what you really need to be doing, there is no extra work this time. :) But as an optional bonus, you could look into ways of how to fix or at least alleviate these issues. The paper linked above proposes some:

Summary: What To Submit