Assignment 13: Introspection Part 2
Deadline: January 23rd, 9am
This is the final assignment for this class.
In this assignment, you will try to detect misbehavior of models and explain errors
using Introspection methods.
Unmasking Clever Hans Predictors
We will start with a synthetic example, where we purposefully make it easy for
the model to cheat. Afterwards, we will apply Introspection to detect that
and explain how the model cheated.
- Start with a simple data set and augment the examples of one class with some
kind of identifier, which can be
- very obvious, like a common text (e.g. time stamp) added to the image,
or maybe a big black square,
- less obvious, like a highly transparent watermark,
- a particular image enhancement (like adding a lightness gradient from one
corner to the other)…
- Train a model on this data set.
It should perform very well on the augmented class because of the added
information.
- Finally, apply Introspection techniques on the trained model to figure out
whether the model really made use of the cheating possibility that we provided.
Instead of only altering one class, you could also apply different kinds of
cheating opportunities to different classes.
Note: If you add regularization to your model, this should not get rid of the
information that we have inserted into the class. For example, if your “extra
information” is that the images of one class are highly saturated, but you add
preprocessing that normalizes image saturation, or perhaps applies random
saturation to all images, this would destroy the extra information.
Contrastive Explanations
This is a more realistic scenario in which you can also try out more advanced
methods to create saliency maps.
You will work with a pre-trained network and try to explain wrong decisions of
the network with different Introspection techniques and contrastive explanations.
- Pick a pre-trained model of your choice (image classification would be best).
- Gather examples that were wrongly predicted by the model, so you have a list
of images with their wrong predicted label and their correct annotated label
(the examples do not need to come from the training data, but it’s recommended
that they are not too far away from the training data distribution).
- Now compute saliency maps, e.g. using
tf-explain,
for both the predicted class and the correct label.
- Compare the two saliency maps and investigate:
- Are they different (maybe you need to look at the actual difference of the
saliency maps)?
- Can you learn something about why the model made the error (either from the
saliency map of the predicted class only, or by comparing to the
target class saliency map)?
If you use tf-explain (or something similar), you can easily try out different
saliency map methods and compare which one helps you most in explaining
classification errors.
Let’s use our feature visualization technique from the last assignment in a
different way.
- Train a model, e.g. on Cifar10, like last time.
- Pick examples that were wrongly classified, together with their predicted
class and their annotated label.
- Now optimize the example such that it is classified as the correct class. We
want to keep the changes as small as possible. To this end, you can add a
penalty to the loss function (e.g. penalize pixel
difference of optimized input to original input).
- Finally inspect what the optimization needed to change in the image to make
the model detect it as the correct class (e.g. by inspecting the difference
image).
- Can you learn something about why the error was made from the changes in the
input?
Note from Valerie: I have no idea whether this approach works.
But I would be very excited to see some experiments :)