Deadline: January 17th, 9am
In this assignment, we will explore the phenomenon of adversarial examples and how to defend against them. This is an active research field and somewhat poorly understood, but we will look at some basic examples.
Train a model of your choice model on “your favorite dataset”. It should be an image classification task (e.g. MNIST… although a more complex dataset such as CIFAR should work better!). Alternatively, see the section below on using pretrained models. Next, let’s create some adversarial examples:
GradientTape
where you watch the input batch.The above is an “untargeted” attack – we simply care about increasing the
loss as much as possible. You could also try targeted attacks where the goal
is to make the network misclassify an input in a specific way – e.g. classify
every MNIST digit as a 3. Think about how you could achieve this, and feel free
to try it out.
If you think about it, this is very similar to computing saliency maps or
activation maximization!
Some more details:
tf.math.sign
of the gradient first, then multiply by a small constant.model.evaluate
to compare to the performance
on the original test set.Hopefully, you are able to “break” your models somewhat reliably, but you don’t have to expect 100% success rate with your attacks.
Note: There is also a tutorial on the TF website!
Depending on your viewpoint, adversarial examples are either really amazing or
really depressing (or both). Either way, it would be desirable if our models
weren’t quite as susceptible to them as they are right now. One such “defense
method” is called adversarial training – explicitly train your models to classify
adversarial examples correctly. The procedure is quite simple (you probably want
to use a custom training step, rather than model.fit
):
Note It’s important that, during model training, you do not backpropagate through the adversarial example creation process. That is, adversarial training should look like this:
with tf.GradientTape() as adversarial_tape:
... # create adversarial examples
with tf.GradientTape() as training_tape:
... # train model on adversarial and/or true batch
NOT like this:
with tf.GradientTape() as training_tape:
with tf.GradientTape() as adversarial_tape:
... # create adversarial examples
... # train model
There are ways to get the second option to work (and it’s less wasteful computationally), but it’s easier to get them wrong.