Assignment 10: Adversarial Examples & Training

Deadline: December 19th, 9am

In this assignment, we will explore the phenomenon of adversarial examples and how to defend against them. This is an active research field and somewhat poorly understood, but we will look at some basic examples.

Creating Adversarial Examples

Train a model of your choice model on “your favorite dataset”. It should be an image classification task (e.g. MNIST… although a more complex dataset such as CIFAR should work better!). Alternatively, see the section below on using pretrained models. Next, let’s create some adversarial examples:

Hopefully, you are able to “break” your models somewhat reliably, but you don’t have to expect 100% success rate with your attacks.

Adversarial Training

Depending on your viewpoint, adversarial examples are either really amazing or really depressing (or both). Either way, it would be desirable if our models weren’t quite as susceptible to them as they are right now. One such “defense method” is called adversarial training – explicitly train your models to classify adversarial examples correctly. The procedure is quite simple:

Note It’s important that, during model training, you do not backpropagate through the adversarial example creation process. That is, adversarial training should look like this:

with tf.GradientTape() as adversarial_tape:
    ...  # create adversarial examples

with tf.GradientTape() as training_tape:
    ...  # train model on adversarial and/or true batch

NOT like this:

with tf.GradientTape() as training_tape:
    with tf.GradientTape() as adversarial_tape:
        ...  # create adversarial examples
    ...  # train model

There are ways to get the second option to work (and it’s less wasteful computationally), but it’s easier to get them wrong.

Bonus: Pretrained Models

Keras comes with a host of models already trained on ImageNet. This would be a decent time to try them out. However, given the available time and resources, you most likely won’t be able to download the entire dataset/do adversarial training.