The dream: pwn a convolutional neural network. The goal: teach myself something. The disclaimer: I’m not a deep learning expert.

Deep learning is everywhere

Deep learning is the hot new thing. Well, it’s not really new anymore. State-of-the-art?

If it’s not already everywhere it soon will be. Go champions have been defeated. Self-driving cars are recognising things on the road thanks to deep neural networks (DNNs). Presumably, surveillance of all stripes is being upgraded with DNNs. Malware and network intrusions are being detected. And evaded.

However, kind of like the internet, machine learning was never built to be secure.

Adversarial machine learning

We’re sitting in your self-driving car as it zooms down a busy road. Halfway through the commute, the cameras in our car see a stop sign. Instead of stopping, our car accelerates through an intersection. We die. Or at least, we’re severely injured. I wonder how insurance works with self-driving cars?

What just happened? Did a hacker just take over our car? Not directly in this case – our car misinterpreted the stop sign as something else. The hacker caused our car to misinterpret its environment. Maybe they used a marker to add a few dots to the sign.

Standard machine learning algorithms aren’t built to handle an adversary. They ingest their training data assuming nothing is wrong. They predict labels for new data, also assuming nothing is wrong. If we imagine ourselves the hacker, we can start to see a few weak points. There’s a good discussion on attacking ML algorithms in Huang et al. (2011). Let’s begin with a quick summary.

A taxonomy of attacks

Consider first at what stage in the ML pipeline we target our attacks. Attacks can be causative, where we target the training data. If we manage to poison this, the classifier will infer a poor model. We could potentially control what kind of poor that classifier is – but either way, we can exploit it. Attacks can also be exploratory – we’ll probe the model after it’s been trained, trying to find weak spots. Maybe we learn something about how it works, or what it was trained on. Maybe we can force it to misclassify something.

Attacks can also vary on how “large” they are. Integrity attacks aim for false negatives – we can slip something by the classifier undetected. Availability attacks aim to cause so many missclassifications – whether false positives or negatives – that the classifier essentially becomes unusable. It’s a bit like a denial of service attack.

Finally, attacks can be targeted or indiscriminate.

Let’s attack a classifier

Adversarial machine learning sounded really cool. I figured the best way to learn something was to actually attack some classifier. With that in mind, I’ll be implementing a black-box attack as described in Papernot et al. (2016) using a technique from Goodfellow et al. (2015). In the above taxonomy, this is an explorative attack.

I’ll be using python and tensorflow for this, targeting a convolutional neural network trained on MNIST handwritten digit data per the tensorflow tutorial. Here’s some music (nsfw?) I used to get into character.

Just before we begin, note that there is a python library called cleverhans that implements this attack. In fact, the authors of the paper we’re implementing contributed. What follows is half me reading the paper(s) and half pulling the library apart in an attempt to understand the attack.

Anatomy of an attack

Most attacks tend to assume you have access to the target classifier internals (parameters, gradients, et al.). Then, given this information, you can craft adversarial samples that cause it to missclassify (here’s a cute example involving a kitten).

The Papernot et al. (2016) black-box attack (henceforth just black-box attack) assumes that we don’t have aceess to this information. We only have access to the predictions – that is, we have access to the target as an oracle. We can query this oracle with some data and it will give us a prediction.

The black-box attack works by exploiting this access to train a substitute model. This substitute model essentially duplicates the target model. We’re not out to maximise accuracy directly, but to learn the target’s decision boundaries.

In fact, we don’t even have to use the same model as the target. This is handy, because we may not know what it is. We could make some educated guesses, but this makes the attack more realistic. In this example, the target model is a convolutional neural network and the adversarial model is plain old multinomial logit.

To train the adversarial model, we start with 100 samples from the MNIST training set. We then get the oracle to labels these for us. That defines our training labels. Our training loop then looks like this:

adv_train_epochs = 5
adv_train_set = mnist.train.next_batch(100)

for adv_train_epoch in range(adv_train_epochs):
    oracle_labels = oracle_predict(adv_train_set)

    train(adv_model, {x: adv_train_set, y: oracle_labels})

    adv_train_set = augment(adv_train_set)

Here, we run through adv_train_epochs of adversarial training. Each epoch is called a substitute training epoch (although I’ve called it adversarial training epoch in the code). The only other thing that requires explanation here is the augment step. What it does is build a new synthetic dataset based on the old one. It takes every training sample we have, perturbs it a bit (new_example = example + lambda*pertubation_vector where lambda is a pertubation factor). This new, perturbed dataset is then added to the old one. So if we had 100 samples, now we have 200, where half of them are perturbed versions of the other half.

But we don’t just use white noise to generate the pertubation vector. We use a heuristic called Jacobian-based dataset augmentation. First, we grab the Jacobian of our adversarial model (a matrix of gradients of our predictions – of the adversarial model – with respect to inputs). We then evaluate it at every training sample we have. Finally, we take the dimension of that matrix depending on how the oracle labels that particular sample. So in practice the Jacobian is a list of lists – 10 lists of 784 elements (28x28 MNIST images). If the oracle predicts a 7, we choose the 7th out of 10. Therefore, we can define some helper functions:

def jacobian(predictions, inputs, num_classes):
    #That is, how does the kth element of yhat vary wrt x?
    return [tf.gradients(predictions[:, c], inputs)[0]
            for c in range(0, num_classes)]

def jacobian_prediction_dimension(grads, predictions):
    return [grads[predictions[i]][i] for i in np.arange(len(predictions))]

This information – how our model ourput varies with respect to inputs – lets us generate some useful variance from which to learn the oracle’s decision boundaries. To complete the heuristic, we take the sign of our selected dimension of the Jacobian and then add it to the original example subject to some pertubation factor lambda. Lambda can vary over each training epoch, here we’ve made it flip sign every tau epochs. The code for the augment heuristic:

#Jacobian-based dataset augmentation
#note that yhat is the logit output of the adversary
#and oracle_labels is one-hot encoded
grads = sess.run(jacobian(yhat, xm, 10), feed_dict={xm: adv_train_set})
jpd = jacobian_prediction_dimension(grads, np.argmax(oracle_labels, 1))

perturbed_set = []
jbda_epoch_lambda = jbda_lambda * np.power(-1, np.floor(adv_train_epoch/tau))
for idx, example in enumerate(adv_train_set):
    new_example = example + jbda_epoch_lambda * (np.sign(jpd[idx]))
    perturbed_set.append(new_example)
adv_train_set = np.vstack((adv_train_set, np.array(perturbed_set)))

And that’s it! We’ve now trained the adversary.

Generating adversarial examples

Now we want to generate some adversarial examples that pass the human sniff test but are missclassified.

Now that we have a substitute model we can use some of the white-box attacks developed elsewhere in the literature. Recall that these attacks assumed we have some knowledge of the target classifier. Now we do – we’ve trained a substitute model to mimic the target’s decision boundaries!

Generating adversarial examples involves perturbing some original input in some fashion. Right off the bat we can think of one method – what if we just added random pertubations? A smarter idea might be to run some sort of optimisation method over the noise we generate – genetic algorithms, particle swarm optimisation, simulated annealing. This is feasible since we have access to a substitute model (which we can query in our own time without being detected). But it may not be the smartest.

I use the method described in Goodfellow et al. (2015) – henceforth Goodfellow attack or Fast Gradient Sign Method. Essentially, this attack perturbs all of the image a little bit (there’s another attack that perturbs some of the image a lot described in the paper). We calculate the pertubation vector in a similar fashion to how we calculate it for Jacobian-based dataset augmentation above. Except the gradient vector is of the loss function with respect to inputs, given model outputs (to prevent label leakage, described in Kurakin et al. 2016 – thanks cleverhans!)

Here’s how I did it:

#Goodfellow attack (FGSA)
goodfellow_eps = 0.3
#grab the adversary predictions of the test data to use as labels: adv_onehot
#define loss function
goodfellow_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
  labels = adv_onehot, logits = yhat))

#calculate the signed gradient matrix of loss wrt x given labels
fgsa = tf.sign(tf.gradients(goodfellow_loss, xm)[0])

#add it to the examples subject to a pertubation factor epsilon
adv_test = tf.stop_gradient(xm + goodfellow_eps * fgsa)

#clip to remain in the MNIST domain [0, 1]
adv_test_clip = tf.clip_by_value(adv_test, 0.0, 1.0)

#get tensorflow to generate them
adv_examples = sess.run(adv_test_clip, feed_dict={xm: mnist.test.images})

And then some reports:

#calculate accuracy of oracle on normal test data and perturbed test data
test_acc_nonadv = sess.run(accuracy, feed_dict =
  {x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})
test_acc_adv = sess.run(accuracy, feed_dict =
  {x: adv_examples, y_: mnist.test.labels, keep_prob: 1.0})

print("Test accuracy %f" % test_acc_nonadv)
print("Test accuracy (after attack) %f" % test_acc_adv)
print("Accuracy reduction %f" % (test_acc_nonadv - test_acc_adv))

After running it a few times (rather quickly, because logits are fast to train), I usually got an accuracy reduction of about 20 percentage points. That takes the model down from 99.2 per cent accuracy to around 80 per cent. A caveat in that the epsilon I’m using as a pertubation factor, 0.3, seems high – reasonable for a first pass at the attack (I may have gotten the code wrong, after all!).

Here’s a particularly entertaining log:

Test accuracy 0.992600
Test accuracy (after attack) 0.664400
Accuracy reduction 0.328200

I think an accuracy reduction of 33 percentage points is enough to make a classifier unusable. Cool!

Further reading