DeepXplore: Automated Whitebox
Testing of Deep Learning Systems
By Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana
DOI: 10.1145/3361566
Abstract
Deep learning (DL) systems are increasingly deployed in
safety- and security-critical domains such as self-driving
cars and malware detection, where the correctness and predictability of a system’s behavior for corner case inputs are
of great importance. Existing DL testing depends heavily on
manually labeled data and therefore often fails to expose
erroneous behaviors for rare inputs.
We design, implement, and evaluate DeepXplore, the first
white-box framework for systematically testing real-world
DL systems. First, we introduce neuron coverage for measuring the parts of a DL system exercised by test inputs. Next, we
leverage multiple DL systems with similar functionality as
cross-referencing oracles to avoid manual checking. Finally,
we demonstrate how finding inputs for DL systems that
both trigger many differential behaviors and achieve high
neuron coverage can be represented as a joint optimization
problem and solved efficiently using gradient-based search
techniques.
DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into
guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets such as ImageNet
and Udacity self-driving challenge data. For all tested DL
models, on average, DeepXplore generated one test input
demonstrating incorrect behavior within one second while
running only on a commodity laptop. We further show that
the test inputs generated by DeepXplore can also be used to
retrain the corresponding DL model to improve the model’s
accuracy by up to 3%.
1. INTRODUCTION
Over the past few years, Deep Learning (DL) has made
tremendous progress, achieving or surpassing human-level
performance for a diverse set of tasks in many application
domains. These advances have led to widespread adoption
and deployment of DL in security- and safety-critical systems such as self-driving cars,
1 malware detection,
4 and aircraft collision avoidance systems.
6
This wide adoption of DL techniques presents new chal-
lenges as the predictability and correctness of such sys-
tems are of crucial importance. Unfortunately, DL systems,
despite their impressive capabilities, often demonstrate
unexpected or incorrect behaviors in corner cases for sev-
eral reasons such as biased training data and overfitting of
the models. In safety- and security-critical settings, such
incorrect behaviors can lead to disastrous consequences
such as a fatal collision of a self-driving car. For example, a
Google self-driving car recently crashed into a bus because it
The original version of this paper was published in
Proceedings of the 26th Symposium on Operating Systems
Principles (Shanghai, China, Oct. 28–31, 2017), 1–18.
expected the bus to yield under a set of rare conditions but
the bus did not.a
A Tesla car in autopilot crashed into a trailer because the
autopilot system failed to recognize the trailer as an obsta-
cle due to its “white color against a brightly lit sky” and the
“high ride height”.
b Such corner cases were not part of Google’s or Tesla’s
test set and thus never showed up during testing.
Therefore, DL systems, just like traditional software,
must be tested systematically for different corner cases to
detect and fix ideally any potential flaws or undesired behaviors. This presents a new system problem as automated and
systematic testing of large-scale, real-world DL systems with
thousands of neurons and millions of parameters for all corner cases is extremely challenging.
The standard approach for testing DL systems is to gather
and manually label as much real-world test data as possible.
Some DL systems such as Google self-driving cars also use
simulation to generate synthetic training data. However,
such simulation is completely unguided as it does not consider the internals of the target DL system. Therefore, for
the large input spaces of real-world DL systems (e.g., all possible road conditions for a self-driving car), none of these
approaches can hope to cover more than a tiny fraction (if
any at all) of all possible corner cases.
Recent works on adversarial deep learning3 have
demonstrated that carefully crafted synthetic images by
adding minimal perturbations to an existing image can
fool state-of-the-art DL systems. The key idea is to create synthetic images such that they get classified by DL
models differently than the original picture but still
look the same to the human eye. Although such adversarial images expose some erroneous behaviors of a DL
model, the main restriction of such an approach is that
it must limit its perturbations to tiny invisible changes
and require ground truth labels. Moreover, just like other
forms of existing DL testing, the adversarial images only
cover a small part ( 52.3%) of DL system’s logic as shown
in Section 5. In essence, the current machine learning
testing practices for finding incorrect corner cases are
analogous to finding bugs in traditional software by using
a
http://www.theverge.com/2016/2/29/11134344/google-self-driving-car-
crash-report
b
https://electrek.co/2016/07/01/understanding-fatal-tesla-accident-autopilot-
nhtsa-probe/