for erroneous corner case behaviors. The main components
of DeepXplore are shown in Figure 5. DeepXplore takes
unlabeled test inputs as seeds and generates new tests that
cover a large number of neurons (i.e., activates them to a
value above a customizable threshold) while causing the
tested DNNs to behave differently. Specifically, DeepXplore
solves a joint optimization problem that maximizes both
differential behaviors and neuron coverage. Note that both
goals are crucial for thorough testing of DNNs and finding diverse erroneous corner case behaviors. High neuron
coverage alone may not induce many erroneous behaviors,
whereas just maximizing different behaviors might simply
identify different manifestations of the same underlying
DeepXplore also supports enforcing of custom domain-specific constraints as part of the joint optimization process.
For example, the value of an image pixel has to be between
0 and 255. Such domain-specific constraints can be specified by the users of DeepXplore to ensure that the generated
test inputs are valid and realistic.
We designed an algorithm for efficiently solving the
joint optimization problem mentioned above using gradient ascent. First, we compute the gradient of the outputs
of the neurons in both the output and hidden layers with
the input value as a variable and the weight parameter as a
constant. Such gradients can be computed efficiently for
most DNNs. Note that DeepXplore is designed to operate
on pretrained DNNs. The gradient computation is efficient
because our whitebox approach has access to the pretrained
DNNs’ weights and the intermediate neuron values. Next,
we iteratively perform gradient ascent to modify the test
input toward maximizing the objective function of the joint
optimization problem described above. Essentially, we perform a gradient-guided local search starting from the seed
inputs and find new inputs that maximize the desired goals.
Note that, at a high level, our gradient computation is similar to the backpropagation performed during the training of
a DNN, but the key difference is that, unlike our algorithm,
backpropagation treats the input value as a constant and the
weight parameter as a variable.
A working example. We use Figure 6 as an example to
show how DeepXplore generates test inputs. Consider that
we have two DNNs to test—both perform similar tasks, that
is, classifying images into cars or faces, as shown in Figure 6,
but they are trained independently with different datasets
and parameters. Therefore, the DNNs will learn similar but
slightly different classification rules. Let us also assume that
classifiers (with state-of-the-art performance on randomly
picked testing sets) still incorrectly classify synthetic images
generated by adding humanly imperceptible perturbations to
a test image.
3 However, the adversarial inputs, similar to random
test inputs, also only cover a small part of the rules learned
by a DNN as they are not designed to maximize coverage.
Moreover, they are also inherently limited to small impercep-
tible perturbations around a test input as larger perturbations
will visually change the input and therefore will require manual
inspection to ensure correctness of the DNN’s decision.
Problems with low-coverage DNN tests. To better understand the problem of low test coverage of rules learned by a
DNN, we provide an analogy to a similar problem in testing
traditional software. Figure 4 shows a side-by-side comparison of how a traditional program and a DNN handle inputs
and produce outputs. Specifically, the figure shows the
similarity between traditional software and DNNs: in software
program, each statement performs a certain operation to
transform the output of previous statement(s) to the input
to the following statement(s), whereas in DNN, each neuron
transforms the output of previous neuron(s) to the input of
the following neuron(s). Of course, unlike traditional software, DNNs do not have explicit branches but a neuron’s
influence on the downstream neurons decreases as the neuron’s output value gets lower. A lower output value indicates
less influence and vice versa. When the output value of a
neuron becomes zero, the neuron does not have any influence on the downstream neurons.
As demonstrated in Figure 4a, the problem of low coverage in testing traditional software is obvious. In this case,
the buggy behavior will never be seen unless the test input
is 0xdeadbeef. The chances of randomly picking such
a value are very small. Similarly, low-coverage test inputs
will also leave different behaviors of DNNs unexplored. For
example, consider a simplified neural network, as shown
in Figure 4b, that takes an image as an input and classifies
it into two different classes: cars and faces. The text in each
neuron (represented as a node) denotes the object or property that the neuron detects,c and the number in each neuron is the real value outputted by that neuron. The number
indicates how confident the neuron is about its output.
Note that randomly picked inputs are highly unlikely to
set high output values for the unlikely combination of
neurons. Therefore, many incorrect DNN behaviors will
remain unexplored even after performing a large number
of random tests. For example, if an image causes neurons
labeled as “Nose” and “Red” to produce high output values
and the DNN misclassifies the input image as a car, such
a behavior will never be seen during regular testing as the
chances of an image containing a red nose (e.g., a picture
of a clown) are very small.
In this section, we provide a general overview of DeepXplore,
our whitebox framework for systematically testing DNNs
Joint optimization with
differences & neuron coverage
Figure 5. DeepXplore workflow.
c Note that one cannot always map each neuron to a particular task, i.e.,
detecting specific objects/properties. Figure 4b simply highlights that different neurons often tend to detect different features.