neuron coverage provides a good estimation of the numbers
and types of DNN rules exercised by an input.
Neuron coverage vs. code coverage. We compare both
code and neuron coverages achieved by the same number of
inputs by evaluating the test DNNs on ten randomly picked
testing samples as described in Section 5. 1. We measure
a DNN’s code coverage in terms of the line coverage of the
Python code used in the training and testing process. We set
the threshold t in neuron coverage 0.75, that is, a neuron is
considered covered only if its output is greater than 0.75 for
at least one input.
The results, as shown in Table 5, clearly demonstrate that
neuron coverage is a significantly better metric than code
coverage for measuring DNN testing comprehensiveness.
Even 10 randomly picked inputs result in 100% code coverage for all DNNs, whereas the neuron coverage never goes
above 34% for any of the DNNs. Moreover, neuron coverage
changes significantly based on the tested DNNs and the test
inputs. For example, the neuron coverage for the complete
MNIST testing set (i.e., 10,000 testing samples) only reaches
57. 7, 76. 4, and 83.6% for C1, C2, and C3, respectively. In
contrast, the neuron coverage for the complete Contagio/
Virustotal test set reaches 100%.
Activation of neurons for different classes of inputs. We
measure the number of active neurons that are common
across the LeNet- 5 DNN running on pairs of MNIST inputs
of the same and different classes, respectively. In particular,
we randomly select 200 input pairs where 100 pairs have the
same label (e.g., labeled as 8) and 100 pairs have different
Figure 7. The first row shows the seed test inputs and the second row shows the difference-inducing test inputs generated by DeepXplore.
The left three columns show results under different lighting effects, the middle three are using single occlusion box, and the right three are
using black rectangles as the transformation constraints. For each type of transformation (three pairs of images), the images from left to
right are from self-driving car, MNIST, and ImageNet.
all:right all: 1 all:diver all:right all: 5 all:cauliflower all:left all: 1 all:castle
DRV_C1:left MNI_C1: 8 IMG_C1:ski DRV_C1:left MNI_C1: 3 IMG_C1:carbonara DRV_C1:right MNI_C1: 2 IMG_C1:beacon
Table 3. The features added to the manifest file for generating two
malware inputs that Android app classifiers (Drebin) incorrectly
mark as benign.
input 1 feature feature::
input 2 feature provider::
Table 4. The top- 3 most in(de)cremented features for generating
two sample malware inputs that PDF classifiers incorrectly mark as
input 1 feature
input 2 feature
It has recently been shown that each neuron in a DNN tends
to independently extract a specific feature of the input
instead of collaborating with other neurons for feature
18 This finding intuitively explains why neuron
coverage is a good metric for DNN testing comprehensiveness. To empirically confirm this observation, we perform
two different experiments as described below.
First, we show that neuron coverage is a significantly better metric than code coverage for measuring comprehensiveness of the DNN test inputs. More specifically, we find
that a small number of test inputs can achieve 100% code
coverage for all DNNs where neuron coverage is actually less
than 34%. Second, we evaluate neuron activations for test
inputs from different classes. Our results show that inputs
from different classes tend to activate more unique neurons
than inputs from the same class. Both findings confirm that
Table 5. Comparison of code coverage and neuron coverage for
10 randomly selected inputs from the original test set of each DNN.
Code coverage Neuron coverage
C1 C2 C3 C1 C2 C3
MNIST 100% 100% 100% 32.7% 33.1% 25.7%
ImageNet 100% 100% 100% 1.5% 1.1% 0.3%
Driving 100% 100% 100% 2.5% 3.1% 3.9%
VirusTotal 100% 100% 100% 19.8% 17.3% 17.3%
Drebin 100% 100% 100% 16.8% 10% 28.6%