fragile with respect to feature selection
and preprocessing. For example, the
coefficient corresponding to the association between flu risk and vaccination might be positive or negative, depending on whether the feature set
includes indicators of old age, infancy,
or immunodeficiency.
Algorithmic transparency. A final notion of transparency might apply at the
level of the learning algorithm itself. In
the case of linear models, you may understand the shape of the error surface.
You can prove that training will converge to a unique solution, even for previously unseen datasets. This might
provide some confidence that the model will behave in an online setting requiring programmatic retraining on
previously unseen data. On the other
hand, modern deep learning methods
lack this sort of algorithmic transparency. While the heuristic optimization
procedures for neural networks are demonstrably powerful, we do not understand how they work, and at present
cannot guarantee a priori they will
work on new problems. Note, however,
that humans exhibit none of these
forms of transparency.
Post hoc interpretability represents a
distinct approach to extracting information from learned models. While
post hoc interpretations often do not
elucidate precisely how a model works,
they may nonetheless confer useful information for practitioners and end users of machine learning. Some common approaches to post hoc
interpretations include natural language explanations, visualizations of
learned representations or models,
and explanations by example (for example, a particular tumor is classified
as malignant because to the model it
looks a lot like certain other tumors).
To the extent that we might consider
humans to be interpretable, this is the
sort of interpretability that applies. For
all we know, the processes by which humans make decisions and those by
which they explain them may be distinct. One advantage of this concept of
interpretability is that opaque models
can be interpreted after the fact, without sacrificing predictive performance.
Text explanations. Humans often
justify decisions verbally. Similarly,
one model might be trained to gener-
ate predictions, and a separate model,
a model’s inner workings. For exam-
ple, a diagnosis model might provide
intuition to a human decision maker
by pointing to similar cases in support
of a diagnostic decision. In some cas-
es, a supervised learning model is
trained when the real task more close-
ly resembles unsupervised learning.
The real goal might be to explore the
underlying structure of the data, and
the labeling objective serves only as
weak supervision.
Fair and ethical decision making. At
present, politicians, journalists, and
researchers have expressed concern
that interpretations must be produced
for assessing whether decisions produced automatically by algorithms
conform to ethical standards. 7 Recidivism predictions are already used to
determine whom to release and whom
to detain, raising ethical concerns.
How can you be sure predictions do not
discriminate on the basis of race? Conventional evaluation metrics such as
accuracy or AUC (area under the curve)
offer little assurance that ML-based decisions will behave acceptably. Thus,
demands for fairness often lead to demands for interpretable models.
The Transparency
Notion of Interpretability
Let’s now consider the techniques and
model properties that are proposed
to confer interpretability. These fall
broadly into two categories. The first
relates to transparency (that is, how
does the model work?). The second
consists of post hoc explanations (that
is, what else can the model tell me?)
Informally, transparency is the opposite of opacity or “black-boxness.” It connotes some sense of understanding the
mechanism by which the model works.
Transparency is considered here at the
level of the entire model (
simulatability), at the level of individual components such as parameters (
decomposability), and at the level of the training
algorithm (algorithmic transparency).
Simulatability. In the strictest sense,
a model might be called transparent if
a person can contemplate the entire
model at once. This definition suggests
an interpretable model is a simple
model. For example, for a model to be
fully understood, a human should be
able to take the input data together
with the parameters of the model and
in reasonable time step through every
calculation required to produce a pre-
diction. This accords with the common
claim that sparse linear models, as
produced by lasso regression, 27 are
more interpretable than dense linear
models learned on the same inputs.
Ribeiro et al. 23 also adopt this notion
of interpretability, suggesting that an
interpretable model is one that “can
be readily presented to the user with
visual or textual artifacts.”
The trade-offs between model size
and computation to apply a single pre-
diction varies across models. For exam-
ple, in some models, such as decision
trees, the size of the model (total num-
ber of nodes) may grow quite large
compared to the time required to per-
form inference (length of pass from
root to leaf). This suggests simulatabil-
ity may admit two subtypes: one based
on the size of the model and another
based on the computation required to
perform inference.
Fixing a notion of simulatability, the
quantity denoted by reasonable is subjective. Clearly, however, given the limited capacity of human cognition, this
ambiguity might span only several orders of magnitude. In this light, neither linear models, rule-based systems,
nor decision trees are intrinsically interpretable. Sufficiently high-dimensional models, unwieldy rule lists, and
deep decision trees could all be considered less transparent than comparatively compact neural networks.
Decomposability. A second notion of
transparency might be that each part
of the model—input, parameter, and
calculation—admits an intuitive explanation. This accords with the property of
intelligibility as described by Lou
et al. 15 For example, each node in a
decision tree might correspond to a
plain text description (for example, all
patients with diastolic blood pressure
over 150). Similarly, the parameters of
a linear model could be described as
representing strengths of association
between each feature and the label.
Note this notion of interpretability
requires that inputs themselves be individually interpretable, disqualifying
some models with highly engineered
or anonymous features. While this notion is popular, it should not be accepted blindly. The weights of a linear model might seem intuitive, but they can be