such as a recurrent neural network
language model, to generate an explanation. Such an approach is taken in a
line of work by Krening et al. 12 They
propose a system in which one model
(a reinforcement learner) chooses actions to optimize cumulative discounted return. They train another
model to map a model’s state representation onto verbal explanations of
strategy. These explanations are
trained to maximize the likelihood of
previously observed ground-truth explanations from human players and
may not faithfully describe the agent’s
decisions, however plausible they appear. A connection exists between this
approach and recent work on neural
image captioning in which the representations learned by a discriminative
CNN (trained for image classification)
are co-opted by a second model to
generate captions. These captions
might be regarded as interpretations
that accompany classifications.
In work on recommender systems,
McAuley and Leskovec18 use text to explain the decisions of a latent factor
model. Their method consists of simultaneously training a latent factor model
for rating prediction and a topic model
for product reviews. During training
they alternate between decreasing the
squared error on rating prediction and
increasing the likelihood of review text.
The models are connected because
they use normalized latent factors as
topic distributions. In other words, latent factors are regularized such that
they are also good at explaining the
topic distributions in review text. The
authors then explain user-item compatibility by examining the top words
in the topics corresponding to matching components of their latent factors.
Note that the practice of interpreting
topic models by presenting the top
words is itself a post hoc interpretation technique that has invited scrutiny. 4 Moreover note we have only spoken
to the form factor of an explanation
(that it consists of natural language),
but not what precisely constitutes correctness. So far, the literature has
dodged the issue of correctness, sometimes punting the issue by embracing
a subjective view of the problem and
asking people what they prefer.
Visualization. Another common
approach to generating post hoc
interpretations is to render visualiza-
tions in the hope of determining qual-
itatively what a model has learned.
One popular method is to visualize
high-dimensional distributed repre-
sentations with t-distributed stochas-
tic neighbor embedding (t-SNE), 28 a
technique that renders 2D visualiza-
tions in which nearby data points are
likely to appear close together.
Mordvintsev et al. 20 attempt to ex-
plain what an image classification
network has learned by altering the
input through gradient descent to en-
hance the activations of certain nodes
selected from the hidden layers. An
inspection of the perturbed inputs
can give clues to what the model has
learned. Likely because the model
was trained on a large corpus of ani-
mal images, they observed that en-
hancing some nodes caused certain
dog faces to appear throughout the
input image.
In the computer vision community,
similar approaches have been ex-
plored to investigate what informa-
tion is retained at various layers of a
neural network. Mahendran and Ve-
daldi17 pass an image through a dis-
criminative CNN to generate a repre-
sentation. They then demonstrate the
original image can be recovered with
high fidelity even from reasonably
high-level representations (level 6 of
an AlexNet) by performing gradient
descent on randomly initialized pix-
els. As before with text, discussions
of visualization focus on form factor
and appeal, but we still lack a rigorous
standard of correctness.
Local explanations. While it may be
difficult to describe succinctly the full
mapping learned by a neural network,
some of the literature focuses instead
on explaining what a neural network
depends on locally. One popular ap-
proach for deep neural nets is to com-
pute a saliency map. Typically, they
take the gradient of the output corre-
sponding to the correct class with re-
spect to a given input vector. For imag-
es, this gradient can be applied as a
mask, highlighting regions of the in-
put that, if changed, would most influ-
ence the output. 25, 30
Note that these explanations of what
a model is focusing on may be mislead-
ing. The saliency map is a local explana-
tion only. Once you move a single pixel,
While post hoc
interpretations
often do not
elucidate precisely
how a model works,
they may confer
useful information
for practitioners
and end users of
machine learning.