Figure 1. An illustration of three lines of interpretable machine learning techniques, taking
DNN as an example.
(global or local)
New layer with
In this article, we first summarize
current progress of three lines of
research for interpretable machine
learning: designing inherently interpretable models (including globally
and locally), post-hoc global explanation, and post-hoc local explanation. We proceed by introducing applications and challenges of current
techniques. Finally, we present limitations of current explanations and
propose directions toward more human-friendly explanations.
Inherently Interpretable Model
Intrinsic interpretability can be
achieved by designing self-explanatory
models that incorporate interpretability directly into the model structures.
These constructed interpretable models either are globally interpretable or
could provide explanations when they
make individual predictions.
Globally interpretable models can
be constructed in two ways: directly
trained from data as usual but with
interpretability constraints and being extracted from a complex and
Adding interpretability constraints.
The interpretability of a model could
be promoted by incorporating interpretability constraints. Some representative examples include enforcing
sparsity terms or imposing semantic
monotonicity constraints in classification models.
14 Here, sparsity means a
model is encouraged to use relatively
fewer features for prediction, while
monotonicity enables the features to
have monotonic relations with the
prediction. Similarly, decision trees
are pruned by replacing subtrees with
leaves to encourage long and deep trees
rather than wide and more balanced
29 These constraints make a model
simpler and could increase the model’s
comprehensibility by users.
Besides, more semantically mean-
ingful constraints could be added to
a model to further improve interpret-
ability. For instance, interpretable con-
volutional neural networks (CNN) add
a regularization loss to higher convo-
lutional layers of CNN to learn disen-
tangled representations, resulting in
filters that could detect semantically
meaningful natural objects.
work combines novel neural units,
called capsules, to construct a capsule
model accuracy and explanation fidel-
ity. Inherently interpretable models
could provide accurate and undistorted
explanation but may sacrifice predic-
tion performance to some extent. The
post-hoc ones are limited in their ap-
proximate nature while keeping the un-
derlying model accuracy intact.
Based on categorization noted here,
we further differentiate two types of
interpretability: global interpretability
and local interpretability. Global inter-
pretability means users can under-
stand how the model works globally by
inspecting the structures and param-
eters of a complex model, while local
interpretability examines an individual
prediction of a model locally, trying to
figure out why the model makes the
decision it makes. Using the DNN in
Figure 1 as an example, global inter-
pretability is achieved by understand-
ing the representations captured by
the neurons at an intermediate layer,
while local interpretability is obtained
by identifying the contributions of
each feature in a specific input to the
prediction made by DNN. These two
types bring different benefits. Global
interpretability could illuminate the
inner working mechanisms of ma-
chine learning models and thus can in-
crease their transparency. Local inter-
pretability will help uncover the causal
relations between a specific input and
its corresponding model prediction.
Those two help users trust a model and
trust a prediction, respectively.
Figure 2. A traditional machine learning pipeline using feature engineering, and a deep
learning pipeline using DNN-based representation learning.
Raw input Feature engineering Features Traditional ML model Output
Raw input Output DNN-based representation learning
Traditional machine learning