those taken in previous price-raising
situations—unless we replicate precisely the market conditions that existed when the price reached double
its current value. Finally, the top level
invokes Counterfactuals, a mode of
reasoning that goes back to the philosophers David Hume and John Stuart
Mill and that has been given comput-er-friendly semantics in the past two
decades.
1, 18 A typical question in the
counterfactual category is: “What if I
had acted differently?” thus necessitating retrospective reasoning.
I place Counterfactuals at the top
of the hierarchy because they subsume interventional and associational questions. If we have a model that
can answer counterfactual queries,
we can also answer questions about
interventions and observations. For
example, the interventional question:
What will happen if we double the
price? can be answered by asking the
counterfactual question: What would
happen had the price been twice its
current value? Likewise, associational questions can be answered once
we answer interventional questions;
we simply ignore the action part and
let observations take over. The translation does not work in the opposite
direction. Interventional questions
cannot be answered from purely observational information, from statistical data alone. No counterfactual
question involving retrospection can
be answered from purely interventional information, as with that acquired from controlled experiments;
we cannot re-run an experiment on
human subjects who were treated
with a drug and see how they might
behave had they not been given the
drug. The hierarchy is therefore directional, with the top level being the
most powerful one.
Counterfactuals are the building
blocks of scientific thinking, as well
as of legal and moral reasoning. For
example, in civil court, a defendant is
considered responsible for an injury
if, but for the defendant’s action, it is
more likely than not the injury would
not have occurred. The computational
meaning of “but for” calls for comparing the real world to an alternative
world in which the defendant’s action
did not take place.
Each layer in the hierarchy has a
syntactic signature that characterizes
the sentences admitted into that layer.
For example, the Association layer is
characterized by conditional prob-
ability sentences, as in P(y|x) = p, stating
that: The probability of event Y = y, given
that we observed event X = x is equal to
p. In large systems, such evidentiary
sentences can be computed efficiently
through Bayesian networks or any num-
ber of machine learning techniques.
At the Intervention layer, we deal
with sentences of the type P(y|do(x), z)
that denote “The probability of event Y
= y, given that we intervene and set the
value of X to x and subsequently observe
event Z = z. Such expressions can be estimated experimentally from randomized trials or analytically using causal
Bayesian networks.
18 A child learns the
effects of interventions through playful
manipulation of the environment (
usually in a deterministic playground),
and AI planners obtain interventional
knowledge by exercising admissible
sets of actions. Interventional expressions cannot be inferred from passive
observations alone, regardless of how
big the data.
Finally, at the Counterfactual level,
we deal with expressions of the type
P(yx |x′,y′) that stand for “The probability that event Y = y would be observed
had X been x, given that we actually
observed X to be x′ and Y to be y′.” For
example, the probability that Joe’s salary would be y had he finished college,
given that his actual salary is y′ and that
he had only two years of college.” Such
sentences can be computed only when
the model is based on functional relations or structural.
18
This three-level hierarchy, and the
formal restrictions it entails, explains
why machine learning systems, based
only on associations, are prevented
from reasoning about (novel) actions,
experiments, and causal explanations.b
b One could be tempted to argue that deep
learning is not merely “curve fitting” because
it attempts to minimize “overfit,” through, say,
sample-splitting cross-validation, as opposed
to maximizing “fit.” Unfortunately, the theoretical barriers that separate the three layers in
the hierarchy tell us the nature of our objective
function does not matter. As long as our system optimizes some property of the observed
data, however noble or sophisticated, while
making no reference to the world outside the
data, we are back to level- 1 of the hierarchy,
with all the limitations this level entails.
ponent in support of strong AI.
In the next section, I describe a
three-level hierarchy that restricts and
governs inferences in causal reasoning. The final section summarizes how
traditional impediments are circumvented through modern tools of causal
inference. In particular, I present seven
tasks that are beyond the reach of “
associational” learning systems and have
been (and can be) accomplished only
through the tools of causal modeling.
The Three-Level Causal Hierarchy
A useful insight brought to light
through the theory of causal models is
the classification of causal information
in terms of the kind of questions each
class is capable of answering. The classification forms a three-level hierarchy
in the sense that questions at level i (i =
1, 2, 3) can be answered only if information from level j (j > i) is available.
Figure 1 outlines the three-level hierarchy, together with the characteristic questions that can be answered at
each level. I call the levels 1. Association, 2. Intervention, and 3. Counterfactual, to match their usage. I call the
first level Association because it invokes purely statistical relationships,
defined by the naked data.a For instance, observing a customer who buys
toothpaste makes it more likely that
this customer will also buy floss; such
associations can be inferred directly
from the observed data using standard
conditional probabilities and conditional expectation.
15 Questions at this
layer, because they require no causal
information, are placed at the bottom
level in the hierarchy. Answering them
is the hallmark of current machine
learning methods.
4 The second level,
Intervention, ranks higher than Association because it involves not just seeing what is but changing what we see.
A typical question at this level would
be: What will happen if we double the
price? Such a question cannot be answered from sales data alone, as it involves a change in customers’ choices
in reaction to the new pricing. These
choices may differ substantially from
a Other terms used in connection with this
layer include “model-free,” “model-blind,”
“black-box,” and “data-centric”; Darwiche5
used “function-fitting,” as it amounts to fitting data by a complex function defined by a
neural network architecture.