These impediments have changed
dramatically in the past three decades; for example, a mathematical language has been developed for
managing causes and effects, accompanied by a set of tools that turn
causal analysis into a mathematical
game, like solving algebraic equations or finding proofs in high-school
geometry. These tools permit scientists to express causal questions formally, codify their existing knowledge
in both diagrammatic and algebraic
forms, and then leverage data to estimate the answers. Moreover, the theory warns them when the state of existing knowledge or the available data
is insufficient to answer their questions and then suggests additional
sources of knowledge or data to make
the questions answerable.
The development of the tools has
had a transformative impact on all da-ta-intensive sciences, especially social
science and epidemiology, in which
causal diagrams have become a second
language.
14, 34 In these disciplines, causal diagrams have helped scientists extract causal relations from associations
and deconstruct paradoxes that have
baffled researchers for decades.
23, 25
I call the mathematical framework
that led to this transformation “
structural causal models” (SCM), which consists of three parts: graphical models,
structural equations, and counterfactual and interventional logic. Graphical models serve as a language for
representing what agents know about
the world. Counterfactuals help them
articulate what they wish to know. And
structural equations serve to tie the two
together in a solid semantics.
Figure 2 illustrates the operation
of SCM in the form of an inference
engine. The engine accepts three inputs—Assumptions, Queries, and
Data—and produces three outputs—
Estimand, Estimate, and Fit indices.
Questions Answered
with a Causal Model
Consider the following five questions:
• How effective is a given treatment
in preventing a disease?;
• Was it the new tax break that
caused our sales to go up?;
• What annual health-care costs are
attributed to obesity?;
• Can hiring records prove an employer guilty of sex discrimination?;
and
• I am about to quit my job, but
should I?
The common feature of these ques-
tions concerns cause-and-effect rela-
tionships. We recognize them through
such words as “preventing,” “cause,”
“attributed to,” “discrimination,” and
“should I.” Such words are common
in everyday language, and modern so-
ciety constantly demands answers to
such questions. Yet, until very recently,
science gave us no means even to ar-
ticulate them, let alone answer them.
Unlike the rules of geometry, mechanics, optics, or probabilities, the rules of
cause and effect have been denied the
benefits of mathematical analysis.
To appreciate the extent of this denial readers would likely be stunned
to learn that only a few decades ago
scientists were unable to write down
a mathematical equation for the obvious fact that “Mud does not cause
rain.” Even today, only the top echelon
of the scientific community can write
such an equation and formally distinguish “mud causes rain” from “rain
causes mud.”
Figure 1. The causal hierarchy. Questions at level 1 can be answered only if information
from level i or higher is available.
Level (Symbol) Typical Activity Typical Questions Examples
1. Association
P(y|x)
Seeing What is? How would
seeing X change my
belief in Y?
What does a symptom
tell me about a disease?
What does a survey tell
us about the election
results?
2. Intervention
P(y|do(x), z)
Doing,
Intervening
What if? What if I do X What if I take aspirin,
will my headache be
cured? What if we ban
cigarettes?
3. Counterfactuals
P(yx|x′, y′)
Imagining,
Retrospection
Why? Was it X that
caused Y What if I had
acted differently?
Was it the aspirin that
stopped my headache?
Would Kennedy be alive
had Oswald not shot
him? What if I had not
been smoking the past
two years?
Figure 2. How the SCM “inference engine” combines data with a causal model (or assumptions) to produce answers to queries of interest.
ES
F
E;S Assumptions (Graphical model)
Data Fit Indices
Estimate
(Answer to query)
Query
answering the query)
(Recipe for
Estimand
Inputs Outputs
Figure 3. Graphical model depicting
causal assumptions about three variables;
the task is to estimate the causal effect
of X on Y from non-experimental data on
{X, Y, Z}.
XY
Z