with Moritz Hardt,d we approached
this problem from the other direction, and showed that interactive
data analysis is inherently more difficult than non-interactive data analysis. Very roughly, we show if either
the dimensionality of the dataset is
large, or if the procedure is required
to be computationally efficient in the
worst case over interactions and datasets, then it is infeasible to ensure
statistical validity for a large number
of rounds of interaction.
Relaxing the Problem. In my opinion, the most promising direction is
to find meaningful relaxations of the
model that allow for more useful procedures, and circumvent the bottlenecks we just discussed. The authors’
reusable holdout is one such relaxation. For another example, Blum and
Hardte gave an efficient algorithm for
a specific application in data science
competitions, showing the benefit of a
more tailored approach. I believe there
are many more opportunities to design
useful tools by finding the right balance
between generality and specificity.
As you can see, the foundations of
interactive data analysis are ready to
be developed and brought up to speed
with the foundations of non-interactive
data analysis. I look forward to reading
more work on this exciting topic.
a Bassily et al. Algorithmic stability for adaptive
data analysis. In Proceedings of S TOC’ 16.
b D. Russo, J. Zou. Controlling bias in adaptive
data analysis using information theory. In Pro-
ceedings of AIS TATS’ 16.
c R. Rogers, A. Roth, A. Smith, O. Thakkar. Max-in-
formation, differential privacy, and post-selection
hypothesis testing. In Proceedings of FOCS’ 16.
d M. Hardt, J. Ullman. Preventing false discovery
in interactive data analysis is hard. In Proceed-
ings of FOCS’ 14.
e A. Blum, M. Hardt. The Ladder: A reliable
leaderboard for machine learning competi-
tions. In Proceedings of ICML’ 15.
Jonathan Ullman is an assistant professor in the College
of Computer and Information Sciences at Northeastern
University, Boston, MA.
Copyright held by author.
MOST EVIDENCE IN the empirical sciences is statistical in nature, and scientists rely on a variety of statistical
tests to distinguish valid scientific discoveries from spurious ones. Unfortunately, there is a growing recognition
that many important research findings
based on statistical evidence are not
reproducible, raising the question of
whether there is a gap between what
these statistical tests ensure and the
way they are used.
While there are many reasons that
research findings may be non-reproduc-ible this work is focused on just one—
interactivity. Existing statistical methods
assume the procedure being used to
analyze the data is fixed before the data
is collected. For example, performing a
regression analysis on a fixed set of variables. However, in practice, the methods
used to analyze a dataset are often chosen based on previous interactions with
the same dataset. For example, using
the same dataset to first select variables
and then perform a regression analysis.
Analyzing a fixed dataset in this interactive manner is known to lead to spurious
conclusions, even when each procedure
is statistically sound in isolation.
How should we study interactive
data analysis? The most natural approach is to explicitly model the interactive procedure as a single procedure.
For example, there is a vast amount of
literature on statistically sound ways to
select variables then regress on those
variables. But how can we model the
entire process of interacting with a dataset from start to finish when there
are many steps and the way early steps
influence subsequent steps is complex
and possibly ill defined?
This work takes a different approach
and embraces interactivity. They con-
sider the feasibility of designing a statis-
tical validity safety net that goes around
a fixed dataset so that it may be analyzed
interactively without compromising sta-
tistical validity even when the dataset is
analyzed in an interactive manner. More
concretely, the dramatis personae are an
unknown population and a data analyst
who wants to study this population, but
only has a random sample from this
population. The analyst is allowed to
ask very general questions about the
population, and each question may de-
pend in an arbitrary, unknown way on
the answers to previous questions. The
answers are filtered through the safety
net, which should ensure the answers
remain approximately accurate for the
population no matter how the analyst
chooses the queries.
The authors of the following paper
show there is a way to construct such a
safety net. They do so using a substantial strengthening of the folklore result
that differentially private algorithms
automatically ensure statistical validity, and then deploy the rich toolkit of
differentially private algorithms for
interactive data analysis to construct
such procedures.
Of course, the paper itself is the
best way to learn about the results.
But for those who become interested
in this topic, I will give my own incomplete, very biased, probably already
out-of-date summary of the body of
work on interactive data analysis that
has happened since.
Sharper Bounds. What are the best
possible statistical guarantees for
interactive data analysis? In a subsequent work with Bassily et al.,a we
gave improved bounds on the error for
estimating adaptively chosen statistics, but we still do not know if these
bounds are optimal. One reason we
cannot currently close this gap is we
do not know necessary conditions for
ensuring statistical validity in interactive data analysis. The authors show a
weaker condition than differential privacy—bounded max-information—are
sufficient, and several other notions
have been considered,a,b,c but we still
do not have necessary conditions.
Computational and Statistical
Bottlenecks. In a concurrent work
Technical Perspective
Building a Safety Net
for Data Reuse
By Jonathan Ullman
To view the accompanying paper,
visit doi.acm.org/10.1145/3051088
DOI: 10.1145/3051086