The record declaration calls a subroutine (or a function) in a manner that
allows the VCR system to record real-time details of its execution, including
the subroutine’s inputs, outputs, and
source code. Outputs of recorded functions are automatically assigned VRIs
and become verifiable results. Finally,
to prepare an important verifiable result for publication, we do not save it to
a local data or graphics file, but rather
invoke the publish declaration, which
assigns it a VRI and turns it into a verifiable result.
In a short time, a group or community
that uses VCR will accumulate a mass
of verifiable results that can benefit
from dream applications. With that
in mind, let’s briefly review the implementation principles and significance
of the three dream applications.
Search. To implement Search, avail-
able verifiable results are crawled and
indexed according to a variety of prop-
erties, including the following:
• the time it was created;
• the researchers who created it;
•the words in the program code
used to create the computation;
• the values of the tuning parameters used;
• the VRIs of datasets used in creat-
ing the computation; and
• the VRIs of the results that repro-
Amalgamate. The researcher provides the Amalgamate application a
list of VRIs that reference numerical
results. Amalgamate then downloads
the results in some desired format by
making the appropriate request to
each of the verifiable results, and creates an amalgamated dataset. The
amalgamated dataset becomes a new
Unlike tables created by manual
(cut-and-paste) amalgamation, which
contain only numeric values, tables
created by the Amalgamate application
unambiguously cite the source of every
table entry. In medical research, for example, this application can significantly reduce the time required to prepare
meta-analysis research; new research
methods could also emerge that use
such automatic tools to mine the enormous body of medical literature.
Once upon a time,
local files were
the only way to
on a computer.
Tweak. The proliferation of quantitative platform-independent scripting languages (such as Python, R, and
Matlab), along with advances in virtualization, are rapidly creating a situation that will let us truly share and reexecute each other’s computations. All
computer programs have an expiration date, after which it becomes practically impossible to execute them on
widely available machines. However,
in addition to being portable across
platforms, scripts of platform-independent languages and computations
staged in virtual machine images
seem to have longer expiration periods than binaries.
For re-execution to provide real
and lasting value to science, it must
be paired with the notion of computation as an object that can be formally
declared, permanently stored, and
later manipulated and operated on.
As we previously mentioned, basic operations on such a computation object
include replacing or shuffling a dataset that the computation imports,
changing environment variables such
as the random seed, and changing
any of the tuning parameters before
VCR can create exactly such computation objects as it records the computation that created a verifiable result.
Furthermore, VCR rules make it easy to
formally manipulate computations simply by replacing all occurrences of some
VRI in the program code with another.
The Tweak application is given
a VRI and uses the VCR interface to
download all required source code
and dependencies. It then stages the
computation for re-execution on the
local machine. Before re-execution,
however, it tweaks the computation by
replacing some of the VRIs in the program code with new ones.
Science is undergoing a digitization
revolution, where more and more
stages of the scientific process are performed on computers. In principle,
digitization offers many new opportunities to leverage the massive body
of knowledge being created by scientists; but most of these opportunities
are too hard to pursue right now. The
dream applications we describe in
this article illustrate the far-reaching
potential benefits offered by the ongoing digitization of science; they allow
scientists to transparently combine
computations and data published by
others, and obtain new scientific results. While these dream applications
are very remote from current practice, the barriers preventing them can
be elegantly surmounted. Only a few
changes in scientific computing practices are needed to enable these dream
applications; the main step is to adopt
the notion of verifiable computational
results. This will allow computational
results to become discoverable, combinable, and generalizable.
A current effort to implement VCR
and dream applications can be found at
[ 1] Golub , T. R. et al, Molecular classification of cancer:
Class discovery and class prediction by gene
expression monitoring. Science 286, 5439 (1999),
[ 2] Gavish, M. and Donoho, D. A universal identifier for
computational results. Procedia Computer Science
[ 3] Gavish, M. and Donoho, D. Three dream applications
of verifiable computational results, IEEE Computing
in Science & Engineering 14, 4 (2012), 26-31.
[ 4] Ioannidis, J. P. et al. Genetic associations in large
versus small studies: An empirical assessment.
The lancet 361, 9357 (2003), 567–571.
[ 5] Data producers deserve citation credit. nature
41, 1045 (2009).
Matan Gavish is a doctoral student in statistics at
Stanford University. Gavish has an M.Sc. in mathematics
from the Hebrew University of Jerusalem.
David l. Donoho is a professor at Stanford University. He
has a Ph. D. in statistics from Harvard University and
received the Doctor of Science degree (honorary) from
the University of Chicago. He is a member of the American
Academy of Arts and Sciences and the US national
Academy of Sciences.
Amos onn is an undergraduate student in mathematics in
the Hebrew University of Jerusalem.