provenance. In Proceedings of the 7th Conference on
File and Storage Technologies, (2009), 1–14.
17. Macko, P. and Seltzer, M. A general-purpose
provenance library. In Proceedings of the 4th Usenix
Conference on Theory and Practice of Provenance,
(2012), 6–6.
18. Macko, P. and Seltzer, M. Provenance Map Orbiter:
interactive exploration of large provenance graphs.
In Proceedings of the 3rd Conference on Theory and
Practice of Provenance, (2011)
19. McDaniel, P. et al. Towards a secure and efficient
system for end-to-end provenance. In Proceedings
of the 2nd Conference on Theory and Practice of
Provenance, (2010), 2–2.
20. Moreau, L. and Missier, P. PROV-DM: The PROV Data
Model. Technical Report. World Wide Web Consortium,
2013.
21. Moreau, L., et al. The Open Provenance Model Core
Specification (V1.1). Future Generations Computer
Systems 27, 6 (2011), 743–756.
22. Muniswamy-Reddy, K.-K., et al. Layering in provenance
systems. In Proceedings of the Usenix Annual
Technical Conference, 2009.
23. Muniswamy-Reddy, K.-K., et al. Provenance-aware
storage systems. In Proceedings of the Usenix Annual
Technical Conference, (2006), 43–56.
24. Park, H., Ikeda, R. and Widom, J. RAMP: A system
for capturing and tracing provenance in MapReduce
workflows. In Proceedings of the 37th International
Conference on Very Large Databases, (2011).
25. Saxena, P., Sekar, R. and Puranik, V. Efficient fine-grained binary instrumentation with applications to
taint-tracking. In Proceedings of the 6th Annual IEEE/
ACM International Symposium on Code Generation
and Optimization, (2008), 74–83.
26. Scheidegger, C., et al. Tackling the provenance
challenge one layer at a time. Concurrency and
Computation: Practice and Experience 20, 5 (2008),
473–483.
27. Shamir, A. 1979. How to share a secret. Commun.
ACM 22, 11 (Nov. 1979), 612–613.
28. Widom, J. Trio: A system for integrated management
of data, accuracy, and lineage. Technical Report 2004-
40, 2004.
Lucian Carata ( lucian.carata@cl.cam.ac.uk) is a Ph. D.
student in the Computer Laboratory, University of
Cambridge. His research focuses on the next-generation
disclosed provenance systems, with the aim of
understanding and controlling the behavior of complex
systems.
Sherif Akoush ( sherif.akoush@cl.cam.ac.uk) is a
Research Associate at University of Cambridge Computer
Laboratory. He is exploring provenance in “Big Data”
systems and its applications.
Nikilesh Balakrishnan ( nikilesh.balakrishnan@cl.cam.
ac.uk) is a Research Assistant in the Computer Laboratory,
University of Cambridge. His research focuses on building
general-purpose provenance systems with emphasis on
usability and wide adoption among the user community.
Thomas Bytheway ( thomas.bytheway@cl.cam.ac.uk)
is a Research Assistant in the Computer Laboratory,
University of Cambridge. His research interests are
in building general-purpose provenance systems and
exploring querying and visualization techniques.
Ripduman Sohan ( ripduman.sohan@cl.cam.ac.uk) is
a Senior Research Associate and Co-PI of the Fabric
For Reproducible Computation (FRESCO) project in
the Computer Laboratory, University of Cambridge. He
previously worked on storage, virtualization, networking
and energy-efficient computing.
Margo Seltzer ( margo@eecs.harvard.edu) is the Herchel
Smith Professor of Computer Science in Harvard’s School
of Engineering and Applied Sciences. She was co-founder
and CTO of Sleepycat Software, the makers of Berkeley
DB, until Oracle acquired Sleepycat in 2006. She is now an
architect in Oracle Labs.
Andy Hopper ( ah12@cam.ac.uk) is Professor of
Computer Technology at the University of Cambridge,
Head of Department of the Computer Laboratory, and
elected member of the University Council. His research
interests include computer networking, pervasive and
sensor-driven computing, and using computers to ensure
the sustainability of the planet.
Copyright held by Owner/Author(s). Publication rights
licensed to ACM. $15.00.
While all systems acknowledge the security of provenance is a fundamental
concern, the rest rely on existing ac-cess-control mechanisms such as SQL
grant privileges and file permissions to
ensure security.
Research Challenges
And Opportunities
Contrasting the initial use cases and
what can actually be achieved with
current provenance systems makes it
clear that research is needed in a number of areas.
Querying and visualization. Despite
the research carried out so far toward
querying and visualizing provenance,
these are still challenging problems.
It remains to be seen how existing
knowledge about graph exploration
and visualizations could be applied, or
whether totally different representations are required.
Computing with provenance. Moving
beyond human queries, provenance
should be made available to applications, allowing automated validation
of inputs, limiting error propagation,
or self-diagnosing changes in output
quality or system behavior.
Distributed systems. There have been
attempts to extend provenance to networked systems, but problems related
to heterogeneity (not all nodes being
provenance aware), scalability, long-term collection, and storage remain to
be solved.
Security and privacy. Collecting provenance has implications on data security and privacy, but most implementations have not considered untrusted
environments or adversarial workloads.
Conclusion
The computing power and storage
capacities available today allow large
quantities of data to be processed in
complex ways. Sometimes the trans-
formations applied are not directly
controlled by or even known to de-
velopers (multiple layers of abstrac-
tion, learning algorithms). Therefore,
a lot of information about a result is
lost when no provenance is recorded,
making it harder to assess quality or
reproducibility. Computing is becom-
ing pervasive, and the need for guar-
antees about it being dependable will
only aggravate those problems; treat-
ing provenance as a first-class citizen
in data processing represents a pos-
sible solution.
Acknowledgments
We would like to thank George Cou-louris for his feedback and our reviewers for their constructive comments
and suggestions.
Related articles
on queue.acm.org
Provenance in Sensor Data Management
Zachary Hensley, Jibonananda Sanyal,
Joshua New
http://queue.acm.org/detail.cfm?id=2574836
CTO Roundtable: Storage
Mache Creeger
http://queue.acm.org/detail.cfm?id=1483110
Better Scripts, Better Games
Walker White, Christoph Koch,
Johannes Gehrke, Alan Demers
http://queue.acm.org/detail.cfm?id=1483106
References
1. Amsterdamer, Y. et al. Putting lipstick on pig: Enabling
database-style workflow provenance. In Proceedings
of the VLDB Endowment 5, 4 (2011), 346–357.
2. Biton, O., Cohen-Boulakia, S. and Davidson, S. B.
ZOOM*UserViews: Querying relevant provenance
in workflow systems. In Proceedings of the 33rd
International Conference on Very Large Databases,
(2007), 366–1369.
3. Blum, M. Coin flipping by telephone: a protocol
for solving impossible problems. In Advances in
Cryptology—A Report on CRYPTO ’ 81, (1982).
4. Borkin, M.A. et al. Evaluation of filesystem provenance
visualization tools. IEEE Transactions on Visualization
and Computer Graphics 19, 12 (2013), 2476–2485.
5. Braun, U., Shinnar, A., Seltzer, M. 2008. Securing
provenance. In Proceedings of the 3rd Usenix Workshop
on Hot Topics in Security, (2008), 1–5.
6. Braun, U. et al. Issues in automatic provenance
collection. In Proceedings of the International
Conference on Provenance and Annotation of Data,
(2006), 171–183.
7. Buneman, P., Khanna, S. and Tan, W. C. Why and where:
A characterization of data provenance. In Proceedings
of the 8th International Conference on Database
Theory, (2002), 316–330.
8. Callahan, S.P. et al. Towards process provenance
for existing applications. In Proceedings of the 2nd
International Provenance and Annotation Workshop,
(2008), 120–127.
9. Cui, Y., Widom, J. and Wiener, J. L. Tracing the lineage
of view data in a warehousing environment. ACM
Transactions on Database Systems 25, 2 (2000), 179–227.
10. Freire, J. et al. Managing rapidly evolving scientific
workflows. In Proceedings of the International
Conference on Provenance and Annotation of Data,
(2006), 10–18.
11. Gates, C. and Bishop, M. One of these records is not like
the others. In Proceedings of the 3rd Usenix Workshop
on the Theory and Practice of Provenance, (2011).
12. Gehani, A. and Tariq, D. SPADE: Support for
provenance auditing in distributed environments. In
Proceedings of the 13th International Middleware
Conference, (2012), 101–120.
13. Green, T. J., Karvounarakis, G., Tannen, V. Provenance
semirings. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGAR T Symposium on Principles of
Database Systems, (2007), 31–40.
14. Guo, P. J., and Seltzer, M. Burrito: Wrapping your
lab notebook in computational infrastructure. In
Proceedings of the 4th Usenix Conference on Theory
and Practice of Provenance, (2012) 7–7.
15. Halevy, D. and Shamir, A. The LSD broadcast
encryption scheme. In Advances in Cryptology,
(2002), 47–60.
16. Hasan, R., Sion, R. and Winslett, M. The case of the
fake Picasso: preventing history forgery with secure