6. Bates, A.M. et al. Trustworthy whole-system
provenance for the Linux kernel. In Proceedings of the
USENIX Security Symposium (2015) 319–334.
7. Buneman, P. et al. Why and where: A characterization
of data provenance. In Proceedings of the International
Conference on Database Theory. Springer, 2001, 316–330.
8. Carata, L. et al. A primer on provenance. Commun.
ACM 57, 5 (May 2014), 52–60.
9. Cheney, J. et al. Provenance in databases: Why, how,
and where. Foundations and Trends in Databases 1, 4
(2009), 379–474.
10. Crabtree, A. et al. Building accountability into the
Internet of Things: The Io T databox model. Journal of
Reliable Intelligent Environments (2018).
11. Davidson, S. et al. Provenance views for module
privacy. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems. ACM, 2011, 175–186.
12. De Hert, P. et al. The right to data portability in the GDPR:
Towards user-centric interoperability of digital services.
Computer Law & Security Review. Elsevier, (2017).
13. Eldefrawy, K. et al. SMART: Secure and minimal
architecture for (establishing dynamic) root of
trust. In Network and Distributed System Security
Symposium 12 (2012), 1–15.
14. Han, X. et al. FRAPpuccino: Fault-detection through
Runtime Analysis of Provenance. In Proceedings
of the Workshop on Hot Topics in Cloud Computing
(HotCloud’ 17). USENIX (2017).
15. Hasan, R. et al. The case of the fake Picasso:
Preventing history forgery with secure provenance.
In Proceedings of the Conference on File and Storage
Technologies (FAST’09), (2009), 1–14.
16. Herschel, M. et al. A survey on provenance: What for?
What form? What from? The VLDB Journal—The
International Journal on Very Large Data Bases 26, 6
(2017), 881–906.
17. Hossain, M. N. et al. Dependence-preserving data
compaction for scalable forensic analysis. In
Proceedings of the USENIX Security Symposium.
18. King, S. T. and Chen, P.M. Backtracking intrusions. ACM
SIGOPS Operating Systems Review 37, 5 (May 2003).
19. Liang, X. et al. Provchain: A blockchain-based data
provenance architecture in cloud environment with
enhanced privacy and availability. In International
Symposium on Cluster, Cloud and Grid Computing.
IEEE/ACM, (2017), 468–477.
20. Missier, P. et al. ProvAbs: Model, policy, and tooling
for abstracting PROV graphs. In Proceedings of the
International Provenance and Annotation Workshop.
Springer, 2017, 3–15.
21. Moyer, T. and Gadepally, V. High-throughput ingest
of data provenance records into Accumulo. In
Proceedings of the High Performance Extreme
Computing Conference (HPEC), IEEE, 2016, 1–6.
22. Pasquier, T. et al. Runtime analysis of whole system
provenance. In Proceedings of the Conference on
Computer and Communications Security (CCS’ 18).
ACM, 2018.
23. Pasquier, T. et al. If these data could talk. Scientific
Data 4 (2017), http://www.nature.com/sdata2017114.
24. Pasquier, T. et al. Data provenance to audit compliance
with privacy policy in the Internet of Things. Personal
and Ubiquitous Computing (2018), 333–344.
25. Pohly, D. J. et al. Hi-Fi: Collecting high-fidelity whole-system provenance. In Proceedings of the 28th Annual
Computer Security Applications Conference. ACM,
2012, 259–268.
26. Schreiber, A. and Struminski, R. Tracing personal data
using comics. In Proceedings of the International
Conference on Universal Access in Human-Computer
Interaction. Springer, 2017, 444–455.
27. Singh, J. et al. Twenty security considerations for
cloud-supported Internet of Things. IEEE Internet of
Things Journal 3, 3 (Mar. 2016), 269–284.
Thomas Pasquier ( http://tfjmp.org) is a Lecturer
(Assistant Professor) at the University of Bristol’s Cyber
Security Group, and a visiting scholar at the University of
Cambridge, U. K.
David Eyers ( https://www.cs.otago.ac.nz/staff/David_Eyers)
is an Associate Professor in the Department of Computer
Science at the University of Otago, New Zealand.
Jean Bacon ( http://www.cl.cam.ac.uk/~jmb25/) is
Professor Emerita of Distributed Systems at the University
of Cambridge, U. K.
Copyright held by authors.
If we assume trustworthy provenance
capture is achievable, the issue of guaranteeing that the provenance record can
be audited remains. If you are to audit
the processing of personal data, guarantees about the integrity and availability
of the provenance record must exist. If
you agreed to share your daily activity
for research, the activities of insurance
companies scraping your data for possible health risks must not be able to
masquerade as benign research use,
nor should data collection for political
purposes be able to pass as harmless entertainment, as in the Cambridge Ana-lytica scandal.h Similarly, the availability
(durability) of the audit record must be
guaranteed. There is no point to an audit record if it can simply be deleted.
Further, Moyer et al. evaluated the
storage requirements of provenance
when used for security purposes in relatively modest distributed systems.
21 In
such a context, several thousands of
graph elements can be generated per
second and per machine, resulting in
a graph containing billions of nodes to
represent system execution over several
months. It is unclear how some past research outcomes, for example, detection
of suspicious behavior,
2 privacy-aware
provenance11 or provenance integrity,
15
scale to very large graphs, as such concerns were not evaluated. Similarly,
while blockchain is heralded19 as an in-tegrity-preserving means to store provenance, it is unclear how well it could expand to such scale. Several options have
been explored to reduce graph size, such
as identifying and tracking only sensitive data objects5 or performing proper-ty-preserving graph compression17 however none has yet adequately addressed
the scalability challenge.
How to Communicate Information?
Means must be developed to commu-
nicate about data usage, but also about
the risks of inference from the data.
Not only must the nature of the data be
considered, but also other properties
such as the frequency of capture.
3 For
example, a 100Hz smart-meter read-
ing can in some cases indicate what
television channel is currently being
watched; even a daily average reading
could inform about occupancy. Here,
it is important to be able to explore
h See https://nyti.ms/2HH74vA
Provenance visualization has been
an active research topic for over a de-
cade, yet no fully satisfactory solution
has been proposed. The simplest possi-
ble visualization is to render the graph,
however beyond trivially simple graphs
such a representation is too complex
and dense to be easily understood, even
by experts. We go further and suggest
that how interpretable such informa-
tion is for end users also depends on
educational background, socioeco-
nomic environment, and culture.
In order to make the accountability
and transparency of Io T platforms effective, a better communication medium
must be provided. An approach often
taken is to analyze motifs in the graph
to extract high-level abstractions (for
example, Missier et al.
20), meaningful to
the average end user. In recent work, it
was proposed to represent such a high-level abstraction as a comic strip.
26
We Need to Care About
Digital Provenance
Building transparent and auditable systems may be one of the greatest software
engineering challenges of the coming
decade. As a consequence, digital provenance and its application to cybersecurity and the management of personal data
has become a hot research topic. We
have highlighted key active areas of research and their associated challenges.
It is fundamental for industry practitioners to understand the threat posed by
the black-box nature of the Io T, the potential solutions, and the challenges to a
practical deployment of those solutions.
Accountability-by-design must become
a core objective of Io T platforms.
References
1. Acar, U. et al. A graph model of data and workflow
provenance. In Proceedings of the TAPP’ 10 Second
Conference on Theory and Practice of Provenance,
USENIX, 2010.
2. Allen, M.D. et al. Provenance for collaboration:
Detecting suspicious behaviors and assessing trust in
information. In Proceedings of the 7th International
Conference on Collaborative Computing: Networking,
Applications and Worksharing (CollaborateCom).
IEEE, 2011, 342–351.
3. Amar, Y. et al. An information theoretic approach to
time-series data privacy. In Proceedings of the 1st
Workshop on Privacy by Design in Distributed Systems.
ACM, (2018), 3.
4. Balakrishnan, N. et al. Non-repudiable disk I/O in
untrusted kernels. In Proceedings of the 8th Asia-Pacific Workshop on Systems. ACM, 2017, 24.
5. Bates, A. et al. Take only what you need: Leveraging
mandatory access control policy to reduce provenance
storage costs. In Proceedings of the Conference on
Theory and Practice of Provenance (2015), USENIX, 7–7.