SELTZER You’re saying that you need provenance and
you need the tools to add the provenance, so that when
Photoshop makes a copy there’s a record that says, “OK,
this is now a Photoshop document, but it came from this
other document and then it was transformed by
Photoshop.”
BREWER I completely agree with provenance, but I
thought you said that it was inherently not going to
work because users could always make copies that are not
under anyone’s control. I think that’s the breach and not
the observance. Most copies are made by software.
SELTZER I agree, but I think that those copies have a
way of leaking outside of the domain where things like
de-duplication can’t do anything about them. What
typically happens is I go through the firewall, open up
something on the corporate server, and then, as I am
about to go on my trip, I save a file to my laptop and take
my laptop away. Steve’s de-duplication software is never
going to see my laptop again.
BREWER Yes, and that was my earlier point about managing the data. If you were to go to any system administrator with that scenario, they would get these big eyes and
be really afraid. It should be a lot harder to do exactly
what you just stated. That particular problem is perceived
as a huge problem by lawyers and system administrators
everywhere. The leakage of that data is a big issue.
KLEIMAN Companies that actually own the end-user
applications will have to set architectures and policies
around this area. They’ll certainly sign and possibly
encrypt the document. Over time, they will also take
responsibility for the things that we have been talking
about: encryption, controlling usage, and external copies.
Part of this problem is solved in the application universe,
and there are only a few companies that are practical
owners of that space.
SELTZER There are times when you want that kind of
provenance and there are times when you really don’t.
CREEGER There’s going to be a hazy line between the
two. Defining what is an extraneous copy or derivation of
a data object will be intimately tied up with the original
object’s semantics. Storage systems are going to be called
on to have a more semantic understanding of the objects
they store, and deciding whether that information is
redundant and delete-able will be much more complex.
KLEIMAN The good news is the trend for end-user application companies, such as Microsoft, is to be relatively
open about their protocols. Having those protocols open
and accessible will allow people to leverage a common
more queue: www.acmqueue.com
model across the entire system. So, yes, if you kept
encrypting blindly, you would defeat any de-duplication because everything is Klingon poetry at that point. I
should be able to determine whether two documents that
are copied and separately encrypted are the same or not.
I’m hoping that will be possible.
CREEGER What recommendations are we going to be able
to make? If IT managers are going to be making investments in archival types of solutions, disaster recovery,
de-duplication, and so on, what should they be thinking
about in terms of how they design their architectures
today and in the next 18 months?
KLEIMAN Over the next decade enterprise-level data
is going to migrate to a central archive function that is
compressed and de-duplicated, potentially with compliance and whatever other disaster-recovery features that
you might want. Once data is in this archive and has
certain known properties, the enterprise storage manager
can control how it is accessed. They may have copies
out on the edges of the network for performance reasons—maybe it’s Flash, maybe its high-performance disks,
maybe it’s something else—but for all that data there’s a
central access and control point.
CREEGER So, people should be looking at building a
central archival store that has known properties. Then,
once a centralized archive is in place, people can take
advantage of other features, such as virtualization or de-duplication, and not sweat the peripheral/edge storage
stuff as much.
KLEIMAN I do that today at home, where I use a service
that backs up all the data on my home servers to the
Internet. When I tell them to back up all my Microsoft
files, the Microsoft files don’t go over the network. The
service knows that they don’t have to copy Word.exe.
BAKER I’m going to disagree a little bit. One of the things
I’ve been doing the past few years is looking at how
people and organizations lose data. There’s an amazing
richness of ways in which you can lose stuff, and a lot of
the disaster stories were due to, even in a virtual sense, a
centralized archive.
There’s a lot to be said for having those edge copies
under other administrative domains. The effectiveness of
securing data in this way depends on how seriously you
want to keep the data, for how long, and what kind of
threat environment you have. The convenience and economics of a centralized archive are very compelling, but
it depends on what kinds of risks you want to take with
your data over how long a period of time.
SELTZER What happens if Steve’s Internet archive service
goes out of business?
ACM QUEUE November/December 2008 37