and what it actually represents.
maRGo seltzeR: With the data prov-
enance you can identify copies, whether
they were made intentionally or uninten-
tionally. That’s a start. However, answer-
ing the other semantic questions, such
as “Why was the copy made?” will still
require user intervention, which histori-
cally has been very difficult to get.
steVe Kleiman: Each set of data—a
database, a user’s home directory—
has certain properties associated with
it. With a database you want to make
sure it has a certain quality of service, a
disaster recovery strategy, and a certain
number of archival copies so that they
can go back a number of years. They
may also want to have a certain num-
ber of backup checkpoints to go back
to in case of corruption.
Those are all properties of the data
set that can be predefined. Once set,
the system can do the right thing, in-
cluding making as many copies as is
relevant. It’s not that people are mak-
ing copies for the sake of making cop-
ies; they’re trying to accomplish this
higher-level goal and not telling the
system what that goal is.
maRGo seltzeR: You’re saying that
you need provenance and you need the
tools to add the provenance, so that
when Photoshop makes a copy there’s
a record that says, “Okay, this is now
a Photoshop document, but it came
from this other document and then it
was transformed by Photoshop.”
eRic BRe WeR: I completely agree with
provenance, but I thought you said that
it was inherently not going to work be-
cause users could always make copies
that are not under anyone’s control.
I think that’s the breach and not the
observance. Most copies are made by
software.
maRGo seltzeR: I agree, but I think
that those copies have a way of leaking
outside of the domain where things
like de-duplication can’t do anything
about them. What typically happens is I
go through the firewall, open up some-
thing on the corporate server, and then,
as I am about to go on my trip, I save a
file to my laptop and take my laptop
away. Steve’s de-duplication software is
never going to see my laptop again.
eRic BReWeR: Yes, and that was my
earlier point about managing the data.
If you were to go to any system administrator with that scenario they’d get
steVe Kleiman
over the next
decade enterprise-
level data is going to
migrate to a central
archive function
that is compressed
and de-duplicated,
potentially with
compliance and
whatever other
disaster recovery
features that you
might want.
these big eyes and be really afraid. It
should be a lot harder to do exactly
what you just stated. That particular
problem is perceived as a huge prob-
lem by lawyers and system administra-
tors everywhere. The leakage of that
data is a big issue.
steVe Kleiman: Companies that actu-
ally own the end-user applications will
have to set architectures and policies
around this area. They’ll certainly sign
and possibly encrypt the document.
Over time, they will also take responsi-
bility for the things that we have been
talking about: encryption, controlling
usage, and external copies. Part of
this problem is solved in the applica-
tion universe and there are only a few
companies that are practical owners
of that space.
maRGo seltzeR: There are times when
you want that kind of provenance and
there are times when you really don’t.
mache cReeGeR: There’s going to be
a hazy line between the two. Defining
what is an extraneous copy or deriva-
tion of a data object will be intimately
tied up with the original object’s se-
mantics. Storage systems are going
to be called on to have a more seman-
tic understanding of the objects they
store, and deciding that information
is redundant and delete-able will be a
much more complex decision.
steVe Kleiman: The good news is the
trend for end-user application com-
panies, such as Microsoft, is to be
relatively open about their protocols.
Having those protocols open and ac-
cessible will allow people to leverage
a common model across the entire
system. So, yes, if you kept encrypting
blindly you’d defeat any de-duplication
because everything is Klingon poetry
at that point. I should be able to deter-
mine whether two documents that are
copied and separately encrypted are
the same or not. I’m hoping that will be
possible.
mache cReeGeR: What recommenda-
tions are we going to be able to make?
If IT managers are going to be making
investments in archival types of solu-
tions, disaster recovery, de-duplica-
tion, and so on, what should they be
thinking about in terms of how they
design their architectures today and in
the next 18 months?
steVe Kleiman: Over the next decade
enterprise-level data is going to mi-