SELTZER You’re saying that you need provenance and you need the tools to add the provenance, so that when Photoshop makes a copy there’s a record that says, “OK, this is now a Photoshop document, but it came from this other document and then it was transformed by Photoshop.” BREWER I completely agree with provenance, but I thought you said that it was inherently not going to work because users could always make copies that are not under anyone’s control. I think that’s the breach and not the observance. Most copies are made by software. SELTZER I agree, but I think that those copies have a way of leaking outside of the domain where things like de-duplication can’t do anything about them. What typically happens is I go through the firewall, open up something on the corporate server, and then, as I am about to go on my trip, I save a file to my laptop and take my laptop away. Steve’s de-duplication software is never going to see my laptop again. BREWER Yes, and that was my earlier point about managing the data. If you were to go to any system administrator with that scenario, they would get these big eyes and be really afraid. It should be a lot harder to do exactly what you just stated. That particular problem is perceived as a huge problem by lawyers and system administrators everywhere. The leakage of that data is a big issue. KLEIMAN Companies that actually own the end-user applications will have to set architectures and policies around this area. They’ll certainly sign and possibly encrypt the document. Over time, they will also take responsibility for the things that we have been talking about: encryption, controlling usage, and external copies. Part of this problem is solved in the application universe, and there are only a few companies that are practical owners of that space. SELTZER There are times when you want that kind of provenance and there are times when you really don’t. CREEGER There’s going to be a hazy line between the two. Defining what is an extraneous copy or derivation of a data object will be intimately tied up with the original object’s semantics. Storage systems are going to be called on to have a more semantic understanding of the objects they store, and deciding whether that information is redundant and delete-able will be much more complex. KLEIMAN The good news is the trend for end-user application companies, such as Microsoft, is to be relatively open about their protocols. Having those protocols open and accessible will allow people to leverage a common
more queue: www.acmqueue.com
model across the entire system. So, yes, if you kept encrypting blindly, you would defeat any de-duplication because everything is Klingon poetry at that point. I should be able to determine whether two documents that are copied and separately encrypted are the same or not. I’m hoping that will be possible.
CREEGER What recommendations are we going to be able to make? If IT managers are going to be making investments in archival types of solutions, disaster recovery, de-duplication, and so on, what should they be thinking about in terms of how they design their architectures today and in the next 18 months?
KLEIMAN Over the next decade enterprise-level data is going to migrate to a central archive function that is compressed and de-duplicated, potentially with compliance and whatever other disaster-recovery features that you might want. Once data is in this archive and has certain known properties, the enterprise storage manager can control how it is accessed. They may have copies out on the edges of the network for performance reasons—maybe it’s Flash, maybe its high-performance disks, maybe it’s something else—but for all that data there’s a central access and control point.
CREEGER So, people should be looking at building a central archival store that has known properties. Then, once a centralized archive is in place, people can take advantage of other features, such as virtualization or de-duplication, and not sweat the peripheral/edge storage stuff as much.
KLEIMAN I do that today at home, where I use a service that backs up all the data on my home servers to the Internet. When I tell them to back up all my Microsoft files, the Microsoft files don’t go over the network. The service knows that they don’t have to copy Word.exe. BAKER I’m going to disagree a little bit. One of the things I’ve been doing the past few years is looking at how people and organizations lose data. There’s an amazing richness of ways in which you can lose stuff, and a lot of the disaster stories were due to, even in a virtual sense, a centralized archive.
There’s a lot to be said for having those edge copies under other administrative domains. The effectiveness of securing data in this way depends on how seriously you want to keep the data, for how long, and what kind of threat environment you have. The convenience and economics of a centralized archive are very compelling, but it depends on what kinds of risks you want to take with your data over how long a period of time. SELTZER What happens if Steve’s Internet archive service goes out of business?
ACM QUEUE November/December 2008 37
References:
Archives