a user’s exported data in a lossless manner is key to data liberation—it may take
more time to implement, but we think
the result is worthwhile.
case study: Blogger
One of the problems we often encounter when doing a liberation project is catering to the power user. These are our
favorite users. They are the ones who
love to use the service, put a lot of data
into it, and want the comfort of being
able to do very large imports or exports
of data at any time. Five years of journalism through blog posts and photos,
for example, can easily extend beyond
a few gigabytes of information, and attempting to move that data in one fell
swoop is a real challenge. In an effort
to make import and export as simple as
possible for users, we decided to implement a one-click solution that would
provide the user with a Blogger export
file that contains all of the posts, comments, static pages, and even settings
for any Blogger blog. This file is downloaded to the user’s hard drive and can
be imported back into Blogger later
or transformed and moved to another
blogging service.
One mistake we made when creating the import/export experience for
Blogger was relying on one HTTP transaction for an import or an export. HTTP
connections become fragile when the
size of the data you are transferring becomes large. Any interruption in that
connection voids the action and can
lead to incomplete exports or missing
data upon import. These are extremely
frustrating scenarios for users and,
unfortunately, much more prevalent
for power users with lots of blog data.
We neglected to implement any form
of partial export as well, which means
power users sometimes need to resort
to silly things such as breaking up their
export files by hand in order to have
better success when importing. We
recognize this is a bad experience for
users and are hoping to address it in a
future version of Blogger.
A better approach, one taken by ri-
val blogging platforms, is not to rely
on the user’s hard drive to serve as the
intermediary when attempting to mi-
grate lots of data between cloud-based
Blogging services. Instead, data lib-
eration is best provided through APIs,
and data portability is best provided by
building code using those APIs to per-
form cloud-to-cloud migration. These
types of migrations require multiple
RPCs between services to move the
data piece by piece, and each of these
RPCs can be retried upon failure auto-
matically without user intervention. It
is a much better model than the one
transaction import. It increases the
likelihood of total success and is an
all-around better experience for the
user. True cloud-to-cloud portability,
however, works only when each cloud
provides a liberated API for all of the
user’s data. We think cloud-to-cloud
portability is really good for users, and
it’s a tenet of the Data Liberation Front.
challenges
As you have seen from these case studies, the first step on the road to data
liberation is to decide exactly what users need to export. Once you have covered data that users have imported or
created by themselves into your product, it starts to get complicated. Take
Google Docs, for example: a user clearly owns a document that he or she created, but what about a document that
belongs to another user, then is edited
by the user currently doing the export?
What about documents to which the
user has only read access? The set of
documents the user has read access
to may be considerably larger than the
set of documents the user has actually
read or opened if you take into account
globally readable documents. Lastly,
you have to take into account document metadata such as access control
lists. This is just one example, but it
applies to any product that lets users
share or collaborate on data.
Another important challenge to
keep in mind involves security and
authentication. When you are making
it very easy and fast for users to pull
their data out of a product, you drastically reduce the time required for an
attacker to make off with a copy of all
your data. This is why it’s a good idea to
require users to re-authenticate before
exporting sensitive data (such as their
search history), as well as over-commu-nicating export activity back to the user
(for example, email notification that
an export has occurred). We are exploring these mechanisms and more as we
continue liberating products.
Large data sets pose another chal-
lenge. An extensive photo collection,
for example, which can easily scale into
multiple gigabytes, can pose difficulties
with delivery given the current transfer
speeds of most home Internet connec-
tions. In this case, either we have a cli-
ent for the product that can sync data
to and from the service (such as Picasa),
or we rely on established protocols and
APIs (for example, POP and IMAP for
Gmail) to allow users to sync incremen-
tally or export their data.
conclusion
Allowing users to get a copy of their
data is just the first step on the road to
data liberation: we have a long way to
go to get to the point where users can
easily move their data from one product on the Internet to another. We look
forward to this future, where we as engineers can focus less on schlepping
data around and more on building interesting products that can compete
on their technical merits—not by holding users hostage. Giving users control
over their data is an important part of
establishing user trust, and we hope
more companies will see that if they
want to retain their users for the long
term, the best way to do that is by setting them free.
acknowledgments
Thanks to Bryan O’Sullivan, Ben Col-lins-Sussman, Danny Berlin, Josh
Bloch, Stuart Feldman, and Ben Laurie
for reading drafts of this article.
Related articles
on queue.acm.org
Other People’s Data
Stephen Petschulat
http://queue.acm.org/detail.cfm?id=1655240
Why Cloud Computing Will never Be Free
Dave Durkee
http://queue.acm.org/detail.cfm?id=1772130
Brian Fitzpatrick started google’s chicago engineering
office in 2005 and is the engineering manager for the
Data Liberation front and the google affiliate network.
a published author, frequent speaker, and open source
contributor for more than 12 years, fitzpatrick is a
member of the apache software foundation and the open
Web foundation, as well as a former engineer at apple
and collabnet.
JJ Lueck joined the software engineering party at google
in 2007. an mIt graduate and prior engineer at aoL,
Bose, and startups Bang networks and reactivity, he
enjoys thinking about problems such as cloud-to-cloud
interoperability and exploring the depths and potentials of
virtual machines.
© 2010 acm 0001-0782/10/1100 $10.00