disaster recovery. Data analysis may
require computing power or specialized computing systems not available locally. Policy or sociology may
require replication of data sets in
distinct geographical regions; for example, in high-energy physics, all data
produced at the Large Hadron Collider, Geneva, Switzerland, must essentially be replicated in the U.S. and
elsewhere for independent analysis.
It is also frequently the case that the
aggregate data-analysis requirements
of a community exceed the analysis
capacity of a data provider, in which
case data must be downloaded for local analysis. This is the case in, for example, the Earth System Grid, which
delivers climate simulation output to
its 25,000 users worldwide.
Another common question about
GO is whether data can be moved
more effectively through the physical movement of media rather than
through communication over networks. After all, no network can exceed the bandwidth of a FedEx truck.
The answer is, again, that while physical shipment has its place (and may
be much cheaper if the alternative is
to pay for a high-speed network connection), it is not suitable in all situations. Latency is high, and so is the
human overhead associated with
loading and unloading media, as well
as with keeping track of what has been
shipped. Nevertheless, it could be feasible to integrate into GO methods
for orchestrating physical shipment
when it is determined to be faster
and/or more cost-effective—as with
Cho’s and Gupta’s Pandora (“People
and networks moving data around”)
One alternative for data movement involves running tools on the user’s computer;
for example, rsync, 20 scp, file transfer program (FtP), secure FtP, and bbftp13 are all
used to move data between a client computer and a remote location. Other software
(such as globus-url-copy, reliable File transfer, File transfer Service, and Lightweight
Data replicator) can each manage large numbers of transfers. however, the need to
download, install, and run software is a significant barrier to use. Users spend much time
configuring, operating, and updating such tools though rarely have the it and networking
knowledge necessary to fix things when they do not “just work,” which is all too often.
Some big-science projects have developed specialized solutions to the problem; for
example, the PheDex high-throughput data-transfer-management system9 manages
data movement among sites participating in the Compact Muon Solenoid experiment
at Cern, and the Laser interferometer gravitational wave Observatory (LigO) project
developed the LigO Data replicator. 4 these centrally managed systems allow users
to hand off data-movement tasks to a third-party service that performs them on their
behalf. however, these services require professional operators functioning only among
carefully controlled endpoints within these communities.
Managed services (such as YouSendit and DropBox) also provide data-management
solutions but do not address researchers’ need for high-performance movement of
large quantities of data. Bittorrent8 and Content Distribution networks21 are good at
distributing a relatively stable set of large files (such as movies) but do not address data
scientists’ need for many frequently updated files managed in directory hierarchies.
the integrated rule-Oriented Data System17 is often run in hosted configurations,
but, though it performs some data-transfer operations (such as for data import), data
transfer is not its primary function or focus.
the Kangaroo, 19 Stork, 14 and CatCh15 systems all manage data movement over
wide-area networks using intermediate storage systems where appropriate to optimize
end-to-end reliability and/or performance. they are not designed as SaaS data-
movement solutions, but their methods could be incorporated into gO.
web and reSt interfaces to centrally operated services are conventional in
business, underpinning such services as Salesforce.com (customer relationship
management), google Docs, Facebook, and twitter—an approach not yet common in
science. two exceptions are the PheDex Data Service, 9 with both reSt and CLis, and
the national energy research Supercomputing Center, or nerSC, web toolkit called
newt7 that enables reStful operations against hPC center resources.
the user to provide credentials GO can
use to interact with those endpoints
on the user’s behalf. GO then estab-
lishes authenticated GridFTP control
channels with each endpoint and is-
sues the appropriate GridFTP com-
mands to transfer the requested files
directly between the endpoints. GO
monitors the transfer progress and
updates transfer state in the state da-
tabase. This information can be used
to restart transfers after faults and re-
Principal Globus online data-transfer commands.
Different interfaces for
The Computation Institute at the
University of Chicago and Argonne
National Laboratory operates GO as
a highly available service (http://www.
globusonline.org/) to which users
submit data-movement and synchro-
nization requests. A typical transfer
request proceeds as follows: A user
authenticates with GO and submits a
request. GO records the request into
its state database, inspects the re-
quest to determine what endpoints
are involved, and if necessary prompts
ls List files and directories on an endpoint.
transfer Request data transfer of one or more files or directories between end-
points; support recursive directory transfer and rsync-like synchronization.
scp Request data transfer of a single file or directory; syntax and semantics
based on secure copy utility to facilitate retargeting to go of scripts using
scp for data movement.
status List transfers initiated by requesting user, along with summary information
(such as status, start time, and completion time).
details Provide details on a transfer (such as number of files transferred
and number of faults).
events List events associated with a specified transfer: start, stop,
cancel Terminate specified transfer or individual file in a transfer.
wait Wait for specified transfer to complete; show progress bar.
Alter deadline for a transfer.