a local disk, perform its computation, then transfer output data from
the local disk back to S3. Making multiple copies in this way can
reduce workflow performance.
Another alternative would be to deploy a file system in the cloud
that could be used by the workflow. For example, in Amazon EC2, an
extra VM can be started to host an NFS file system and worker VMs
can mount that file system as a local partition. If better performance is
needed then several VMs can be started to host a parallel file system
such as PVFS [ 23, 52] or GlusterFS [ 16].
Although clouds like Amazon’s already provide several good alternatives to HPC systems for workflow computation, communication
and storage, there are still challenges to overcome.
Virtualization overhead. Although virtualization provides greater
flexibility, it comes with a performance cost. This cost comes from
intercepting and simulating certain low-level operating system calls
while the VM is running. In addition, there is the overhead of deploying and unpacking VM images before the VM can start. These overheads are critical for scientific workflows because in many cases the
entire point of using a workflow is to run a computation in parallel to
improve performance. Current estimates put the overhead of existing
virtualization software at around 10 percent [ 2, 15, 51] and VM
startup time takes between 15 and 80 seconds depending on the size
of the VM image [ 19, 32]. Fortunately, advances in virtualization technology, such as improved hardware-assisted virtualization, may
reduce or eliminate runtime overheads in the future.
Lack of shared or parallel file systems. Although clouds provide
many different types of shared storage systems, they are not typically
designed for use as file systems. For example, Amazon EBS does not
allow volumes to be mounted on multiple instances, and Amazon S3
does not provide a standard file system interface. To run on a cloud
like Amazon’s, a workflow application must either be modified to use
these different storage systems, which takes time, or they must create
their own file system using services available in the cloud, which is at
least difficult and potentially impossible depending on the file system
desired (for example, Lustre cannot be deployed on Amazon EC2
because it requires kernel modifications that EC2 does not allow).
Relatively slow networks. In addition to fast storage systems, scientific workflows rely on high-performance networks to transfer data
quickly between tasks running on different hosts. The HPC systems typically used for scientific workflows are built using high-bandwidth, low-latency networks such as InfiniBand [ 20] and Myrinet [ 27]. In comparison, most existing commercial clouds are equipped with commodity
gigabit Ethernet, which results in poor performance for demanding
workflow applications. Fortunately, the use of commodity networking
hardware is not a fundamental characteristic of clouds and it should be
possible to build clouds with high-performance networks in the future.
Future Outlook
While many scientists can make use of existing clouds that were
designed with business users in mind, in the future we are likely to see
a great proliferation of clouds that have been designed specifically for
science applications. We already see science clouds being deployed at
traditional academic computing centers [ 14, 28, 30]. One can imagine
that these science clouds will be similar to existing clouds, but will
come equipped with features and services that are even more useful to
computational scientists. Like existing clouds, they will potentially
come in a variety of flavors depending on the level of abstraction
desired by the user.
Biographies
Gideon Juve is a PhD student in computer science at the University of
Southern California. His research interests include distributed and high-performance computing, scientific workflows, and computational science.
Ewa Deelman is a research associate professor at the University of
Southern California Computer Science Department and a project leader
at the USC Information Sciences Institute, where she heads the Pegasus
project, which designs and implements workflow mapping techniques for
large-scale workflows running in distributed environments.
References
1. Amazon. Elastic compute cloud. http://aws.amazon.com/ec2/.
2. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neuge-bauer, R., Pratt, I., and Warfield, A. 2003. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems
Principles. 164-177.
3. Barish, B. C. and Weiss, R. 1999. LIGO and the detection of gravitational
Waves. Physics Today 52. 44.
4. Berriman, G. B., Deelman, E., Good, J., Jacob, J., Katz, D. S., Kesselman, C.,
Laity, A., Prince, T. A., Singh, G., and Su, M.-H. 2004. Montage: A grid enabled engine for delivering custom science-grade mosaics on demand. In
SPIE Conference 5487: Astronomical Telescopes.
5. Brown, D. A., Brady, P. R., Dietz, A., Cao, J., Johnson, B., and McNabb, J.
2006. A case study on the use of workflow technologies for scientific
analysis: Gravitational wave data analysis. In Workflows for e-Science,
Taylor, I., Deelman, E., Gannon, D., and Shields, M., Eds., Springer.
6. Callaghan, S., Maechling, P., Deelman, E., Vahi, K., Mehta, G., Juve, G.,
Milner, K., Graves, R., Field, E., Okaya, D., Gunter, D., Beattie, K., and
Jordan, T. 2008. Reducing time-to-solution using distributed high-throughput mega-workflows—Experiences from SCEC CyberShake.
Crossroads
www.acm.org/crossroads
Spring 2010/ Vol. 16, No. 3