By Gideon Juve and Ewa Deelman
In recent years, empirical science has been evolving from physical experimentation to computation- based research. In astronomy, researchers seldom spend time at a telescope, but instead access the large number of image databases that are created and curated by the community [42]. In bio-
informatics, data repositories hosted by entities such as the National Institutes of Health [29] provide
the data gathered by Genome-Wide Association Studies and enable researchers to link particular
genotypes to a variety of diseases.
Besides public data repositories, scientific collaborations maintain
community-wide data resources. For example, in gravitational-wave
physics, the Laser Interferometer Gravitational-Wave Observatory [ 3]
maintains geographically distributed repositories holding time-series
data collected by the instruments and their associated metadata.
Along with the large increase in online data, the need to process these
data is growing.
In addition to traditional high performance computing (HPC) centers, a nation-wide cyberinfrastructure—a computational environment, usually distributed, that hosts a number of heterogeneous
resources; cyberinfrastructure could refer to both grids and clouds or
a mix of the two—is being provided to the scientific community,
including the Open Science Grid (OSG) [36] and the TeraGrid [47].
These infrastructures, also known as grids [ 13], allow access to high-performance resources over wide area networks. For example, the
TeraGrid is composed of computational and data resources at Indiana
University, Louisiana University, University of Illinois, and others.
These resources are accessible to users for storing data and performing parallel and sequential computations. They provide remote login
access as well as remote data transfer and job scheduling capabilities.
Scientific workflows are used to bring together these various data
and compute resources and answer complex research questions.
Workflows describe the relationship of the individual computational
components and their input and output data in a declarative way. In
astronomy, scientists are using workflows to generate science-grade
mosaics of the sky [ 26], to examine the structure of galaxies [46], and,
in general, to understand the structure of the universe. In bioinformatics, researchers are using workflows to understand the underpinnings
of complex diseases [34, 44]. In earthquake science, workflows are used
to predict the magnitude of earthquakes within a geographic area over
a period of time [ 10]. In physics, workflows are used to search for gravitational waves [ 5] and model the structure of atoms [40]. In ecology,
scientists use workflows to explore the issues of biodiversity [ 21].
Today, workflow applications are running on national and international cyberinfrastructures such as OSG, TeraGrid, and EGEE [ 11].
The broad spectrum of distributed computing provides unique opportunities for large-scale, complex scientific applications in terms of
resource selection, performance optimization, and reliability. In addition to the large-scale cyberinfrastructure, applications can target
campus clusters, or utility computing platforms such as commercial
[ 1, 17] and academic clouds [31].
However, these opportunities also bring with them many chal-
lenges. It’s hard to decide which resources to use and how long they will
be needed. It’s hard to determine what the cost-benefit tradeoffs are
when running in a particular environment. And it’s difficult to achieve
good performance and reliability for an application on a given system.
Workflow Applications
Scientific workflows are being used today in a number of disciplines.
They stitch together computational tasks so that they can be executed
automatically and reliably on behalf of the researcher. These workflows
are composed of a number of image-processing applications that discover the geometry of the input images on the sky, calculate the geometry of the output mosaic on the sky, re-project the flux in the input images to conform to the geometry of the output mosaic, model the
background radiation in the input images to achieve common flux scales
and background levels across the mosaic, and rectify the background that
makes all constituent images conform to a common background level.
These normalized images are added together to form the final mosaic.
Figure 1 shows a mosaic of the Rho Oph dark cloud created using
this workflow.
Montage mosaics can be constructed in different sizes, which dictate the number of images and computational tasks in the workflow.
For example, a 4-degree square mosaic (the moon is 0.5 degrees
square) corresponds to a workflow with approximately 5,000 tasks and
750 input images. Workflow management systems enable the efficient
and reliable execution of these tasks and manage the data products
they produce (both intermediate and final).
Figure 2 shows a graphical representation of a small Montage
workflow containing 1,200 computational tasks. Workflow management systems such as Pegasus [ 4, 9, 39] orchestrate the execution of
these tasks on desktops, grids, and clouds.
Another example is from the earthquake science domain, where
researchers use workflows to generate earthquake hazard maps of
www.acm.org/crossroads
Crossroads