The schema of SDSS data includes
more than 70 tables, though most
user queries focus on only a few of
them, referring, as needed, to spectra
and images. The queries aim to spot
objects with specific characteristics,
similarities, and correlations. Patterns of query expression are also limited, featuring conjunctions of range
and user-defined functions in both
the predicate and the join clause.
Simulation scientific data. Earth
science employs simulation models to help predict the motion of the
ground during earthquakes. Ground
motion is modeled with an octree-based hexahedral mesh19 produced by
a mesh generator, using soil density
as input (see Figure 2). A “solver” tool
simulates the propagation of seismic
waves through the Earth by approximating the solution to the wave equation at each mesh node. During each
time step, the solver computes an
estimate of each node velocity in the
spatial directions, writing the results
to the disk. The result is a 4D spatio-temporal earthquake data set describing the ground’s velocity response.
Various types of analysis can be performed on the data set, employing
both time-varying and space-varying
queries. For example, a user might describe a feature in the ground-mesh,
and the DBMS finds the approximate
location of the feature in the simula-
figure 3. Workflow of the atLas experiment.
tion data set through multidimensional indexes.
Combined simulation and observational data. The ATLAS experiment
( http://atlas.ch/), a particle-physics
experiment in the Large Hadron Collider ( http://lhc.web.cern.ch/lhc/) beneath the Swiss-French border near
Geneva, is an example of scientific
data processing that combines both
simulated and observed data. ATLAS
intends to search for new discoveries
in the head-on collision of two highly
energized proton beams. The entire
workflow of the experiment involves
petabytes of data and thousands of users from organizations the world over
(see Figure 3).
We first describe some of major ATLAS data types: The raw data is the direct observational data of the particle
collisions. The detector’s output rate
is about 200Hz, and raw data, or electrical signals, is generated at about
320MB/sec, then reconstructed using
various algorithms to produce event
summary data (ESD). ESD has an object-oriented representation of the
reconstructed events (collisions), with
content intended to make access to
raw data unnecessary for most physics
applications. ESD is further processed
to create analysis object data (AOD),
a reduced event representation suitable for user analysis. Data volume
decreases gradually from raw to ESD
to AOD. Another important data type
is tag data, or event-level metadata,
stored in relational databases, designed to support efficient identification and selection of events of interest
to a given analysis.
Due to the complexity of the experiment and the project’s worldwide
scope, participating sites are divided
into multiple layers. The Tier-0 layer is
a single site—CERN itself—where the
detector is located and the raw data
is collected. The first reconstruction
of the observed electrical signals into
physics events is also done at CERN,
producing ESD, AOD, and tag data.
Tier- 1 sites are typically large national
computing centers that receive replicated data from the Tier-0 site. Tier- 1
sites are also responsible for reprocessing older data, as well as for storing the final results from Monte Carlo
simulations at Tier- 2 sites. Tier- 2 sites
are mostly institutes and universities
providing computing resources for
Monte Carlo simulations and end-user analysis. All sites have pledged
computing resources, though the vast
majority is not dedicated to ATLAS or
to high-energy physics experiments.
The Tier-0 site is both computation-and storage-intensive, since it stores
the raw data and performs the initial
event reconstruction. It also serves
data to the Tier- 1 sites, with aggregate
sustained transfer rates for raw, ESD,
Physics discovery!
improving algorithms
data taking
reprocessing
MC
simulation
Analysis
raw
raw
raw/esd/Aod
output
stored
esd/Aod
raw/esd/Aod
Data management
observed data only
both observed and simulated data
both observed and simulated data
raw
esd
T0
Aod
raw
esd
T1
Aod
Aod