right data regions for testing hypotheses and drawing conclusions. The key
to efficient data processing is a carefully designed database and is why
automated physical database design
is the subject of recent research (
discussed in the following section). In addition, there is an imminent need for
online data processing (discussed in
the second following section).
automation
Errors and inefficiencies due to hu-man-handled physical database design are common in both metadata
management and data processing.
Much recent research has focused on
automating procedures for these two
phases of scientific data management.
Metadata management. Metadata
processing involves determining the
data model, annotations, experimental setup, and provenance. The data
model can be generated automatically
by finding dependencies between different attributes of data. 10 However,
experimenters typically determine
the model since this is a one-time process, and dependencies A=πr2 are easily identified at the attribute level.
Annotations are meta-information
about the raw scientific data and especially important if the data is not numeric. For example, annotations are
used in biology and astronomy image
data. Given the vast scale of scientific
data, automatically generating these
annotations is essential. Current automated techniques for gathering annotations from documents involve
machine-learning algorithms, learning the annotations through a set of
pre-annotated documents. 14 Similar
techniques are applied to images
and other scientific data but must be
scaled to terabyte or petabyte scale.
Once annotations are built, they can
be managed through a DBMS.
Experimental setups are generally recorded in notebooks, both paper and electronic, then converted to
query-able digital records. The quality
of such metadata is typically enforced
through policies that must be as automated as possible. For example,
when data is collected from instruments, instrument parameters can be
recorded automatically in a database.
For manually generated data, the policies must be enforced automatically.
not having to read
from the disk and
write computation
results back saves
hours to days of
scientific work,
giving scientists
more time to
investigate the data.
For the ATLAS experiment, the parameters of the detectors, as well as
the influence of external magnetic devices and collider configurations, are
all stored automatically as metadata.
Some policies can be enforced automatically through a knowledge base
of logical statements; the rest can
be verified through questionnaires.
Many commercial tools are available
for validating policies in the enterprise scenario, and the scientific community can borrow technology from
them to automate the process (http://
www.compliancehome.com/).
Provenance data includes experimental parameters and task history
associated with the data. Provenance
can be maintained for each data entry or for each data set. Since workload management tracks all tasks applied to the data, it can automatically
tag it with task information. Hence,
automating provenance is the most
straightforward of the metadata-processing automation tasks. The enormous volume of automatically collected metadata easily complicates
the effort to identify the relevant subset of metadata to the processing task
in hand. Some research systems are
capable of automatically managing a
DBMS’s provenance information. 2
Data processing. Data processing
depends on how data is physically
organized. Commercial DBMSs usually offer a number of options for determining how to store and access it.
Since scientific data might come in
petabyte-scale quantities and many
scientists work on the same data simultaneously, the requirements for
efficient data organization and retrieval are demanding. Furthermore,
the data might be distributed or replicated in multiple geographically
dispersed systems; hence, network
resources play an important role in
facilitating data access. Possibly hundreds or thousands of scientists could
simultaneously query a petabyte-scale
database over the network, requiring
more than 1GB/sec bandwidth.
To speed data access, the database
administrator might have to tune several parameters, changing the data’s
logical design by normalizing the
data or its physical design. The logical design is determined by the data
model in the metadata-processing