phase of a science experiment. The
physical design determines optimal
data organization and location, caching techniques, indexes, and other
performance-enhancing techniques.
All depend on the data-access pattern,
which is dynamic, hence, it changes
much more frequently than logical
design; physical design automation
is therefore critical for efficient data
processing.
Considering the number of parameters involved in the physical design
of a scientific database, requiring the
database administrator to specify and
optimize the parameters for all these
techniques is unreasonable. Data
storage and organization must be automated.
All DBMSs today provide techniques for tuning databases. Though
the provision of these techniques is
a step in the right direction, existing
tools are insufficient for four main
reasons:
Precision. They require the query
workload to be static and precise;
Relational databases. They consider
only auxiliary structures to be built on
relational databases and do not consider other types of data organization;
Static database. They assume a
static database, so the statistics in the
database are similar to the statistics at
the time the tool is run; and
Query optimizer. They depend on
the query optimizer to direct their
search algorithms, making them slow
for large workloads.
Recent database research has addressed these inherent DBMS limitations. For example, some techniques
do not require prespecifying the
workload, 1 and others make the cost
model more efficient, enabling more
thorough search in the data space. 18
However, they also fall short in several
areas; for example, they are not robust
enough to change database statistics
and do not consider data organization
other than relational data. Likewise,
data-organization methods for distributed data and network caches are
nascent today. Automatically utilizing
multiple processing units tuned for
data-intensive workloads to scale the
computation is a promising research
direction, and systems (such as Gray-Wulf24) apply this technique to achieve
scalability.
Physical and logical design-auto-mation tools must consider all parameters and suggest optimal organization. The tools must be robust to small
variations in data and query changes,
dynamically suggesting changes in
the data organization when the query
or data changes significantly.
online Processing
Most data-management techniques in
the scientific community are offline today; that is, they provide the full result
of the computation only after processing an entire data set. However, the
ever-growing scale of scientific data
volume necessitates that even simple
processes, one-time data movement,
checksum computation, and verification of data integrity might have to run
for days before completion.
Simple errors can take hours to
be noticed by scientists, and restarting the process consumes even more
time. Therefore, it is important that
all processing of scientific data be performed online. Converting the processes from offline to online provides
the following benefits:
Efficiency. Many operations can be
applied in a pipeline manner as data
is generated or move around. The operations are performed on the data
when already in memory, which is
much closer to the CPU than to a disk
or tape. Not having to read from the
disk and write computation results
back saves hours to days of scientific
work, giving scientists more time to
investigate the data.
Feedback. Giving feedback to the
operations performed on the scientif-
ic data is important, because it allows
scientists to plan their analysis accord-
ing to the progress of the operation.
Modern DBMSs typically lack a prog-
ress indicator for queries, hence sci-
entists running queries or other pro-
cesses on DBMSs are typically blind to
the completion time of their queries.
This blindness may lead to canceling
the query and issuing a different one
or abandoning the DBMS altogether.
DBMSs usually allow a query issuer
to compute the “cost” of a query in a
unit specific to the DBMS. This cost is
not very useful to scientists, since it
doesn’t correspond to actual running
time or account for the complete set of
resources (such as memory size, band-
width, and operation sharing) avail-
able to the DBMS for running the que-
ry. Operations, including querying/
updating data, should thus provide
real-time feedback about the query
progress to enable scientists to better
plan their experiments.