nature the human brain could otherwise never imagine. Scientists must
be able to manage data derived from
observations and simulations. Constant improvement of observational
instruments and simulation tools give
modern science effective options for
abundant information capture, reflecting the rich diversity of complex
life forms and cosmic phenomena.
Moreover, the need for in-depth analysis of huge amounts of data relentlessly drives demand for additional
Microsoft researcher and ACM
Turing Award laureate Jim Gray once
said, “A fourth data-intensive science is emerging. The goal is to have
a world in which all of the science literature is online, all the science data
is online, and they interoperate with
each other.” 9 Unfortunately, today’s
commercial data-management tools
are incapable of supporting the unprecedented scale, rate, and complexity of scientific data collection and
Despite its variety, scientific data
does share some common features:
˲ ˲ Scale usually dwarfing the scale of
transactional data sets;
˲ ˲ Generated through complex and
˲ ˲ Typically multidimensional;
˲ ˲ Embedded physical models;
˲ ˲ Important metadata about experiments and their provenance;
˲ ˲ Floating-point heavy; and
˲ ˲ Low update rates, with most updates append-only.
Needed are generic, rather than one-off, DBMS
solutions automating storage and analysis of
data from scientific collaborations.
BY anastasia aiLamaKi, VeRena KanteRe,
anD DeBaBRata Dash
DATA-orien TeD sCien TifiC ProCesses depend on
fast, accurate analysis of experimental data generated
through empirical observation and simulation.
However, scientists are increasingly overwhelmed
by the volume of data produced by their own
experiments. With improving instrument precision
and the complexity of the simulated models, data
overload promises to only get worse. The inefficiency
of existing database management systems (DBMSs)
for addressing the requirements of scientists has led
to many application-specific systems. Unlike their
general-purpose counterparts, these systems require
more resources, hindering reuse of knowledge. Still,
the data-management community aspires to general-purpose scientific data management. Here, we explore
the most important requirements of such systems and
the techniques being used to address them.
Observation and simulation of phenomena are keys
for proving scientific theories and discovering facts of
managing the enormous amount of
scientific data being collected is the key
to scientific progress.
though technology allows for the
extreme collection rates of scientific
data, processing is still performed
with stale techniques developed for
small data sets; efficient processing
is necessary to be able to exploit the
value of huge scientific data collections.
Proposed solutions also promise
to achieve efficient management for
almost any other kind of data.