barrodale.com/) but are not generic
enough for other file formats (such as
HDF and ROOT). Aiming for file manipulation, the Data Format Description Language Work Group project
dfdl-wg) is developing an XML-based
language for describing the metadata
of files. However, these approaches
do not provide complete support for
DBMS features on the files. MapRe-duce-based techniques for processing
data stored in files17, 20 do not replicate
DBMS functionalities and are mainly
used for batch processing of files.
A management level on top of the
database and file system that manages structured and unstructured
data separately but transparently
seems more feasible in the near future. The challenge in developing this
approach is twofold. First, it is necessary to construct middleware through
which users define their queries in a
declarative high-level manner, but
the middleware must include a mechanism that transcribes queries as input for the DBMS or file system and
routes it appropriately. Second, a query mechanism dedicated to the file
system must be developed; the benefit of a separate file-querying mechanism is that it includes only procedures targeted at querying, thereby
avoiding implications due to complicated database mechanisms—insert,
delete, update—serving database
operations. However, the procedures
involved in the querying mechanism
must be designed and implemented
from scratch, an approach that precludes uniform querying of both
structured and unstructured data;
it also means limited uniform query
optimization and limited query-exe-cution efficiency. Future grid middleware promises to support such integration ( http://www.ogsadai.org.uk/),
though the related research is still
Scientific data management suffers
from storage and processing limita-
tions that must be overcome for sci-
entific research to take on demand-
ing experimentation involving data
collection and processing. Future so-
lutions promise to integrate automa-
tion, online processing, integration,
and file management, but data manip-
ulation must still address the diversity
of experimentation tasks across the
sciences, the complexity of scientific
data representation and processing,
and the volume of collected data and
metadata. Nevertheless, data-man-
agement research in all these areas
suggests the inherent management
problems of scientific data will indeed
be addressed and solved.
We would like to express our gratitude
to Miguel Branco of CERN for contributing the dataflow of the ATLAS experiment, allowing us to demonstrate
a scientific application requiring both
observational and simulation data.
We would also like to acknowledge the
European Young Investigator Award
by the European Science Foundation
1. bruno, n. and Chaudhuri, S. online autoadmin:
Physical design tuning. In Proceedings of the
ACM International Conference on Management of
Data (beijing, June 11–14). ACM Press, new york,
2. buneman, P., Chapman, A., and Cheney, J.
Provenance management in curated databases.
In Proceedings of the ACM SIGMOD International
Conference on Management of Data (Chicago, June
27–29). ACM Press, new york, 2006, 539–550.
3. buneman, P., khanna, S., Tajima, k., and Tan, W.
Archiving scientific data. In Proceedings of the ACM
SIGMOD International Conference on Management
of Data (Madison, WI, June 3–6). ACM Press, new
york, 2002, 1–12.
4. Chervenak, A.L., Schuler, R., Ripeanu, M., Amer, M.A.,
bharathi, S., Foster, I., Iamnitchi, A., and kesselman,
C. The globus Replica Location Service: Design and
experience. IEEE Transactions on Parallel Distributed
Systems 20, 9 (Sept. 2009), 1260–1272.
5. Cohen, S., Hurley, P., Schulz, k. W., barth, W.L., and
benton, b. Scientific formats for object-relational
database systems: A study of suitability and
performance. SIGMOD records 35, 2 (June 2006),
6. Cudre-Mauroux, P., kimura, H., Lim, k., Rogers, J.,
Simakov, R., Soroush, E., Velikhov, P., Wang, D.L.,
balazinska, M., becla, J., De Witt, D., Heath, b., Maier,
D., Madden, S., Patel, J., Stonebraker, M., and Zdonik,
S. A demonstration of SciDb: A science-oriented
DbMS. Proceedings of vLDB Endowment 2, 2 (Aug.
7. Davidson, S.b. and Freire, J. Provenance and
scientific workflows: Challenges and opportunities.
In Proceedings of the ACM SIGMOD International
Conference on Management of Data (Vancouver, b.C.,
June 9–12). ACM Press, new york, 1345–1350.
8. gray, J. and Thomson, D. Supporting Finite-Element
Analysis with a relational Database Backend,
Parts i–iii. MSR-TR-2005-49, MSR-TR-2006-21, MSR-
TR-2005-151. Microsoft Research, Redmond, WA,
9. Hey, T., Tansley, S., and Tolle, k. The Fourth
Paradigm: Data-Intensive Scientific Discovery.
Microsoft, Redmond, WA, oct. 2009.
10. Ilyas, I.F., Markl, V., Haas, P., brown, P., and
Aboulnaga, A. CoRDS: Automatic discovery of
correlations and soft functional dependencies. In
Proceedings of the ACM SIGMOD International
Conference on Management of Data (Paris, June
13–18). ACM Press, new york, 2004, 647–658.
11. kunszt, P.Z., Szalay, A.S., and Thakar, A.R. The
hierarchical triangular mesh. In Proceedings of the
MPA/ESO/MPE Workshop (garching, germany, July
31–Aug. 4). Springer, berlin, 2000, 631–637.
Anastasia Ailamaki ( email@example.com) is director of
the Data-Intensive Applications and Systems Laboratory
and a professor at Ecole Polytechnique Fédérale de
Lausanne, Lausanne, Switzerland, and an adjunct
professor at Carnegie Mellon university, Pittsburgh, PA.
Verena Kantere ( firstname.lastname@example.org) is a
postdoctoral researcher in the Data-Intensive
Applications and Systems Laboratory at Ecole
Polytechnique Fédérale de Lausanne, Lausanne,
Debabrata Dash ( email@example.com) is Ph.D
student in the Data-Intensive Applications and Systems
Laboratory at Ecole Polytechnique Fédérale de Lausanne,