well. The research challenges include
language design, efficient compilers
and runtimes, and techniques to optimize code automatically across both
the horizontal distribution of parallel
processors and the vertical distribution of tiers. It seems natural that the
techniques behind parallel and distributed databases—partitioned dataflow
and cost-based query optimization—
should extend to new environments.
However, to succeed, these languages
must be fairly expressive, going beyond
simple MapReduce and select-project-join-aggregate dataflows. This agenda
will require “synthesis” work to harvest
useful techniques from the literature
on database and logic programming
languages and optimization, as well
as to realize and extend them in new
programming environments.
To genuinely improve programmer
productivity, these new approaches
also need to pay attention to the softer issues that capture the hearts and
minds of programmers (such as attractive syntax, typing and modularity,
development tools, and smooth interaction with the rest of the computing ecosystem, including networks,
files, user interfaces, Web services,
and other languages). This work also
needs to consider the perspective of
programmers who want to use their
favorite programming languages and
data services as primitives in those
languages. Example code and practical
tutorials are also critical.
To execute successfully, database
research must look beyond its traditional boundaries and find allies throughout computing. This is a unique opportunity for a fundamental “reformation”
of the notion of data management, not
as a single system but as a set of services that can be embedded as needed in
many computing contexts.
iLLustration by gLuekit
Interplay of structured and unstructured data. A growing number of data-management scenarios involve both
structured and unstructured data.
Within enterprises, we see large heterogeneous collections of structured data
linked with unstructured data (such
as document and email repositories).
On the Web, we also see a growing
amount of structured data primarily
from three sources: millions of databases hidden behind forms (the deep
Web); hundreds of millions of high-
quality data items in HTML tables on
Web pages and a growing number of
mashups providing dynamic views on
structured data; and data contributed
by Web 2.0 services (such as photo and
video sites, collaborative annotation
services, and online structured-data
repositories).
A significant long-term goal for the
database community is to transition
from managing traditional databases
consisting of well-defined schemata
for structured business data to the
it developed domain-independent
technology for crawling through forms
(that is, automatically submitting well-formed queries to forms) and surfacing the resulting HTML pages in a
search-engine index. Within the enterprise, the database research community recently contributed to enterprise
search and the discovery of relationships between structured and unstructured data.
The first challenge database
researchers face is how to extract struc-
much more challenging task of managing a rich collection of structured,
semi-structured, and unstructured
data spread over many repositories in
the enterprise and on the Web—
sometimes referred to as the challenge of
managing dataspaces.
In principle, this challenge is closely
related to the general problem of data
integration, a longstanding area for
database research. The recent advances in this area and the new issues
due to Web 2.0 resulted in significant
discussion at the Claremont meeting.
On the Web, the database community
has contributed primarily in two ways:
First, it developed technology that
enables the generation of domain-specific (“vertical”) search engines
with relatively little effort; and second,
ture and meaning from unstructured
and semistructured data. Informa-tion-extraction technology can now
pull structured entities and relationships out of unstructured text, even in
unsupervised Web-scale contexts. We
expect in coming years that hundreds
of extractors will be applied to a given
data source. Hence developers and
analysts need techniques for applying
and managing predictions from large
numbers of independently developed
extractors. They also need algorithms
that can introspect about the correctness of extractions and therefore
combine multiple pieces of extraction
evidence in a principled fashion. The
database community is not alone in
these efforts; to contribute in this area,
database researchers should continue