the new frontier
We are not the first to notice these
tides of change. In 1998, the leading
database researchers concluded that
database management systems were
becoming too complex and that automated configuration and management
were becoming essential. 2 Two years
later, Surajit Chaudhuri and Gerhard
Weikum proposed radically rethinking database management system
architecture. 4 They suggested that database management systems be made
more modular and that we broaden
our thoughts about data management
to include rather simple, component-based building blocks. Most recently,
Michael Stonebraker joined the chorus, arguing that “one size no longer
fits all,” and citing particular application examples where the conventional
RDBMS architecture is inappropriate. 8
As argued by Stonebraker, the relational vendors have been providing the
illusion that an RDBMS is the answer to
any data management need. For example, as data warehousing and decision
support emerged as important application domains, the vendors adapted
products to address the specialized
needs that arise in these new domains.
They do this by hiding fairly different
data management implementations
behind the familiar SQL front end.
This model breaks down, however, as
one begins to examine emerging data
needs in more depth.
Data warehousing. Retail organizations now have the ability to record
every customer transaction, producing
an enormous data source that can be
mined for information about custom-
ers’ purchasing patterns, trends in
product popularity, geographical preferences, and countless other phenomena that can be exploited to increase
sales or decrease the cost of doing business. This database is read-mostly: it is
updated in bulk by periodically adding
new transactions to the collection, but
it is read frequently as analysts cull the
data extracting useful tidbits. This application domain is characterized by
enormous tables (tens or hundreds
of terabytes), queries that access only
a few of the many columns in a table,
and a need to scan tables sorted in a
number of different ways.
Directory services. As organizations
become increasingly dependent upon
distributed resources and personnel,
the demand for directory services has
exploded. 3 Directory servers provide
fast lookup of entities arranged in a
hierarchical structure that frequently
matches the hierarchical structure of
an organization. The LDAP standard
emerged in the 1990s in response to the
heavyweight ISO X.400/X.500 directory
services. LDAP is now at the core of authentication and identity management
systems from a number of vendors (for
example, IBM Tivoli’s Directory Server,
Microsoft’s Active Directory Server, the
Sun ONE Directory Server). Like data
warehousing, LDAP is characterized by
read-mostly access. Queries are either
single-row retrieval (find the record
that corresponds to this user) or lookups based on attribute values (find all
users in the engineering department).
The prevalence of multivalued attributes makes a relational representation quite inefficient.
Web search. Internet search engines lie at the intersection of database
management and information retrieval. The objects upon which they operate are typically semistructured (that
is, HTML instead of raw text), but the
queries posed are most often keyword
lookups where the desired response is
a sorted list of possible answers. Practically all the successful search engines
today have developed their own data
management solution to this problem,
constructing efficient inverted indices
and highly parallelized implementations of index and lookup. This application is read-mostly with bulk updates
and nontraditional indexing.
Mobile device caching. The prevalence of small, mobile devices introduces yet another category of application: caching relevant portions of a
larger dataset on a smaller, low-func-tionality device. While today’s users
think of their cell phone’s directory as
their own data collection, another view
might be to think of it as a cache of a
global phone and address directory.
This model has attractive properties—
in particular, the ability to augment
the local dataset with entries as they
are used or needed. Mobile telephony
infrastructure requires similar caching
capabilities to maintain communication channels to the devices. The access pattern observed in these caches
is also read-mostly, and the data itself
is completely transitory; it can be lost
and regenerated if necessary.
XML management. Online transactions are increasingly being conducted
by exchanging XML-encoded documents. The standard solution today involves converting these documents into
a canonical relational organization,
storing them in an RDBMS, and then
converting again when one wishes to
use them. As more documents are created, transmitted, and operated upon in
XML, these translations become unnecessary, inefficient, and tedious. Surely
there must be a better way. Native XML
data stores with Xquery and Xpath access patterns represent the next wave
of storage evolution. While new items
are constantly added to and removed
from an XML repository, the documents
themselves are largely read-only.
Stream processing. Stream processing is a bit of an outcast in this laundry list of data-intensive applications.