ating systems, and it must provide explicit mechanisms to allow portability,
through simple interposition libraries
or source-code availability.
Even on a single platform, the developer makes architectural choices
that affect the database system. For example, a system may be built using: a
single thread of control; a collection of
cooperating processes, each of which
is single-threaded; multiple threads
of control in a single process; multiple
multithreaded processes; or a strictly
event-based architecture. These choices are driven by a combination of the
application’s requirements, the developer’s preferences, the operating system, and the hardware. The database
system must accommodate them.
The database must also avoid making decisions about network protocols.
Since the database will run in environments where communication takes
place over backplanes, as well as environments where it takes place over
WANs, the developer should select
the appropriate communication infrastructure. A special-purpose telephone
switch chassis may include a custom
backplane and protocol for fast communication among redundant boards;
the database must not prevent the developer from using it.
Up to this point, configurability has
revolved around adapting to the hardware and software environment of the
application. The last area of configuration that we address revolves around
the application’s data. Data layout, indexing, and access are critical performance considerations. There are three
main design points with respect to data:
the physical clustering, the indexing
mechanism, and the internal structure
of items in the database. Some of these,
like the indexing mechanism, really
are runtime configuration decisions,
whereas others are more about giving
the application the ability to make design decisions, rather than having designers forced into decisions because
of the database management system.
Database management systems designed for spinning magnetic media
expend considerable effort clustering
related data together on disk so that
seek and rotation times can be amortized by transferring a large amount of
data per repositioning event. In general, this clustering is good, as long as
the data is clustered according to the
correct criteria. In the case of a configurable database system, this means that
the developer needs to retain control
over primary key selection (as is done
in most relational database management systems) and must be able to ignore clustering issues if the persistent
medium either does not exist or does
not show performance benefits to accessing locations that are “close” to the
last access.
On a related note, the developer
must be left the flexibility to select an
indexing structure for the primary keys
that is appropriate for the workload.
Workloads with locality of reference
are probably well served by B+ trees;
those with huge datasets and truly random access might be better off with
hash tables. Perhaps the data is highly
dimensional and require a completely
different indexing structure; the extensibility discussed in the previous section should allow a developer to provide an application-specific indexing
mechanism and use it with all of the
system’s other features (for example,
locking, transactions). At a minimum,
the configurable database should provide a range of alternative indexing
structures that support iteration, fast
equality searches, and range searches,
including searches on partial keys.
Unlike relational engines, the configurable engine should permit the
programmer to determine the internal structure of its data items. If the
application has a dynamic or evolving
schema or must support ad hoc queries, then the internal structure should
be one that enables high-level query access such as SQL, Xpath, Xquery, LDAP,
etc. If, however, the schema is static
and the query set is known, selecting
an internal structure that maps more
directly to the application’s internal
data structures provides significant
performance improvements. For example, if an application’s data is inherently nonrelational (for example, containing multivalued attributes or large
chunks of unstructured data), then
forcing it into a relational organization simply to facilitate SQL access will
cost performance in the translation
and is unlikely to reap the benefits of
the relational store. Similarly, if the application’s data was relational, forcing
it into a different format (for example,
XML, object-oriented, among others)
would add overhead for no benefit. The
configurable engine must support storing data in the format that is most natural for the application. It is then the
programmer’s responsibility to select
the format that meets the “most natural” criteria.
new-Style Databases
for new-Style Problems
Old-style database systems solve old-style problems; we need new-style databases to solve new-style problems.
While the need for conventional database management systems isn’t going away, many of today’s problems
require a configurable database system. Even without a crystal ball, it
seems clear that tomorrow’s systems
will also require a significant degree of
configurability. As programmers and
engineers, we learn to select the right
tool to do a job; selecting a database is
no exception. We need to operate in a
mode where we recognize that there
are options in data management, and
we should select the right tool to get
the job done as efficiently, robustly,
and simply as possible.
References
1. Astrahan, M.M. System R: Relational approach
to database management. ACM Trans. Database
Systems 1, 2 (1976), 97–137.
2. Bernstein, P. The Asilomar Report on database
research. ACM SIGMOD Record 27, 4 (1998); www.
sigmod.org/record/issues/9812/asilomar.html.
3. Broussard, F. Worldwide IT asset management
software forecast and analysis, 2002–2007. (2004).
IDC Doc. #30277; www.idc.com/getdoc.jsp?containerI
d=30277&pid=35178981.
4. Chaudhuri, S., and Weikum, G. Rethinking database
system architecture: Towards a self-tuning RISC-style database system. The VLDB Journal. (2000),
1– 10; www.vldb.org/conf/2000/P001.pdf.
5. Codd, E.F. A relational model of data for large shared
data banks. Commun. ACM 13, 6 (June 1970):
377–387.
6. Gray, J., and Reuter, A. Transaction Processing:
Concepts and Technologies. Morgan Kaufman, San
Mateo, CA, 1993, 397-402
7. Stonebraker, M. The design and implementation of
Ingres. ACM Trans. Database Systems 1, 3 (1976),
189–222.
8. Stonebraker, M., and Cetintemel, U. One size fits
all: An idea whose time has come and gone. In
Proceedings of the 2005 International Conference on
Data Engineering (April 2005); http://www.cs.brown.
edu/~ugur/fits_all.pdf.
Margo I. Seltzer ( margo@eecs.harvard.edu) is the
Herchel Smith Professor of Computer Science and a
Harvard College Professor in the Division of Engineering
and Applied Sciences at Harvard University, Cambridge,
MA. She is also a founder and CTO of Sleepycat Software,
the makers of Berkeley DB.
A previous version of this article appeared in the April 2005
issue of ACM Queue.Vol 3, .No. 3.