The relational model of data management, which dates to 1970, still
dominates today and influences new paradigms as the field evolves.
For T Y Years aGo this June, an ar- ticle appeared in these pages that would shape the long- term direction of information technology like few other ideas
in computer science. The opening sentence of the article, “A Relational Model
of Data for Large Shared Data Banks,”
summed it up in a way as simple and elegant as the model itself: “Future users
of large data banks must be protected
from having to know how the data is organized in the machine,” wrote Edgar F.
Codd, a researcher at IBM.
And protect them it did. Programmers and users at the time dealt
mostly with crude homegrown database systems or commercial products
like IBM’s Information Management
System (IMS), which was based on a
low-level, hierarchical data model.
“These databases were very rigid, and
they were hard to understand,” recalls
Ronald Fagin, a Codd protégé and now
a computer scientist at IBM Almaden
Research Center. The hierarchical
“trees” in IMS were brittle. Adding a
single data element, a common occurrence, or even tuning changes, could
involve major reprogramming. In addition, the programming language
used with IMS was a low-level language
akin to an assembler.
But Codd’s relational model stored
data by rows and columns in simple
tables, which were accessed via a high-level data manipulation language
(DML). The model raised the level of
abstraction so that users specified what
they wanted, but not how to get it. And
when their needs changed, reprogramming was usually unnecessary. It was
similar to the transition 10 years earlier
from assembler languages to Fortran
and COBOL, which also raised the level
of abstraction so that programmers no
longer had to know and keep track of
details like memory addresses.
“People were stunned to learn that
complex, page-long [IMS] queries could
be done in a few lines of a relational
language,” says Raghu Ramakrishnan,
chief scientist for audience and cloud
computing at Yahoo!
Codd’s model came to dominate a
multibillion-dollar database market,
but it was hardly an overnight success. The model was just too simple
to work, some said. And even if it did
work, it would never run as efficiently
as a finely tuned IMS program, others
said. And although Codd’s relational
concepts were simple and elegant, his
mathematically rigorous languages,
relational calculus and relational algebra, could be intimidating.
In 1969, an ad hoc consortium called
CODASYL proposed a hierarchical database model built on the concepts behind IMS. CODASYL claimed that its approach was more flexible than IMS, but
it still required programmers to keep
track of far more details than the relational model did. It became the basis
for a number of commercial products,
including the Integrated Database Man-
agement System (IDMS) from the company that would become Cullinet.
Contentious debates raged over the
models in the CS community through
much of the 1970s, with relational enthusiasts arrayed against CODASYL advocates while IMS users coasted along
on waves of legacy software.
As brilliant and elegant as the relational model was, it might have remained confined to computer science
curricula if it wasn’t for three projects
aimed at real-world implementation of
the relational database management
system (RDBMS). In the mid-1970s,
IBM’s System R project and the University of California at Berkeley’s Ingres
project set out to translate the relational concepts into workable, maintainable, and efficient computer code.
Support for multiple users, locking,
logging, error-recovery, and more were
System R went after the lucrative
mainframe market with what would become DB2. In particular, System R produced the Structured Query Language
(SQL), which became the de facto standard language for relational databases.
Meanwhile Ingres was aimed at UNIX
machines and Digital Equipment Corp.
Then, in 1979, another watershed paper appeared. “Access Path Selection in
a Relational Database Management System,” by IBM System R researcher Patricia Selinger and coauthors, described
an algorithm by which a relational
system, presented with a user query,
could pick the best path to a solution
from multiple alternatives. It did that
by considering the total cost of the various approaches in terms of CPU time,
required disk space, and more.
“Selinger’s paper was really the piece
of work that made relational database
systems possible,” says David DeWitt,
director of Microsoft’s Jim Gray Systems