and so forth, and relational systems
will get richer in what they can handle,
but we’re not going to replace all of
the technologies with any one single
answer.
Jh Do you see content management
systems of the future mostly layered
on relational database systems, or do
you see them as independent stores
built using some of what we’ve learned
over the past 30 years of working on relational technologies?
Ps I like the architecture in the DB2
content manager, where DB2 is the
library server—the card catalogue,
so to speak—and it uses some extra
semantics in a system-level application surrounding DB2 with some new
user-defined types and functions,
and stored procedures implementing
those applications. Then it has separate resource managers, which are
capable of handling a certain class of
data types and styles with this kind
of document, these kinds of images.
They could be physically stored in either the DB2 as the library server or
some separate place or file system out
on a number of different engines.
It gives you a flexible configuration.
You can exploit as much as you like of
the functions of DB2—XML, for example—or you can choose to use some
of these repository managers. They
may be less feature-rich but are expert
in a particular kind of information
and could be stored locally to where
you need that data—particularly if
it’s massive amounts of data, such as
mass spectrometry results. Those are
huge files and you want them close to
where you’re doing the analysis.
Jh Given that relational stores now
support XML and full-text search,
what’s missing? Why haven’t extended relational systems had a bigger impact in the unstructured world?
Ps The semantics of content management go beyond just the data storage parts, the data storage engines, the
DB2s of the world. There’s a significant
set of other abstractions and management techniques that either have to go
on top or have to come from a content
management system that uses and exploits an extended relational engine
but doesn’t solely rely on it.
For example, content management
systems have the ability to allow Pat
access to Chapter 1 of a document,
“i love the idea
of open source.
my dream is that
this allows many
more opportunities
for using databases
in places where
people wouldn’t
ordinarily go
out and buy a
database engine.”
and James access to Chapter 2, and
Ed access to Chapters 1 and 2, at the
sub-sub-document level. This is something that relational systems don’t do
today. Similarly, foldering, the idea
of document collections that really
aren’t related to similar structure but
are tied to some higher-level semantic
content, is beyond what relational systems are undertaking at this point.
Jh Are there other areas where you
see research needed for content managers and relational stores to improve
and help customers manage a wider
variety of data?
Ps If I were choosing today to do
research or advanced development,
there are a number of areas that are
very, very interesting to me. There’s
continued invention needed in the
autonomics. What do you have to do
to have a truly hands-free data system
that could be embedded in anything?
What do you have to do to have truly
mass-parallelism at the millions-of-systems (e.g., Internet) level? As commodity hardware becomes smaller
and smaller, can we link and talk to
systems and compute things on a
scale of millions, where today we’re at
a technology level of thousands? How
do you deal with data streams where
the queries are fixed and the data is
rushing by, and it could be unstructured data? How do you accumulate
metadata and keep it up to date? How
do you manage it, learn from it, derive
information from it?
Searching is still in its first generation. There are lots of opportunities
to make search better. If it knew you
were angry when you typed in your
three keywords to a search engine,
would that help it understand what
you were searching for? If it knew what
email you had just seen before you
typed those search keywords, would
that help it understand what you were
looking for? How can a search engine
find what you intended as opposed to
what you typed?
How reliable is derived information? There are many sources of unreliability. What if I have a source of
information that’s right only half the
time? How do I rate that information
compared with another source who’s
right all of the time? How do I join together that information, and what’s
the level of confidence I have in the
resulting joined information?
All of those things, as we start dealing with unstructured data and incomplete answers and inexact answers and
so forth, are great opportunities for research and advanced development.
Jh We’ve started to see open source
having an increasingly large role in
server-side computing. Specifically in
the database world, we’ve now got a
couple of open source competitors. Is
open source a good thing for the database world?
Ps I love the idea of open source.
I was the manager of the IBM Cloudscape team at the time that we contributed it to Apache, where it has become an incubator project under the
name Derby. My dream is that this allows many more opportunities for using databases in places where people
wouldn’t ordinarily go out and buy a
database engine.
So open source can bring the benefits of the reliability, the recoverability,
the set-oriented query capabilities to
another class of applications—small
businesses—and the ability to exploit
the wonderful characteristics of database systems across a much richer set
of applications. I think it’s good for
the industry.
a previous version of this interview appeared in the april
2005 issue of ACM Queue.