and so forth, and relational systems will get richer in what they can handle, but we’re not going to replace all of the technologies with any one single answer.
Jh Do you see content management systems of the future mostly layered on relational database systems, or do you see them as independent stores built using some of what we’ve learned over the past 30 years of working on relational technologies?
Ps I like the architecture in the DB2 content manager, where DB2 is the library server—the card catalogue, so to speak—and it uses some extra semantics in a system-level application surrounding DB2 with some new user-defined types and functions, and stored procedures implementing those applications. Then it has separate resource managers, which are capable of handling a certain class of data types and styles with this kind of document, these kinds of images. They could be physically stored in either the DB2 as the library server or some separate place or file system out on a number of different engines.
It gives you a flexible configuration. You can exploit as much as you like of the functions of DB2—XML, for example—or you can choose to use some of these repository managers. They may be less feature-rich but are expert in a particular kind of information and could be stored locally to where you need that data—particularly if it’s massive amounts of data, such as mass spectrometry results. Those are huge files and you want them close to where you’re doing the analysis.
Jh Given that relational stores now support XML and full-text search, what’s missing? Why haven’t extended relational systems had a bigger impact in the unstructured world?
Ps The semantics of content management go beyond just the data storage parts, the data storage engines, the DB2s of the world. There’s a significant set of other abstractions and management techniques that either have to go on top or have to come from a content management system that uses and exploits an extended relational engine but doesn’t solely rely on it.
For example, content management systems have the ability to allow Pat access to Chapter 1 of a document,
and James access to Chapter 2, and Ed access to Chapters 1 and 2, at the sub-sub-document level. This is something that relational systems don’t do today. Similarly, foldering, the idea of document collections that really aren’t related to similar structure but are tied to some higher-level semantic content, is beyond what relational systems are undertaking at this point.
Jh Are there other areas where you see research needed for content managers and relational stores to improve and help customers manage a wider variety of data?
Ps If I were choosing today to do research or advanced development, there are a number of areas that are very, very interesting to me. There’s continued invention needed in the autonomics. What do you have to do to have a truly hands-free data system that could be embedded in anything? What do you have to do to have truly mass-parallelism at the millions-of-systems (e.g., Internet) level? As commodity hardware becomes smaller and smaller, can we link and talk to systems and compute things on a scale of millions, where today we’re at a technology level of thousands? How do you deal with data streams where the queries are fixed and the data is rushing by, and it could be unstructured data? How do you accumulate metadata and keep it up to date? How do you manage it, learn from it, derive
information from it?
Searching is still in its first generation. There are lots of opportunities to make search better. If it knew you were angry when you typed in your three keywords to a search engine, would that help it understand what you were searching for? If it knew what email you had just seen before you typed those search keywords, would that help it understand what you were looking for? How can a search engine find what you intended as opposed to what you typed?
How reliable is derived information? There are many sources of unreliability. What if I have a source of information that’s right only half the time? How do I rate that information compared with another source who’s right all of the time? How do I join together that information, and what’s the level of confidence I have in the resulting joined information?
All of those things, as we start dealing with unstructured data and incomplete answers and inexact answers and so forth, are great opportunities for research and advanced development.
Jh We’ve started to see open source having an increasingly large role in server-side computing. Specifically in the database world, we’ve now got a couple of open source competitors. Is open source a good thing for the database world?
Ps I love the idea of open source. I was the manager of the IBM Cloudscape team at the time that we contributed it to Apache, where it has become an incubator project under the name Derby. My dream is that this allows many more opportunities for using databases in places where people wouldn’t ordinarily go out and buy a database engine.
So open source can bring the benefits of the reliability, the recoverability, the set-oriented query capabilities to another class of applications—small businesses—and the ability to exploit the wonderful characteristics of database systems across a much richer set of applications. I think it’s good for the industry.
a previous version of this interview appeared in the april 2005 issue of ACM Queue.
References:
Archives