data, and then you have to plan which
of those access techniques makes
sense for any given kind of query.
The trick to cost-based query optimization is estimating a cost for each
of the different ways of accessing the
data, each of the different ways of joining information from multiple tables,
and estimating the sizes of results and
the savings that you would get by having data in the buffer pools, estimating the number of rows you’ll actually
touch if you use an index to access the
data, and so forth.
The more deeply you can model
the cost of accessing the data, the
better the choice you’ll make for an
access path. What we did back in the
late 1970s was to invent this cost-based query optimization and provide
a model that was good enough for
searching a very large space of choices
within a reasonable amount of time,
then coming up with a very good cost
estimate and, therefore, a very good
access path.
Jh It’s amazing that this number of
years later, this work remains the dominant approach to relational database
system query optimization. Cost-based
optimizers have been a real success
in technology transfer from research
to industry. Can you tell us a little bit
about why that was so successful?
Ps The quality of cost-based query
optimization has really made it possible for people to have relatively hands-free application development. That is,
the application developer doesn’t have
to know a huge amount about the layout of the data on disk and the exact
places where the records are and the
exact access paths to those records.
There’s a huge upside from the application productivity standpoint to being able to do really good cost-based
query optimization. So, there’s a compelling marketplace force for having
good cost-based query optimization.
I participated in inventing the System R query optimizer, which was
taken lock, stock, and barrel and put
into IBM’s DB2 relational database
product where it has been continually
enhanced. Many of the simplifying assumptions to make the problem trackable back in the late 1970s have been
eliminated, and the model is now
deeper and richer and includes more
techniques for accessing the data.
“outstanding
cost-based query
optimization is
critical to speeding
application
productivity and
lowering total cost
of ownership.”
It’s a growing science, and I think
that’s part of its success. It has been
able to grow and adapt as new inventions come along for accessing the
data, or for joining the data in different ways. All of the relational database
products have gone to a cost-based
query optimization approach and
have moved away from a rules-based
approach, which was really too inflexible to get you good performance all
the time.
The technology still has room to
grow—for example, when the data itself behaves differently from what the
model assumes. Many optimizers do
not model highly correlated data really well. For example, 90210 is a ZIP
code that’s only in California. ZIP
codes are not evenly distributed across
states, and there isn’t a 90210 in every
state of the union. For a user request,
nailing down the ZIP code to 90210 is
sufficient and applying another predicate, such as state equals California,
doesn’t change the result. It won’t reduce the number of rows because the
only 90210 is in California.
Jh One of the enemies of industrial query optimizers is complexity,
and that can sometimes yield lack of
query plan robustness. Small changes
in the queries or in the data being que-ried can lead to substantially different
plans. Customers often ask me for a
good plan that is stable rather than a
near-optimal plan that changes frequently in unpredictable ways. What
direction should we be looking to
make progress on the optimal query
plan versus query plan robustness
problem?
Ps I think we have to approach it in
two ways. One is that you have to be
able to execute good plans, and during
the execution of a plan you want to notice when the actual data is deviating
dramatically from what you expected.
If you expected five rows and you’ve got
a million, chances are your plan is not
going to do well because you chose it
based on the assumption of five. Thus,
being able to correct mid-course is an
area of enhancement for query optimizers that IBM is pursuing.
Second, you have to continue to
deepen the model because you’ve got
to come up with reasonable plans before you can fine-tune them dynamically. Understanding the correlation
between rows or between columns in
different tables—noting the ZIP code
example I gave before—is a very important part of continuing to understand the data more deeply and therefore being able to do even better query
optimization.
The fact is that as customers use
more and more shrink-wrapped packages or have ad hoc users who haven’t
gone to SQL school for a year, there’s a
real need to be able to do good query
optimization. You can’t have database
administrators running into the room,
saying, “Don’t hit Enter yet. I’ve got to
look at your query to see if it’s going to
be OK.” Outstanding cost-based query
optimization is critical to speeding
application productivity and lowering
total cost of ownership.
Jh Let’s look for a moment at query
optimization and where the technology can be taken beyond database
management systems. IBM, and the
industry as a whole, have been investing in recent years in auto-tuning and
in autonomic computing. Do you see
a role for cost-based optimization in
this application area?
Ps Absolutely. Companies have a lot
of data that is quite well structured—
an order, a customer, an employee
record—but that’s maybe 15% of all of
the data that a company has. The rest
of it is in document files, it’s in XML,
it’s in pictures, it’s on Web pages—all
of this information also needs to be
managed. XML provides a mechanism
for being able to do that, but the data
isn’t quite as regularly structured. It’s
very dynamic. Every record could look
different from the next record even in