data, and then you have to plan which of those access techniques makes sense for any given kind of query.
The trick to cost-based query optimization is estimating a cost for each of the different ways of accessing the data, each of the different ways of joining information from multiple tables, and estimating the sizes of results and the savings that you would get by having data in the buffer pools, estimating the number of rows you’ll actually touch if you use an index to access the data, and so forth.
The more deeply you can model the cost of accessing the data, the better the choice you’ll make for an access path. What we did back in the late 1970s was to invent this cost-based query optimization and provide a model that was good enough for searching a very large space of choices within a reasonable amount of time, then coming up with a very good cost estimate and, therefore, a very good access path.
Jh It’s amazing that this number of years later, this work remains the dominant approach to relational database system query optimization. Cost-based optimizers have been a real success in technology transfer from research to industry. Can you tell us a little bit about why that was so successful?
Ps The quality of cost-based query optimization has really made it possible for people to have relatively hands-free application development. That is, the application developer doesn’t have to know a huge amount about the layout of the data on disk and the exact places where the records are and the exact access paths to those records. There’s a huge upside from the application productivity standpoint to being able to do really good cost-based query optimization. So, there’s a compelling marketplace force for having good cost-based query optimization.
I participated in inventing the System R query optimizer, which was taken lock, stock, and barrel and put into IBM’s DB2 relational database product where it has been continually enhanced. Many of the simplifying assumptions to make the problem trackable back in the late 1970s have been eliminated, and the model is now deeper and richer and includes more techniques for accessing the data.
It’s a growing science, and I think that’s part of its success. It has been able to grow and adapt as new inventions come along for accessing the data, or for joining the data in different ways. All of the relational database products have gone to a cost-based query optimization approach and have moved away from a rules-based approach, which was really too inflexible to get you good performance all the time.
The technology still has room to grow—for example, when the data itself behaves differently from what the model assumes. Many optimizers do not model highly correlated data really well. For example, 90210 is a ZIP code that’s only in California. ZIP codes are not evenly distributed across states, and there isn’t a 90210 in every state of the union. For a user request, nailing down the ZIP code to 90210 is sufficient and applying another predicate, such as state equals California, doesn’t change the result. It won’t reduce the number of rows because the only 90210 is in California.
Jh One of the enemies of industrial query optimizers is complexity, and that can sometimes yield lack of query plan robustness. Small changes in the queries or in the data being que-ried can lead to substantially different plans. Customers often ask me for a good plan that is stable rather than a near-optimal plan that changes frequently in unpredictable ways. What direction should we be looking to make progress on the optimal query plan versus query plan robustness problem?
Ps I think we have to approach it in two ways. One is that you have to be able to execute good plans, and during the execution of a plan you want to notice when the actual data is deviating dramatically from what you expected. If you expected five rows and you’ve got a million, chances are your plan is not going to do well because you chose it based on the assumption of five. Thus, being able to correct mid-course is an area of enhancement for query optimizers that IBM is pursuing.
Second, you have to continue to deepen the model because you’ve got to come up with reasonable plans before you can fine-tune them dynamically. Understanding the correlation between rows or between columns in different tables—noting the ZIP code example I gave before—is a very important part of continuing to understand the data more deeply and therefore being able to do even better query optimization.
The fact is that as customers use more and more shrink-wrapped packages or have ad hoc users who haven’t gone to SQL school for a year, there’s a real need to be able to do good query optimization. You can’t have database administrators running into the room, saying, “Don’t hit Enter yet. I’ve got to look at your query to see if it’s going to be OK.” Outstanding cost-based query optimization is critical to speeding application productivity and lowering total cost of ownership.
Jh Let’s look for a moment at query optimization and where the technology can be taken beyond database management systems. IBM, and the industry as a whole, have been investing in recent years in auto-tuning and in autonomic computing. Do you see a role for cost-based optimization in this application area?
Ps Absolutely. Companies have a lot of data that is quite well structured— an order, a customer, an employee record—but that’s maybe 15% of all of the data that a company has. The rest of it is in document files, it’s in XML, it’s in pictures, it’s on Web pages—all of this information also needs to be managed. XML provides a mechanism for being able to do that, but the data isn’t quite as regularly structured. It’s very dynamic. Every record could look different from the next record even in
References:
Archives