region shape is a consequence of the class of cost functions
(piecewise-linear functions) that we consider.
The algorithm needs to perform several elementary operations on Pareto regions and cost functions. For instance,
it must verify whether a Pareto region is empty or calculate
a parameter space region in which one plan is preferable
to a second one. We show how all those operations can be
implemented based on the aforementioned representation
of Pareto regions. We implement those operations using linear programming.
1. 4. Outline
The remainder of this paper is organized as follows. We
define the MPQ problem and related concepts more formally in Section 2. In Section 3, we describe our algorithm for
MPQ with piecewise-linear plan cost functions. In Section 4,
we analyze the MPQ problem and the asymptotic complexity of our algorithm. We present experimental results for an
implementation of our algorithm in Section 5 and discuss
related work in Section 6.
2. FORMAL PROBLEM STATEMENT
We now define the MPQ problem. A query describes data to
generate. The description of our algorithm for solving MPQ
problems, given in the next section, focuses on simple SQL
join queries. An SQL join query is defined by a set of tables
to join. A subquery joins a subset of tables. Standard methods exist by which a query optimization algorithm for this
simple query language can be extended into an algorithm
supporting full SQL queries.
A query plan describes how to generate data. We say
that a query plan answers a query if it generates the data, that
is, described by the query. We assume in the following that
query plans consist of a sequence of scan operations and
binary join operations. For a query q, we denote by P(q) the
set of alternative plans that answer the query.
We compare query plans according to their execution
cost. The cost of a given plan depends on a set of real-valued parameters. The set of parameters is a property of
the query. All alternative plans in P(q) depend therefore on
the same parameters. A parameter value vector contains for
each parameter a corresponding value. We do not know the
parameter values at optimization time. The parameter space
is the set of all possible parameter value vectors. We assume
in the following that there are n parameters and denote by
X ⊆ n the n-dimensional parameter space. A parameter
space region is a subset of the parameter space.
We compare query plans according to multiple execution cost metrics. A cost vector contains for each cost metric
a nonnegative cost value. We assume in the following that
there are m execution cost metrics and denote by C = m
the space of cost vectors. We associate each query plan p
with a cost function cp: X → C that maps the n-dimensional
parameter space to the m-dimensional cost space. We can
compare the cost of query plans for specific parameter values. Denote by x ∈ X a parameter value vector and by p1 and
p2 two plans answering the same query. We say that p1 dominates p2 for x, written p1 x p2, if p1 has lower or equivalent
cost than p2 according to each metric for parameter values x.
In other words, p1 dominates p2 if cp1 (x) contains for no component a higher value than cp2 (x). Now, we are ready to introduce the MPQ problem.
Definition 1. An MPQ problem is defined by a query q, a
parameter space X, and a cost space C. A solution is a subset
S ⊆ P(q) of query plans such that for each possible plan p
∈ P(q) and for each possible parameter value vector x ∈ X
there is a solution plan s ∈ S such that s dominates p for x,
that is, s x p.
We focus on a subclass of MPQ problems that restricts
the class of cost functions. In order to define the class of
cost functions that we consider, we must first introduce
convex polytopes. A convex polytope is defined by a set of
linear inequalities. The convex polytope is the set of points
in the parameter space that satisfy all its inequalities. We
use the terms convex polytope and polytope as synonyms.
A linear cost function is defined by a constant b and an
n-dimensional weight vector w ∈ n such that b + w T × x is
the associated cost value for each parameter vector x ∈ X.
A scalar piecewise-linear cost function is a cost function
that allows to partition the parameter space into convex
polytopes such that the function is linear in each polytope.
A vector-valued piecewise-linear cost function consists of
one piecewise-linear cost function for each cost metric. We
use the terms vector-valued piecewise-linear cost function
and piecewise-linear cost function as synonyms. We restrict
our scope to MPQ with piecewise-linear cost functions.
Our algorithm produces a set of relevant plans for a given
query. A plan is relevant if its execution cost is Pareto-optimal for some parameter value combination.
3. 1. Overview
Our algorithm splits the input query recursively into
smaller and smaller parts until we obtain atomic subqueries. We start with atomic subqueries and calculate the set
of relevant plans for each of them. After that, larger subqueries are treated. We treat subqueries in an order which
makes sure that before treating a query, we have calculated relevant plan sets for each of its subqueries. The reason for restricting the order is that we want to calculate the
set of relevant plans for a query out of the sets of relevant
plans for its subqueries. More precisely, we can guarantee that each relevant plan for a query can be obtained by
splitting the query into two subqueries and combining a
relevant plan for the first subquery with a relevant plan
for the second subquery, thereby generating a new query
plan. Having calculated the set of relevant plans for each
subquery, we can therefore obtain a superset of relevant
query plans by iterating over all possible splits into subqueries and over all possible combinations of relevant
subplans. In order to reduce the superset to the actual
set of relevant query plans, we must prune plans answering the same query. Pruning them means to identify and
to discard plans that are irrelevant. The input query is
treated last. The set of relevant plans for the input query is