FEBRUARY 2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 109
order of magnitude faster than the ILP solver used directly
on the entire problem; ( 2) scales up to sizes that the solver
cannot manage directly; ( 3) produces packages of very good
quality in terms of objective value.
2. LANGUAGE SUPPORT FOR PACKAGES
Database systems do not natively support package queries.
While there are ways to express package queries in SQL,
these are cumbersome and inefficient.
Specifying packages with self-joins. In the limited case of
packages with strict cardinality, that is, a fixed number of
tuples, it is possible to express package queries using relational self-joins. The query of Example 1 requires three
meals (a package with cardinality three) and can be
expressed as a three-way self-join:
SELECT * FROM Recipes R1, Recipes R2, Recipes R3
WHERE R1.pk < R2.pk AND R2.pk < R3.pk AND
R1.gluten = ‘free’ AND R2.gluten = ‘free’ AND R3.gluten = ‘free’
AND R1.kcal + R2.kcal + R3.kcal BETWEEN
2.0 AND
2. 5
ORDER BY R1.saturated_fat +
R2.saturated_fat +
R3.saturated_fat
Such a query is efficient only for constructing packages with
very small cardinality: larger cardinality requires a larger
number of self-joins, quickly rendering evaluation time prohibitive (Figure 1). The benefit of this specification is that
the optimizer can use the traditional relational algebra operators and augment its decisions with package-specific strategies. However, this method does not apply for packages of
unbounded cardinality.
Specifying packages using recursion. SQL can express
package queries by generating and testing each possible
subset of the input relation. This requires recursion to build
a powerset table; checking each set in the powerset table for
the query conditions will yield the result packages. This
approach has three major drawbacks. First, it is not declarative, and the specification is tedious and complex. Second, it
is not amenable to optimization in existing systems. Third,
it is extremely inefficient to evaluate, because the powerset
table generates an exponential number of candidates.
2. 1. PaQL: The package query language
Our goal is to support declarative and intuitive package
specification. In this section, we describe PaQL, a declarative query language that introduces simple extensions to
SQL to define package semantics and package-level constraints. Figure 2 shows the general syntax of PaQL (left) and
the specification for the query of Example 1 (right), which we
use as a running example to demonstrate PaQL’s features.
Square brackets enclose optional clauses and arguments,
and a vertical bar separates syntax alternatives. In this speci-
fication, repeat is a non-negative integer; w_expression
is a Boolean expression over tuple values (as in standard
SQL) and can only contain references to relation_name
and relation_alias; st_expression is a Boolean
expression and obj_expression is an expression over
aggregate functions or SQL subqueries with aggregate func-
tions; both st_expression and obj_expression can
evaluating package queries, even if one were to express
them in SQL. Figure 1 shows the performance of evaluating
a package query expressed as a multi-way self-join query in
traditional SQL. As the cardinality of the package increases,
so does the number of joins, and the runtime quickly
becomes prohibitive: In a small set of 100 tuples from the
Sloan Digital Sky Survey (SDSS) dataset,
22 SQL evaluation
takes almost 24 hours to construct a package of 7 tuples.
Our goal is to extend the database evaluation engine to take
advantage of external tools, such as ILP solvers, which are
more effective for combinatorial problems.
The third challenge pertains to query evaluation
performance and scaling to large datasets. Integer programming
solvers have two major limitations: they require the entire
problem to fit in main memory, and they fail when the problem is too complex (e.g., too many variables and/or too many
constraints). Our goal is to overcome these limitations
through sophisticated evaluation methods that allow solvers to scale to large data sizes.
Our work addresses these challenges through the design
of language and algorithmic support for the specification
and evaluation of package queries. We present PaQL
(Package Query Language), a declarative language that provides simple extensions to standard SQL to support constraints at the package level. PaQL is at least as expressive as
ILP, which implies that evaluation of package queries is
NP-hard.
5 We present a fundamental evaluation strategy,
Direct, that combines the capabilities of databases and
constraint optimization solvers to derive solutions to package queries. The core of our approach is a set of translation
rules that transform a package query to an ILP. This translation allows for the use of highly-optimized external solvers
for the evaluation of package queries. We introduce an
offline data partitioning strategy that allows package query
evaluation to scale to large data sizes. The core of our evaluation strategy, SketchRefine, lies in separating the package computation into multiple stages, each with small
subproblems, which the solver can evaluate efficiently. In
the first stage, the algorithm “sketches” an initial sample
package from a set of representative tuples, while the subsequent stages “refine” the current package by solving an ILP
within each partition. SketchRefine offers strong approximation guarantees for the package results compared to
Direct. We present an extensive experimental evaluation
on real-world data that shows that our query evaluation
method SketchRefine: ( 1) is able to produce packages an
10–3
101
105
1234567
T
i
me
(s
)
Package cardinality
SQL Formulation ILP Formulation
Figure 1. Traditional database technology is ineffective at package
evaluation, and the runtime of a SQL formulation of a package query
grows exponentially. In contrast, tools such as ILP solvers are more
effective.