FEBRUARY2019 | VOL. 62 | NO. 2 | COMMUNICATIONS OF THE ACM 111
not included in the output package, and xi = k means that
tuple ti is included k times. Thus, the result of is the package:
{t2, t3, t5}.
4. SCALABLE PACKAGE EVALUATION
The Direct algorithm has two crucial drawbacks. First, it is
only applicable if the input relation is small enough to fit
entirely in main memory: ILP solvers, such as IBM’s CPLEX,
require the entire problem to be loaded in memory before
execution. Second, even for problems that fit in main memory, this approach may fail due to the complexity of the integer problem. In fact, ILP is a notoriously hard problem, and
modern ILP solvers use algorithms, such as
branch-and-cut,
16 that often perform well in practice, but can “choke”
even on small problem sizes due to their exponential worst-case complexity.
8 This may result in unreasonable performance if the solvers use too many resources (main memory,
virtual memory, CPU time), eventually thrashing the entire
system.
In this section, we present SketchRefine, an approximate divide-and-conquer evaluation technique for efficiently
answering package queries on large datasets. Rather than
solving the original large problem with Direct, SketchRefine
smartly decomposes a query into smaller queries, formulates
them as ILP problems, and employs an ILP solver as a black-box evaluation method to answer each individual query. By
breaking down the problem into smaller subproblems, the
algorithm avoids the drawbacks of Direct.
The algorithm is based on an important observation:
similar tuples are likely to be interchangeable within packages. A
group of similar tuples can therefore be “compressed” to a
single representative tuple for the entire group.
SketchRefine sketches an initial answer package using
only the set of representative tuples, which is substantially
smaller than the original dataset. This initial solution is
then refined by evaluating a subproblem for each group, iteratively replacing the representative tuples in the current
package solution with original tuples from the dataset.
Figure 4 provides a high-level illustration of the three main
steps of SketchRefine:
1. Offline Partitioning (Section 4. 1): The algorithm
assumes a partitioning of the data into groups of similar
tuples, with a representative tuple chosen for each
group. This partitioning is performed offline (not at
query time).
2. Sketch (Section 4. 2. 1): SketchRefine sketches an
can take on. Specifically, REPEAT ρ implies 0 ≤ xi ≤ ρ + 1.
Base predicate. Let b be a base predicate, for example,
R.gluten = ‘free’, and Rb the relation containing tuples from
R satisfying b. We encode b by setting xi = 0 for every tuple
ti ∉ Rb.
Global predicate. Each global predicate in the SUCH
THAT clause takes the form f (P) v. For each such predicate,
we derive a linear function over the integer variables.
A cardinality constraint f(P) = COUNT(P.*) is translated into a
linear function . A summation constraint f (P) =
SUM(
P.attr) is translated into a linear function
Boolean expressions over the global predicates can be
encoded into a linear program with the help of Boolean variables and linear transformation tricks found in the literature.
3 We refer to the original version of this paper for further
details.
4, 5
Objective clause. We encode MAXIMIZE f(P) as max ,
where is the encoding of f(P). Similarly MINIMIZE f(P) is
encoded as min .
Example 2 (ILP translation). Figure 3 shows a toy example
of the Recipes table, with two columns and 5 tuples. To transform into an ILP, we first create a non-negative, integer variable for each tuple: x1, …, x5. The cardinality constraint
specifies that the sum of the xi variables should be exactly 3.
The global constraint on SUM(P.kcal) is formed by multiplying
each xi with the value of the kcal column of the corresponding
tuple, and specifying that the sum should be between 2 and 2. 5.
The objective of minimizing SUM(
P.sat_fat) is similarly formed
by multiplying each xi with the sat_fat value of the corresponding tuple.
3. 2. Query evaluation with DIRECT
Using the ILP formulation, we develop Direct, our basic
evaluation method for package queries. In Section 4, we
extend this technique to our main algorithm, SketchRefine,
which supports efficient package evaluation in large datasets. Package evaluation with Direct employs three steps:
1. Base Relations: We first compute the base relations,
such as Rb, Rc, and Rp, with a series of standard SQL
queries, one for each, or by simply scanning R once
and populating these relations simultaneously.
2. ILP Formulation: We transform the PaQL query to an
ILP problem using the rules described in Section 3. 1.
After this phase, all variables xi such that xi = 0 can
be eliminated from the ILP problem because the corresponding tuple ti cannot appear in any package
solution.
3. ILP Execution: We employ an off-the-shelf ILP solver,
as a black box, to get a solution to each of the integer
variables xi. Each xi informs the number of times tuple
ti should be included in the answer package.
Example 3 (ILP solution). The ILP solver operating on the
program of Figure 3 returns the variable assignments to xi
that lead to the optimal solution; xi = 0 means that tuple ti is
Recipes
sat_fat kcal
t1 x1 = 0
= 1
= 1
= 0
= 1
t2 x2
t3 x3
t4 x4
t5
7. 1
5. 2
3. 2
6. 5
2.0
0.45
0.55
0.25
0.15
1. 20 x5
min 7.1x1 + 5.2x2 + 3.2x3 + 6.5x4 + 2.0x5
s.t. x1 +x2 +x3 +x4 +x5 = 3
0.45x1 + 0.55x2 + 0.25x3
+ 0.15x4 + 1.20x5 ≥ 2.0
0.45x1 + 0.55x2 + 0.25x3
+ 0.15x4 + 1.20x5 ≤ 2. 5
x1,x2,x3, x4,x5 ∈ {0, 1}
Figure 3. Example ILP formulation and solution for query Q, on a
sample Recipe dataset. There are only two packages that satisfy all
the constraints, namely {t2, t3, t5} and {t1, t2, t5}, but the first one is the
optimal because it minimizes the objective function.