114 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2
algorithm terminates as soon as it encounters a complete
package, which it returns. The algorithm assumes a (
initially random) refinement order for all groups in and
places them in a priority queue. During refinement, this
group order can change by prioritizing groups with infeasible refinements.
Runtime complexity. In the best case, all refine queries
are feasible and the algorithm never backtracks. In this case,
the algorithm makes up to m calls to the ILP solver to solve
problems of size up to t, one for each refining group. In the
worst case, SketchRefine tries every group ordering leading to an exponential number of calls to the ILP solver. Our
experiments show that the best case is the most common
and backtracking occurs infrequently.
4. 3. Theoretical guarantees
We present two important results on the theoretical guarantees of SketchRefine: ( 1) it produces packages that closely
approximate the objective value of the packages produced
by Direct; ( 2) the probability of false negatives (i.e., queries
incorrectly deemed infeasible) is low and bounded. The
extended version of this work4 includes the formal proofs of
both results.
For a desired approximation parameter e, we can derive
diameter bounds wi j for the offline partitioning that guarantee that SketchRefine will produce a package with objective value ( 1±e)-factor close to the objective value of the
solution generated by Direct for the same query.
Theorem 1 (Approximation Bounds). Let R(attr1, . .., attrk)
be a relation with k attributes, and let be a feasible package
query with a maximization (minimization, resp.) objective over
R. Let S be an exact solver that produces an answer to with
optimal objective value OPT. We denote with ALG the objective
value of the package returned by SketchRefine using S as a
black-box solver. For any e ∈ [0, 1) (e ∈ [0, ∞), resp.), there
exists b ∈ [0, 1) (b ∈ [ 1, ∞), resp.) that depends on e, such that if
R is partitioned into m groups with diameter limits:
( 1)
then ALG ≥ ( 1 − e)OPT (ALG ≤ ( 1 + e)OPT, resp.).
For a feasible query , false infeasibility may happen in two
cases: ( 1) when the sketch query (R~) is infeasible; ( 2) when
greedy backtracking fails (possibly due to suboptimal partitioning). In both cases, SketchRefine would (incorrectly)
report a feasible package query as infeasible. False negatives
are, however, extremely rare, as the following theorem
establishes.
Theorem 2 (False-infeasibility Bounds). For any query
and any random package P, if P is feasible for , then with high
probability: ( 1) the Sketch query (R~) is feasible; ( 2) all
Refine queries i(p ),
1 ≤ i ≤ m, are feasible. Thus,
SketchRefine returns a feasible result.
5. EXPERIMENTAL EVALUATION
This section presents an extensive experimental evaluation of
our techniques for package query execution on real-world
data. The results show the following properties of our meth-
ods: ( 1) SketchRefine evaluates package queries an order of
magnitude faster than Direct; ( 2) SketchRefine scales up to
sizes that Direct cannot handle directly; ( 3) SketchRefine
produces packages of high quality (similar objective value as
the packages returned by Direct). We have also performed
extensive experiments on benchmark data that demonstrate
the robustness of SketchRefine under imperfect partition-
ing and different approximation parameters.
4, 5
5. 1. Experimental setup
We implemented our package evaluation system as a layer
on top of PostgreSQL.a The system interacts with the DBMS
via SQL and uses IBM’s CPLEX12 as the black-box ILP solver.
A package is materialized into the DBMS as a relation, only
when necessary (e.g., to compute its objective value). The
experiments compare Direct with SketchRefine. Both
methods use the PaQL to ILP translation presented in
Section 3.1: Direct translates and solves the original query;
SketchRefine translates and solves the subqueries. We
demonstrate the performance of our query evaluation methods using a real-world dataset consisting of approximately
5. 5 million tuples extracted from the Galaxy view of the
SDSS,
22 and a workload of seven feasible package queries
(Figure 5) constructed by adapting some of the real-world
sample SQL queries available directly from the SDSS
Website. The experiments use the following efficiency and
effectiveness metrics:
Response time. We measure response time as wall-clock
time to generate an answer package. This includes the time
to translate the PaQL query into one or several ILP problems,
the time to load the problems to the solver, and the time the
solver takes to produce a solution.
Approximation ratio. We compare the objective value of a
package returned by SketchRefine with the objective value
of the package returned by Direct on the same query. Using
ObjS and ObjD to denote the objective values of Sketch Refine
and Direct, respectively, we report the empirical
approximation ratio for maximization queries, and for minimization queries. An approximation ratio of one indicates that
SketchRefine produces a solution with same objective
value as the solution produced by the solver on the entire
problem. The higher the approximation ratio, the lower the
quality of the result package.
5. 2. Results and discussion
We evaluate two fundamental aspects of our algorithms: ( 1)
Query 1234567
Objective
of SUM constraints
COUNT (∗)
max
2
min
4
min
2
min
1
min
1
min
5
max
5
BETWEEN
5 AND
10
Figure 5. Summary of queries in the Galaxy workload. The full PaQL
queries appear in the extended version of this work.
4
a Our code is publicly available on our project Website: http://packagebuilder.
cs.umass.edu.