setting can lead to an order of magnitude improvement in
query response time.
The diameter bounds, wi j, are not required, but they can
be enforced to ensure a desired approximation guarantee.
In general, enforcing the diameter limits may cause the
resulting partitions to become excessively small. While still
obeying the approximation guarantees, this could increase
the number of resulting partitions and thus degrade the
running time performance of SketchRefine. This is an
important trade-off between running time and quality that
we also observe in our experiments, and it is a very common
characteristic of most approximation schemes.
24
Partitioning method. Our partitioning procedure is
based on k-dimensional quad-tree indexing.
11 The method
recursively partitions a relation into groups until all the
groups satisfy the size threshold and meet the diameter
limits. First, relation R is augmented with an extra group
ID column gid, such that
t.gid = i if tuple t is assigned to
group Gi. The procedure initially creates a single group G1
that includes all the original tuples from relation R, by initializing gid = 1 for all tuples. Our method recursively computes the sizes and diameters of the current groups, as well
as the centroid of each group. It then partitions the groups that
violate either the size or the diameter limits, using the centroids as partitioning boundaries. In the last iteration, the
centroids for each group become the representative tuples,
, 1 ≤ i ≤ m, and get stored in a new representative relation R~
(gid, attr1, …, attrk).
One-time cost. Partitioning is an expensive procedure.
Partitioning the data in advance avoids this cost at query
time. For a known workload, our experiments show that
partitioning the dataset on the union of all query attributes
provides the best performance in terms of query evaluation
time and approximation error for the computed answer
package. We also demonstrate that our query evaluation
approach is robust to a wide range of partition sizes, and to
imperfect partitions that cover more or fewer attributes
than those used in a particular query. This means that,
even without a known workload, a partitioning performed
on all of the data attributes still provides good perfor-
mance. Note that the same partitioning can be used to sup-
port different queries over the same dataset. In our
initial package by evaluating the package query only
over the set of representative tuples.
3. Refine (Section 4. 2. 2): Finally, SketchRefine transforms
the initial package into a complete package by replacing
each representative tuple with some of the original tuples
from the same group, one group at a time.
SketchRefine always constructs approximate feasible
packages, that is, packages that satisfy all the query con-
straints, but with a possibly sub-optimal objective value that
is guaranteed to be within certain approximation bounds.
SketchRefine may suffer from false infeasibility, which
happens when the algorithm reports a feasible query to be
infeasible. The probability of false infeasibility is, however, low
and bounded. We formalize these properties in Section 4. 3.
In the subsequent discussion, we use R(attr1, …, attrk) to
denote an input relation with k attributes. R is partitioned
into m groups G1, …, Gm. Each group Gi ⊆ R, 1 ≤ i ≤ m, has a
representative tuple , which may not always appear in R.
We denote the partitioned space with .
We refer to packages that contain representative tuples as
sketch packages and packages with only original tuples as
complete packages (or simply packages). We denote a com-
plete package with p and a sketch package with p , where
⊆ is the set of groups that are yet to be refined to trans-
form p to a complete answer package p.
4. 1. Offline partitioning
SketchRefine relies on an offline partitioning of the input
relation R into groups of similar tuples. Partitioning is based
on a set of partitioning attributes from the input relation R, a
size threshold, and a set of diameter bounds. The size thresh-
old t, 1 ≤ t ≤ n, restricts the size of each partitioning group Gi,
1 ≤ i ≤ m, to a maximum of t original tuples, that is, |Gi| ≤ t.
The diameter di j ≥ 0 of a group Gi, 1 ≤ i ≤ m, on attribute attrj, 1
≤ j ≤ k, is the greatest absolute distance between all pairs of
tuples within group Gi. The diameter bounds, wi j ≥ 0, 1 ≤ i ≤ m,
1 ≤ j ≤ k, require all diameters to be bounded by di j ≤ wi j.
Setting the partitioning parameters. The size threshold,
t, affects the number of partitions, m: a lower t leads to
smaller partitions, but more of them (larger m). For best
response time of SketchRefine, t should be set so that
both m and t are small. Our experiments show that a proper
1
02
21
G1 G2
G3
G4
2
G1 G2
G3
G4
1
0
G1 G2
G3
G4
21
G1 G2
G3
G4
(b) Initial query using
representative tuples
(c) Initial package (e) Skipping G2 (g) Refinement
query for group G4
(h) Final approximate
package
REFINE PARTITION SKETCH
(d) Refinement
query for group G1
(f) Refinement
query for group G3
(a) Original tuples
Multiplicity of representative
tuples in the initial package
Representative and original tuples selected during previous steps, shown by
hatching lines, are aggregated and used to modify later refinement queries
Figure 4. The original tuples (a) are partitioned into four groups and a representative is constructed for each group (b). The initial sketch
package (c) contains only representative tuples, with possible repetitions up the size of each group. The refine query for group G1 (d)
involves the original tuples from G1 and the aggregated solutions to all other groups (G2, G3, and G4). Group G2 can be skipped (e) because no
representatives could be picked from it. Any solution to previously refined groups is used while refining the solution for the remaining groups
(f and g). The final approximate package (h) contains only original tuples.