DOI: 10.1145/3041061
To view the accompanying paper,
visit doi.acm.org/10.1145/3041063
WHEN WE TALK about big data and data
analytics, a big—some say, the biggest—component of it is what is known
as data wrangling: extracting, integrating, querying, and otherwise preparing
data for meaningful analytic algorithms
to be applied. Data wrangling relies on
well-known and trusted database technology, but many classical database
questions now are posed in new settings. One reason for this is that parallel
processing becomes very important for
handling large amounts of data. This
has given rise to a steady line of research
on classical database problems in new
environments where costs caused by
massive parallelism dominate the usual
I/O costs of the standard database environment. These new costs are primarily
related to communication.
What is the most drastic way to reduce the cost of communication for
parallel data processing algorithms,
for example, query evaluation? If we
could distribute data to servers in a
single round of communication, let
them do their work, and then collect
the results to produce the answer to
our query, that would be ideal. This is
precisely the kind of questions studied
in the following paper. It looks at join
algorithms: the most common and
important task in database query processing, and investigates conditions
on joins that make one-round parallel
algorithms produce correct results.
They are not the first to look at this
problem. In 2010, Afrati and Ullman initiated the study of such multi-join algorithms. A refinement, Hypercube, algorithm was proposed in 2013 by Beame,
Koutris, and Suciu. In those algorithms,
the network topology is a hypercube. To
evaluate a query, one replicates each tuple in several of its nodes and then lets
each node perform its computation.
While the hypercube is a rather natural
distribution policy, it is certainly not
the only one. But can we reason about
single-round join evaluation under arbitrary distribution policies?
Also, distribution policies are query-dependent. While finding one policy
for all scenarios is of course unrealistic, what about a more down-to-earth
requirement: if we already know that
a policy works for a query Q, perhaps
we can use the same policy for another
query Q′, without redistributing data?
This paper addresses these questions.
The formalism. It is very simple
and elegant. A network is a set of
node names; a distribution policy assigns each tuple in a relation to a set
of nodes. This is the communication
round. The query Q is then executed locally at each node. It is parallel correct
if such a distributed evaluation gives
the result of Q; that is, tuples in the answer to Q are exactly those produced locally at network nodes.
Next, if we have two queries Q and
Q′, and we know that each distribution
policy that makes Q parallel-correct
does the same for Q′, we say that parallel-correctness transfers from Q to
Q′. In this case, the work done for Q in
terms of looking for the right distribution policy need not be redone for Q′.
The results, and what they tell us.
This is a theory paper; the main results
are about the complexity of checking
parallel-correctness and parallel-trans-ferability. It concentrates on the class of
conjunctive queries, that is, slightly more
general queries than multi-way joins.
Parallel-correctness, under mild assumptions, is II p
2 -complete. That is, it
is a bit harder than NP or coNP, but still
well within polynomial space. In practice, this means that checking whether
a distribution policy guarantees correctness for all databases can be done
in exponential time. Note that this is a
static analysis problem (the database is
not an input), and exponential time is
tolerable and in fact the expected best
case for conjunctive queries (as their
containment is NP-complete).
The authors then show the same
problems for conjunctive queries with
negations requires (modulo some complexity theory assumptions) double-
exponential time, that is, is realistically
unsolvable, which means one needs to
restrict attention to simple joins.
Finally, transferability of parallel-correctness for conjunctive queries is
solvable in exponential time (
remember, this is a problem about queries, not
about data), and importantly it is in NP
for many classes of conjunctive queries,
like multi-joins (which hints at the possibility of using efficient NP solvers to
address this problem in practice).
To conclude, I would like to explain
why I view this as a model database theory paper. Such a paper ought to have
several key ingredients:
˲ It should consider a real data management problem of interest in practice;
˲ It should provide a clean and simple formalism that can be followed by
theoreticians and practitioners alike;
˲ It should provide theoretical results that give us insights about the
original practical problem.
The paper ticks all these boxes: It
provides an elegant theoretical investigation of a practically important problem, and gives a good set of results that
delineate the feasibility boundary.
Leonid Libkin ( libkin@inf.ed.ac.uk) is a professor in the
School of Informatics and chair of Foundations of Data
Management at the University of Edinburgh, Scotland.
Copyright held by author.
Technical Perspective
Data Distribution
for Fast Joins
By Leonid Libkin
The following
paper looks at
join algorithms:
the most common
and important task
in database query
processing.