server receives 1/pth of the data. There are no assumptions on
the particular partitioning scheme. At the end of the execution,
the output must be present in the union of the p servers. As the
model focuses primarily on quantifying the amount of communication there is no a priori bound on the computational power
of a server. A relevant measure is the load at each server, which
is the amount of data received by a server during a particular
round. Examples of optimization goals are minimizing total
load (e.g., Ref.
1) and minimizing maximum load (e.g., Ref.
12).
To get a feeling for the model, we next present simple
examples of single- and multi-round algorithms in the MPC
model for evaluating specific conjunctive queries.
Example 3. 1. ( 1) Consider the query Q1
H(x, y, z) ← R(x, y), S( y, z),
joining two binary relations R and S over a common attribute.
Let h be a hash function mapping every domain value to one of
the p servers. The following single-round algorithm computes
Q1. In the communication phase, executed by every server on
its local data, every tuple R(a, b) is sent to server h(b) while every
tuple S(c, d) is sent to server h(c). In the computation phase,
every server evaluates Q1 on the received data. The output of the
algorithm is the union of the results computed at the compu-
tation phase. This strategy is called a repartition join in Ref.
6
( 2) Let Q2 be the triangle query:
H (x, y, z) ← R (x, y), S ( y, z), T (z, x).
One way to evaluate Q2 is by two binary joins leading to a two-round algorithm. We assume two hash functions h and h′.
In the first round, all tuples R(a, b) and S(c, d) are sent to servers h(b) and h(c), respectively. The computation phase computes the join of R and S at each server in a relation K. In the
second round, each resulting triple K(e, f, g) is sent to h′(e, g),
while each tuple T (i, j ) is sent to h′( j, i ). Finally, K and T are
joined at each server.
We note that every MapReduce8 program can be seen as
an algorithm within the MPC model since the map phase
and reducer phase readily translate to the communication
and computation phase of MPC.
4. HYPERCUBE ALGORITHM
To illustrate the HyperCube algorithm, we show in the following example that the triangle query of Example 3. 1( 2) can be
evaluated by a single-round MPC algorithm.
Example 4. 1. Consider again the triangle query Q2 of Example
3. 1( 2):
H (x, y, z) ← R(x, y), S ( y, z), T (z, x).
Let αx, αy, and αz be positive natural numbers such that
αxαyαz = p. Every server can then uniquely be identified by
a triple in [ 1, αx] × [ 1, αy] × [ 1, αz]. For c ∈ {x, y, z}, let hc be a
hash function mapping each domain value to a number in
[ 1, αc]. The algorithm then operates as follows. In the com-
munication phase, every fact
• R(a, b) is sent to every server with coordinate (hx(a),
hy(b), α) for every α ∈ [ 1, αz]; so, R(a, b) is sent to the sub-
cube determined by the hash values hx(a) and hy(b) in the
x- and y-dimension, respectively, as illustrated in Figure 1a;
• S(b, c) is sent to every server with coordinate (α, hy(b),
hz(c) ) for every α ∈ [ 1, αx]; and
• T(c, a) is sent to every server with coordinate (hx(a), α,
hz(c) ) for every α ∈ [ 1, αy].
We note that every R-tuple is replicated αz times and similarly
for S- and T-tuples.
The computation phase consists of evaluating Q2 on the
local data at each server. The algorithm is correct because
for every valuation V for Q2 some server contains the facts
{V (R(x, y)), V (S( y, z)), V ( T (z, x))},
if the (hypothetical) centralized database contains them. In
this sense, the algorithm distributes the space of all valuations
of Q2 over the computing servers in an instance independent
way through hashing of domain values. In the special case that
αx = αy = αz = p1/3, each tuple is replicated p1/3 times. Assuming
each relation consists of m tuples and there is no skew, each
server will receive m/p2/3 tuples for each of the relations R, S,
and T. So, the maximum load per server is O(m/p2/3).
hy (b) = 2
hx (a) = 3
(a)
hy (b) = 2
hz (c) = 0
(b)
hx (a) = 3 hz (c) = 0
(c)
Figure 1. HyperCube distribution policies view the computing nodes in the network as arranged in a multi-dimensional grid. Each dimension
corresponds to a variable of the query to be computed. Replication happens in a structurally restricted way: along a line, a plane, or a
hyperplane. This figure illustrates the replication of facts R(a, b), S(b, c), T(c, a) as required by a valuation for the triangle query in Example 4. 1
for values p = 72, αx = 6, αy = 4, and αz = 3. All facts meet at the node with coordinate (hx(a), hy(b), hz(c) ) = ( 3, 2, 0). Therefore the fact H(a, b, c)
can be derived locally, as desired. (a) Replication of R(a, b). (b) Replication of S(b, c). (c) Replication of T(c, a).