application-level task scheduler within each Spark job scales to only about
1,500 tasks per second. Queueing tasks
for assignment at a single scheduler
hence increases their overall “
makes-pan” (the time between task submission
and completion). It also leaves resources
idle while waiting for new tasks to be assigned by the overwhelmed scheduler.
Sparrow addresses this problem in
a radical way: it builds task queues at
each worker and breaks the scheduler
into several distributed schedulers that
populate these worker-side queues independently. Using the “power of two
random choices” property, which says
that (under certain assumptions) it
suffices to poll two random queues to
achieve a good load balance, Sparrow
then randomly places tasks at workers.
This requires neither state at the scheduler, nor communication between
schedulers—and, hence, scales well by
simply adding more schedulers.
This paper includes several important details that make the random-placement approach practical and
bring it close to the choices that an omniscient scheduler would make to balance queue lengths perfectly. As an example, Sparrow speculatively enqueues
a given task in several queues, spreading its bets across multiple workers and
smoothing head-of-line blocking from
other straggler tasks. It can also deal
with placement constraints and offer
weighted fair sharing across multiple
jobs that share the same workers.
Sparro w could in principle be used as
a cluster-level scheduler, but in practice
works best when it load-balances application-level tasks over long-running
workers of a single framework (which,
for example, serves queries or runs analytics jobs). At the cluster-scheduler
level, task startup overheads normally
make tasks below tens of seconds in
duration impractical because package
distribution, container launch, among
others already take several seconds.
Consequently, the open source Sparrow implementation supplies a Spark
application-level scheduler plug-in.
Finally, while Sparrow’s random-
ized, distributed decisions are scalable,
YARN) or an Omega-style shared-state
architecture (for example, Microsoft’s
Apollo and HashiCorp’s Nomad).
Another deliberate consequence of
the offer-driven design is that the Mesos resource manager is fairly simple,
which benefits its scalability. The next
two papers look deeper into this concern: the first one proposes an even
simpler design for even greater scalability, and the second suggests that
scaling a complex scheduler is more
feasible than widely thought.
Breaking the Scheduler Further Apart
Ousterhout, K. et al.
Sparrow: Distributed, low-latency scheduling.
In Proceedings of the Symposium on Operating
Systems Principles, 2013, 69–84; http://dl.acm.
Even within what Mesos considers
framework-level tasks, there may be
another level of scheduling. Specifically, some data-analytics systems break
their processing into many short work
units: Spark, for example, generates
application-level “tasks” that often run
for only a few hundred milliseconds
(note that these are different from the
cluster-level tasks that Borg or Mesos
frameworks place!). Using such short
tasks has many benefits: it implicitly
balances load across the workers that
process them; failing tasks lose only
a small amount of state; and straggler
tasks that run much longer than others
have smaller absolute impact.
Shorter tasks, however, impose a
higher load on the scheduler that places
them. In practice, this is usually a framework-level scheduler, or a scheduler
within the job itself (as in standalone
Spark). With tens of thousands of tasks,
this scheduler might get overwhelmed:
it might simply be unable to support the
decision throughput required. Indeed,
the paper shows that the centralized
depends on the
workload and the