This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an
architecture for cluster computing systems that can tackle emerging data processing workloads at scale.
Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also
enables streaming and interactive queries, while keeping MapReduce’s scalability and fault tolerance. And
whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also
extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the
specialized systems proposed for some of these workloads, our architecture allows these computations to be
combined, enabling rich new applications that intermix, for example, streaming and batch processing.
We achieve these results through a simple extension to MapReduce that adds primitives for data sharing,
called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of
workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and
real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while
offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine
the generality of RDDs from both a theoretical modeling perspective and a systems perspective.
This version of the dissertation makes corrections throughout the text and adds a new section on the
evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the
references have been added.
This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an
architecture for cluster computing systems that can tackle emerging data processing workloads at scale.
Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also
enables streaming and interactive queries, while keeping MapReduce’s scalability and fault tolerance. And
whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also
extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the
specialized systems proposed for some of these workloads, our architecture allows these computations to be
combined, enabling rich new applications that intermix, for example, streaming and batch processing.
We achieve these results through a simple extension to MapReduce that adds primitives for data sharing,
called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of
workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and
real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while
offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine
the generality of RDDs from both a theoretical modeling perspective and a systems perspective.
This version of the dissertation makes corrections throughout the text and adds a new section on the
evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the
references have been added.