This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an
architecture for cluster computing systems that can tackle emerging data processing workloads at scale.
Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also
enables streaming and interactive queries, while keeping MapReduce’s scalability and fault tolerance. And
whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also
extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the
specialized systems proposed for some of these workloads, our architecture allows these computations to be
combined, enabling rich new applications that intermix, for example, streaming and batch processing.
We achieve these results through a simple extension to MapReduce that adds primitives for data sharing,
called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of
workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and
real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while
offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine
the generality of RDDs from both a theoretical modeling perspective and a systems perspective.
This version of the dissertation makes corrections throughout the text and adds a new section on the
evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the
references have been added.
This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an
architecture for cluster computing systems that can tackle emerging data processing workloads at scale.
Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also
enables streaming and interactive queries, while keeping MapReduce’s scalability and fault tolerance. And
whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also
extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the
specialized systems proposed for some of these workloads, our architecture allows these computations to be
combined, enabling rich new applications that intermix, for example, streaming and batch processing.
We achieve these results through a simple extension to MapReduce that adds primitives for data sharing,
called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of
workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and
real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while
offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine
the generality of RDDs from both a theoretical modeling perspective and a systems perspective.
This version of the dissertation makes corrections throughout the text and adds a new section on the
evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the
references have been added.
Cover
CII
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
CIII
CIV
Zoom level
fit page
fit width
A
A
fullscreen
one page
two pages
share
print
SlideShow
fullscreen
in this issue
search
back issues
help
Click to subscribe to this magazine
Open Article
Open Article
Close Article
article text for page
< previous story
|
next story >
Share this page with a friend
Save to “My Stuff”
Subscribe to this magazine
Search
Help