risk is never even assessed, because
fault-induced failures, no matter how
severe, are billed to someone else.
Ironically, even fault tolerance researchers can succumb to the lure of
CEO Software. With commendable
candor, one group has argued that optimizing fault-free efficiency, “a practice
that we have engaged in with relish...,
is increasingly misguided, dangerous,
and even futile.”
6 Broader awareness
of CEO Software’s liabilities could help
avoid such traps.
Over 60 years ago, von Neumann
recognized the scalability limits of his
namesake design. He predicted a future
computing model, a “logic of automata,” in which short runtimes would be
preferred, and all computational operations would tolerate faults.
11 With
transistor sizes continuing to shrink
beneath churning piles of dubious
software, plus fault-injecting attackers
wielding lasers and X-rays,
3 we continue
to delay that future at our peril.
Today’s computing base is optimized for the perfect logician, so the
quest for robustness has many fronts.
A clean-slate approach is robust, indefinitely scalable architectures,
2 but
research topics abound.
For example, it would help to have
better visibility into efficiency-robust-ness trade-offs. Existing software quality metrics could be expanded to include correctness degradation under
various fault models. Bubble sort was
not even designed to be robust. How
will other algorithms behave?
To set the stage formally, it would
help to have results on the trade-offs
between efficiency and robustness,
perhaps like a Brewer’s Theorem8 extended to degrees of failure. A starting
point might be highly optimized tolerance,
4 which explicitly represents failure size.
We also need to understand robust
applications programming better. The
ongoing trend away from monolithic
mainlines toward quick event handlers
could be seen as compatible with von
Neumann’s short program runtimes.
More possibilities arise as we replace
prima donna CEO Software with robust
team player code that gives better than
it gets. For example, Dijkstra’s self-sta-
bilization9 technique continually “falls
toward” correctness. It is a beautiful
embodiment of the robust spirit, but re-
I subsequently found that quick
and merge become competitive with
bubble if I repeat each of their com-
parisons six times and return the
most common answer. But that felt
like a cheap hack—an after-the-fact
fix tailored to one situation. Here, the
failed comparisons were independent,
identically distributed (IID) events, but
real-world faults often occur in bursts
or other patterns. For example, a mis-
taken person gives wrong directions
even if you repeat the question. How
far should you drive based on their an-
swers? Or in this case, how far should
a sorter swap data?
Bubble sort wins on some non-IID faults as well. Suppose our fallible comparator also suffered from a
sticky output, so half the time it just
repeated its previous answer. Given
such awful directions, bubble’s error
rose to nearly 90—but quick exceeded
550, merge exceeded 650, and even
their best-of-six variants both broke
250. Bubble sort is inefficient, but it
is robust.
Efficiency is key when demands
exceed resources. Yet many comput-
ers are usually idling. They could have
been checking their work, but that is
off the radar for CEO Software. Op-
timizing efficiency conspires with
program correctness to undermine
robustness. The more work done
per program action, as with the long
swaps of efficient sorters, the more
serious the disruption from faulty ac-
tion. Yet as long as the optimized code
remains logically correct, with CEO
Software there is no pushback. That
quires extra work to interface with CEO
Software, because it can violate correct-
ness while stabilizing, which CEO Soft-
ware cannot tolerate.
Finally, we could develop self-tuning algorithms that trade efficiency
against robustness dynamically, based
on workload properties, and resource
availability and quality. Some existing
work5 uses failure-detection heuristics to switch from efficient to robust
processing. More could be done in a
computational model that supported
signaling external resource characteristics, and internal robustness margins, between components.
It is time to study and manage incorrectness in the interest of robustness.
We should not shun the trade-off, but
rather, we should understand, engineer, and teach computation beyond
correctness and efficiency only.
Robust-first computation now.
References
1. ackley, D.h. bubble sort robustness demonstration
code (apr. 2012); http://livingcomputation.com/
robusort2.tar.
2. ackley, D.h., Cannon, D.C., and Williams, l.r. a
movable architecture for robust spatial computing.
The Computer Journal, 2012.
3. h. bar-el, et al. the sorcerer’s apprentice guide to
fault attacks. In Proceedings of the IEEE 94, 2 (Feb.
2006), 370–382.
4. Carlson, J.m. and Doyle, J. highly optimized tolerance:
a mechanism for power laws in designed systems.
Phys. Rev. E. 60, (aug. 1999), 1412–1427.
5. Chang, I., hiltunen, m.a., and schlichting, r.D.
affordable fault tolerance through adaptation. In
Proceedings of PPS/SPDP Workshops, (1998), 585–603.
6. Clement, a. et al. making byzantine fault tolerant
systems tolerate byzantine faults. In Proceedings of
the 6th USENIX Symposium on Networked Systems
Design and Implementation (NSDI’09). usenIX
association, berkeley, Ca, 2009, 153–168.
7. Dohi, t. and Cukic, b., eds. Proceedings of the IEEE
22nd International Symposium on Software Reliability
Engineering (ISSRE 2011) (nov. 29–Dec. 2, 2011,
hiroshima, Japan), Ieee, 2011.
8. gilbert, s. and lynch, n. brewer’s conjecture and the
feasibility of consistent, available, partition-tolerant
web services. SIGACT News 33, 2 (June 2002), 51–59.
9. schneider, m. self-stabilization. ACM Comput. Surv.
25, 1 (mar. 1993), 45–67.
10. swarz, r.s., koopman, P., and Cukier, m. eds. IEEE/
IFIP International Conference on Dependable
Systems and Networks, (boston, ma, June 25–28,
2012). Ieee Computer society, 2012.
11. von neumann, J. the general and logical theory of
automata. In l.a. Jeffress, ed. Cerebral Mechanisms
in Behaviour: The Hixon Symposium (1948) Wiley,
1951, 15–19. also appears as pages 302–306 in
a.h. taub, ed., John von Neumann Collected Works:
Volume V—Design of Computers, Theory of Automata
and Numerical Analysis, Pergamon Press, 1963.
David h. Ackley ( ackley@cs.unm.edu) is an associate
professor of computer science at the university of new
mexico in albuquerque, nm.
the author thanks brian Christian, howard shrobe, george
Furnas, Jim hollan, michael lesk, terry Jones, and John
king for their review comments.
Copyright held by author/owner(s).
figure 2. errors sorting a deck of cards
with 10% failed comparisons.
300
200
100
0
Quick
tot
al
car
d
p
os
i
t
iona
lerror
(a
vera
ge
)
sort algorithm
merge Bubble