Esteem for efficiency should be tempered with respect for robustness.
COmPUTEr ScIEncE OfTEn emphasizes processing ef- ficiency, leaving robustness to be addressed separately. However, robustness requires redundancy, which efficiency
eliminates. For safer and more scalable computing, we must embrace and
manage this trade-off.
You’ve seen them: Crashed computers, frozen on the job (see Figure 1).
Fortunately, the result is seldom worse
than user inconvenience or owner embarrassment. Still, as computer scientists we wonder why the computer
inside the machine is so often the
Computers keep gaining new responsibilities. In everything from
smartphones to cars to medical equipment, we need computers to be robust.
They should be competent at their jobs,
but also sensible about the unexpected,
and prudent about the malicious.
Over the years we have learned
much about how to keep computers
working. Fields like fault tolerance10
and software reliability7 employ structured redundancy to enhance robustness. Data centers and other high-availability systems have benefited,
but the techniques rarely reach the
mass market. Meanwhile, many areas
of computer science—for example,
algorithm and database design, and
programming generally—view redundancy as waste. A common perspective, here stated categorically for emphasis, says software designers and
programmers should assume a 100%
reliable deployment platform, and the
goal of software is ‘CEO’: Correctness
and Efficiency Only .
That ‘CEO Software’ mind-set has
gone largely unchallenged because it
has history and technology behind it.
Our traditional digital architectures,
error-correcting hardware, and fault-masking subsystems like TCP libraries work together to support it. Yet it
is misleading and risky. It implies efficiency and robustness are separate,
when actually they are coupled. Teaching it to our students perpetuates that
CEO Software powers much of computing today, often with great success,
but it is neither inevitable nor harmless in general. This Viewpoint reviews
its origins, offers a canonical example
of its hidden risks, and suggests opportunities for balancing efficiency and
robustness throughout the computational stack.
It is easy to blame the woes of mod-
ern computing on flaky hardware, mis-
creants spreading malware, clueless
users clicking, sloppy coders writing
buggy code, and companies shipping
first and patching later. Certainly in
my programming classes I have a stern
face for missing error checks and all
program ugliness, but the deeper prob-
lem is our dependence on the basic von
Neumann model of computation—a
CPU with RAM, cranking out a vast
daisy chain of ideal logical inferences.
This is ultimately unscalable, as von
Neumann himself observed11 in 1948.
We have stretched von Neumann’s
model far beyond what he saw for it.
The cracks are showing.
The original concept of general-purpose digital computing amounts to
this: Reliability is a hardware problem;
desirability is a software problem. The
goal was to engineer a “perfect logician” to flawlessly follow a logic ‘recipe’
provided later. Reliability was achieved
through massive redundancy, using
whole wires to carry individual bits,
and amplifiers everywhere to squish
out errors. Analog machines of the day
could compute a desirable result, such
as an artillery firing solution, using less
than 10 amplifiers. Early digital machines did similar work more flexibly,
but used thousands of amplifiers.
Meanwhile, computer scientists
and programmers devised correct recipes for desirable computations. The
huge costs but modest abilities of early
hardware demanded those recipes
be ruthlessly efficient. CEO Software
was born. Many efficient algorithms
were developed, along with efficiency
enhancements such as keeping the
processor busy with multiple recipes,
sharing memory to speed task interactions, and caching intermediate results to save time.
However, if an intermediate result
might—for any reason—be incorrect,
reusing it might just make things
worse. Sharing resources lets problems cascade between recipes. If efficiency is optimized, a single fault
could corrupt a machine’s behavior
machines may be
in the coal mine.