then receives each reply in turn, gathering them in a list. The client-side
code for a server call is reused entirely
as is.
By using worker processes, libraries
are free to use receive expressions as
needed without worrying about blocking their caller. If the caller does not
wish to block, it is always free to spawn
a worker.
Dangers of concurrency
Though it eliminates shared state,
Erlang is not immune to races. The
server behaviour allows its application
code to execute as a critical section accessing protected data, but it’s always
possible to draw the lines of this protection incorrectly.
Figure 5, for example, illustrates
that if we had implemented sequences with raw primitives to read and
write the counter, we would be just as
vulnerable to races as a shared-state
implementation that forgot to take
locks.
This code is insidious as it will pass
simple unit tests and can perform reli-ably in the field for a long time before
it silently encounters an error. Both
the client-side wrappers and server-side call-backs, however, look quite
different from those of the correct
implementation. By contrast, an incorrect shared-state program would
look nearly identical to a correct one. It
takes a trained eye to inspect a shared-state program and notice the missing
lock requests.
All standard errors in concurrent
programming have their equivalents in
Erlang: races, deadlock, livelock, starvation, and so on. Even with the help
Erlang provides, concurrent programming is far from easy, and the nondeterminism of concurrency means that
it is always difficult to know when the
last bug has been removed.
Testing helps eliminate most gross
errors—to the extent that the test cases model the behaviour encountered
in the field. Injecting timing jitter and
allowing long burn-in times will help
the coverage; the combinatorial explosion of possible event orderings in a
concurrent system means that no nontrivial application can be tested for all
possible cases.
When reasonable efforts at testing
reach their end, the remaining bugs
are usually heisenbugs, which occur
5
nondeterministically but rarely. They
can be seen only when some unusual
timing pattern emerges in execution.
They are the bane of debugging since
they are difficult to reproduce, but this
curse is also a blessing in disguise. If
a heisenbug is difficult to reproduce,
then if you rerun the computation, you
might not see the bug. This suggests
that flaws in concurrent programs,
while unavoidable, can have their impact lessened with an automatic retry
mechanism—as long as the impact of
the initial bug event can be detected
and constrained.
figure 5: badsequence.erl.
BAD - race-prone implementation - do not use - BAD
-module(badsequence).
-export([make_sequence/0, get_next/1, reset/1]).
-export([init/0, handle_call/2, handle_cast/2]).
API
make_sequence() -> server:start(badsequence).
get_next(Sequence) ->
N = read(Sequence),
write(Sequence, N + 1), BAD: race!
N.
reset(Sequence) -> write(Sequence, 0).
read(Sequence) -> server:call (Sequence, read).
write(Sequence, N) ->
server:cast(Sequence, {write, N}).
Server callbacks
init()
handle_call(read, N)
handle_cast({write, N}, _)
-> 0.
-> {N, N}.
-> N.
failure and supervision
Erlang is a safe language—all runtime faults, such as division by zero,
an out-of-range index, or sending a
message to a process that has terminated, result in clearly defined behavior, usually an exception. Application
code can install exception handlers
to contain and recover from expected faults, but an uncaught exception
means that the process cannot continue to run. Such a process is said to
have failed.
Sometimes a process can get stuck
in an infinite loop instead of failing
overtly. We can guard against stuck
processes with internal watchdog
processes. These watchdogs make periodic calls to various corners of the
running application, ideally causing
a chain of events that cover all long-lived processes, and fail if they don’t
receive a response within a generous
but finite timeout. Process failure is
the uniform way of detecting errors in
Erlang.
Erlang’s error-handling philosophy stems from the observation that
any robust cluster of hardware must
consist of at least two machines, one
of which can react to the failure of the
other and take steps toward recovery. 2
If the recovery mechanism were on the
broken machine, it would be broken,
too. The recovery mechanism must be
outside the range of the failure. In Erlang, the process is not only the unit of
concurrency, but also the range of failure. Since processes share no state, a
fatal error in a process makes its state
unavailable but won’t corrupt the state
of other processes.
Erlang provides two primitives for
one process to notice the failure of
another. Establishing monitoring of
another process creates a one-way notification of failure, and linking two
processes establishes mutual notification. Monitoring is used during temporary relationships, such as a client-server call, and mutual linking is used
for more permanent relationships. By
default, when a fault notification is
delivered to a linked process, it causes
the receiver to fail as well, but a pro-cess-local flag can be set to turn fault
notification into an ordinary message
that can be handled by a receive expression.
In general application program-