Erlang
for Concurrent Programming
long burn-in times will help the coverage; the combinatorial explosion of possible event orderings in a concurrent
system means that no nontrivial application can be tested
for all possible cases.
When reasonable efforts at testing reach their end,
the remaining bugs are usually heisenbugs, 5 which occur
nondeterministically but rarely. They can be seen only
when some unusual timing pattern emerges in execution.
They are the bane of debugging since they are difficult to
reproduce, but this curse is also a blessing in disguise. If a
heisenbug is difficult to reproduce, then if you rerun the
computation, you might not see the bug. This suggests
that flaws in concurrent programs, while unavoidable,
can have their impact lessened with an automatic retry
mechanism—as long as the impact of the initial bug
event can be detected and constrained.
FAILURE AND SUPERVISION
Erlang is a safe language—all runtime faults, such as division by zero, an out-of-range index, or sending a message
to a process that has terminated, result in clearly defined
behavior, usually an exception. Application code can
install exception handlers to contain and recover from
expected faults, but an uncaught exception means that
the process cannot continue to run. Such a process is said
to have failed.
Sometimes a process can get stuck in an infinite loop
instead of failing overtly. We can guard against stuck processes with internal watchdog processes. These watchdogs
make periodic calls to various corners of the running
application, ideally causing a chain of events that cover
all long-lived processes, and fail if they don’t receive a
response within a generous but finite timeout. Process
failure is the uniform way of detecting errors in Erlang.
Erlang’s error-handling philosophy stems from the
observation that any robust cluster of hardware must
consist of at least two machines, one of which can react
to the failure of the other and take steps toward recovery. 6
If the recovery mechanism were on the broken machine,
it would be broken, too. The recovery mechanism must
be outside the range of the failure. In Erlang, the process
is not only the unit of concurrency, but also the range of
failure. Since processes share no state, a fatal error in a
process makes its state unavailable but won’t corrupt the
state of other processes.
Erlang provides two primitives for one process to
notice the failure of another. Establishing monitoring of
another process creates a one-way notification of failure,
and linking two processes establishes mutual notification. Monitoring is used during temporary relationships,
such as a client-server call, and mutual linking is used for
more permanent relationships. By default, when a fault
notification is delivered to a linked process, it causes the
receiver to fail as well, but a process-local flag can be set
to turn fault notification into an ordinary message that
can be handled by a receive expression.
In general application programming, robust server
deployments include an external “nanny” that will monitor the running operating-system process and restart it if
it fails. The restarted process reinitializes itself by reading
its persistent state from disk and then resumes running.
Any pending operations and volatile state will be lost, but
assuming that the persistent state isn’t irreparably corrupted, the service can resume.
The Erlang version of a nanny is the supervisor behaviour. A supervisor process spawns a set of child processes
and links to them so it will be informed if they fail. A
supervisor uses an initialization callback to specify a
strategy and a list of child specifications. A child specification gives instructions on how to launch a new child.
The strategy tells the supervisor what to do if one of
its children dies: restart that child, restart all children,
or several other possibilities. If the child died from a
persistent condition rather than a bad command or a rare
heisenbug, then the restarted child will just fail again. To
avoid looping forever, the supervisor’s strategy also gives
a maximum rate of restarting. If restarts exceed this rate,
the supervisor itself will fail.
Children can be normal behaviour-running processes,
or they can be supervisors themselves, giving rise to a tree
structure of supervision. If a restart fails to clear an error,
then it will trigger a supervisor subtree failure, resulting
in a restart with an even wider scope. At the root of the
supervision tree, an application can choose the overall
strategy, such as retrying forever, quitting, or possibly
restarting the Erlang virtual machine.
Since linkage is bidirectional, a failing server will
notify or fail the children under it. Ephemeral worker
processes are usually spawned linked to their long-lived
parent. If the parent fails, the workers automatically
fail, too. This linking prevents uncollected workers from
accumulating in the system. In a properly written Erlang