programmer or program analysis tool
in reasonable time.
Designs are generally abstractions
of systems, omitting certain details.
For example, even the most detailed
design may not specify how behaviors
change if the system is incinerated or
crushed. However, an implementation
of the design does have specific reactions to these events (albeit probably
not predictable reactions). Reliability is
the extent to which an implementation
of a design delivers correct behaviors
over time and under varying operating conditions. A system that tolerates
more operating conditions or remains
correct for a longer period of time is
more reliable. Operating conditions include those in the environment (such
as temperature, input values, timing
of inputs, and humidity) but may also
include those in the system itself (such
as fault conditions like failures in communications and loss of power).
A brittle system is one in which small
changes in the operating conditions or
in the design yield incorrect behaviors.
Conversely, a robust system remains
correct with small changes in operating conditions or in design. Making
these concepts mathematically precise
is extremely difficult for most design
languages, so engineers are often limited to intuitive and approximate assessments of these properties.
Requirements
Embedded systems have always been
held to a higher reliability standard
than general-purpose computing systems. Consumers do not expect their
TVs to crash and reboot. They count
on highly reliable cars in which computer controllers have dramatically
improved both reliability and efficiency compared to electromechanical or
manual controllers. In the transition
to CPS, the expectation of reliability
will only increase. Without improved
reliability, CPS will not be deployed
into such applications as traffic control, automotive safety, and health care
in which human lives and property are
potentially at risk.
The physical world is never entirely
predictable. A CPS will not operate in
controlled environments and must be
robust to unexpected conditions and
adaptable to subsystem failures. Engineers face an intrinsic tension between
predictable performance and an unpredictable environment; designing reliable components makes it easier to assemble these components into reliable
systems, but no component is perfectly
reliable, and the physical environment
will inevitably manage to foil reliability
by presenting unexpected conditions.
Given components that are reliable,
how much can designers depend on
that reliability when designing a system? How do they avoid brittle design?
The problem of designing reliable
systems is not new in engineering. Two
basic engineering tools are analysis
and testing. Engineers analyze designs
to predict behaviors under various operating conditions. For this analysis to
work, the properties of interest must
be predictable and yield to such analysis. Engineers also test systems under
various operating conditions. Without
repeatable properties, testing yields incoherent results.
Digital circuit designers have the
luxury of working with a technology
that delivers predictable and repeatable logical function and timing. This
predictability and reliability holds despite the highly random underlying
physics. Circuit designers have learned
to harness intrinsically stochastic
physical processes to deliver a degree
of repeatability and predictability that
is unprecedented in the history of human innovation. Software designers
should be extremely reluctant to give
up on the harnessing of stochastic
physical processes.
The principle designers must follow
is simple: Components at any level of
abstraction should be made as predictable and repeatable as is technologically feasible. The next level of abstraction
above these components must compensate for any remaining variability
with robust design.
Some successful designs today follow this principle. It is (still) technically
feasible to make predictable gates with
repeatable behaviors that include both
logical function and timing. Engineers
design systems that count on these
behaviors being repeatable. It is more
difficult to make wireless links predictable and repeatable. Engineers compensate one level up, using robust coding
schemes and adaptive protocols.
Is it technically feasible to make
software systems that yield predictable
and repeatable properties for a CPS? At
the foundation of computer architecture and programming languages, software is essentially perfectly predictable
and repeatable, if we consider only the
properties expressed by the programming languages. Given an imperative
language with no concurrency, well-defined semantics, and a correct compiler, designers can, with nearly 100%
confidence, count on any computer
with adequate memory to perform exactly what is specified in the program.
The problem of how to ensure reliable and predictable behavior arises
when we scale up from simple programs to software systems, particularly
to CPS. Even the simplest C program is
not predictable and repeatable in the
context of CPS applications because
the design does not express properties
that are essential to the system. It may
execute perfectly, exactly matching its
semantics (to the extent that C has semantics) yet still fail to deliver the properties needed by the system; it could,
for example, miss timing deadlines.
Since timing is not in the semantics
of C, whether or not a program misses
deadlines is irrelevant to determining
whether it has executed correctly but
is very relevant to determining whether
the system has performed correctly. A
component that is perfectly predictable and repeatable turns out not to be
predictable and repeatable in the dimensions that matter. Such lack of predictability and repeatability is a failure
of abstraction.
The problem of how to ensure predictable and repeatable behavior gets
more difficult as software systems get
more complex. If software designers
step outside C and use operating system primitives to perform I/O or set up
concurrent threads, they immediately
move from essentially perfect predictability and repeatability to wildly nondeterministic behavior that must be
carefully anticipated and reigned in by
the software designer. Semaphores,
14
mutual exclusion locks, transactions,
and priorities are some of the tools
software designers have developed to
attempt to compensate for the loss of
predictability and repeatability.
But computer scientists must ask
whether the loss of predictability and
repeatability is necessary. No, it is not.
If we find a way to deliver predictable