Misconception #4:
Software Can Be Shown
to Be Safe by Testing, Simulation,
or Standard Formal Verification
Testing: Exhaustive testing of software
is impossible. The problem can be explained by examining what “
exhaustive” might mean in the domain of software testing:
˲Inputs: The domain of possible
software logic and humans making
cognitively complex decisions.
We need to stop pretending that
these probabilistic estimates of safety
have anything to do with reality, and
not base our confidence about safety
on them. I have examined hundreds of
accident reports in my 40 years in system safety engineering. Virtually every
accident involved a system with a probabilistic risk assessment that showed
the accident could/would not occur,
usually exactly in the way it did happen.
Misconception 3:
The Safety of Components in
a Complex System Is a Useful
Concept; That Is, We Can
Model or Analyze the Safety
of Software in Isolation from
the Entire System Design
While the components of a more complex system can have hazards (states
that can lead to some type of loss),
these are usually not of great interest
when the component is not the entire
system of interest. For example, the
valve in a car or an aircraft can have
sharp edges that could potentially lead
to abrasions or cuts to those handling
it. But the more interesting hazards are
always at the system level—the sharp
corners on the valve do not impact the
hazards involved in the role of the valve
in the inadvertent release of nuclear radiation from a nuclear power plant or
the release of noxious chemicals from a
chemical plant (for example).
In other words, safety is primarily
a system property and the hazards of
interest are system-level hazards. The
component’s behavior can, of course,
contribute to system hazards, but its
contribution cannot be determined
without considering the behavior of
all the system components as a whole.
Potentially effective approaches to
safety engineering involve identifying
the system hazards and then eliminating or, if that is not possible, preventing or mitigating them at the system
level. The system hazards can usually
be traced down to behavior of the system components, but the reverse is
not true. One cannot show that each
component is safe in isolation and
then use that analysis to conclude the
system as a whole will be safe.
Another way of saying this is that
a system component failure is not
equivalent to a hazard. Component
failures can lead to system hazards,
but a component failure is not neces-
sary for a hazard to occur. In addition,
even if a component failure occurs,
it may not be able to contribute to a
system hazard. This is simply another
way of clarifying misconception #2
concerning the difference between
reliability and safety.
Some Navy aircraft were ferrying missiles from
one point to another. One pilot executed a
planned test by aiming at the aircraft in front
(as he had been told to do) and firing a dummy
missile. Apparently, nobody knew that the
“smart” software was designed to substitute a
different missile if the one that was commanded
to be fired was not in a good position. In this case,
there was an antenna between the dummy missile
and the target, so the software decided to fire a
live missile located in a different (better) position
instead. What aircraft component(s) failed here?
This loss involved the Mars Polar Lander. It is
necessary to slow the spacecraft down to land
safely. Ways to do this include using the Martian
atmosphere, a parachute and descent engines
(controlled by software). As soon as the spacecraft
lands, the software must immediately shut
down the descent engines to avoid damage to
the spacecraft. Some very sensitive sensors on
the landing legs provide this information. But it
turned out that noise (sensor signals) is generated
when the legs are deployed. This expected
behavior was not in the software requirements.
Perhaps it was not included because the software
was not supposed to be operating at this time, but
the software engineers decided to start early to
even out the load on the processor. The software
thought the spacecraft had landed and shut down
the descent engines while the spacecraft was
still 40 meters about the planet surface. Which
spacecraft components failed here?
It is dangerous for an aircraft’s thrust reversers
(which are used to slow the aircraft after it has
touched down) to be activated when the aircraft
is still in the air. Protection is designed into
the software to prevent a human pilot from
erroneously activating the thrust reversers when
the aircraft is not on the ground. Without going
into the details, some of the clues for the software
to determine the plane has landed are weight
on wheels and wheel spinning rate, which for a
variety of reasons did not hold in this case. For
example, the runway was very wet and the wheels
hydroplaned. As a result, the pilots could not
activate the thrust reversers and the aircraft ran
off the end of the runway into a small hill. What
aircraft components failed here?
Three Examples of
Accidents Due to Unsafe
Interactions between
Systems Components