Aims
To understand the factors which affect the reliability of a system, we
introduce how software and hardware design faults can be tolerated
We look at:
- Safety and Dependability
- Reliability, failure and faults
- Failure modes
- Fault prevention and fault tolerance
- N-Version programming
- Dynamic Redundancy
Safety and Reliability
Safety
E.g., measures which increase the likelihood of a
weapon firing when required may well increase
the possibility of its accidental detonation
* In many ways, the only safe airplane is one that
never takes off, however, it is not very reliable
* As with reliability, to ensure the safety
requirements of an embedded system, system
safety analysis must be performed throughout
all stages of its life cycle development
4
Reliability, Failure and Faults
Fault Types
Software Faults
Effect of failure in the US Patriot Missile system
Failure modes
fig 1 & fig 2
Approaches to Achieving Reliable Systems
Fault Prevention
Two stages: fault avoidance and fault removal
* Fault avoidance attempts to limit the introduction of faults during system
construction by:
1. use of the most reliable components within the given cost and performance
constraints
2. use of thoroughly-refined techniques for interconnection of components
and assembly of subsystems
3. use of proven design methodologies
4. use of languages with facilities for data abstraction and modularity
5. use of software engineering environments to help manipulate software
components and thereby manage complexity
Fault Removal
Design errors (hardware and software) will exist
* Fault removal: procedures for finding and removing the causes of errors;
- e.g. design reviews, program verification, code inspections and system testing
* System testing can never be exhaustive and remove all potential faults
- A test can only be used to show the presence of faults, not their absence
- It is sometimes impossible to test under realistic conditions
- Most tests are done with the system in simulation mode and it is difficult to guarantee that the simulation is
accurate
- Requirements errors during the system’s development may not manifest themselves until the system goes
operational
Failure of Fault Prevention Approach
Levels of Fault Tolerance
Graceful Degradation in an Air Traffic Control System & Redundancy
Hardware Fault Tolerance
*Two types: static (or masking) and dynamic redundancy
* Static: redundant components are used inside a system to hide the
effects of faults; e.g. Triple Modular Redundancy (TMR)
- TMR — 3 identical subcomponents and majority voting circuits; the outputs are compared and
if one differs from the other two, that output is masked out
- Assumes the fault is not common (such as a design error) but is either transient or due to
component deterioration
- To mask faults from more than one component requires NMR
* Dynamic: redundancy supplied inside a component which indicates
that the output is in error; provides an error detection facility;
recovery must be provided by another component
- E.g. communications checksums and memory parity bits
Software Fault Tolerance
N-Version Programming
Design diversity
* The independent generation of N (N > 2) functionally equivalent
programs from the same initial specification
No interactions between groups
* The programs execute concurrently with the same inputs and their
results are compared by a driver process
* The results (VOTES) should be identical, if different the consensus
result, assuming there is one, is taken to be correct
+ N-version programming
Vote Comparison
Consistent Comparison Problem
Fig 1
Each version will produce a
different but correct result
Even if inexact comparison
techniques are used, the problem
occurs
N-version programming depends on
Error Detection
Software Dynamic Redundancy
Software Dynamic Redundancy
24
Alternative to static redundancy: four phases
* error detection — no fault tolerance scheme can be utilised until the
associated error is detected
* damage confinement and assessment — to what extent has the
system been corrupted?
- The delay between a fault occurring and the detection of the error means
erroneous information could have spread throughout the system
* error recovery — techniques should aim to transform the corrupted
system into a state from which it can continue its normal operation
(perhaps with degraded functionality)
* fault treatment and continued service — an error is a symptom of a
fault; although the damage is repaired, the fault may still exist
Damage Confinement and Assessment
Error Recovery
Backward Error Recovery (BER)