attributes of dependability
• Safety: absence of harm to people and environment • Availability: the readiness for correct service • Integrity: absence of improper system alterations • Reliability: continuity of correct service • Maintainability: ability to undergo modifications and repairs
metric Reliability
MTTF
Fault, Error & Failures
• Fault: a defect within the system or a situation that can lead to failure • Error: manifestation (symptom) of the fault - an unexpected behaviour • Failure: system not performing its intended function
Effects in time:
Transient/ Intermittent / Permanent
Dependability techniques
Goal of system verification and
validation
is to remove faults
Goal of hazard/risk analysis
is to focus
on more important faults
Goal of fault tolerance
is to reduce
effects of errors if they appear -
eliminate or delay failures
Fault model
describes the foreseen
faults in fault tolerance
Node failures
– Crash
– Omission
– Timing
– Byzantine
Channel failures
in Distributed systems: – Crash (and potential partitions) – Message loss – Message delay – Erroneous/arbitrary messages
On-line error management
• Detection: By program or its environment • Mitigation: – Fault containment by architectural choices – Fault tolerance using redundancy • in software (redundancy in space or time) • in hardware • in data
Static Redundancy
Used all the time (whether an error has appeared or not), just in case… – SW: N-version programming – HW: Voting systems – Data: Parity bits, checksums
Dynamic Redundancy
Used when error appears and specifically aids the treatment – SW: • Space: Exceptions, Rollback recovery • Time: Re-computing a result – HW: Switching to back-up module – Data: Self-correcting codes
Byzantine agreement protocol
a