Critical Systems Flashcards

(119 cards)

1
Q

What is a system?

A

A construct or collection of different elements that together produce results not obtainable by elements alone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a critical system?

A

A system where a system failure leads directly to an incident that has an associated loss of some kind.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some essential properties of critical systems?

A

Safety
Availability
Reliability
Security
Resilience
[Integrity]
[Confidentiality]
-> not all attributes are relevant for a given system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are primary safety-critical systems?

A

Embedded software systems whose failure can cause the associated hardware to fail and directly threaten people.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are secondary safety-critical systems?

A

Systems whose failure results in faults in other (socio-technical) systems, which can then have safety consequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an accident (or mishap)?

A

An unplanned even or sequence of events which results in human death or injury, damage to property, or to the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a hazard?

A

A condition with the potential for causing or contributing to an accident.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is damage?

A

A measure of the loss resulting from a mishap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some approaches for security assurance?

A

Vulnerability avoidance
Attack detection and elimination
Exposure limitation and recovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a computer-based system?

A

Socio-technical system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In the pathology of failures, when is a fault defined as active? Dormant?

A

Active: When it produces an error.
Dormant: When it does not produce an error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is error propagation?

A

When an error successively transforms into other errors.
Chaining of errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a service failure?

A

When an error propagates to the service interface and causes the service to deviate from what it should do. This can also chain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is fault tolerance?

A

Avoid service failures in the presence of faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is fault removal?

A

Reduce the number and severity of faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is fault forecasting?

A

Estimates the present number, future incidence and hence the likely consequences of faults.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are two definitions of a dependable system?

A

1: has the ability to deliver a service that can justifiably be trusted.
2: can avoid service failures that are more frequent or severe than acceptable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Give an example of a chain software failure.

A
  • error by programmer leads to a dormant fault in the written software
  • upon activation the fault becomes active, producing an error
  • once the error affects the delivered service, a failure occurs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some verification approaches (system not exercised aka static verification)?

A
  1. System
    - static analysis
    - theorem proving
  2. behaviour model
    - model checking
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are some verification approaches (system exercised aka dynamic verification)?

A
  1. Symbolic inputs
    - symbolic execution
  2. Actual inputs
    - testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why isn’t testing enough?

A
  • Can be used to catch bugs, but not prevent them: Dijkstra
  • developers should be able to provide evidence that their system satisfies given Dependability Goals.
  • Software certification may still rely primarily on testing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What do we mean by a system cannot be dependable without evidence?

A

Dependability is not merely the absence of defects or failures that result from them but the presence of Concrete Information which suggest said failures will not occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the building blocks of a critical system?

A
  1. system boundaries, modularity
    - e.g. components and interactions
  2. critical properties and their level of confidence.
    - not associated to a function alone, typically cutting across several.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do we define the dependability case?

A
  1. Auditable
    - allows a third party certifier to evaluate it
  2. Complete
    - the argument that the critical properties applied should contain no holes to be filled by the certifier
  3. Sound
    - consist of correctness claims
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What do we mean by decoupling and simplicity in the dependability mindset?
Localise critical properties to components. Allows making assurances easier to check locally.
26
What do we mean by formal verification?
- rigorous techniques for the specification, development and (manual or automated) verification of software and hardware systems. - logic-based (e.g. propositional logic) so that we can formulate what we want to check. - model checking
27
What is model checking?
- allows for desired behaviour properties to be verified via a suitable model of the system. - completely automatic - offers counterexamples
28
What is the underlying state based model - LTS?
- set of states connected by transitions - transitions labelled with elements from an alphabet (actions) - states denote a snapshot or configuration of the system
29
What does dependability mean?
A system can be trusted to give the right service.
30
What is availability?
The system is up and running when needed.
31
What is reliability?
The system gives the correct service again and again.
32
What is safety?
The system does not harm people or the environment.
33
What is security?
The system can resist attacks or mistakes.
34
What is a fault?
A problem in the system (like bad code).
35
What is an error?
A wrong state inside the system.
36
What is a failure?
The system gives the wrong service to users.
37
How do we avoid failures (dependable software)?
Fault avoidance: stop mistakes early. Fault detection/removal: find and fix errors. Fault tolerance: keep working even if faults exist.
38
What is safety in systems?
Work without causing injury, death, or damage.
39
How do we achieve safety?
Hazard avoidance: design so hazards can’t happen. Hazard detection/removal: catch hazards before accidents. Damage limitation: reduce harm if accidents occur.
40
How do we assure security?
Avoid vulnerabilities Detect and stop attacks Limit damage and recover
41
What is the chain of software failure?
Programmer makes an error Error becomes a fault in the code Fault gets activated → creates an error state If error affects service → failure
42
What happened in the case studies (Denver, St Helena, Renfe)?
Denver: baggage system failed → huge delays. St Helena: airport built but planes couldn’t land safely. Renfe: trains too big for tunnels → wasted money. Common issue: poor planning + weak requirements.
43
What is a dependability case?
Auditable → can be checked by others. Complete → no missing parts. Sound → arguments are correct. Shows why a system can be trusted.
44
What is model checking? (railroad crossing example)
Build a model of system behaviour. Check safety rules (e.g., gates closed when train passes). Automatic → gives counterexamples if rules fail. Helps prove safety before real use.
45
What is the environment problem in verification?
Systems run in real-world settings. Example: plane thrust only works on ground, not in air. Hard to model humans, traffic, or changing rules. Environment adds uncertainty.
46
What is the socio-technical systems (STS) stack?
Equipment → hardware/devices. OS → basic software. Middleware → connects systems. Applications → do tasks. Business processes → people + systems. Organisation → strategy. Society → laws, culture. All layers affect each other.
47
What are emergent properties?
Appear when parts work together. Examples: reliability, security, usability. Cannot be seen in single components. Only visible in whole system.
48
Why are socio-technical systems non-deterministic?
People act differently each time. Systems change often (software, hardware, data). Same input ≠ same output.
49
What is success vs failure in systems?
Success depends on viewpoint. Example: hospital system → managers happy (reports), doctors unhappy (less patient time). One group’s success = another’s failure.
50
What are normal failures?
Everyday glitches. Cause extra work for users. Not catastrophic, but waste time. Recovery cost = extra effort.
51
What is requirements engineering?
Requirements engineering is the systematic process of identifying, analysing, documenting, validating, and managing stakeholder needs so that the system built actually solves the right problem and meets stakeholder expectations.
52
Why is requirements engineering difficult?
Environments change fast. Stakeholders disagree. People unclear about needs. Politics influence decisions. Hard to get stable, clear requirements.
53
What was the SERUMS healthcare project?
EU project for secure medical data sharing. Patient-focused, GDPR compliant. Used blockchain + privacy-preserving AI. Aim: trustworthy healthcare systems.
54
What is the inevitability of failures in critical systems?
All systems fail at some point. Failures cannot be fully avoided. We must plan for them.
55
Why is modelling the environment hard?
Real world is complex. Traffic, people, weather, rules all change. Hard to capture everything in one model.
56
What are requirement conflicts?
Different groups want different things. One group’s success = another group’s failure. Conflicts never fully go away.
57
What are viewpoints in requirements engineering?
Ways to group needs by stakeholder type. Examples: end‑user, manager, admin, engineer.
58
What are concerns in requirements engineering?
Big issues that affect all. Examples: safety, privacy, cost, usability. Link goals → system needs.
59
What laws affect the MHCPM system?
Data Protection Act → keep info private. Mental Health Act → rules for patient detention.
60
Give one MHCPM safety requirement.
System must warn if patient allergic to medicine. Prescriber can override, but system records it.
61
What is the V‑Model in system engineering?
Step‑by‑step process. From requirements → design → build → test → integrate → validate.
62
Why do requirements change over time?
Tech changes. Organisations change. Markets, politics, laws change. So system needs change too.
63
What is stakeholder uncertainty?
Busy people give vague needs. Hard to get clear, detailed requirements.
64
Why does process variability matter?
Different systems need different detail. Example: railway signals need strict specs. Games need storyboards.
65
What is risk‑driven specification?
Start with risks. Find ways to reduce them. Steps: 1. Risk Identification 2. Risk Analysis 3. Risk reduction 4. Risk requirements
66
What are the phases of risk analysis?
Preliminary → risks from environment. Life cycle → risks from design/build. Operational → risks from users/operators.
67
What is ALARP in risk assessment?
“As Low As Reasonably Practical.” Keep risks small, within cost/time limits.
68
What is hazard identification?
Find dangers that may harm system. Types: physical, electrical, biological, service failure.
69
What is the main safety rule for insulin pumps?
Never give too much insulin. Overdose is life‑threatening.
70
What is a fault tree?
Diagram of causes of a hazard. Shows how failures combine. Goal: avoid single points of failure.
71
What are dependability requirements?
Functional → check errors, recover from attacks Non‑functional → usability, reliability, availability Excluding → avoid unsafe system states
72
How do we check requirements?
Use methods like DISCOS to test if they’re complete and consistent.
73
What are the main risk reduction strategies?
Avoid the risk. Detect & remove faults. Limit damage if failure happens.
74
What is a Safety Decision Support System (SDSS)?
A tool that helps humans judge risks better. It’s itself safety‑critical.
75
What are the 5 levels of driving automation?
L0 → No automation. L1–2 → Driver assist (ADAS). L3 → Conditional automation (car monitors environment, driver still needed). L4 → High automation (car handles most tasks, driver steps in sometimes). L5 → Full automation (no driver needed at all).
76
What are the key reliability metrics?
POFOD → Probability of Failure on Demand. Use for on‑demand safety systems (airbags, shutdown systems). ROCOF → Rate of Occurrence of Failures. Use for continuous systems where failure frequency matters. MTTF → Mean Time To Failure. Use when you need time‑to‑failure for hardware or long‑running components. Availability → % of time system is up and running. Use when service uptime is the key requirement (web, ATC, hospitals).
77
How is security different from safety?
Safety → failures are accidental. Security → failures may be caused by attackers who know system weaknesses.
78
What steps are in security risk assessment?
1. Identifying assets, 2. Identifying threats, 3. Analysing vulnerabilities, 4. Evaluating and prioritising risks based on likelihood and impact.
79
What is redundancy vs diversity?
Redundancy → backup copies/components. Diversity → different ways to do the same job, so one bug doesn’t break all versions.
80
What is a Protection System?
A separate system that monitors and shuts down equipment if danger is detected. Example: reactor shutdown system.
81
What is Self‑Monitoring Architecture?
Multiple channels run the same computation. If outputs differ → assume a failure. Used in Airbus flight control.
82
What is Triple Modular Redundancy (TMR)?
Three identical components. They vote on the output. If one disagrees → it is ignored.
83
What is N‑Version Programming?
Several teams build different versions of the same software. A voting system picks the majority result.
84
What are the problems with software diversity?
Teams make similar mistakes Specification errors affect all versions Hard to ensure true independence
85
What is Emergent Behaviour?
Unexpected behaviour that appears when components interact. Not visible when looking at components alone.
86
What are Timed Automata?
LTS + clock variables. Used to model real‑time systems like controllers.
87
What are Priced (or Cost) Timed Automata?
Timed automata with costs such as energy, memory, or usage counts.
88
What is a Discrete‑Time Markov Chain (DTMC)?
A model where transitions have probabilities.
89
What is a Markov Decision Process (MDP)?
Mix of choices (actions) and probabilities.
90
What is a Hybrid Automaton?
Automaton with continuous variables (e.g., physics).
91
What is SCADA?
Industrial systems that monitor and control infrastructure like water, power, gas.
92
What risks do LLMs introduce?
Hallucinations Indirect prompt injection Jailbreaks
93
Why are hallucinations dangerous in healthcare?
LLMs may produce confident but false medical statements. This can lead to unsafe treatment.
94
What was the Kegworth Air Disaster?
A 1989 crash of a British Midland Boeing 737‑400 near Kegworth after the crew shut down the wrong engine.
95
What caused the initial problem (kegworth)?
A fan blade failure in the left engine caused vibration, smoke smell, and loss of power. The pilots misinterpreted vibration readings, cockpit cues, and smell location. They shut down the right engine, which was working.
96
What role did human factors play in kegworth?
New aircraft model Crew trained on older version Instruments placed differently Vibration gauge moved to a new location Crew relied on habit, not the new layout
97
How does this relate to the taxonomy of dependable systems (kegworth)?
Fault: fan blade failure Error: crew believed the wrong engine was failing Failure: shutting down the working engine → loss of thrust Hazard: aircraft unable to maintain flight Accident: crash short of runway
98
What lessons relate to N‑version or diversity (kegworth)?
The aircraft relied on pilot judgement rather than diverse independent systems. A second independent diagnostic system could have prevented the wrong engine shutdown.
99
What happened (denver)?
Automated baggage system failed → jams, lost bags, huge delays.
100
What caused it (denver)?
Over‑ambition, late requirement changes, poor integration, no full testing.
101
Who is to blame (denver)?
Shared — airport authority, contractors, airlines, weak project governance.
102
How could it have been prevented (denver)?
Freeze requirements, incremental rollout, proper integration testing, realistic schedule.
103
What exam notions does it illustrate?
Requirements engineering, socio‑technical systems, dependability, emergent failures.
104
What happened (ST HELENA)?
£285m airport built but planes couldn’t land safely due to wind shear.
105
What caused it (ST HELENA)?
Poor environmental modelling, missed hazard, political pressure.
106
Who is to blame (ST HELENA)?
Government planners, consultants, political stakeholders.
107
How could it have been prevented (ST HELENA)?
Full hazard analysis, independent safety review, prototype testing.
108
What exam notions does it illustrate (ST HELENA)?
Safety vs reliability, hazard identification, system boundaries, socio‑technical pressure.
109
What happened (ARIANE)?
The Ariane 5 rocket exploded 37 seconds after launch (1996) due to a software failure in the inertial reference system.
110
What caused it (ARIANE)?
- Reused Ariane 4 software without re‑validating assumptions. - Integer overflow: 64‑bit floating‑point → 16‑bit integer conversion failed. - Unhandled exception → inertial system shut down. - Backup system failed the same way (identical software). - Rocket received nonsense attitude data, causing violent course correction → breakup.
111
Who is to blame (ARIANE)?
- Design team for reusing software without checking environmental assumptions. - Management for not requiring full re‑validation. - Process failure, not individual failure: - No exception handling. - No independent diversity in backup system. - Inadequate testing for Ariane 5’s different flight profile
112
How could it have been prevented (ARIANE)?
- Validate reused software against new system environment. - Add exception handling for overflow conditions. - Use design diversity in backup systems. - Perform full system‑level testing under Ariane 5 flight conditions. - Apply risk‑driven specification to identify critical assumptions
113
What exam notions does it illustrate (ARIANE)?
- Dependability (failure due to unhandled fault). - Fault → Error → Failure chain (classic example). - Fault prevention (bad assumptions). - Fault tolerance (backup system identical → no diversity). - Safety‑critical software engineering. - Requirements engineering (environment assumptions not captured). - Reusing components without re‑verification. - Hazard analysis (integer overflow as intolerable risk). - System boundaries (software assumed Ariane 4 flight dynamics)
114
How could it have been prevented (KEGWORTH)?
- Better human‑centred design of engine indicators. - Clearer warning systems for engine failure. - Updated training for new engine behaviour. - Stronger cockpit–cabin communication protocols. - Use of formal methods to check assumptions about human interaction and system cues.
115
What does DISCOS stand for?
Distributed, Interacting, Complex, Organisational Systems — a socio‑technical method for analysing accidents by looking at the whole system, not just individuals.
116
What is the main purpose of the DISCOS method?
To understand how organisational structures, communication, tools, and people interact to create conditions for failure, avoiding “blame the operator”.
117
What kinds of failures does DISCOS help reveal?
Latent organisational failures such as poor training, outdated procedures, mismatched assumptions, unclear responsibilities, and weak communication channels.
118
What types of systems is DISCOS most useful for analysing?
Large‑scale socio‑technical systems: aviation, healthcare, rail, defence, banking, and complex IT deployments — anywhere many people + tech interact.
119
How does the Kegworth air accident illustrate the DISCOS method?
Kegworth shows that the crash wasn’t just “pilot error”: - training was based on older aircraft models, - cockpit indicators were ambiguous, - organisational communication didn’t highlight differences in the new engine behaviour, - procedures didn’t match real‑world cues. DISCOS reveals how distributed organisational decisions created conditions where the pilots’ mistake became likely.