Reliability
ability of a system to operate correctly over time, under expected load and conditions.
Fault Tolerance
system’s ability to continue operating correctly even if components fail.
Uptime
Percentage of time a system is operational
Resilience
Ability to recover quickly from failures or degraded performance
Snapshotting
Save system state periodically for recovery after crash.
Idempotency
Ensure repeated operations do not change result; important for safe retries.