A system runs on a single large EC2 instance and crashes during peak demand. Which reliability principles are violated and how should the architecture be improved?
Violated:
• Scale horizontally
• Avoid single point of failure
Fix:
• Replace with multiple smaller instances
• Use load balancing + Auto Scaling
A company has backup systems but has never tested them. During failure, recovery does not work. Which principle is missing and why is it critical?
Test recovery procedures
It is critical because untested recovery plans often fail in real scenarios.
A system detects failure but requires manual intervention to recover. What reliability principle is missing and how should it be implemented?
Automatically recover from failure
Implement using:
• Monitoring + alarms
• Auto recovery mechanisms
• Self-healing systems
A company provisions fixed capacity based on estimates and frequently over/under-provisions resources. What principle fixes this and how?
Stop guessing capacity
Use:
• Monitoring metrics
• Auto Scaling to dynamically adjust resources
A system handles failures well but breaks whenever updates are deployed. Which reliability concept is being ignored?
Manage change in automation
Changes should be:
• Automated
• Controlled
• Tested before deployment