Approaches for coping with load?
Scaling up or vertical scaling?
Moving to a more powerful machine
Scaling out or horizontal scaling?
Distributing the load across multiple smaller machines
What are Elastic systems? And when are they useful?
Automatically add computing resources when detected load increase.
Quite useful if load is unpredictable.
What are the three design principles for software systems in terms of maintainability?
What is Operability?
Make it easy for operation teams to keep the system running.
Data systems can do the following to make routine tasks easy e.g.
What is Simplicity?
Easy for new engineers to understand the system by removing as much complexity as possible.
What is Evolvability?
Make it easy for engineers to make changes to the system in the future.
What are functional and nonfunctional requirements?
What is the difference between Latency and response time?
The response time is what the client sees. Always measured on client side.
Latency is the duration that a request is waiting to be handled.
When we measure response time, what is a better metric than average response time? And why?
Percentiles are a metric than average response time as percentiles tells how many users actually experienced that delay
What are the common percentiles measures used for response time?
Give an example of using 99.99 percentile as a measure of response time? And why is not common practice?
Amazon uses 99.9th percentile for response time requirements for internal services because the customers with the slowest requests often have the most data.
What accounts for large part of response times at high percentiles? And why are high percentiles not common practice?
Queueing delays often account for large part of the response times at high percentiles.
Optimizations are expensive at high percentiles.
What are SLOs and SLAs?
Service level objectives (SLOs) and service level agreements (SLAs) are contracts that define the expected performance and availability of a service. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.
An SLA may state the median response time to be less than 200ms and a 99th percentile under 1s.
What is reliability?
Reliability: The system should work correctly (performing the correct function at the desired level of performance) even in the face of adversity.
What is Scalability?
Scalability: As the system grows(in data , traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
What is maintainability?
Maintainability: People should be able to work on the system productively in the future.
How can human errors be reduced?