Intro Flashcards

Question 1

Q

Approaches for coping with load?

Answer

A

Scaling up or vertical scaling
Scaling out or horizontal scaling
Using Elastic systems in case of unpredictable load.

Question 2

Q

Scaling up or vertical scaling?

Answer

A

Moving to a more powerful machine

Question 3

Q

Scaling out or horizontal scaling?

Answer

A

Distributing the load across multiple smaller machines

Question 4

Q

What are Elastic systems? And when are they useful?

Answer

A

Automatically add computing resources when detected load increase.
Quite useful if load is unpredictable.

Question 5

Q

What are the three design principles for software systems in terms of maintainability?

Answer

A

Operability
Simplicity
Evolvability

Question 6

Q

What is Operability?

Answer

A

Make it easy for operation teams to keep the system running.

Data systems can do the following to make routine tasks easy e.g.

Providing visibility into the runtime behavior and internals of the system, with good monitoring.
Providing good support for automation and integration with standard tools.
Providing good documentation and easy-to-understand operational model (“If I do X, Y will happen”).
Self-healing where appropriate, but also giving administrators manual control over the system state when needed.

Question 7

Q

What is Simplicity?

Answer

A

Easy for new engineers to understand the system by removing as much complexity as possible.

Question 8

Q

What is Evolvability?

Answer

A

Make it easy for engineers to make changes to the system in the future.

Question 9

Q

What are functional and nonfunctional requirements?

Answer

A

Functional requirements: what the application should do
Nonfunctional requirements: general properties like security, reliability, compliance, scalability, compatibility and maintainability.

Question 10

Q

What is the difference between Latency and response time?

Answer

A

The response time is what the client sees. Always measured on client side.
Latency is the duration that a request is waiting to be handled.

Question 11

Q

When we measure response time, what is a better metric than average response time? And why?

Answer

A

Percentiles are a metric than average response time as percentiles tells how many users actually experienced that delay

Median (50th percentile or p50). Half of user requests are served in less than the median response time, and the other half take longer than the median
Percentiles 95th, 99th and 99.9th (p95, p99 and p999) are good to figure out how bad your outliners are.

Question 12

Q

What are the common percentiles measures used for response time?

Answer

A

Median (50th percentile or p50). Half of user requests are served in less than the median response time.
Percentiles 95th, 99th and 99.9th (p95, p99 and p999) are good to figure out how bad your outliners are.

Question 13

Q

Give an example of using 99.99 percentile as a measure of response time? And why is not common practice?

Answer

A

Amazon uses 99.9th percentile for response time requirements for internal services because the customers with the slowest requests often have the most data.

Question 14

Q

What accounts for large part of response times at high percentiles? And why are high percentiles not common practice?

Answer

A

Queueing delays often account for large part of the response times at high percentiles.
Optimizations are expensive at high percentiles.

Question 15

Q

What are SLOs and SLAs?

Answer

A

Service level objectives (SLOs) and service level agreements (SLAs) are contracts that define the expected performance and availability of a service. These metrics set expectations for clients of the service and allow customers to demand a refund if the SLA is not met.

An SLA may state the median response time to be less than 200ms and a 99th percentile under 1s.

Question 16

Q

What is reliability?

Answer

Study These Flashcards

A

Reliability: The system should work correctly (performing the correct function at the desired level of performance) even in the face of adversity.

Question 17

Q

What is Scalability?

Answer

Study These Flashcards

A

Scalability: As the system grows(in data , traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Question 18

Q

What is maintainability?

Answer

Study These Flashcards

A

Maintainability: People should be able to work on the system productively in the future.

Question 19

Q

How can human errors be reduced?

Answer

Study These Flashcards

A

Designing systems in a way that minimize opportunities for error through well-designed abstractions, APIs, and admin interfaces.
Decoupling the places where people make the most mistakes from the places where they can cause failures. E.g. by providing a fully-featured non-production sandbox environment where people can explore and experiment safely, using real data, without affecting real users.
Testing thoroughly at all levels: from unit tests to integration tests to manual tests to automated tests.
Allow quick and easy recovery from human errors, to minimize the impact of failure. E.g. By making it easy to roll back configuration changes, roll out new code gradually ( so bugs do not affect all users).
Set up detailed and clear monitoring, such as performance metrics and error rates.

Intro Flashcards

(19 cards)