What are the three primary concerns for most software systems discussed in the source material?
Reliability, Scalability, and Maintainability.
What is the definition of Reliability in a software system?
The system should continue to work correctly even in the face of adversity such as hardware faults, software faults, or human error.
How is Scalability defined in the context of software systems?
As the system grows in data volume, traffic, or complexity, there should be reasonable ways of dealing with that growth.
What does Maintainability refer to in a software system?
Many different people should be able to work on the system productively over time, both maintaining current behavior and adapting it to new use cases.
In the Twitter example, what is the primary scaling challenge, rather than just tweet volume?
The fan-out, where each user follows many people and is followed by many people.
What is one approach to implementing Twitter’s home timeline that involves maintaining a cache for each user?
When a user posts a tweet, look up all their followers and insert the new tweet into each of their home timeline caches.
What is the primary performance measure for a batch processing system like Hadoop?
Throughput, which is the number of records processed per second or the total time to run a job.
In online systems, what is typically the more important performance metric compared to throughput?
The service’s response time, which is the time between a client sending a request and receiving a response.
What is the key difference between ‘latency’ and ‘response time’?
Response time is what the client sees (including network and queuing delays), while latency is the duration a request is waiting to be handled.
Why is the arithmetic mean not a very good metric for ‘typical’ response time?
It doesn’t tell you how many users actually experienced that delay, as it can be skewed by outliers.
What metric is usually better than the mean for understanding typical response time, and is also known as the 50th percentile (p50)?
The median.
What are percentiles often used for in service contracts that define expected performance and availability?
Service level objectives (SLOs) and service level agreements (SLAs).
The effect where a single slow backend call makes an entire end-user request slow is known as _____.
tail latency amplification
Distributing load across multiple machines is also known as a _____-nothing architecture.
shared
What term describes systems that can automatically add computing resources when they detect a load increase?
Elastic.
A software project mired in complexity is sometimes described as a _____.
big ball of mud
What is one of the best tools for removing accidental complexity in software?
Abstraction.
Systems that take a large amount of input data, run a job to process it, and produce output data are known as _____ systems.
batch processing
What type of system operates on events shortly after they happen, placing it between online and batch processing?
Stream processing systems.
According to the Unix philosophy, the output of every program is expected to become the _____ of another program.
input
What characteristic feature of Unix tools allows a shell user to wire up inputs and outputs, separating I/O wiring from program logic?
The use of standard input (stdin) and standard output (stdout).
In Hadoop MapReduce, what is each file or file block within the input directory considered, which can be processed by a separate map task?
A separate partition.
The ability to recover from buggy code by re-running a batch job on immutable input has been called _____ fault tolerance.
human
The principle of minimizing _____ is beneficial for Agile software development, as seen in the design of MapReduce jobs.
irreversibility