flashcards Flashcards by Abhishek Verma

What are the three primary concerns for most software systems discussed in the source material?

Reliability, Scalability, and Maintainability.

How well did you know this?

Not at all

Perfectly

What is the definition of Reliability in a software system?

The system should continue to work correctly even in the face of adversity such as hardware faults, software faults, or human error.

How well did you know this?

Not at all

Perfectly

How is Scalability defined in the context of software systems?

As the system grows in data volume, traffic, or complexity, there should be reasonable ways of dealing with that growth.

How well did you know this?

Not at all

Perfectly

What does Maintainability refer to in a software system?

Many different people should be able to work on the system productively over time, both maintaining current behavior and adapting it to new use cases.

How well did you know this?

Not at all

Perfectly

In the Twitter example, what is the primary scaling challenge, rather than just tweet volume?

The fan-out, where each user follows many people and is followed by many people.

How well did you know this?

Not at all

Perfectly

What is one approach to implementing Twitter’s home timeline that involves maintaining a cache for each user?

When a user posts a tweet, look up all their followers and insert the new tweet into each of their home timeline caches.

How well did you know this?

Not at all

Perfectly

What is the primary performance measure for a batch processing system like Hadoop?

Throughput, which is the number of records processed per second or the total time to run a job.

How well did you know this?

Not at all

Perfectly

In online systems, what is typically the more important performance metric compared to throughput?

The service’s response time, which is the time between a client sending a request and receiving a response.

How well did you know this?

Not at all

Perfectly

What is the key difference between ‘latency’ and ‘response time’?

Response time is what the client sees (including network and queuing delays), while latency is the duration a request is waiting to be handled.

How well did you know this?

Not at all

Perfectly

Why is the arithmetic mean not a very good metric for ‘typical’ response time?

It doesn’t tell you how many users actually experienced that delay, as it can be skewed by outliers.

How well did you know this?

Not at all

Perfectly

What metric is usually better than the mean for understanding typical response time, and is also known as the 50th percentile (p50)?

The median.

How well did you know this?

Not at all

Perfectly

What are percentiles often used for in service contracts that define expected performance and availability?

Service level objectives (SLOs) and service level agreements (SLAs).

How well did you know this?

Not at all

Perfectly

The effect where a single slow backend call makes an entire end-user request slow is known as _____.

tail latency amplification

How well did you know this?

Not at all

Perfectly

Distributing load across multiple machines is also known as a _____-nothing architecture.

shared

How well did you know this?

Not at all

Perfectly

What term describes systems that can automatically add computing resources when they detect a load increase?

Elastic.

How well did you know this?

Not at all

Perfectly

A software project mired in complexity is sometimes described as a _____.

big ball of mud

How well did you know this?

Not at all

Perfectly

What is one of the best tools for removing accidental complexity in software?

Abstraction.

How well did you know this?

Not at all

Perfectly

Systems that take a large amount of input data, run a job to process it, and produce output data are known as _____ systems.

batch processing

How well did you know this?

Not at all

Perfectly

What type of system operates on events shortly after they happen, placing it between online and batch processing?

Stream processing systems.

How well did you know this?

Not at all

Perfectly

According to the Unix philosophy, the output of every program is expected to become the _____ of another program.

input

How well did you know this?

Not at all

Perfectly

What characteristic feature of Unix tools allows a shell user to wire up inputs and outputs, separating I/O wiring from program logic?

The use of standard input (stdin) and standard output (stdout).

How well did you know this?

Not at all

Perfectly

In Hadoop MapReduce, what is each file or file block within the input directory considered, which can be processed by a separate map task?

A separate partition.

How well did you know this?

Not at all

Perfectly

The ability to recover from buggy code by re-running a batch job on immutable input has been called _____ fault tolerance.

human

How well did you know this?

Not at all

Perfectly

The principle of minimizing _____ is beneficial for Agile software development, as seen in the design of MapReduce jobs.

irreversibility

How well did you know this?

Not at all

Perfectly

What is a significant drawback of implementing complex processing jobs using the raw MapReduce APIs?

It is often hard and laborious, requiring implementation of algorithms like joins from scratch.

How do dataflow engines like Spark and Flink typically handle faults without writing all intermediate state to HDFS?

If intermediate state is lost, it is recomputed from other available data, such as a prior stage or the original input.

The _____ model of computation, also known as the Pregel model, is an optimization for batch processing graphs in an iterative style.

bulk synchronous parallel (BSP)

In the Pregel model, how is fault tolerance typically achieved?

By periodically checkpointing the state of all vertices at the end of an iteration to durable storage.

The idea of processing data incrementally as it becomes available over time, rather than in fixed batches, is the core concept of _____.

stream processing

In a log-based message broker, what mechanism is used to achieve load balancing across a group of consumers?

The broker assigns entire partitions to nodes in the consumer group.

A slow message holding up the processing of subsequent messages in the same partition is a form of _____-of-line blocking.

head

What problem can occur with dual writes to separate systems (e.g., a database and a search index) if there is no additional concurrency detection?

A race condition where writes arrive in different orders, leading to inconsistent data.

What is the process of periodically looking for log records with the same key, discarding duplicates, and keeping only the most recent update?

Log compaction.

In a log-structured storage engine, what does an update with a special null value, known as a tombstone, indicate?

It indicates that a key was deleted and should be removed during log compaction.

In _____, the application logic is explicitly built on the basis of immutable events written to an event log.

event sourcing

Pat Helland stated: 'The truth is the _____. The database is a cache of a subset of the log.'

log

In accounting, how are mistakes corrected in a ledger, illustrating the principle of immutability?

A new transaction is added that compensates for the mistake, rather than erasing or changing the original incorrect transaction.

For what reason might you need to 'rewrite history' in an immutable log, despite the principle of immutability?

For administrative reasons, such as privacy regulations requiring the deletion of personal information.

What is CEP, an approach developed in the 1990s for analyzing event streams?

Complex event processing, which allows specifying rules to search for certain patterns of events in a stream.

In stream analytics, the time interval over which you aggregate data is known as a _____.

window

What are the two options for handling 'straggler' events that arrive after a processing window has been declared complete?

1. Ignore the straggler events, or 2. Publish a correction with the result including the stragglers.

A _____ window has a fixed length, and every event belongs to exactly one window.

tumbling

What type of window has a fixed length but allows windows to overlap to provide smoothing?

A hopping window.

What type of join is required when enriching a stream of activity events with user profile information from a database?

A stream-table join.

In a table-table join where both inputs are database changelogs, what is the result?

A stream of changes to the materialized view of the join between the two tables.

To achieve exactly-once semantics in stream processing, all outputs and side effects must happen _____, or none of them must happen.

atomically

A _____ message broker assigns individual messages to consumers, and messages are deleted from the broker once acknowledged.

AMQP/JMS-style

What approach provides a unified query interface to a wide variety of underlying storage engines, also known as a polystore?

Federated databases.

The approach of unifying writes across disparate systems, for example through change data capture and event logs, is known as _____ databases.

unbundling

The design pattern of composing specialized storage and processing systems with application code is also known as the 'database _____' approach.

inside-out

What is the primary trade-off made by caches, indexes, and materialized views?

They shift the boundary between the read path and the write path, doing more work on writes to save effort on reads.

To suppress duplicate requests in the face of network timeouts, what end-to-end mechanism is required?

A unique request ID generated by the client and checked by the server.

The _____ principle states that a function can be completely and correctly implemented only with the knowledge of the application at the endpoints.

end-to-end argument

What is a transaction that corrects a mistake made by a previous transaction called?

A compensating transaction.

What are the two requirements that the term 'consistency' often conflates, which are better considered separately?

Timeliness and integrity.

Why must data in systems be treated with humanity and respect?

Because many datasets are about people: their behavior, their interests, and their identity.

Replacing the word 'data' with '_____' can reveal the ethical implications of data collection practices.

surveillance

Data models have a profound effect not only on how software is written, but also on how we _____ the problem we are solving.

think about

The relational technique of splitting a document-like structure into multiple tables is known as _____.

sharding

Document databases are sometimes called 'schemaless', but a more accurate term is _____, because the code that reads the data assumes some structure.

schema-on-read

The traditional relational database approach where the schema is explicit and all written data must conform to it is called _____.

schema-on-write

What kind of query language specifies the pattern of the data you want, but not how to achieve that goal?

A declarative query language.

For highly interconnected data, what data model is often the most natural choice?

Graph models.

In a property graph model, what are the two fundamental object types?

Vertices (nodes) and edges (relationships).

What query language for property graphs uses patterns like `(person)-[:BORN_IN]->(location)` to find relationships?

Cypher.

What is a key advantage of graph databases for queries involving variable-length paths, such as finding locations `[:WITHIN*0..]` a country?

They can easily traverse a variable number of edges, which is difficult in SQL where the number of joins is fixed in advance.

The RDF data model represents all information in the form of _____, which are (subject, predicate, object) triples.

statements

What is the standard query language for triple-stores and the RDF data model?

SPARQL.

What does a database use to efficiently find the value for a particular key without scanning the entire dataset?

An index.

What is a common side effect of adding an index to a database?

It usually slows down writes, because the index also needs to be updated.

What is the simple, append-only data file that many databases use internally for writes?

A log.

In a log-structured storage engine, what is the in-memory balanced tree data structure that holds recent writes called?

A memtable.

When a memtable in an LSM-tree gets bigger than a threshold, it is written to disk as a ____ file.

SSTable (Sorted String Table)

What is the most widely used indexing structure, which organizes data into fixed-size blocks or pages?

The B-tree.

In B-tree terminology, what is the number of references to child pages in one page of the tree called?

The branching factor.

When values are stored directly within an index, it is called a _____ index.

clustered

When an index only stores a reference to the location of a row in a separate heap file, it is called a ____ index.

non-clustered or secondary

What does OLTP stand for?

Online Transaction Processing.

What does OLAP stand for?

Online Analytical Processing.

A database used primarily for analytics, separate from OLTP systems, is often called a data _____.

warehouse

In a _____ storage layout, all the values from one row of a table are stored next to each other.

row-oriented

What storage layout is optimized for analytics queries by storing all the values of each column together?

Column-oriented storage.

What is a key advantage of column-oriented storage for analytics queries that only access a few columns?

It only needs to read and parse the data for the columns required by the query, saving a lot of work.

Column-oriented storage is highly effective for _____ because all values in a column have the same type and often low cardinality.

compression

A _____ is a precomputed summary of data, often used to speed up queries in a data warehouse.

materialized view or data cube

When newer code can read data that was written by older code, this is known as _____ compatibility.

backward

When older code can read data that was written by newer code, this is known as _____ compatibility.

forward

In binary encoding formats like Thrift and Protocol Buffers, what critical element identifies each field and cannot be changed without invalidating data?

The field tag number.

How does Avro support schema evolution without using field tags?

It requires the reader to know the exact schema with which the data was written (the writer's schema) and resolves differences against its own schema (the reader's schema).

What is a key advantage of Avro's schema evolution approach for use cases like dumping a relational database?

It is friendlier to dynamically generated schemas, as field tags don't need to be manually assigned.

The RPC model's attempt to make a remote network service look the same as a local function call is called _____.

location transparency

What fundamental problem arises from retrying a failed network request when only the response was lost?

The action may be performed multiple times unless a deduplication (idempotence) mechanism is used.

What is the primary mode of dataflow in a database?

A process writes encoded data, and another process reads it back at some unknown point in the future.

A message broker ensures that a message is delivered to one or more consumers of a named queue or _____.

topic

What concurrency model encapsulates logic in entities that have local state and communicate via asynchronous messages?

The actor model.

In snapshot isolation, an object is visible to a transaction if the transaction that created it had already _____ when the reader's transaction started.

committed

What is the defining characteristic of distributed systems?

The fact that partial failures can occur.

In an asynchronous packet network, if you send a request and don't receive a response, why is it impossible to know the reason?

You cannot distinguish whether the request was lost, the remote node is down, or the response was lost.

What is the usual way of detecting a fault in an asynchronous network?

A timeout.

A short timeout detects faults faster, but carries a higher risk of incorrectly declaring a node dead when it has only suffered a _____.

temporary slowdown

Networks like Ethernet and IP are _____-switched protocols, optimized for bursty traffic, which can lead to queueing and unbounded delays.

packet

What is the name of the protocol used to synchronize computer clocks over a network?

Network Time Protocol (NTP).

What type of clock is guaranteed to always move forward and is suitable for measuring durations like timeouts?

A monotonic clock.

Google's TrueTime API in Spanner explicitly reports the _____ on the local clock, providing an [earliest, latest] possible timestamp.

confidence interval

What must a node do to ensure its lease doesn't expire when acting as a leader?

It must periodically renew the lease before it expires.

What is the term for the time a virtual machine's CPU time is spent on other virtual machines?

Steal time.

A _____ property is often informally defined as 'nothing bad happens.'

safety

A _____ property is often informally defined as 'something good eventually happens.'

liveness

To prevent a client with an expired lock from writing to a storage service, the lock service can issue a _____ that increases with each new lock.

fencing token

The problem of getting all nodes in a distributed system to agree on something is known as _____.

consensus

What is the term for the situation where two nodes in a single-leader system both believe they are the leader?

Split brain.

What is the strongest consistency model, which makes a system appear as if there were only a single copy of the data?

Linearizability.

What does the 'recency guarantee' of linearizability ensure?

Once a new value has been written or read, all subsequent reads see that new value until it is overwritten again.

For what types of database features are hard uniqueness constraints and linearizability required?

Implementing distributed locks, leader election, and uniqueness guarantees (e.g., unique usernames).

In a system with single-leader replication, reads from the _____ or from synchronously updated followers have the potential to be linearizable.

leader

According to the CAP theorem, when a network partition occurs, a system must choose between _____ and total availability.

consistency (linearizability)

The reason many distributed databases drop linearizability is primarily for _____, not fault tolerance.

performance

What type of consistency ensures that if a system observes an effect, it must also have observed its cause?

Causal consistency.

Causality defines a _____ order, not a total order, because concurrent operations are incomparable.

partial

How does linearizability relate to causality?

Linearizability implies causality; any system that is linearizable will preserve causality correctly.

The principle of replicating a database by having every replica process the same writes in the same order is known as _____ replication.

state machine

What is the name of the classic algorithm for achieving atomic transaction commit across multiple nodes?

Two-phase commit (2PC).

What is the primary disadvantage of two-phase commit (2PC)?

It is a blocking protocol; if the coordinator fails, participants must wait for it to recover.

Any consensus algorithm requires at least a _____ of nodes to be functioning correctly in order to assure termination.

majority

Most consensus algorithms actually implement _____ broadcast, deciding on a sequence of values rather than a single value.

total order

Services like ZooKeeper and etcd use consensus to provide features like linearizable atomic operations and _____.

service discovery

The process of shredding can lead to _____ schemas and unnecessarily complicated application code for document-like data.

cumbersome

A change in an application often requires a change to the data it stores; the evolution of data formats over time is called ____.

schema evolution

The problem with a shared-memory architecture for scaling is that the cost grows faster than ____.

linearly

In a shared-nothing architecture, each machine or virtual machine running the database software is called a ____.

node

What are the two primary ways of distributing data across multiple nodes in a shared-nothing architecture?

Replication and partitioning.

Systems of record represent the primary authority or 'truth' for data, whereas _____ systems can be re-created from other sources.

derived data

In RPC, what is a network request's possible outcome that does not exist for a local function call?

It may return without a result due to a timeout, leaving the state of the request unknown.

What is the key difference between how JMS/AMQP brokers and log-based brokers handle message delivery and retention?

JMS/AMQP brokers delete messages after they are acknowledged, while log-based brokers retain messages on disk.

What is the purpose of a query optimizer in a database with a declarative query language like SQL?

To decide the most efficient way to execute a query, including which indexes and join methods to use.

In the Unix philosophy, one should use tools in preference to unskilled help to lighten a programming task, even if you have to ____ to build the tools.

detour

What is the primary benefit of the immutability in MapReduce jobs regarding fault tolerance?

If a task fails, it can be safely retried on the same immutable input without causing side effects.

Databases that store and index a wide range of data from scientific experiments, such as particle physics or genomics, often require ____ solutions.

custom

What is the term for a system systematically excluding a person from services based on algorithmic decision-making?

Algorithmic prison.

What is a major advantage of RESTful APIs over custom RPC protocols for public-facing services?

They are good for experimentation, supported by all platforms, and have a vast ecosystem of tools.

What is the primary difference between how a service and a batch processing system are triggered?

A service waits for a client request, while a batch processing system runs a scheduled job on a large set of input data.

To what does 'shredding' refer in the context of storing document-like structures in a relational database?

The process of splitting the hierarchical structure into multiple tables, linked by foreign keys.

What is a primary motivation for using a separate data warehouse for analytics instead of querying an OLTP system directly?

The data warehouse can be optimized for analytic access patterns, preventing interference with OLTP performance.

In column-oriented storage, what technique involves iterating over compressed data in a tight loop that is friendly to CPU caches?

Vectorized execution.

What is a key difference between change data capture (CDC) and event sourcing?

CDC extracts low-level changes from a mutable database, while event sourcing explicitly builds application logic around immutable, high-level events.

If a system uses _____, it is possible to reconstruct the application's state at any point in time by replaying the event log.

event sourcing

What type of transaction is used to correct a mistake, such as refunding an incorrect charge, without altering the original immutable record?

A compensating transaction.

In the context of the CAP theorem, what is a network partition?

A type of network fault where nodes are split into groups that cannot communicate with each other.

What is the main reason that CPU memory is not linearizable across multiple cores?

Performance; enforcing linearizability would require expensive synchronization that slows down memory access.

What is the purpose of a Lamport timestamp in distributed systems?

To generate a total ordering of events that is consistent with causality.

In two-phase commit (2PC), what is the state of a transaction on a participant node after it has voted 'yes' but before it has received the final decision from the coordinator?

The transaction is in-doubt or in a prepared state.

Why is it often impractical to run consensus algorithms with a very large number of voting nodes?

The communication overhead becomes terribly inefficient.

What feature of ZooKeeper allows a client to find out when another client fails or joins the cluster?

Ephemeral nodes and change notifications (watches).

What is a 'big ball of mud' in software engineering?

A term describing a software project that has become very complex and difficult to understand.

What type of join in a stream processor requires maintaining state for events from both input streams within a certain time window?

A stream-stream join.

What is the purpose of using idempotence in a distributed system?

To ensure that performing an operation multiple times has the same effect as performing it once, preventing issues from retries.

How does a log-based message broker achieve parallelism for consumers?

Through partitioning, where different partitions of a topic can be consumed by different consumer instances.

What is the primary advantage of a declarative query language for system maintainability?

It hides implementation details, allowing the database engine to be improved without changing application queries.

What does it mean for a system to be 'elastic'?

It can automatically add computing resources when it detects a load increase.

How can an application with a caching layer (like Memcached) separate from its main database become inconsistent?

If the application code fails to keep the caches and indexes in sync with the main database after a write.

In the end-to-end argument, why are low-level reliability features like TCP checksums not sufficient to ensure end-to-end correctness?

They don't protect against higher-level faults like software bugs or misconfigurations that can still corrupt data.

How can a stream processor use a partitioned log to enforce a uniqueness constraint, such as for usernames?

By routing all requests for a particular username to the same partition and processing them sequentially on a single thread.

What is the term for checking the integrity of data in a system to find corruption?

Auditing.

In a _____ architecture, multiple machines with independent CPUs and RAM store data on a shared array of disks.

shared-disk

The problem of losing data in a multi-leader or leaderless replication system due to concurrent writes is known as a ____.

write conflict

Why is the median (p50) a good metric for how long users typically have to wait for a response?

Because half of user requests are served in less than the median response time.

What is a 'runaway process' in the context of system reliability?

A process that uses up a shared resource like CPU time, memory, or disk space, potentially causing cascading failures.

In MapReduce, the process of bringing all related data (e.g., records with the same key) together in the same place is handled by ____.

partitioning, sorting, and merging between the map and reduce phases

A stream of database write events, produced by a leader and applied by followers, is known as a ____.

replication log

What is a 'noisy neighbor' in the context of public clouds and multi-tenant datacenters?

Another customer who is using a lot of shared resources, causing performance degradation and variable latency for your application.

What is a 'leap second' and why can it cause problems for software?

It is an extra second added to UTC to keep it aligned with solar time, which can crash systems that assume a minute is always 60 seconds long.

What are the four properties of a consensus algorithm?

Agreement, integrity, validity, and termination.

What is the 'termination' property of a consensus algorithm?

Every node that does not crash eventually decides on a value.

flashcards

(173 cards)