What is a distributed system?
A distributed system consists of multiple autonomous computers/processors connected via a network, collaborating to achieve a common goal. Unlike parallel systems, distributed systems feature separated memory, geographic distance, modularity, reliability, scalability, and often heterogeneity. Failures are partial—one node or link may go down without halting the whole system.
List and explain three fundamental challenges in distributed computing.
What are the main motivations for distributed systems?
Scalability, resource sharing, data sharing, reliability via redundancy, cost-effectiveness via commodity machines, geographic distribution, and resilience to local failures.
Explain the CAP theorem and its practical implications.
The CAP theorem states that a distributed data system can simultaneously provide only two of:
Why is latency a critical concern in distributed systems?
Network latency affects both consistency and availability; slow communication can mimic partitions, requiring systems to balance immediate response with up-to-date data, leading to fundamental trade-offs in system design.
How do distributed systems model computation and communication?
Applications consist of sets of processes communicating via channels (messages), with no assumptions about message ordering, delivery time, or shared state. Most models are asynchronous; messages have arbitrary delays, and each process maintains local state and reacts to events.
Distinguish between the “event” and “state” models for distributed computation.
Event models focus on the sequence and causality of events (message sends/receives, local computations), while state models focus on snapshots of system configuration. Event models are more flexible for asynchronous, unpredictable behaviors.
Explain Lamport’s “happened-before” relationship and its importance.
Lamport’s relation expresses potential causality: if event A could have affected event B (via local sequence or message transfer), then A “happened before” B. It enables reasoning about global system order, detecting concurrency, and tracking dependencies for algorithm design.
What are logical clocks, and how do they help track causality?
Logical clocks, like Lamport clocks, assign monotonically increasing timestamps to events. Causally related events always have increasing timestamps, allowing distributed systems to impose partial or total order for coordination and debugging.
What are vector clocks, and how do they extend logical clocks?
Vector clocks use arrays—each process maintains a vector of counters for all processes. When sending/receiving messages, vectors are updated. Vector clocks detect true causal concurrency and allow systems to order events, resolve conflicts, and observe distributed state more precisely.
What is the distributed mutual exclusion problem?
Like the critical section in concurrency, distributed mutual exclusion ensures that multiple distributed processes do not simultaneously enter a critical section. Safety (exclusion), liveness (eventual access), and fairness (order of access) must be guaranteed, but communication delays and failures complicate the solution.
Briefly describe Ricart-Agrawala’s decentralized mutual exclusion algorithm.
A process sends timestamped requests to all others; each responds with OK if not interested in the resource or has a higher timestamp. When a process receives OK from all, it enters the critical section. To release, a process sends OK to all pending requesters. The scheme ensures exclusion and fairness via timestamps and message passing.
What is leader election, and what does the Chang-Roberts algorithm do?
Leader election designates a unique coordinator among processes.
Chang-Roberts: In a ring, processes send election messages with their IDs. Each forwards higher IDs, swallows lower ones. Receiving its own election message declares leadership; leader status then propagates. The highest ID wins. Used for centralized coordination in algorithms.
Explain message ordering in distributed algorithms—causal, FIFO, and total ordering.
What is a global snapshot and what challenge does it solve?
A global snapshot records a consistent state of a distributed system across all processes, usually for debugging, deadlock detection, or checkpointing. The challenge is synchronizing such snapshots without pausing processors or missing in-transit messages.
Summarize the Chandy-Lamport snapshot algorithm.
Processes mark their local state and broadcast “marker” messages, recording incoming messages after marking (to capture in-transit traffic) until markers are received on all channels. After, the joint record describes a consistent, concurrent global state.
What is the distributed consensus problem, and why is it hard?
Consensus means all non-faulty nodes agree on a single value (e.g., commit/abort, leader). In asynchronous networks, consensus is impossible if any process can fail undetectably (FLP impossibility). Requires careful algorithm design under partial synchrony or using failure detectors.
How does Byzantine failure differ from crash/omission failure, and what’s the Byzantine Generals problem?
Byzantine failures involve arbitrary, malicious, or inconsistent process behavior—not just crashing or omitting messages. Byzantine Generals: requires loyal nodes to agree on a plan despite faulty, traitorous participants. Solvable only if number of correct nodes exceeds three times the faulty ones.
Name and explain a real consensus protocol (e.g., Paxos or Raft).
What is Service-Oriented Architecture (SOA), and what are its principle benefits?
SOA decomposes applications into loosely-coupled, network-accessible services with formal contracts.
Advantages: modularity, maintainability, independent deployment, cross-platform interoperability, and enhanced scalability. Services expose APIs, interact via messages, and can be scaled, reused, or replaced without affecting clients.
Compare SOAP-based and REST-based web services (SOA).
How do microservices extend SOA ideas for distributed application architecture?
Microservices break applications into independently deployable, single-responsibility services, each running in its own process and communicating via REST or lightweight messaging. Enables horizontal scaling, polyglot development, continuous deployment, and faster time-to-market.