Distributed Coordination Service
A Distributed Coordination Service is a system that helps multiple distributed applications or nodes work together correctly, especially in environments where failures, timing delays, and concurrency issues are common.
Why It’s Needed
In a distributed system (many servers working together), you need to coordinate shared tasks, such as:
a) Leader election:Choose one node as the master or coordinator
b) Service discovery: Help services find each other dynamically.
c) Metadata coordination Track who is doing what and where
d) Distributed locking Prevent two nodes from doing the same task at the same time.
Without coordination, systems might:
Duplicate work
Get out of sync
Corrupt shared resources
How It Works (Simple Example)
Leader Election Example:
5 microservices want to pick one leader
They all try to acquire a lock from ZooKeeper
Only one succeeds → becomes the leader
Others wait or act as followers
If the leader fails, another one is elected automatically
eg: Common Coordination Services
Apache ZooKeeper: Most widely used coordination service — strong consistency guarantees.
Flow:
Correct flow with ZooKeeper (step by step)
Using Apache ZooKeeper.
1️⃣ All service instances start
Service-1
Service-2
Service-3
Service-4
Service-5
Each instance connects to ZooKeeper.
2️⃣ All instances try to become leader (lock)
They try to create an ephemeral node, for example:
/leaders/inventory-cleanup
ZooKeeper guarantees:
Only one instance succeeds
That instance becomes the leader
Others become followers
✅ No race conditions
✅ No duplicates
3️⃣ Leader runs the cron job
Leader executes:
cleanup expired inventory
Followers do NOT run the cron job
Followers continue doing:
API handling
Kafka consumption
Normal service work
4️⃣ ✅ Followers set watchers
5️⃣ Leader failure
If the leader:
Crashes
Loses network
Gets restarted
ZooKeeper:
Detects session loss
Deletes the leader’s ephemeral node automatically
6️⃣ ZooKeeper notifies watchers
ZooKeeper sends a notification to followers
Followers wake up
One of them becomes the new leader
New leader runs the next cron job
✅ Automatic failover
✅ No human intervention
✅ No split-brain
Important:
Followers register watchers once and ZooKeeper notifies them on change.
✅ Watcher is a notification mechanism inside Apache ZooKeeper
Another example:
how ZooKeeper helps in service discovery
Service discovery = finding which service instances are currently alive and where they are (IP, port).
In dynamic systems:
Services scale up/down
IPs change
Containers restart
Hard-coding IPs does not work.
How ZooKeeper helps with service discovery
ZooKeeper acts as a central, consistent registry of who is alive.
Step-by-step flow (very simple)
1️⃣ Service registers itself
When a service instance starts, it registers itself in ZooKeeper by creating an ephemeral node.
Important:
Ephemeral node → auto-removed if service dies
Data contains IP + port
2️⃣ ZooKeeper now knows who is alive
ZooKeeper always has an up-to-date list of active service instances.
No heartbeats needed in your app.
3️⃣ Clients query ZooKeeper
A client service (e.g., Payment Service) asks:
“Who are the active Order Service instances?”
4️⃣ Clients set watchers (important)
Clients set a watcher on:
/services/order-service
Meaning:
“Notify me if instances are added or removed.”
5️⃣ Instance failure or scale event
Instance crashes or shuts down
ZooKeeper deletes its ephemeral node automatically
ZooKeeper notifies watchers
Clients update their local cache.
Why this works so well
Problem ZooKeeper solves it
Dynamic IPs Central registry
Crashes Ephemeral nodes
No polling Watchers
Consistency Strong guarantees
Split-brain Single source of truth
What ZooKeeper is NOT doing
❌ Routing traffic
❌ Load balancing
❌ Health checks
What is Service Discovery?
Service discovery is the mechanism by which a service finds the network location (IP/port) of another service at runtime.
In short:
Who is available?
Where are they running?
Service Discovery (core job)
“Which instances exist, and where are they?”
Service name → list of IP:port
Dynamic add/remove as instances change
So LB integrates with Service discovery to get the IP in a dynamic environment and also does health checks.
If a service is:
Stuck
Slow
Returning 500s
…but still connected → ZooKeeper thinks it’s alive.
So what kind of “health” is this?
Type ZooKeeper
Liveness (process/session alive) ✅ Yes
Readiness (can serve traffic) ❌ No
Distributed File Systems
A Distributed File System allows multiple computers (nodes) to store, access, and manage files together as if they are part of a single shared storage system, even though the data is physically spread across many machines.
Key Charactewrstics:
a) Unified namespace: Appears as a single file system to users and applications
b) Data spread across nodes: Files are split and stored across multiple servers.
c)Scalability: Can store petabytes by adding more machines
d)Fault tolerance: Files are replicated across nodes, so even if one fails, data isn’t lost
Why Use a DFS?
To handle huge datasets (e.g., big data, media, backups)
To share files across machines in a cluster
To improve availability and reliability
To enable parallel processing
eg: Hadoop and Amazon S3 (object Storage)
How it works:
Imagine you upload a 1 GB video file to a distributed file system:
It splits the file into chunks (e.g., 64 MB each)
Distribute these chunks across multiple servers
Each chunk is replicated (e.g., 3 copies) for fault tolerance
When you access the file, it reads the chunks from wherever they are stored — automatically and transparently
What Actually Happens: (Detail)
File is split into chunks
For example, a 1 GB file might be split into 16 chunks of 64 MB each.
Chunks are stored on different nodes
DFS distributes these chunks across multiple machines (for load balancing and fault tolerance).
You request the file
From your point of view, you’re just opening or downloading video.mp4.
DFS automatically finds and retrieves the chunks
It uses a master node / metadata service to locate the chunks (e.g., “Chunk 1 is on Node A, Chunk 2 is on Node C…”).
The chunks are fetched in parallel to speed up the read.
Chunks are merged in order
DFS assembles the chunks back together in the correct sequence before passing the full file to the application or user.
This merging happens automatically — you don’t need to manage it manually.
DFS as a System
It is a software system designed to:
Split files into chunks
Distribute them across nodes
Handle replication, fault tolerance, and recovery
Track metadata (file-to-chunk mappings)
The system includes:
Storage nodes (where data chunks are stored)
Metadata/Name nodes (track where everything is)
Protocols for reading/writing files
DFS Service
├── Metadata servers (where is the data?)
└── Storage servers (store the chunks)
Example:
Hadoop Distributed File System.
S3 is a Object Storage not a DFS.
Object storage stores objects, where each object consists of:
Data
Metadata
a Unique ID (object key)
What each part means (simple)
1️⃣ Data
The actual content: image, video, log file, backup, etc.
2️⃣ Metadata
Information about the data
Examples: content type, size, creation time, custom tags, permissions
3️⃣ Unique ID (Object Key)
A globally unique identifier used to retrieve the object
In systems like Amazon S3, this is the object key (often looks like a path, but it’s just an ID)
Example:
Bucket: fleet-videos
Key: vehicles/1234/crash/clip001.mp4
➡️ That entire string is the unique ID, not a real directory path.
You read/write the whole object at once.
Suppose we have a 1-GB video. In object storage, it is stored as one object that contains:
the entire 1-GB video data,
metadata (such as content type, creation time, permissions, tags), and
a unique object key.
Users access the video by using this object key, which retrieves the whole 1-GB object.
How you access it
Via HTTP APIs (PUT / GET).
Objects are read or written as a whole (no in-place append).
Examples
Amazon S3
Google Cloud Storage
Azure Blob Storage
In this, If you have a 1-GB object in Amazon S3, you can read only a specific portion of it using a range request.
How it works (simple)
You can ask S3:
“Give me bytes 0–256 MB of this object.”
S3 will return only that part, not the entire 1-GB file.
This is done using an HTTP Range header, for example:
Range: bytes=0-268435455
What this is useful for
Parallel downloads (multiple ranges at once)
Resuming interrupted downloads.
Important clarification
You can read partial ranges,
But you cannot modify or append only part of the object.
Any update requires re-uploading the whole object.
ImpORTANT:
Simple Mental Model
Object storage → “Store and retrieve objects by key”
DFS / File system → “Read and write files by path in blocks”
Difference between db and S3
Database
Store structured data
Support queries, updates, transactions
Optimized for low latency
Object Storage (Amazon S3)
Store unstructured data (videos, images, logs, backups)
Optimized for durability and scale
Not designed for querying or frequent updates
✅ Use a Database when:
You need fast lookups or updates
You need queries, filters, joins
Data changes frequently
📌 Example: users, trips, orders, telemetry metadata
✅ Use S3 when:
You store large files
Data is write-once, read-many
You need massive scale and durability
📌 Example: dashcam videos, images, logs, backups
Why databases usually don’t store images/videos
Databases are optimized for:
Structured data
Fast queries
Frequent updates
Storing large binary files (images, videos) in a DB:
Slows down queries
Increases storage cost
Makes backups and replication heavy
Distributed Messaging Systems
Distributed Messaging Systems are systems that enable asynchronous communication between distributed components (such as microservices, applications, or servers) by sending messages through intermediaries (called brokers or queues). They decouple producers (senders) and consumers (receivers), improving scalability, reliability, and flexibility in large-scale systems.
Key Concepts:
Producer:Sends messages to a message broker
Consumer: Receives messages from a broker
Broker:Middleware that routes, stores, and manages messages
Topic/Queue :Logical channel through which messages are sent and received
Message:The data being transferred (e.g., JSON, XML, byte stream)
eg: Apache Kafka Pub-Sub
Google Pub/Sub Pub-Sub
Benefits
Loose Coupling – Producers and consumers evolve independently
Scalability – Easy to add producers/consumers
Reliability – Retry, acknowledgments, and durability
Asynchronous Processing – Non-blocking communication
Challenges
Ordering Guarantees – Ensuring the order of message processing
Real-World Examples
E-commerce: Order placed → Message to “Order Processing” topic
IoT: Sensors → Kafka → Stream processing
Kafka acts as the broker — it manages communication between producers and consumers.
It has:
Topic:Logical name to organize messages (used by producers and consumers)
Partition:Actual storage unit where messages are written in order
Producer:Sends messages to a topic
Consumer:Subscribes to a topic, and reads from its partitions
Kafka flow:
a)Producer prepares the message
Producer sets:
The topic name (e.g., order_created)
Optionally a key (e.g., “user_123”) → used by the partitioner
The value (the actual data/message)
b)Partitioner decides which partition to use
The partitioner logic runs on the producer side, not Kafka.
Based on the key (if provided), the producer decides:
“I’ll send this to partition 1 of topic order_created.”
c) Producer sends the message to Kafka
The message is sent to the Kafka broker.
Kafka stores the message in that partition
d) Consumer subscribes to the topic
The consumer subscribes to order_created.
Kafka assigns one or more partitions of that topic to this consumer (based on consumer group logic).
Kafka does not push the message — the consumer pulls it.
Consumer reads from the assigned partition(s)
Consumer reads messages directly from the partition(s) assigned to it.
It keeps track of offsets — i.e., which messages it has already read.
Correct Kafka consumer flow (final)
1️⃣ Consumer fetches messages from its assigned partition(s)
2️⃣ Consumer processes messages
Business logic
DB writes
Side effects
3️⃣ Consumer commits the offset
Indicates processing is complete
4️⃣ Kafka stores the committed offset
In __consumer_offsets
What this guarantees
At-least-once delivery (default)
If consumer crashes before commit → messages are reprocessed
If consumer crashes after commit → messages are not reprocessed
How to ensure ordering of data in Kafka.
Note: Kafka enforces one-consumer-per-partition within a consumer group to guarantee ordered consumption.
example (confirmed)
Topic: 1 topic
Partitions: 3 (P0, P1, P2)
Consumer Group: 1 group
Consumers: 3 (C1, C2, C3)
Partition assignment
C1 → P0
C2 → P1
C3 → P2
What is allowed
C1 reads from P0
C2 reads from P1
C3 reads from P2
All in parallel
What is NOT allowed
❌ C2 reading from P0 while C1 is reading P0
❌ Two consumers reading the same partition in the same group
Important:
Different consumer groups can read from the same partition at the same time.
Example (very common)
Topic
order-events (3 partitions)
Consumer Groups
Group A → Order Processing
Group B → Analytics
Group C → Audit / Logging
Each group:
Reads all partitions
Has its own offsets
Processes independently
Partition 0 → Group A Consumer 1
→ Group B Consumer 1
→ Group C Consumer 1
No conflict.
Ordering guarantee
Ordering is preserved per partition per consumer group
Different groups do not affect each other.
Importnat: Your example (correct behavior)
Topic
order-events
Partition 1: [msg0, msg1, msg2, msg3]
Consumer Groups
Group A → Order Processing
Group B → Analytics
Group C → Audit
Step-by-step
1️⃣ Message arrives in Partition 1
Partition 1: msg3 (OrderPlaced)
2️⃣ Group A consumer reads msg3
Processes order
Commits offset = 3
Kafka stores:
(Group A, order-events, partition 1) → offset 3
✅ Message still exists in Kafka
3️⃣ Group B consumer reads msg3
Processes analytics
Commits offset = 3
Kafka stores:
(Group B, order-events, partition 1) → offset 3
✅ Same message, different offset tracking
4️⃣ Group C consumer reads msg3
Same thing.
🚫 Why messages are NOT deleted on commit
Kafka is not a queue
Kafka is a log / event store
Messages are deleted only when:
Retention policy expires (time-based or size-based),
Kafka guarantees message ordering, but only within a single partition, not across partitions. Only one consumer per consumer group fetches the message from the same partition at the same time. No two consumers within the same consumer group can read the message from the same partiotion at the same time.
Within a partition ✅ Strict ordering is preserved
Across partitions ❌ No ordering guarantee
Inside a Partition (Consistent Hashing)
Messages are stored in the exact order they are received.
When a consumer reads this partition, it reads messages in order (A → B → C).
🔀 Across Partitions
If the topic has multiple partitions (say 3), and you’re publishing without a key:
Messages go to different partitions (round-robin).
There is no global order across partitions.
Consumers assigned to each partition read in order within their partition, but there’s no guarantee that A → B → C will be processed in that sequence.
How to endure the order is maintained:
Use a key when producing messages (e.g., customer ID or order ID)
Kafka always sends messages with the same key to the same partition
This ensures ordering for that key is preserved.
i.e instead of using round robin, we use consistent hashing: If the partitioner on the producer uses consistent hashing (based on a message key), then:
🔁 Messages with the same key will always be routed to the same partition.
This guarantees ordering for that key ✅.
How Kafka ensures ordering in a single partition
Kafka guarantees ordering within a partition because of two combined rules:
1️⃣ Append-only log with offsets (broker-side)
Each partition is an append-only log
Messages are written sequentially
Kafka assigns a monotonically increasing offset
Example:
Partition P1:
offset 0 → msgA
offset 1 → msgB
offset 2 → msgC
Kafka never reorders messages inside a partition.
2️⃣ One consumer per partition per consumer group (consumer-side)
Kafka ensures:
Only ONE consumer in a consumer group can read from a given partition at a time
So:
No parallel reads of the same partition
Any example with or without kafka
Without Kafka:
E-commerce Order Created
Goal: When a customer places an order:
The inventory must be updated
An email must be sent
The order should be logged for analytics
What Happens:
Customer places an order
OrderService:
Calls Inventory Service: ✅ Success
Calls Email Service: ❌ Fails (e.g., timeout)
Calls Analytics: 🚫 Not reached
Problems:
❌ Inventory is updated:But the customer never gets the confirmation
❌ Analytics not logged:Order data is lost for tracking
❌ No retry for email:Unless you add retry logic manually
With Kafka (Asynchronous + Decoupled)
🔁 Consumer 1: Inventory Service
🔁 Consumer 2: Email Service
🔁 Consumer 3: Analytics Service
✅ What Happens:
OrderService publishes to topic order_created
Kafka stores the message
Inventory Service reads the message and updates the DB
Email Service is down → Kafka keeps the message, retries later
Analytics Service reads and logs it
No service blocks the others
kafka offers scalability by scaling the brokers and partitions.
Kafka offers reliability by Durably store messages so that even if a consumer is down, the message is not lost — and can be consumed later when the service comes back. it stores in the partitions disk (HDD)