what is heartbeat
Heartbeating is one of the mechanisms for detecting failures in a distributed system. If there is a central server, all servers periodically send a heartbeat message to it. If there is no central server, all servers randomly choose a set of servers and send them a heartbeat message every few seconds. This way, if no heartbeat message is received from a server for a while, the system can suspect that the server might have crashed. If there is no heartbeat within a configured timeout period, the system can conclude that the server is not alive anymore and stop sending requests to it and start working on its replacement.
What “no heartbeat” actually means
When a heartbeat is missing, several things could be true:
❌ The server crashed
🐢 The server is alive but very slow
🌐 There is a network partition
🔥 The server is overloaded and can’t send heartbeats
Timeouts are a trade-off in the above:
Short timeout → faster failure detection, more false positives
Long timeout → slower detection, fewer false positives
In distributed systems:
You cannot reliably distinguish a crashed node from a slow or partitioned node.
Heartbeats give best-effort failure detection
what is Checksum
A checksum is a small value calculated from data that’s used to verify the data hasn’t changed or been corrupted.
In a distributed system, data transferred between components can become corrupted due to disk faults, network issues, software bugs, or memory errors. To ensure that clients do not consume corrupted data, systems use checksums to verify data integrity.
To do this, the system computes a checksum using a hash function (such as MD5, SHA-1, SHA-256, or SHA-512). A hash function takes the input data and produces a fixed-length string, called the checksum.
How it works:
When data is stored, the system computes a checksum and stores it alongside the data.
When a client retrieves the data, it recomputes the checksum from the received data.
The client compares the computed checksum with the stored checksum.
If the checksums do not match, the data is considered corrupted.
The client can then retry the request or fetch the data from another replica.
This ensures that clients either receive correct data or an explicit error, rather than silently consuming corrupted data.
How checksum works end-to-end (roles clarified)
1️⃣ When data is saved (write path)
The server/storage node that receives the data:
Computes the checksum of the data
Stores the data
Stores the checksum alongside the data (as metadata)
So yes — the server creates and stores the checksum at write time.
2️⃣ When data is read (read path)
The consumer of the data (which could be another server or the client):
Receives the data
Recomputes the checksum from the received data
Fetches the stored checksum
Compares the two
If:
✅ Checksums match → data is valid
❌ Checksums don’t match → data is corrupted → retry / error / fetch from another replica