HeartBeat Flashcards

(2 cards)

1
Q

what is heartbeat

A

Heartbeating is one of the mechanisms for detecting failures in a distributed system. If there is a central server, all servers periodically send a heartbeat message to it. If there is no central server, all servers randomly choose a set of servers and send them a heartbeat message every few seconds. This way, if no heartbeat message is received from a server for a while, the system can suspect that the server might have crashed. If there is no heartbeat within a configured timeout period, the system can conclude that the server is not alive anymore and stop sending requests to it and start working on its replacement.

What “no heartbeat” actually means

When a heartbeat is missing, several things could be true:

❌ The server crashed

🐢 The server is alive but very slow

🌐 There is a network partition

🔥 The server is overloaded and can’t send heartbeats

Timeouts are a trade-off in the above:

Short timeout → faster failure detection, more false positives

Long timeout → slower detection, fewer false positives

In distributed systems:

You cannot reliably distinguish a crashed node from a slow or partitioned node.

Heartbeats give best-effort failure detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is Checksum

A

A checksum is a small value calculated from data that’s used to verify the data hasn’t changed or been corrupted.

In a distributed system, data transferred between components can become corrupted due to disk faults, network issues, software bugs, or memory errors. To ensure that clients do not consume corrupted data, systems use checksums to verify data integrity.

To do this, the system computes a checksum using a hash function (such as MD5, SHA-1, SHA-256, or SHA-512). A hash function takes the input data and produces a fixed-length string, called the checksum.

How it works:

When data is stored, the system computes a checksum and stores it alongside the data.

When a client retrieves the data, it recomputes the checksum from the received data.

The client compares the computed checksum with the stored checksum.

If the checksums do not match, the data is considered corrupted.

The client can then retry the request or fetch the data from another replica.

This ensures that clients either receive correct data or an explicit error, rather than silently consuming corrupted data.

How checksum works end-to-end (roles clarified)
1️⃣ When data is saved (write path)

The server/storage node that receives the data:

Computes the checksum of the data

Stores the data

Stores the checksum alongside the data (as metadata)

So yes — the server creates and stores the checksum at write time.

2️⃣ When data is read (read path)

The consumer of the data (which could be another server or the client):

Receives the data

Recomputes the checksum from the received data

Fetches the stored checksum

Compares the two

If:

✅ Checksums match → data is valid

❌ Checksums don’t match → data is corrupted → retry / error / fetch from another replica

How well did you know this?
1
Not at all
2
3
4
5
Perfectly