How do we distribute data across a set of nodes
Using data partitioning.
Data partitioning: It is the process of distributing data across a set of servers. It improves the scalability and performance of the system.
There are challenges when we try to distribute data:
a) How do we know on which node a particular piece of data will be stored?
b) When we add or remove nodes, how do we know what data will be moved from existing nodes to the new nodes? Additionally, how can we minimize data movement when nodes join or leave?
Simple approach is: Use a suitable hash function to map the data key/sharding key to a number.
✅ Example:
Let’s say we have:
3 servers: Server 0, Server 1, Server 2
Keys: “user123”, “video789”, “order456”
We use a simple hash() function and then apply modulo operation:
Step-by-step:
Key = “user123”
hash(“user123”) = 123456 → just assume
123456 % 3 = 0 → goes to Server 0
Key = “video789”
hash(“video789”) = 654321
654321 % 3 = 0 → goes to Server 0
Key = “order456”
hash(“order456”) = 789123
789123 % 3 = 0 → goes to Server 0
All 3 keys went to Server 0. This happens if the hash values aren’t well-distributed or the number of servers is too small.
What happens if you add a server?
Now number_of_servers = 4
All hash(key) % 4 values change, so most keys will get reassigned to different servers. This is a big problem in scaling and fault tolerance — it causes data reshuffling.
To solve this problem:
🛠️ Why Consistent Hashing is Better:
With consistent hashing, only a small number of keys are remapped when servers are added or removed — not all. Consistent hashing ensures stability during scaling up or down.
Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data.
With consistent hashing, the ring is divided into smaller, predefined ranges. Each node is assigned one of these ranges.
Eg:
Data range 1-25 will go to server 1
Data range 26-50 will go to server 2
Data range 51-75 will go to server 3
Data range 76–100 will go to server 4
Hash range: 1-100
No of servers: 4
The hash generated from the key tells us the node where the data will be stored.
State City Zip
CA Pleasanton 94588
WA. ABC 93823
TX Austin 73301
HI Honolulu 96801
Apply hash function on the key(sharding key) i.e. State
State City Zip
79 Pleasanton 94588
3 ABC 93823
66 Austin 73301
29 Honolulu 96801
WA goes to server 1
HI goes to server 2
TX goes to server 3
CA goes to server 4
The Consistent Hashing scheme described above works great when a node is added or removed from the ring, as in these cases, since only the next node is affected. For example, when a node is removed, the next node becomes responsible for all of the keys stored on the outgoing node. However, this scheme can result in non-uniform data and load distribution. This problem can be solved with the help of Virtual nodes.
Virtual Nodes
To avoid overloading one server (e.g., if too many keys go to Server A), we:
a)Assign multiple virtual positions to each server.
b)This helps spread the load more evenly.
Instead of placing just 1 point per physical server on the ring, we place multiple virtual points for each physical server.
So:
Server A → placed at hash positions 0°, 120°, 240°
Server B → placed at 30°, 150°, 270°
Server C → placed at 60°, 180°, 300°
Now each physical server owns multiple chunks of the ring.
So lets assume if you hash a key to 115°, it goes to vnode at 120°, which belongs to Server A.
Why Virtual Nodes Help
a)Better Load Balancing: Multiple vnodes ensure keys are more evenly spread, especially with uneven hash distributions.
b)Easier Scalability:
When a server is added or removed, only a small portion of vnodes are redistributed, not huge segments.
c) Heterogeneous Servers: Vnodes make it easier to maintain a cluster containing heterogeneous machines. This means, with Vnodes, we can assign a high number of sub-ranges to a powerful server and a lower number of sub-ranges to a less powerful server.
Using Replication:
Data replication: It is the process of making multiple copies of data and storing them on different servers. It improves the availability and durability of the data across the system.
Replication ensures:
a) Fault tolerance (if one server goes down, data isn’t lost)
b)High availability (reads/writes can still be served)
c) Durability
Data replication using Consistent Hashing
When a key is hashed, it’s assigned to the first node clockwise (its primary owner), and then the data is also replicated to the next N-1 virtual nodes
✅ Example:
Let’s assume:
3 servers: A, B, and C
Each has 3 virtual nodes:
Server A → 0°, 120°, 240°
Server B → 30°, 150°, 270°
Server C → 60°, 180°, 300°
The replication factor is the number of nodes that will receive the copy of the same data. For example, a replication factor of two means there are two copies of each data item, where each copy is stored on a different node.
Replication factor = 3 (i.e., one primary + two replicas)
🔁 Replication Example:
Let’s say a key hashes to 115°
Step 1 (Primary):
115° → next clockwise node is 120° → Server A
Step 2 (Replica 1):
Next vnode after 120° → 150° → Server B
Step 3 (Replica 2):
Next vnode after 150° → 180° → Server C
Benefits:
a) Even if Server A goes down, the data is still available on B and C
b) With virtual nodes, data is evenly distributed.
What if replica lands on the same physical server?
If replication factor was 3 and two vnodes belonged to the same server, the system might skip to the next unique server to ensure physical redundancy