Data Partitioning Flashcards

(22 cards)

1
Q

What is data partitioning?

A

Data partitioning is process of dividing a large database (DB) into smaller, more manageable parts called partitions. Each partition is independent and contains a subset of the overall data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

By what criteria do people typically partition data?

A

By data range, data size, or data type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the three most popular partitioning methods?

A

Horizontal, vertical, and hybrid partitioning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is horizontal partitioning?

A

Aka sharding, a complex “Nuclear Option” used when all other scaling strategies (like Read Replicas and Caching) are exhausted.

Splitting a large table by rows into smaller chunks (called shards) and distributing them across multiple servers.

Accessible by the shard key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the tradeoffs of horizontal partitioning?

A

Pros: it’s scalable (adding more shards can increase capacity), performant (scoping requests to a single shard is faster because it has fewer rows), and fault tolerant (failures are isolated to one node).

Cons: it’s complex (it requires additional logic to target the right shard), multiple shard queries are complicated and slow, the design is subject to hotspots (disproportionately large shards) which require effortful rebalancing, there might be data duplication or inconsistencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is vertical partitionining?

A

Paritioning data based on columns rather that rows. For example, storing username and email in one shard but storing location and demographic info in another.

Can help optimize performance by reducing the amount of data that needs to be scanned, especially when certain columns are accessed more frequently than others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is hybrid partitioning?

A

Hybrid data partitioning combines both horizontal and vertical partitioning techniques to partition data into multiple shards. This technique can help optimize performance by distributing the data evenly across multiple servers, while also minimizing the amount of data that needs to be scanned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some common row based partitioning (sharding) strategies?

A
  1. Key or Hash based partitioning
  2. List partitioning
  3. Round Robin partitioning.
  4. Composite partitioning.
  5. Range partitioning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is hash based partitioning?

A

When you apply a hash function to some key attributes of the entity you are storing that yields the partition number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the trade offs of key or hash based partitioning?

A

Pro: It ensures a uniform allocation of data among servers.

Con: It fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service.

A workaround for this problem is to use ‘Consistent Hashing’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is consistent hashing?

A

A technique used in distributed systems to efficiently distribute keys across nodes while minimizing key movement when nodes are added or removed. It maps both keys and nodes onto a circular hash ring, assigning each key to the next node in a clockwise direction.

Note, each server contains a replica of data. With virutal nodes, you spread this data across multiple servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is list based partitioning?

A

It is when each partition is assigned a list of values. Whenever we want to insert a new record, we will see which partition contains our key and then store it there.

For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or Denmark will be stored in a partition for the Nordic countries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is round robin partitioning?

A

A very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is composite partitioning?

A

A combination any known partitioning schemes used to devise a new scheme. For example, first applying a list partitioning scheme and then a hash-based partitioning.

Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key-space to a size that can be listed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the problems of using joins across partitions living on a different server?

A

They’re often not feasible because they’re slow and hard to do across multiple servers.

One solution is to denormalize the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is enforcing referential integrarity accross a partitioned database difficult?

A

Because most RDBMS do not support foreign key constraints across databases on different servers. This means foreign key constraints must be enforced in code and often cleaned with subsequent sql jobs.

17
Q

What are some reason s we might have to change our partitioning scheme (rebalance)?

A

The data distribution is not uniform - e.g. a lot of places with one zip code that won’t fit a partition.

There is a lot of load on a parition - e.g. there are too many requests being handled by the DB partition dedicated to user photos.

18
Q

How might you rebalance without incurring downtine?

A

By using directory-based paritioning.

19
Q

What is directory-based partitioning and what are it’s pros and cons.

A

Directory-Based Partitioning is a database partitioning technique where a separate mapping directory is used to determine where a specific piece of data should be stored.

Pro: Makes rebalancing possible without downtime
Con: Increases the complexity of the system and creating a single point of failure at the lookup service/database

20
Q

Why should you typically avoid sharding?

A

No DB level Cross-Shard Joins: Can no longer do JOIN User ON Orders. If User is on Shard A and their Orders are on Shard B, the database cannot join them. You have to do it in the application code (which is slow and complex).

No Global Transactions (ACID): You generally lose multi-row ACID transactions. If you need to update User A (Shard 1) and User B (Shard 2) atomically, you need complex “Two-Phase Commit” protocols.

The “Celebrity Problem”: If you shard by User_ID, and Justin Bieber (User_999) posts a tweet, millions of people rush to read from Shard 999. That single shard melts down.

21
Q

“We are designing Twitter. We decided to shard our Tweets database by User_ID so all of a user’s tweets live on one server. This works great for 99% of users. However, when Lady Gaga (100M followers) tweets, millions of people try to read from that one specific shard at the same time, causing it to crash. How do you fix this?”

A

*“The issue here isn’t storage; it’s read throughput on a ‘Hot Key’.

Identify Hot Users: We flag users with >1M followers as ‘VIPs’.

Hybrid Strategy:
* Normal Users: We continue to read from their specific shard.
* VIP Users: We cache their tweets aggressively in a distributed cache (Redis) with a high replication factor.

Extreme Case: If the cache is still overwhelmed, we can manually duplicate the VIP’s tweets across multiple database shards (randomly) to dilute the read load, although this complicates the write path.”

22
Q

“We have a User database sharded by User_ID. It works perfectly for looking up a user by their ID. But now, the Product Manager wants a feature to ‘Search Users by Name’ (e.g., Find all users named ‘John’). How do we implement this query efficiently?”

A

*“Since we sharded by ID, users named ‘John’ are scattered randomly across all 100 shards.

If we run a standard query, we have to hit every single shard (Scatter-Gather), wait for all 100 to answer, and merge the results. This has terrible latency and high failure probability.

The Fix: We need a Global Secondary Index (GSI). We create a separate table (or index service) specifically for names: (Name) -> (User_ID). .CREATE INDEX ON Users(Name)

We shard this index by Name and replicate that (rf 3 or something).

When we search for ‘John’, we hit the ‘Name Shard’, get the list of IDs, and then fetch the actual user data effectively.”

**Why horizontally partition the GSI?
**1. Storage Constraint: That index might literally exceed the hard drive space of a single server. And probably would in a FAANG context.

Solution: You Shard the index (A-M on Server 1, N-Z on Server 2) just to store it all.

  1. The “Write Limit” (The Real Bottleneck)
    This is the big one. If your application is popular enough to need sharding, you probably have massive Write Volume (e.g., 50,000 new users signing up per second).

Replication doesn’t help Writes: Adding 10 read-replicas doesn’t help the Leader server handle incoming writes. It actually makes it slower (because the Leader has to forward the data to the replicas).

NOTE: Standard Replication does NOT improve Write Throughput.

In fact, adding more replicas often makes your Write Throughput slower, not faster.