Sharding Flashcards

Question 1

Q

What is Sharding or Partitioning

Answer

A

It is a process of dividing a large database (DB) into smaller, more manageable parts called partitions or shards. Each partition is independent and contains a subset of the overall data.

This improves scalability, performance, and Manageability

Question 2

Q

Types of Sharding or Partitioning:

Answer

A

Horizontal Sharding:
It involves dividing a database table into multiple partitions or shards, with each partition containing a subset of rows.

Data is split by rows.
All shards (servers) have the same schema (columns).
Each shard holds different records (rows).
Uses a sharding key (e.g., user_id, order_id) to decide where a row goes.

Scalability: Handle large volumes of data and user traffic BY adding more servers or shards.

Performance: Each shard is typically assigned to a different database server, which allows for parallel processing and faster query execution times. Now each server handles fewer rows, less data to scan.

Maintainability: Simpler Upgrades since You can upgrade one shard at a time without bringing down the entire system.

Horizontal Sharding Techniques:

Range-Based Based on value range (e.g., User ID 1–1000, 1001–2000)
Pros: Simple
Cons: Uneven distribution possible

Hash-Based: Hash the shrding key and mod with shard count (hash(key) % N) where no is no of db servers.
Pros: Even distribution
Cons: Hard to add/remove nodes i.e adding new servers means changing the hash function which would require redistribution of data and downtime for the service. A workaround for this problem is to use ‘Consistent Hashing’.

Consistent Hashing: Uses a hash ring to distribute keys, allows dynamic scaling
Pros: Easy to scale nodes in/out
Cons: More complex logic

Geographic Sharding: Users/data split by region (e.g., US data on US servers, EU data on EU servers)
Pros: Simple
Cons: Uneven distribution is possible since some areas are densely populated and some are not.

Vertical Sharding: It involves dividing a database table into multiple partitions or shards, with each partition containing a subset of columns.

Data is split by columns.
Each server stores different columns of the same logical entity (e.g., User).
All parts are connected using a common key (usually the primary key like user_id).
There is no sharding key in the horizontal sense.

In this, sharding is done based on Group related features or responsibilities eg: Auth: username, password in one shard and Profile: bio, profile_pic in another.
Map data to specific services/microservices, Payments team owns billing_info, not full user profile

Original Table Columns
user_id, username, password, bio, profile_pic, last_login, analytics_data
After Vertical sharding:
Auth Table (for login service):
user_id, username, password

Profile Table (for user profile UI):
user_id, bio, profile_pic

Analytics Table (for tracking user activity):
user_id, last_login, analytics_data

This results in:
Login APIs don’t touch profile or analytics tables → faster queries.
You can scale or secure the auth service separately
Analytics DB can be optimized for write-heavy patterns

Performance: Improves query speed by reducing number of columns per query.

Maintaiability: Each vertical shard contains different columns/features, often managed by separate services or modules. eg:
You can upgrade the auth service (and its DB schema) without touching profile or analytics services.

Scalability: it is Not typically — unless services are manually separated

Important:
Can it scale to handle more users or records?
Horizontal : ✅ Yes — by adding shards
Vertical: ❌ Not directly
Can it scale feature-specific workloads?
Horizontal:No — same data on all shards
Vertical: ✅ Yes — e.g., scale only billing service
Does it allow infinite growth with new servers?
Horizontal: ✅ Yes |
Vertical: ⚠️ Only to a point

Hybrid Partitioning: Hybrid data partitioning combines both horizontal and vertical partitioning techniques to partition data into multiple shards.

Eg: consider a large e-commerce website that stores customer data in a database table. The website might partition the customer table horizontally based on the geographic location of the customers, and then partition each shard vertically based on the type of data.

Step 1: Horizontal Partitioning (by location)
Split the Customer table by geographic region:

Shard 1 (US customers) → Server A

Shard 2 (EU customers) → Server B

Shard 3 (Asia customers) → Server C

Now each shard contains fewer rows, reducing query load per region.

Step 2: Vertical Partitioning (by type of data within each shard)
Inside each shard:

Auth Table: user_id, username, password

Profile Table: user_id, name, address, profile_pic

Order Table: user_id, order_id, order_history

Billing Table: user_id, credit_card_token, billing_address

Now each module or service can handle only the data it needs, improving performance and maintainability.

Note:
Step 1 Horizontal sharding by region/user/etc.
Step 2 Vertical partitioning by column group/type
Result Distributed + modular + high-performance system

Why not use Hybrid everywhere since it covers everthing
a) High Complexity: Managing both horizontal and vertical shards together means dealing with routing logic, joins, service boundaries, and multiple databases.
b) Operational Overhead: More servers, more DBs, more services = more things to monitor, secure, upgrade, and troubleshoot.
c) Cross-partition joins are hard: Once you split data, joining data across horizontal shards and vertical services becomes very difficult and slow.

—————————————————– | —————————— | —————————————- |

Question 3

Q

Common Problems of Data Partitioning

Answer

A

Joins and Denormalization: Performing joins on a database that is running on one server is straightforward i.e.Vertical Partitioning, but once a database is partitioned and spread across multiple servers (Horizontal) it is often not feasible to perform joins that span database partitions. Such joins will not be performance efficient since data has to be compiled from multiple servers and sends over the distributed N/W.
A common workaround for this problem is to denormalize the database.

Using Joins across multiple shards:
Query all shards (expensive)
Pull the data into memory
Manually join in application code (slow, complex)

Denormalization:
What: Duplicate data to avoid joins
Why: Improve read speed, simplify queries, support sharding
Risk:
Data duplication: Same data stored in multiple places → more storage
Data inconsistency:Harder to keep duplicates in sync during updates
More write complexity:Updating a user name? Now you must update it in every place it’s denormalized

How you do denormailzation:
Create (or replace) a table that now carries columns from two or more normalized tables. or Keep the original table but tack on extra columns you’d normally fetch via a join.

How Denormalization Leads to Data Inconsistency
🔁 Because: You’re duplicating the same data in multiple places, and now you have to manually keep them in sync. Solution to this: Use trigger: Ties logic to DB or kafka or Accept inconsistency.If stale data is okay for non-critical reads (e.g., logs, analytics)

Rebalancing:
Rebalancing means redistributing data across shards or partitions to maintain even load, storage usage, and performance as your system evolves.

eg:
The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP code that cannot fit into one database partition.
There is a lot of load on a partition, e.g., there are too many requests being handled by the DB partition dedicated to user photos.

In such cases, either we have to create more DB partitions or have to rebalance existing partitions, which means the partitioning scheme changed and all existing data moved to new locations.
Risks:
a)Doing this without incurring downtime is extremely difficult.
b) Also Data consistency: Must ensure data isn’t lost or duplicated during migration

Question 4

Q

Indexes

Answer

A

An index is a data structure (usually a tree or hash) that allows the database to quickly locate rows without scanning the entire table.

Indexes are data structures (like B-Trees or Hash tables) that store column values along with pointers (or references) to the actual rows in the table.

Eg: Think of an index like the index in a book — instead of reading every page, you jump directly to the page where your topic is.

Why Use Indexes?
Benefits:
🔍 Faster queries: Reduces the amount of data scanned (especially with WHERE, JOIN, ORDER BY, GROUP BY)
📈 Better performance: Essential for large tables — improves speed for lookups and filters
🔄 Improved sorting: Helps speed up ORDER BY queries

Example:
Without Index:
SELECT * FROM users WHERE email = ‘alice@example.com’;
→ Scans every row in the users table = full table scan

With Index on email:

CREATE INDEX idx_email ON users(email);
→ Database uses the index to jump directly to the matching row = very fast

Trade-Offs of Indexing:
a) Slower writes:INSERT, UPDATE, DELETE become slower because indexes also need to be updated.
b)More storage:Indexes consume extra disk space
c)Need tuning:Too many indexes can hurt performance — need the right ones for your query patterns.

Best Practices:
Index columns in WHERE, JOIN, ORDER BY: These benefit most from indexes
Don’t over-index Too many indexes = slower writes and higher storage

Concept Summary
What:Data structure to speed up reads
Why:Avoid full table scans
Types:B-Tree, Hash, Composite, Full-Text, etc.
Trade-offs: Faster reads, but slower writes & more storage
Use for: WHERE, JOINs, ORDER BY, SELECT performance

Indexes only applies to columns since db checks which column is being used with WHERE, JOINs, ORDER BY, GROUP BY clause.

Note: We should not put indexes if we need to write often and rarely read data from.

Sharding Flashcards

(4 cards)