Design Distributed Message Queue
Non-functional requirements
- High throughput or low latency
- Scalable, system should be distributed in nature
- Persistent and durable
Apache Kafka uses topics and partitions for efficient data processing. Topics are data categories to which records are published; consumers subscribe to these topics. For scalability and performance, topics are divided into partitions, allowing parallel data processing and fault tolerance. Partitions enable multiple consumers to read concurrently and are replicated across nodes for resilience against failures
Each partition is an ordered, immutable sequence of records, where order is maintained only within the partition, not across the entire topic. This partitioning mechanism enables Kafka to handle a high volume of data efficiently, as multiple producers can write to different partitions simultaneously, and multiple consumers can read from different partitions in parallel
At least once data delivery (kafka default)
Producer sends a message synchronously or asynchronously with a response callback, setting ack=1 or ack=all, to make sure messages are delivered to the broker. If the message delivery fails or timeouts, the producer will keep retrying.
Consumer fetches the message and commits the offset only after the data is successfully processed. If the consumer fails to process the message, it will re-consume the message so there won’t be data loss. On the other hand, if a consumer processes the message but fails to commit the offset to the broker, the message will be re-consumed when the consumer restarts, resulting in duplicates.
Design Unique ID Generator
ID will be made up of:
- Sign bit: 1 bit, reserved for future use cases
- Timestamp: 41 bits, milliseconds since epoch or custom datetime. 2 ^ 41 = 2199023255552 which gives us ~69 years = 2199023255552 ms / 1000 / 365 days / 24 hours / 3600 seconds
- Data center ID: 5 bits, 2 ^ 5 = 32 data centers
- Machine ID: 5 bits, 2 ^ 5 = 32 machines / data center
- Sequence number: 12 bits. For each machine/process, the sequence number is incremented by 1. The number is reset every millisecond. 2 ^ 12 = 4096 possible values every millisecond
Future discussions
- Clock synchronization. We’re assuming each ID generation server has the same clock. This might not be necessarily true across multiple machines. Consider Network Time Protocol (NTP) as a solution
Design Search Auto Complete
Non-Functional Requirements
- Fast response time
- Relevancy and Sorted
- High availability
High Level Design
- Data gathering service: aggregates queries and async process for processing historical frequency
- Real time query service: Given a prefix return top k results
Design URL Shortener
Two primary components:
(1) GET /api/v1/<short_url>
return 301 in Location header</short_url>
301 = permanent
302 = temporary
(2) POST /api/v1/url
longURL -> hash function -> hash value (short url)
The hash value can have 62 different characters so how long should it be
62 ^ 1 = 62 possible values
62 ^ 2 = 3,844
62 ^ 7 = 3.5 trillion
We can use a known hash function (eg MD5, SHA1) but we’ll need to get the first 7 chars and perform DB lookup to ensure no collision
2nd approach. Use base conversion to convert the same number between its different number representation systems. Will need a unique ID generator but can use auto-incrementing on DB
base_62(11157) –> 2TX
Design a Rate Limiter
Rate Limiter: limits/throttles the number of client requests allowed to be sent over a time period
Non functional requirements
- Low latency
- Distributed
HTTP 429: Too Many Requests
Use response headers to determine ratelimit attributes (eg remaining, limit, retry-after)
Types of rate limiting rules
- Number of requests / second
Algorithms for rate limiting:
- Token bucket. Simple, well understood, used by Stripe and Amazon. It is a bucket with pre-defined capacity and tokens are replenished periodically (say once / second or once / minute, whatever the time period is). Each request consumes one token so if there aren’t enough tokens we drop the request
- How many buckets do we need? Examples include bucket / IP address., global bucket across all requests, bucket / endpoint (eg POST vs GET)
- Leaking bucket: Uses FIFO queue
- Fixed window counter: Divide requests into fixed size time windows
- Sliding window
Counters should be stored in a distributed redis cluster. Why Redis?
- Fast, in memory
- Supports time-based expiration
- Includes atomic INCR and EXPIRE
Why not DB? Slow for random access
Design a news feed
Understand scope of problem
Two primary workflows
1) Publishing to the feed
2) Building the news feed for a user and reading it
APIs
1) POST /api/v1/me/feed {content, auth}
2) GET /api/v1/me/feed
1) Fanout on write. News feed is pre-computed at write time
Pros: Fetching news feed is fast
Cons: For inactive users producing the feed wastes resources. Also hot-spot problem in which servers are taxed for popular individual
2) Fanout on read. Feed is generated at read time
Pros: Limits compute on servers, doesn’t waste resources for inactive users
Cons: Fetching the news feed is slow because it’s not pre-computed
A solution? Take a hybrid approach. Try to pre-compute the feed but for users who have a lot of friends let that be on demand
1) Fetch friend_ids (can use a graph db)
2) Send friends list and new post_id to kafka
3) Workers fetch data from Kafka. Append user_id and post_id to the feeds table. This feeds table serves as the news feed cache and we only store post_id so that we’re not storing expensive content (eg images)
Design a KV store
There is no perfect design with the following tradeoffs
- Read, write, and memory usage
- Consistency vs availability
We’ll design a system that can:
- Store big data
- Key/value pair is small: less than 10kb
- High availability: Responds quickly, even during failures
- High scalability: Can scale to support larger data sets
Single server:
- Store everything in memory and use a hash table
- Optimizations that can be added:
1) Data compression
2) Only hold frequently used data, spill rest to disk
Distributed KV store
Data Partition
- Infeasible to fit entire data set on single server
- Partition data across multiple servers evenly but must try to minimize data movement when nodes are added/removed (consistent hashing, servers and keys use same hash function on ring)
Data Replication
- Replicate data asynchronously over N servers where N is configurable
- Since data is replicated across multiple nodes, it must be synchronized across replicas. A write quorum of size W, for a write operation to be considered as successful, write operation must be acknowledged from W replicas
Failure detection
- Each node maintains a membership list
- Each node periodically sends heartbeats to random nodes, which in turn propagate to others
- If the heartbeat has not been recorded in predefined period, it’s considered as offline
Design Google Maps
What features to focus on?
- Location updates, navigation ETA, and map rendering
How large is the road data and do we have access to it?
- Yes. Several TBs
Consider traffic conditions for accurate time estimates
Consider different travel modes such as driving, walking, bus
Map Projection:
- The process of translating the points from a 3D globe to a 2D plane
- Many different ways with strengths and limitations. All of them distort the actual geometry in some way
Geohashing
- An encoding system that encodes a geographic area into a short string of letters and digits
- Recursively divides the grid into subgrid squares
- 0th division is a single square represents entire earth
- 1st division is cut into 4 squares.
01 11
00 10
- 2nd division, cut each square 4 more times –> 16 squares now
- Geohashing can be used for map tiling
Map rendering
- Foundational concept is tiling
- World is broken up into smaller tiles and client only downloads relevant tiles for the area, then stitches them together
- Distinct set of tiles of tiles at different zoom levels
- Client only needs tiles for the zoom level of the map viewport
- Prevents consuming too much bandwidth
- For example, a single tile could be 256x256 pixel image
Road data processing for navigation algorithms
- Most routing algos use A* pathfinding
- They operate on a graph data structure
- Intersections are nodes, roads are edges
- Routing algos can use similar concept as tiles.
- Load on demand
Back of the envelope
- What are the storage requirements for the entire collection of map tile images?
Zoom 0: 1 tile
Zoom 1: 4 tiles
Zoom 2: 16 tiles
….
Zoom 21: 4.3 trillion tiles
After roughly 20 zoom levels, you hit practicality limits. Adding more doesn’t really help the user
Assume each tile is 256x256 pixel compressed PNG, means image size is about 100kb
Total size roughly equal 440 PB. Keep in mind 90% of earth is uninhabited. Therefore roughly 50 PB
3 big features
1. Location service
2. Navigation service
3. Map Rendering
Geocoding database
- DB which stores places and their corresponding lat/lng pairs
Design Google Drive
Non-functional requirements
- Reliable. Data loss is unacceptable
- Fast sync speeds
- Minimizing bandwidth usage
- Highly available
Design YouTube
What are the important features?
- Video upload
- Watch a video
- Do we need to support international users? Yes
- Encryption required? Yes
- Max file size for video upload? 1GB
- Can we leverage some existing cloud services? Yes
High level design
- Videos are stored in CDN (AWS Cloudfront, Cloudfare). When you press play it’s streamed from the CDN
- API Servers. Everything else. Feed recommendation, generating video upload URL, updating meta DB, user signup, etc
Transcoding:
- Video encoding. Translate to other file types to provide best
- Two components
1) Container. Contains the content and the format is the file extension, eg .avi, .mov, .mp4
2) Codecs. Compression and decompression algorithms. Reduce the video size while preserving quality, eg H.264, VP9, HEVC
File Upload Flow
- Videos are uploaded to S3
- Transcoding servers fetch video and start transcoding
possible format given device and bandwidth issues
- Once transcoding complete, send to CDN
- Completion handler updates the metadata DB
Video Streaming Flow
- We’re streaming the video from the CDN, not downloading to local device
- We will use a standard streaming protocol (eg MPEG-DASH, Apple HLS, etc)
- We will serve the video from the edge which is closest to the user
Safety Optimization: Presigned upload URLs
1. Client makes an HTTP POST /upload to API servers to receive a pre-signed URL
2. Client uploads the video using the pre-signed URL to S3