Caching Flashcards

(19 cards)

1
Q

What is caching

A

Caching is a technique used to store a copy of data temporarily so future requests for that data can be served faster. It helps you to make better use of resources.

Flow:
When data is requested by the application, it is first checked in the cache. If the data is found in the cache, it is returned to the application. If the data is not found in the cache, it is retrieved from its original source, stored in the cache for future use, and returned to the application.

Request comes in
→ Application needs data (e.g., product details, user session).

Check the cache (Cache Hit or Miss)
→ If data is found in cache → ✅ Cache Hit
→ If data is not found in cache → ❌ Cache Miss

If Cache Hit:
→ Return the cached data immediately (super fast!).

If Cache Miss:
→ Go to the original data source (e.g., database or API).
→ Fetch the data.
→ Store this data in the cache for future use.
→ Return the data to the application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Benefits and Tradeoffs

A

Benefits:
Faster performance (low latency)

Reduced load on backend systems

Improved scalability

Trade-offs:
Can serve stale (outdated) data if not managed well.

Complexity in deciding what to cache, how long, and when to refresh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Caching Terminology

A

Cache: It is a temporary storage location for data, designed for faster access.

Cache hit: When the requested data is present in the cache.

Cache Miss: When the requested data is absent in the cache and needs to be fetched from the original data source.

Cache Eviction: When the cache reaches its maximum capacity, it has to decide which data to remove to store new data. This is done based on cache eviction policies.

Cache Staleness: When the data in the cache is outdated compared to the original data source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of Caching

In Memory Caching

A

In Memory caching: It’s a technique where frequently accessed data is stored in memory (RAM) to make data retrieval extremely fast . Faster than disk storage.

Use case: This type of caching is commonly used for caching API responses, session data.

In-memory caching (like Redis or Memcached) is typically:

a) Located on the web server — or on a dedicated cache server in the backend infrastructure — wherever the website is hosted.

Two common setups:
a) On the same web server: Web server has built-in caching (like in-memory object caching).Good for simple applications or smaller scale.Faster because it’s local to the process.

b) On a separate caching server (e.g., Redis cluster):Used in large-scale systems.Multiple web servers access the same centralized cache.Improves consistency and performance under high traffic.

What is Memcached?
Lightweight, simple key-value store
Stores strings and objects in memory
Best for simple caching use cases
Doesn’t support complex data types (just basic key-value)
Doe not supports data persistence to disk.

✅ Use it when: You need to cache database query results, session data, or API responses

What is Redis?
More powerful than Memcached
Supports advanced data types: strings, lists, sets, sorted sets, hashes, bitmaps, and more
Can also be used as a message broker, pub/sub system, and persistent store
Supports data persistence to disk (optional)
Has built-in replication

✅ Use it when:You want more than just key-value storageYou need things like counters, real-time analytics, queues, leaderboards, or caching with TTL (expiration). This makes Redis suitable for:

User sessions

Rate limiting

Shopping carts

Leaderboards

Caches that must survive restarts

Queues

Pub/Sub with persistence

If Redis supports persistence then how it avoids latency because it needs to store the data in disk as well ?

Redis supports persistence without adding latency by writing to memory first and flushing to disk asynchronously. With AOF every second, Redis batches writes and uses OS-level buffering so only 1 second of data might be lost, but write performance remains in-memory fast.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Disk Caching

A

Disk caching stores data on the hard disk or SSD, which is slower than RAM but faster than fetching data from a remote source (like a database or external API).

It is especially useful when:
a)The data is too large to fit in memory.
b)The cached data needs to persist across restarts or failures.
Disk caching is commonly used for:
a)Database query results (to reduce expensive computations or joins).
b)File system data, like images, documents, or logs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Database Caching

A

Database caching refers to storing frequently accessed data, queries, or results in the database itself, reducing the need to access external storage.
Why it is used:
a)Improve query performance
b)Reduce latency for end users
c)Lower load on the database engine
d)Help scale for more concurrent users

Database caching techniques: Database query caching and result set caching.
Database query caching : Caches full SQL query plan and its results. If the same query is repeated, it returns the result from cache instead of reprocessing it. Usualluy outside the db such as Redis etc

Result set caching: Caches the results returned from a query execution — stored temporarily for reuse. It mactches the exact query and usually inside the db server. It caches the query as a Key and stores the results as value.

eg: The query string (e.g., “SELECT * FROM products WHERE category = ‘shoes’”) is used as a key.

The result of that query (e.g., list of shoe products) is stored as the value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Client Side Caching

A

Client-side caching refers to storing data on the user’s device (browser or app) so that future requests for the same data such as images, CSS, or JavaScript files can be served faster without reaching the server again.

How It Works:
When a user visits a website or makes an API call:
a)The server sends the requested data along with caching instructions (e.g., Cache-Control headers).
b)The client (usually a browser) stores this data in local storage areas like:Browser Cache,Local Storage / Session Storage, Cookies and IndexedDB

Benefits:
a) Faster load times
b)Reduced server load
c)Better user experience

Note: All these below stores in the Hard drive of the client device.

a)Browser cache: Browser Cache: Stores static files (HTML, CSS, JavaScript, images) on your device’s hard disk or SSD.
b)Local Storage: it stores data in key-value pair. It stores User preferences/settings such as retain language settings. This is persistent.
c)Session Storage: Also stores in the key value pair. It is not persistent even though it stores on the Hard drive, and can be automatically removed once the browser is closed. This is set by the browser settings to remove once the session is done. Use case: Temporarily Save Form Inputs.
d) Cookies: Small pieces of data stored on the device’s disk, used mainly for session management and tracking. Authentication token: User stays logged in on closing the browser.

Session management flow:
a)User adds item to cart.
Example: “Nike Shoes - Size 10”
b) Server creates a session and assigns a unique session ID (e.g., xyz789).
c)Server saves the cart data in a backend store (like Redis, Memcached, or a database)
Example: { session_id: ‘xyz789’, cart: [‘Nike Shoes’] }
d) Server sends a response back to the browser with this header: set-Cookie: session_id=xyz789; Path=/; HttpOnly
e)Browser stores cookie: session_id=xyz789.
d)User closes browser.
e)Days later, user revisits — browser sends session_id=xyz789 automatically.
f)Server uses session ID to restore the cart.

All the above storages are done by browser via client side interaction only except Cookie.Cookies are the only browser storage mechanism that supports both:
Client-side + Server-side interaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Server Side Caching

A

This type of caching occurs on the server, typically in web applications or other backend systems. Server-side caching can be used to store frequently accessed data, precomputed results, or intermediate processing results to improve the performance of the server.

Note: This caching can be both In Memory (RAM) of the server and stores in the hard disk (Disk caching) as well.

This caching used on the web server, App server etc.

Examples:
a)full-page caching: Caches the entire HTML output of a page,when the page does not change often (static)

b)fragment caching: Caches parts of a page (fragments), not the entire page. such as product page.
and object caching.

c) Object caching: Caches specific database query results or backend data objects in memory. Eg: Caching a user profile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CDN Caching

A

A CDN = a distributed cluster of edge servers located around the world.

Content Delivery Network caching: CDN caching stores data on a distributed network of servers, reducing the latency of accessing data from remote locations. This type of caching is useful for data that is accessed from multiple locations around the world, such as images, videos, and other static assets. It caches and serves content from edge servers (servers close to the user) instead of your main server.

How it works:
a)User requests a file (e.g., image.jpg)
b)CDN checks if the file exists in its edge cache
c)✅ If yes → serves directly from cache (fast)
d)❌ If not → fetches from your origin server, caches it, and then serves it
e)Future requests from other users near that edge server are served from the cache.

Advantages:
a)Faster load times due to proximity
b)Lower bandwidth costs for the origin server
c)DDoS protection (many CDNs act as a security layer)
d)Higher availability and global performance

NOte: It stores the cache in the Edge servers In Memory as well as hard disk.

Real examples: Cloudflare, Akamai and amazon Cloud front.

🔁 Step-by-step Flow (Assuming the site uses a CDN):
🧍‍♂️ 1. User enters www.video.mpg in browser
This is a standard HTTP(S) request to load a video.

🌐 2. Browser initiates DNS resolution
It needs to convert www.video.mpg into an IP address.

It sends a request to the local DNS resolver (usually your ISP or system-level DNS like 8.8.8.8 for Google DNS).

🛰️ 3. DNS resolver contacts CDN’s authoritative DNS
The domain www.video.mpg is managed by a CDN (e.g., Cloudflare, Akamai, etc.)

The resolver forwards the DNS query to the CDN’s DNS server.

📍 4. CDN DNS picks the best edge server
The CDN DNS:

Looks at your IP address to determine your geolocation.

Checks for nearby edge servers.

Filters based on:

Server health ✅

Load 🔁

Latency ⚡

It chooses the best edge server (e.g., one in Delhi) and responds with that edge server’s IP.

🔄 5. Browser connects to the edge server
Now your browser sends an HTTP(S) request (GET /video.mpg) to that edge server.

📦 6. Edge server checks its cache
If the video is already cached:

✅ It serves the video immediately — fast delivery!

If the video is not cached:

❌ It fetches it from the origin server, caches it locally, and then serves it to the user.

🔁 7. Next users nearby get it from cache
Future users in the same region will get the video directly from that edge cache, speeding things up even more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DNS Caching

A

DNS cache is a type of cache used in the Domain Name System (DNS) to store the results of DNS queries for a period of time.

Advantages:

a)Faster domain resolution
b)Reduced DNS server load
c)Improved user experience

DNS caching can occur at multiple levels:
Browser Cache
Operating System Cache
Router/Modem
ISP DNS Servers

How it works:
a)You visit www.example.com.
b)The browser checks its cache. If not found…
c)The OS checks its DNS cache. If not found…
d)The router checks its cache. If not found…
e)It asks the ISP’s DNS resolver, which may also cache the result.
f)If the domain is not cached anywhere, the full DNS resolution process happens:
From Root → TLD → Authoritative DNS.
g) Once resolved, the answer is cached at all these levels for future use.

General Flow of DNS look up:

What Happens When You Access a Website — DNS Resolution (Simple View):
a)User Enters URL
You type www.amazon.com in your browser and hit Enter.
b)Browser Asks the OS for IP: The browser sends a request to the operating system:
“What’s the IP address for www.amazon.com?”

c) OS Sends DNS Query: The operating system sends a DNS query to the configured DNS server (usually your ISP’s DNS resolver or a public one like Google DNS 8.8.8.8).

d) DNS Resolver Starts Resolution: If the resolver doesn’t already know the IP, it performs the full DNS lookup:

a. Root Server

Asks: “Where can I find .com domains?”

Gets referred to .com TLD (Top level Domain) servers.

b. TLD Server (.com)

Asks: “Where can I find amazon.com?”

Gets referred to Amazon’s authoritative name server.

c. Authoritative DNS Server (Amazon)

Asks: “What is the IP for www.amazon.com?”

Reply: “It’s 176.32.103.205” (example).

e)IP is Returned to OS → Browser
The DNS resolver sends the IP address back to your operating system, which passes it to the browser.

f) Browser Uses IP to Connect
The browser initiates a TCP/HTTP (or HTTPS) connection to that IP address — and now the actual website starts loading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Difference between In Memory (RAM) and Disk Cache (Hard Drive)

A
  1. In-Memory Cache (RAM) – Fastest
    Where: Stored in the server’s RAM

Use when: You need ultra-fast access to frequently used data (e.g., user sessions, API results)

Examples: Redis, Memcached, or in-process cache (Map, LRUCache)

Pros: Extremely fast

Cons: Volatile (data lost on restart), limited by available memory. Not persistence.

🧠 2. Disk-Based Cache (Hard Disk/SSD) – Persistent
Where: Stored on the server’s hard drive

Use when: You need larger, more persistent caching (e.g., file caching, large result sets)

Examples: File system cache, CDN edge disk cache, Redis with persistence (RDB/AOF)

Pros: Can survive server restarts, supports more data

Cons: Slower than RAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cache Invalidation

A

Cache Invalidation is the process of removing or updating stale (outdated) data from a cache to ensure users get the most accurate and up-to-date information.
Eg: Say a product price updates in the db, we must remove the old cache so user’s wont see the stale or old info.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cache Invalidation Methods

A

a) Time-to-Live (TTL): Each cached item has an expiry time (e.g., 10 minutes).After TTL expires, the cache automatically refreshes the data from the origin server.

Use case: Best for data that can tolerate being slightly stale.
Example: Set TTL = 300s for weather API response.

Deletes Immediately: After Time out
Useful for: Auto expiry

b) Purge: Forcefully removes specific cache content such as specific URL or, specific object.

Use case: When you know data is outdated (e.g., updated blog post), you purge the cache entry.

Example: PURGE /blog/123 → removes that specific page from cache.

Deletes Immediately: Yes
Useful for:Precise invalidation

c)Manual Invalidation/Refresh: Actively replaces a cached item with new data from the source. Fetches requested content from the origin server, even if cached content is available.

Use case: When you want to keep the cache entry fresh but not wait for it to expire.

How: Fetch from DB or API and update the cache.

Example: After updating a product, immediately refresh its cache with new data.

Deletes Immediately: Yes (replaces)
Useful for: Manual freshness

d) Ban: Marks cache entries as invalid based on the specific criteria such as URL pattern or header.(but doesn’t delete them immediately). Next time it’s requested, fetch from origin and update cache.

Use case: Used in reverse proxies like Varnish to invalidate content by pattern.

Example: BAN /products/* → next time it’s requested, fetch from origin and update cache.

Deletes Immediately: No
Useful for: Pattern based

e) Stale-while-revalidate: Serve stale content immediately, then update it in the background.

Use case: Improves performance and freshness without blocking the user.

How:
a)User A requests → gets stale data quickly
b)Meanwhile, the server fetches fresh data → updates cache
c) User B gets updated version.

Example: Common in: CDNs, service workers, HTTP cache headers.

Deletes Immediately: No
Useful for: Performance + freshness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Three main cache invalidation strategies/Cache Write Strategies

A

1) Write Through Cache: A write-through cache is a caching strategy where every write/update from the application is immediately written to both:
a)The cache (e.g., Redis, Memcached)
b)The database (e.g. My Sql, MOngo DB)
Once the write operation is completed, it sends the success to the Application.

Speed: Slower
Consistency: High

Pros:
a)High Consistency because both Db and Cache are in sync. No stale data in cache.
b) Read performance: Always hits fresh data from cache
c) No Risk of data loss.
Cons:
a)Introduce latency since it needs to write to both cache and DB.
b) Higher infrastructure cost (more usage of cache + DB).

Use case:
a) When reads are frequent and you need fresh data always such as Profile data, Product data.
b) Ideal for read-heavy systems with frequent access to recently written data

2) Write Around Cache: In this, data is only written to the database, not to the cache during a write.
The cache is updated later — only when the data is read after cache miss by application.

Speed: Balanced
Consistency: stale reads

Pros:
a) Avoids cache pollution (no caching of rarely accessed data)
b) Less write operations less infra cost.
c) Simpler logic

Cons:
a) First Reads are slow, due to cache miss and then go to the DB.
b) Stale reads possible (if read occurs before cache is populated)

Use case:
a) Cold data, rarely accessed
b) Useful for write-heavy systems where read-after-write is rare

3) Write Back/Behind Cache: In write-behind, the data is first written to the cache, and asynchronously written to the database later (usually in batches or after a delay). Cache is responsible to update the db.

Speed: Fast
Consistent: Eventually

Pros:
a) Very fast writes
b)Great for high write throughput
c) Reduces DB load

Cons:
a)Risk of data loss (if the app crashes before DB write)
b)Eventually consistent — cache and DB may be out of sync for a short time

Use Case:
a) Analytics data
b)Non-critical user activity logs
c) Telemetry, metrics
d)Great for high write throughput scenarios.

Important:

Strategy Cache type Who updates DB
Write-around General-purpose Application
Write-through General-purpose Application
Write-back Specialized cache Cache

General-purpose cache has only API’S (GET & SET) and eviction policies.

Specialized cache has fecthing/updating db capablities, it acts as a persistent storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cache Read Strategies

A

Read Through Cache:

How it works:
a)The application always reads from the cache.
b)If data is not in cache, the cache itself fetches from the database, stores it, and returns to the app.

📌 Key Point:
The cache layer handles database access on a miss — app only talks to cache. This helps to maintain consistency b/w the cache and db, as the cache is always responsible for retrieving and updating the data.

Use Case:
a)Simplifies application code (no cache-check logic).
b)Good for centralized cache logic.
c) Where data retrieval is from db is expensive and cache misses are relatively infrequent.

Downside:
Requires more sophisticated caching tools or middleware. In this cache and db are tightly coupled.

Code complexity: moderate
Ideal for seamless data access pattern

Read-Aside Cache (a.k.a Lazy Loading): More frequently used.
🔧 How it works:
a)Application checks the cache first.
b)If hit: return data.
c)If miss: app fetches from database, writes it into cache, and returns the data.

📌 Key Point:
The application is responsible for reading from and writing to the cache. This provides better control over the caching process as the aplication can decide when and how to update the cache.

✅ Use Case:
a)Fine-grained control over what is cached.
b)Works well when cache needs to be updated manually after DB writes.

📉 Downside:
a)Slightly more logic in the application.
b) Cache misses can result in more DB hits.

Code complexity:Higher
Ideal for Manual control, custom logic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cache Eviction Policies

A

First In First Out (FIFO):
Removes the oldest inserted item first.

Last In First Out (LIFO):
Removes the most recently inserted item first.

Least Recently Used (LRU):
Removes the item that hasn’t been used for the longest time.

Most Recently Used (MRU):
Removes the item that was used most recently.

Least Frequently Used (LFU):
Removes the item that is accessed the least number of times.

Random Replacement (RR):
Randomly selects an item to remove when space is needed.

18
Q

How cache can take down db ?

A

1️⃣ Problem: Repeated cache misses for NON-EXISTENT keys

(aka cache penetration)

Scenario you described:

Attacker (or buggy client) sends tons of requests for a key that doesn’t exist in DB.

Flow:

Request → Cache → miss

Cache → DB → “not found”

Next request → same thing again

Result: DB hammered by junk keys, cache not helping at all.

✅ Fixes:

Build a Bloom filter of all valid keys from the DB (or updated periodically).

On request:

If Bloom filter says “definitely not present” → drop or short-circuit (don’t hit cache/DB).

If Bloom filter says “might be present” → go to cache → DB if cache miss.

So the Bloom filter protects the DB from obviously invalid keys.

Explain Bloom filter:

A Bloom filter is:

An in-memory bit array + multiple hash functions

Used to answer one question very fast:

“Is this key definitely not in the database?”

Possible answers:

Definitely NOT present → safe to stop immediately

Maybe present → go check cache / DB

It never says “definitely present”.

1️⃣ Inside the application server (most common)
Client

App Server
├─ Bloom Filter (in memory)
├─ Cache (Redis)
└─ Database

Bloom filter is loaded into memory at app startup

Very fast (nanoseconds)

No network hop

Used before cache/DB calls

This is the most common design.

✅ Who writes the Bloom filter?

Developers do.

It is:

Part of application logic

Runs inside the app server process

Uses RAM

Checked before cache and DB calls

So yes — it is written, owned, and maintained by developers.

b) Negative caching (very common)

If DB says “not found for key X”, cache that “not found” for a short TTL:

Key: user:12345 → value: NULL / “NOT_FOUND” (TTL 30–60s)

Future requests hit the cache and never go to DB.

Explain:

🔄 How negative caching works (step-by-step)
Step 1: First request

Request: GET /users/999999

Cache lookup → MISS

DB lookup → NOT FOUND

Step 2: Cache the “NOT FOUND” result

Instead of doing nothing, you store:

Key: user:999999
Value: NOT_FOUND (or NULL)
TTL: 30–60 seconds

Step 3: Next requests

Another request for user:999999:

Cache lookup → HIT (NOT_FOUND)

Application returns 404 immediately

DB is never touched

This is often simpler than a Bloom filter.

c) Rate limiting / WAF

If it’s a clear abuse pattern:

Rate-limit by IP / token / key pattern

Use WAF or app firewall to block obviously bad traffic.

2️⃣ Problem: Many requests after a hot key expires

(aka cache stampede / thundering herd)

Your scenario:

Hot key (or many hot keys) expire.

Suddenly hundreds/thousands of requests see a cache miss.

All of them go to DB → DB overwhelmed → can go down.

a) Per-key lock / single-flight (your idea)

For a given key K:

First request misses the cache → acquires a lock for K → goes to DB.

Other requests for K:

Either wait on that lock, or

Get the old stale value until refresh finishes.

When the DB result comes back:

Response is written to cache

Lock is released

Waiting requests now read from cache

Only one DB hit per key on expiry = DB safe.

b) Stale-while-revalidate

When TTL expires, don’t immediately delete the cached value.

Mark it as stale, but still serve it to users.

In the background, one request (or a background worker) refreshes the value from DB.

So users never cause a DB storm.

c) Add randomized TTL (to avoid many keys expiring together)

If all keys have TTL = 600s, tons of them might expire at the exact same moment → mini-disaster.

Instead set:

TTL = base + random_jitter

e.g., 600–900 seconds

This spreads expiries over time → smoother DB load.

19
Q

How Bloom Filters Are Kept in Sync with the DB?

A

1️⃣ Build once + update on writes (most common)

This is the standard approach.

How it works:

Initial build

Scan all existing primary keys from DB

Insert them into the Bloom filter

On every CREATE

When new data is written to DB, also insert the key into the Bloom filter.

On DELETE

Usually ignored (Bloom filters don’t support deletes)

Accept some false positives

Periodically rebuild the filter

Why this works:

Bloom filter only needs to answer:

“Could this key exist?”

False positives are acceptable

False negatives are NOT (and this approach avoids them)

2️⃣ Periodic rebuild (very common at scale)
Why rebuild?

Deletes

Data migrations

Long-running drift

Memory pressure

How:

Nightly / hourly job

Scan DB primary keys

Rebuild Bloom filter from scratch

Atomically swap old filter with new one

Benefit:

Simple

Safe

Keeps false-positive rate low

Used by many large systems.

3) Event-driven sync (CDC / messaging)

For high-scale or distributed systems.

Flow:

DB change (insert/delete/update)

CDC (Change Data Capture) stream or event published

Bloom filter updater consumes event

Bloom filter updated in near real-time

Examples:

MySQL binlog

Kafka CDC streams

DynamoDB Streams

Benefit:

Near real-time sync

No heavy DB scans

What happens if Bloom filter update fails?

Normal write flow
App
├─ Write to DB ✅
├─ Add key to Bloom filter ✅
└─ Return success

Failure case (Bloom filter add fails)
App
├─ Write to DB ✅
├─ Add key to Bloom filter ❌ (timeout / error)
└─ Return success anyway

Consequence:

Bloom filter may temporarily miss this key

A future read might:

Bloom filter says “definitely not present” ❌ (false negative risk)

⚠️ False negatives are dangerous, so we design to avoid them.

🛡️ How systems protect against false negatives
1️⃣ Update Bloom filter after DB write (but don’t depend on it)

DB commit happens first

Bloom filter update is best-effort

2️⃣ Periodic rebuild

Regular rebuild from DB keys

Fixes any missed inserts

Keeps Bloom filter accurate over time

⚠️ Why NOT make Bloom filter update transactional?

Because:

Bloom filters don’t support transactions

They may be in-memory or Redis-based

Making DB writes depend on Bloom filter:

Reduces availability

⚠️ Important clarification

“we can periodically check for the new keys in the DB?”

✔️ Conceptually yes, but
❌ In practice we usually rebuild from scratch, not just “check deltas”.

Why?

Because:

Bloom filters don’t support deletes

Full rebuild:

removes drift

resets false-positive rate

guarantees no false negatives

⏱ How often rebuilds happen (typical)
System size / pattern Typical rebuild frequency
Small / low :writes Once per day
Medium: traffic Every 6–12 hours
High write volume: Every 1–3 hours
Extremely large systems: Rolling or continuous rebuild

It depends on:

Write rate

Acceptable false-positive rate

Cost of scanning DB

🎯 Interview-ready explanation (memorize this)

“Periodic rebuild means regenerating the Bloom filter from the database’s primary keys at fixed intervals. This ensures the filter stays accurate, fixes missed updates, and controls false positives. Most systems rebuild from scratch rather than tracking incremental changes, because Bloom filters don’t support deletes well.”