Databricks Interview prep - Delta Lake Deep Dive Flashcards

Question 1

Q

ACID Transactions

How does Delta Lake provide ACID transactions on a data lake?

Answer

A

Delta uses a transaction log (_delta_log) to track all changes:
Each write creates a new log entry (JSON + checkpoint files)
Transactions are atomic → either fully committed or not applied
Readers always see a consistent snapshot
👉 It uses optimistic concurrency control instead of locking.

Question 2

Q

Optimistic Concurrency Control

What is optimistic concurrency control in Delta Lake?

Answer

A

Instead of locking:
Multiple writers can attempt writes simultaneously
At commit time, Delta checks for conflicts
If conflict detected → transaction fails and retries
👉 This improves scalability compared to traditional locking systems.

Question 3

Q

Transaction Log (_delta_log)

What is stored in the _delta_log and why is it important?

Answer

A

It contains:
File-level changes (add/remove files)
Schema metadata
Transaction history
👉 It enables:
ACID transactions
Time travel
Efficient reads (no need to scan full dataset)

Question 4

Q

Time Travel

What is Delta Lake time travel and when is it useful?

Answer

A

Allows querying previous versions of a table:
By version number or timestamp
Use cases:
Debugging data issues
Reproducibility (ML experiments)
Recovering from bad writes
👉 Example: restoring a table after accidental overwrite.

Question 5

Q

VACUUM

What does the VACUUM command do in Delta Lake?

Answer

A

Deletes old data files no longer referenced
Frees storage space
Tradeoff:
Removes ability to time travel beyond retention period
👉 Default retention = 7 days (important interview detail)

Question 6

Q

Schema Enforcement

How does Delta enforce schema and why is it important?

Answer

A

Rejects writes that don’t match table schema
Prevents data corruption
👉 Example:
Writing string into integer column → fails
This ensures data quality at write time, not query time.

Question 7

Q

Schema Evolution

When should you enable schema evolution in Delta Lake?

Answer

A

Use when:
New columns are expected (e.g., ingestion pipelines)
Avoid when:
Schema must remain strict (e.g., curated datasets)

👉 Overuse can lead to messy, inconsistent schemas.

Question 8

Q

MERGE INTO (Upserts)

How does MERGE INTO work and why is it important?

Answer

A

It allows:
INSERT + UPDATE + DELETE in one operation
Used for:
CDC pipelines
Deduplication
Slowly changing dimensions

👉 Critical for building idempotent pipelines.

Question 9

Q

Small File Problem

Why does Delta Lake suffer from the small file problem?

Answer

A

Because:
Distributed writes create many small files
Each file adds overhead (metadata + scheduling)
Impact:
Slower queries
Inefficient scans

👉 Common in streaming or frequent writes.

Question 10

Q

OPTIMIZE (File Compaction)

What does OPTIMIZE do in Delta Lake?

Answer

A

Combines small files into larger ones (~128MB ideal)
Improves read performance
👉 Often paired with ZORDER for better query speed.

Question 11

Q

Z-ORDER

When should you use partitioning vs Z-ORDER?

Answer

A

Partitioning → low-cardinality columns (e.g., date)
Z-ORDER → high-cardinality columns (e.g., user_id)

👉 Over-partitioning = too many small files (bad).

Question 12

Q

Data Skipping

How does Delta Lake avoid scanning unnecessary data?

Answer

A

Stores min/max statistics per file
Skips files that don’t match query filters

👉 This reduces I/O significantly.

Question 13

Q

Updates and Deletes in Delta

How does Delta Lake handle UPDATE and DELETE operations internally?

Answer

A

Does NOT modify files in place
Creates new files with updated data
Marks old files as removed in transaction log

👉 This is called copy-on-write.

Question 14

Q

Copy-on-Write vs In-Place Updates

Why does Delta use copy-on-write instead of in-place updates?

Answer

A

Benefits:
Ensures immutability → easier consistency
Supports time travel
Simplifies distributed writes
Tradeoff:
More storage usage temporarily

Question 15

Q

Checkpointing in Delta

What is checkpointing in Delta Lake?

Answer

A

Periodically compacts transaction logs into Parquet files
Prevents long log replay times

👉 Improves read performance and scalability.

Question 16

Q

Concurrent Writes

What happens if two jobs write to the same Delta table simultaneously?

Answer

Study These Flashcards

A

Both attempt commit
One succeeds
Other fails with conflict → must retry
👉 Ensures consistency without locking.

Question 17

Q

Delta vs Parquet

Why is Delta Lake better than raw Parquet files?

Answer

Study These Flashcards

A

Parquet:
Just a file format
Delta:
Adds transaction layer + metadata
👉 Delta = Parquet + reliability + performance features.

Question 18

Q

Append vs Overwrite

What is the risk of using overwrite mode without Delta Lake?

Answer

Study These Flashcards

A

Partial writes can corrupt data
No rollback mechanism
With Delta:
Overwrite is transactional → safe

Question 19

Q

Handling Late Data (Delta + Streaming)

How does Delta Lake help handle late-arriving data?

Answer

Study These Flashcards

A

Supports upserts via MERGE
Works with watermarking in streaming
👉 Ensures correctness even with delayed events.

Question 20

Q

File Size Optimization

Why is ~128MB considered an optimal file size in Delta Lake?

Answer

Study These Flashcards

A

Balances parallelism and overhead
Too small → too many tasks
Too large → less parallelism
👉 Sweet spot for Spark processing.

Question 21

Q

Delta Table Versioning

How are versions tracked in Delta Lake?

Answer

Study These Flashcards

A

Each commit = new version number
Stored in _delta_log
👉 Enables time travel and auditing.

Question 22

Q

Change Data Feed (CDF)

What is Delta Change Data Feed and when is it used?

Answer

Study These Flashcards

A

Tracks row-level changes (insert/update/delete)
Used for:
Incremental processing
Downstream system sync
👉 Useful for CDC pipelines.

Question 23

Q

Performance Bottlenecks

What are common causes of poor performance in Delta Lake tables?

Answer

Study These Flashcards

A

Too many small files
Poor partitioning strategy
No Z-ORDER
Large shuffles

👉 Fix with OPTIMIZE, repartitioning, and query tuning.

Question 24

Q

When NOT to Use Delta Lake

Answer

Study These Flashcards

A

Simple, read-only datasets
One-time batch processing
No need for updates or transactions
👉 Delta adds overhead—use it when you need reliability.

Databricks Interview prep - Delta Lake Deep Dive Flashcards

(24 cards)