Databricks Interview prep - Delta Lake Deep Dive Flashcards

(24 cards)

1
Q

ACID Transactions

How does Delta Lake provide ACID transactions on a data lake?

A

Delta uses a transaction log (_delta_log) to track all changes:
Each write creates a new log entry (JSON + checkpoint files)
Transactions are atomic β†’ either fully committed or not applied
Readers always see a consistent snapshot
πŸ‘‰ It uses optimistic concurrency control instead of locking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Optimistic Concurrency Control

What is optimistic concurrency control in Delta Lake?

A

Instead of locking:
Multiple writers can attempt writes simultaneously
At commit time, Delta checks for conflicts
If conflict detected β†’ transaction fails and retries
πŸ‘‰ This improves scalability compared to traditional locking systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transaction Log (_delta_log)

What is stored in the _delta_log and why is it important?

A

It contains:
File-level changes (add/remove files)
Schema metadata
Transaction history
πŸ‘‰ It enables:
ACID transactions
Time travel
Efficient reads (no need to scan full dataset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Time Travel

What is Delta Lake time travel and when is it useful?

A

Allows querying previous versions of a table:
By version number or timestamp
Use cases:
Debugging data issues
Reproducibility (ML experiments)
Recovering from bad writes
πŸ‘‰ Example: restoring a table after accidental overwrite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

VACUUM

What does the VACUUM command do in Delta Lake?

A

Deletes old data files no longer referenced
Frees storage space
Tradeoff:
Removes ability to time travel beyond retention period
πŸ‘‰ Default retention = 7 days (important interview detail)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Schema Enforcement

How does Delta enforce schema and why is it important?

A

Rejects writes that don’t match table schema
Prevents data corruption
πŸ‘‰ Example:
Writing string into integer column β†’ fails
This ensures data quality at write time, not query time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Schema Evolution

When should you enable schema evolution in Delta Lake?

A

Use when:
New columns are expected (e.g., ingestion pipelines)
Avoid when:
Schema must remain strict (e.g., curated datasets)

πŸ‘‰ Overuse can lead to messy, inconsistent schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MERGE INTO (Upserts)

How does MERGE INTO work and why is it important?

A

It allows:
INSERT + UPDATE + DELETE in one operation
Used for:
CDC pipelines
Deduplication
Slowly changing dimensions

πŸ‘‰ Critical for building idempotent pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Small File Problem

Why does Delta Lake suffer from the small file problem?

A

Because:
Distributed writes create many small files
Each file adds overhead (metadata + scheduling)
Impact:
Slower queries
Inefficient scans

πŸ‘‰ Common in streaming or frequent writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

OPTIMIZE (File Compaction)

What does OPTIMIZE do in Delta Lake?

A

Combines small files into larger ones (~128MB ideal)
Improves read performance
πŸ‘‰ Often paired with ZORDER for better query speed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Z-ORDER

When should you use partitioning vs Z-ORDER?

A

Partitioning β†’ low-cardinality columns (e.g., date)
Z-ORDER β†’ high-cardinality columns (e.g., user_id)

πŸ‘‰ Over-partitioning = too many small files (bad).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Skipping

How does Delta Lake avoid scanning unnecessary data?

A

Stores min/max statistics per file
Skips files that don’t match query filters

πŸ‘‰ This reduces I/O significantly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Updates and Deletes in Delta

How does Delta Lake handle UPDATE and DELETE operations internally?

A

Does NOT modify files in place
Creates new files with updated data
Marks old files as removed in transaction log

πŸ‘‰ This is called copy-on-write.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Copy-on-Write vs In-Place Updates

Why does Delta use copy-on-write instead of in-place updates?

A

Benefits:
Ensures immutability β†’ easier consistency
Supports time travel
Simplifies distributed writes
Tradeoff:
More storage usage temporarily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Checkpointing in Delta

What is checkpointing in Delta Lake?

A

Periodically compacts transaction logs into Parquet files
Prevents long log replay times

πŸ‘‰ Improves read performance and scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Concurrent Writes

What happens if two jobs write to the same Delta table simultaneously?

A

Both attempt commit
One succeeds
Other fails with conflict β†’ must retry
πŸ‘‰ Ensures consistency without locking.

17
Q

Delta vs Parquet

Why is Delta Lake better than raw Parquet files?

A

Parquet:
Just a file format
Delta:
Adds transaction layer + metadata
πŸ‘‰ Delta = Parquet + reliability + performance features.

18
Q

Append vs Overwrite

What is the risk of using overwrite mode without Delta Lake?

A

Partial writes can corrupt data
No rollback mechanism
With Delta:
Overwrite is transactional β†’ safe

19
Q

Handling Late Data (Delta + Streaming)

How does Delta Lake help handle late-arriving data?

A

Supports upserts via MERGE
Works with watermarking in streaming
πŸ‘‰ Ensures correctness even with delayed events.

20
Q

File Size Optimization

Why is ~128MB considered an optimal file size in Delta Lake?

A

Balances parallelism and overhead
Too small β†’ too many tasks
Too large β†’ less parallelism
πŸ‘‰ Sweet spot for Spark processing.

21
Q

Delta Table Versioning

How are versions tracked in Delta Lake?

A

Each commit = new version number
Stored in _delta_log
πŸ‘‰ Enables time travel and auditing.

22
Q

Change Data Feed (CDF)

What is Delta Change Data Feed and when is it used?

A

Tracks row-level changes (insert/update/delete)
Used for:
Incremental processing
Downstream system sync
πŸ‘‰ Useful for CDC pipelines.

23
Q

Performance Bottlenecks

What are common causes of poor performance in Delta Lake tables?

A

Too many small files
Poor partitioning strategy
No Z-ORDER
Large shuffles

πŸ‘‰ Fix with OPTIMIZE, repartitioning, and query tuning.

24
Q

When NOT to Use Delta Lake

A

Simple, read-only datasets
One-time batch processing
No need for updates or transactions
πŸ‘‰ Delta adds overheadβ€”use it when you need reliability.