ACID Transactions
How does Delta Lake provide ACID transactions on a data lake?
Delta uses a transaction log (_delta_log) to track all changes:
Each write creates a new log entry (JSON + checkpoint files)
Transactions are atomic β either fully committed or not applied
Readers always see a consistent snapshot
π It uses optimistic concurrency control instead of locking.
Optimistic Concurrency Control
What is optimistic concurrency control in Delta Lake?
Instead of locking:
Multiple writers can attempt writes simultaneously
At commit time, Delta checks for conflicts
If conflict detected β transaction fails and retries
π This improves scalability compared to traditional locking systems.
Transaction Log (_delta_log)
What is stored in the _delta_log and why is it important?
It contains:
File-level changes (add/remove files)
Schema metadata
Transaction history
π It enables:
ACID transactions
Time travel
Efficient reads (no need to scan full dataset)
Time Travel
What is Delta Lake time travel and when is it useful?
Allows querying previous versions of a table:
By version number or timestamp
Use cases:
Debugging data issues
Reproducibility (ML experiments)
Recovering from bad writes
π Example: restoring a table after accidental overwrite.
VACUUM
What does the VACUUM command do in Delta Lake?
Deletes old data files no longer referenced
Frees storage space
Tradeoff:
Removes ability to time travel beyond retention period
π Default retention = 7 days (important interview detail)
Schema Enforcement
How does Delta enforce schema and why is it important?
Rejects writes that donβt match table schema
Prevents data corruption
π Example:
Writing string into integer column β fails
This ensures data quality at write time, not query time.
Schema Evolution
When should you enable schema evolution in Delta Lake?
Use when:
New columns are expected (e.g., ingestion pipelines)
Avoid when:
Schema must remain strict (e.g., curated datasets)
π Overuse can lead to messy, inconsistent schemas.
MERGE INTO (Upserts)
How does MERGE INTO work and why is it important?
It allows:
INSERT + UPDATE + DELETE in one operation
Used for:
CDC pipelines
Deduplication
Slowly changing dimensions
π Critical for building idempotent pipelines.
Small File Problem
Why does Delta Lake suffer from the small file problem?
Because:
Distributed writes create many small files
Each file adds overhead (metadata + scheduling)
Impact:
Slower queries
Inefficient scans
π Common in streaming or frequent writes.
OPTIMIZE (File Compaction)
What does OPTIMIZE do in Delta Lake?
Combines small files into larger ones (~128MB ideal)
Improves read performance
π Often paired with ZORDER for better query speed.
Z-ORDER
When should you use partitioning vs Z-ORDER?
Partitioning β low-cardinality columns (e.g., date)
Z-ORDER β high-cardinality columns (e.g., user_id)
π Over-partitioning = too many small files (bad).
Data Skipping
How does Delta Lake avoid scanning unnecessary data?
Stores min/max statistics per file
Skips files that donβt match query filters
π This reduces I/O significantly.
Updates and Deletes in Delta
How does Delta Lake handle UPDATE and DELETE operations internally?
Does NOT modify files in place
Creates new files with updated data
Marks old files as removed in transaction log
π This is called copy-on-write.
Copy-on-Write vs In-Place Updates
Why does Delta use copy-on-write instead of in-place updates?
Benefits:
Ensures immutability β easier consistency
Supports time travel
Simplifies distributed writes
Tradeoff:
More storage usage temporarily
Checkpointing in Delta
What is checkpointing in Delta Lake?
Periodically compacts transaction logs into Parquet files
Prevents long log replay times
π Improves read performance and scalability.
Concurrent Writes
What happens if two jobs write to the same Delta table simultaneously?
Both attempt commit
One succeeds
Other fails with conflict β must retry
π Ensures consistency without locking.
Delta vs Parquet
Why is Delta Lake better than raw Parquet files?
Parquet:
Just a file format
Delta:
Adds transaction layer + metadata
π Delta = Parquet + reliability + performance features.
Append vs Overwrite
What is the risk of using overwrite mode without Delta Lake?
Partial writes can corrupt data
No rollback mechanism
With Delta:
Overwrite is transactional β safe
Handling Late Data (Delta + Streaming)
How does Delta Lake help handle late-arriving data?
Supports upserts via MERGE
Works with watermarking in streaming
π Ensures correctness even with delayed events.
File Size Optimization
Why is ~128MB considered an optimal file size in Delta Lake?
Balances parallelism and overhead
Too small β too many tasks
Too large β less parallelism
π Sweet spot for Spark processing.
Delta Table Versioning
How are versions tracked in Delta Lake?
Each commit = new version number
Stored in _delta_log
π Enables time travel and auditing.
Change Data Feed (CDF)
What is Delta Change Data Feed and when is it used?
Tracks row-level changes (insert/update/delete)
Used for:
Incremental processing
Downstream system sync
π Useful for CDC pipelines.
Performance Bottlenecks
What are common causes of poor performance in Delta Lake tables?
Too many small files
Poor partitioning strategy
No Z-ORDER
Large shuffles
π Fix with OPTIMIZE, repartitioning, and query tuning.
When NOT to Use Delta Lake
Simple, read-only datasets
One-time batch processing
No need for updates or transactions
π Delta adds overheadβuse it when you need reliability.