What is the MERGE operation in Delta Lake used for?
To perform upserts and deletes by matching records from a source dataset to a target Delta table based on a join condition and applying INSERT, UPDATE, or DELETE actions.
What is a typical use case for MERGE INTO with a Delta table?
Applying change data capture (CDC) feeds or incremental updates from source systems into a target table without full reloads.
How does a basic MERGE INTO statement look conceptually?
MERGE INTO target USING source ON join_condition WHEN MATCHED THEN UPDATE/DELETE WHEN NOT MATCHED THEN INSERT.
Why is MERGE preferred over manual UPDATE + INSERT logic for upserts?
It encapsulates matching logic and multiple actions in a single atomic transaction, simplifying code and ensuring consistency.
What is upsert in the context of Delta Lake?
An operation that updates existing rows if they match a key and inserts new rows otherwise, commonly implemented via MERGE.
How can you implement an ‘insert-only’ incremental pattern with Delta?
By appending new records to the Delta table, using either simple append mode or MERGE with only NOT MATCHED THEN INSERT logic.
What is CDC (Change Data Capture) in data engineering?
A pattern for capturing and applying incremental changes from a source system, such as inserts, updates, and deletes.
How can Delta Lake support CDC-style ingestion?
By using MERGE INTO against a Delta table with a CDC feed that encodes operation types and keys, or by using structured streaming to apply changes incrementally.
What is an example of a CDC merge pattern?
Match on business key, update non-key fields for ‘update’ events, insert new rows for ‘insert’ events, and delete rows or mark flags for ‘delete’ events.
Why is idempotency important for MERGE-based pipelines?
So that rerunning the merge with the same CDC data does not corrupt or duplicate target state, allowing safe retries and backfills.
What is the difference between UPDATE and MERGE in Delta Lake?
UPDATE modifies rows based on a filter, while MERGE joins source and target and can conditionally update, insert, or delete rows in a single operation.
When might you use DELETE directly on a Delta table?
When removing rows that meet certain conditions, such as GDPR requests, test data cleanup, or correcting known bad data ranges.
What is OPTIMIZE in Delta Lake?
A command that rewrites small files into larger, more efficient files and can optionally reorder data (e.g., via ZORDER) to improve performance.
Why is OPTIMIZE important for long-lived Delta tables?
Ingestion and updates can create many small files and suboptimal layouts; OPTIMIZE consolidates and organizes files for better scan and skip efficiency.
What trade-off does OPTIMIZE introduce?
It consumes compute and time to rewrite files, so it should be run with appropriate frequency based on table activity and query patterns.
What is ZORDER BY used for in OPTIMIZE?
To co-locate rows with similar values for specified columns, improving data skipping when queries filter on those columns.
When is it beneficial to ZORDER a Delta table?
When you consistently filter or join on certain columns (e.g., user_id, country, date) and want better clustering for those access paths.
What is partitioning in Delta tables?
Physically organizing table data into directories based on one or more partition columns to enable partition pruning and focused scans.
Why should you be careful when choosing partition columns for Delta tables?
Too many partitions or partitions with very high cardinality can create many small files; partitions should match common filters and produce balanced file sizes.
What is the difference between partitioning and ZORDER in Delta?
Partitioning creates directory-level segmentation on a column; ZORDER reorders data within files across partitions to improve skipping on additional columns.
How does Delta Lake support streaming reads from a table?
A Delta table can be used as a structured streaming source, where new data committed to the table is incrementally processed as it appears.
How does Delta Lake support streaming writes to a table?
A structured streaming query can write to a Delta table as a sink, with each micro-batch or continuous update committed as a new transaction.
What are common patterns for using Delta with streaming?
Streaming ingestion into bronze Delta tables, streaming transformations into silver tables, and updating gold tables via streaming or batch merges.
Why is Delta well-suited for unified batch and streaming pipelines?
The transaction log and ACID semantics allow both batch and streaming jobs to read and write the same tables with consistent behavior.