Delta Lake Patterns Flashcards by O Cam

What is the MERGE operation in Delta Lake used for?

To perform upserts and deletes by matching records from a source dataset to a target Delta table based on a join condition and applying INSERT, UPDATE, or DELETE actions.

How well did you know this?

Not at all

Perfectly

What is a typical use case for MERGE INTO with a Delta table?

Applying change data capture (CDC) feeds or incremental updates from source systems into a target table without full reloads.

How well did you know this?

Not at all

Perfectly

How does a basic MERGE INTO statement look conceptually?

MERGE INTO target USING source ON join_condition WHEN MATCHED THEN UPDATE/DELETE WHEN NOT MATCHED THEN INSERT.

How well did you know this?

Not at all

Perfectly

Why is MERGE preferred over manual UPDATE + INSERT logic for upserts?

It encapsulates matching logic and multiple actions in a single atomic transaction, simplifying code and ensuring consistency.

How well did you know this?

Not at all

Perfectly

What is upsert in the context of Delta Lake?

An operation that updates existing rows if they match a key and inserts new rows otherwise, commonly implemented via MERGE.

How well did you know this?

Not at all

Perfectly

How can you implement an ‘insert-only’ incremental pattern with Delta?

By appending new records to the Delta table, using either simple append mode or MERGE with only NOT MATCHED THEN INSERT logic.

How well did you know this?

Not at all

Perfectly

What is CDC (Change Data Capture) in data engineering?

A pattern for capturing and applying incremental changes from a source system, such as inserts, updates, and deletes.

How well did you know this?

Not at all

Perfectly

How can Delta Lake support CDC-style ingestion?

By using MERGE INTO against a Delta table with a CDC feed that encodes operation types and keys, or by using structured streaming to apply changes incrementally.

How well did you know this?

Not at all

Perfectly

What is an example of a CDC merge pattern?

Match on business key, update non-key fields for ‘update’ events, insert new rows for ‘insert’ events, and delete rows or mark flags for ‘delete’ events.

How well did you know this?

Not at all

Perfectly

Why is idempotency important for MERGE-based pipelines?

So that rerunning the merge with the same CDC data does not corrupt or duplicate target state, allowing safe retries and backfills.

How well did you know this?

Not at all

Perfectly

What is the difference between UPDATE and MERGE in Delta Lake?

UPDATE modifies rows based on a filter, while MERGE joins source and target and can conditionally update, insert, or delete rows in a single operation.

How well did you know this?

Not at all

Perfectly

When might you use DELETE directly on a Delta table?

When removing rows that meet certain conditions, such as GDPR requests, test data cleanup, or correcting known bad data ranges.

How well did you know this?

Not at all

Perfectly

What is OPTIMIZE in Delta Lake?

A command that rewrites small files into larger, more efficient files and can optionally reorder data (e.g., via ZORDER) to improve performance.

How well did you know this?

Not at all

Perfectly

Why is OPTIMIZE important for long-lived Delta tables?

Ingestion and updates can create many small files and suboptimal layouts; OPTIMIZE consolidates and organizes files for better scan and skip efficiency.

How well did you know this?

Not at all

Perfectly

What trade-off does OPTIMIZE introduce?

It consumes compute and time to rewrite files, so it should be run with appropriate frequency based on table activity and query patterns.

How well did you know this?

Not at all

Perfectly

What is ZORDER BY used for in OPTIMIZE?

Study These Flashcards

To co-locate rows with similar values for specified columns, improving data skipping when queries filter on those columns.

When is it beneficial to ZORDER a Delta table?

Study These Flashcards

When you consistently filter or join on certain columns (e.g., user_id, country, date) and want better clustering for those access paths.

What is partitioning in Delta tables?

Study These Flashcards

Physically organizing table data into directories based on one or more partition columns to enable partition pruning and focused scans.

Why should you be careful when choosing partition columns for Delta tables?

Study These Flashcards

Too many partitions or partitions with very high cardinality can create many small files; partitions should match common filters and produce balanced file sizes.

What is the difference between partitioning and ZORDER in Delta?

Study These Flashcards

Partitioning creates directory-level segmentation on a column; ZORDER reorders data within files across partitions to improve skipping on additional columns.

How does Delta Lake support streaming reads from a table?

Study These Flashcards

A Delta table can be used as a structured streaming source, where new data committed to the table is incrementally processed as it appears.

How does Delta Lake support streaming writes to a table?

Study These Flashcards

A structured streaming query can write to a Delta table as a sink, with each micro-batch or continuous update committed as a new transaction.

What are common patterns for using Delta with streaming?

Study These Flashcards

Streaming ingestion into bronze Delta tables, streaming transformations into silver tables, and updating gold tables via streaming or batch merges.

Why is Delta well-suited for unified batch and streaming pipelines?

Study These Flashcards

The transaction log and ACID semantics allow both batch and streaming jobs to read and write the same tables with consistent behavior.

What is a medallion architecture (bronze/silver/gold) in Databricks?

A layered design where bronze holds raw ingested data, silver holds cleaned and conformed data, and gold holds curated, business-ready models.

How are MERGE operations typically used in medallion architectures?

Silver tables often apply MERGE to integrate CDC or correct data; gold tables may MERGE to maintain slowly changing dimensions or aggregated fact tables.

What is a typical pattern for building a silver table from a bronze Delta table?

Read bronze events, apply cleaning, deduplication, and business rules, then MERGE into a silver Delta table keyed by business keys and timestamps.

What is vacuuming in Delta and how does it relate to MERGE and DELETE?

VACUUM removes obsolete files that are no longer referenced after updates, merges, or deletes, freeing storage while preserving time-travel retention.

Why must you be aware of retention when vacuuming tables with recent MERGE/DELETE operations?

Removing files too soon can break time travel, rollback, or queries that expect older versions to exist within the retention window.

What is the 'merge-on-read' nature of Delta tables?

Delta queries logically combine the effects of all logged file additions and removals at read time to present a consistent table view.

How can large MERGE operations impact performance?

They may rewrite many files, create small files, and require later OPTIMIZE; tuning partitioning, join keys, and batch sizes is important.

What are some performance best practices for MERGE on Delta tables?

Cluster/partition on join keys when appropriate, avoid full table scans in the ON clause, filter the source to relevant time windows, and run OPTIMIZE periodically.

What is the importance of primary key design in Delta table operations?

Although Delta does not enforce primary keys natively, having clear business keys is essential for reliable MERGE logic and idempotent upserts.

How do you handle slowly changing dimensions (SCD Type 2) with Delta?

Use MERGE to close out old records and insert new records with updated attributes and effective date ranges based on change detection logic.

What is the impact of concurrent writes on Delta table design?

You must consider write patterns, partitioning, and job scheduling so that optimistic concurrency conflicts are rare and handled gracefully.

How can you debug Delta operations when something goes wrong?

Inspect the transaction log and history (DESCRIBE HISTORY), examine recent versions, and compare snapshots before and after MERGE/DELETE operations.

In one sentence, what is the core mental model for Delta operations on Databricks?

Delta tables are transactionally managed Parquet folders where you use MERGE/UPDATE/DELETE, OPTIMIZE, partitioning, and streaming reads/writes to implement robust, incremental lakehouse pipelines.

Delta Lake Patterns Flashcards

(37 cards)