Spark SQL & Catalyst Optimizer Flashcards

(39 cards)

1
Q

What is Delta Lake at a high level?

A

An open table storage format that adds ACID transactions, schema enforcement, and other data management features on top of files in cloud object storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Delta Lake relate to Parquet files?

A

Delta Lake stores data in Parquet files but adds a transaction log and metadata layer that tracks versions and operations on those files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where are Delta Lake tables typically stored on Databricks?

A

On cloud object storage (such as S3, ADLS, or GCS), accessed via DBFS or a catalog, with a _delta_log directory that holds transaction logs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the _delta_log directory in a Delta table?

A

A folder that contains JSON and checkpoint files describing all committed transactions, schema changes, and file-level metadata for the table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does it mean that Delta Lake provides ACID transactions?

A

Operations on Delta tables are atomic, consistent, isolated, and durable, ensuring readers see consistent snapshots and writes are either all applied or not at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is optimistic concurrency control in Delta Lake?

A

A mechanism where writers assume they will not conflict, but check the transaction log at commit time and retry or fail if conflicting changes have occurred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is optimistic concurrency control well-suited to Delta on object storage?

A

It avoids centralized locks and coordinates through the transaction log, working with the eventual consistency and immutability of object storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is snapshot isolation in Delta Lake?

A

A property that each query sees a consistent snapshot of the table as of a particular version, unaffected by concurrent writes that commit later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does Delta Lake achieve snapshot isolation for readers?

A

Readers use the transaction log to build a view of which data files are valid for a particular table version and read only those files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is schema enforcement in Delta Lake?

A

The requirement that data written to a Delta table matches its defined schema, preventing incompatible data from being appended by default.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is schema enforcement important in a data lake setting?

A

It prevents silent corruption or drifting of schemas that would otherwise be easy when writing raw Parquet files directly to object storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is schema evolution in Delta Lake?

A

The ability to change a table’s schema over time (e.g., adding columns) in a controlled way, tracked via the transaction log.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a common safe form of schema evolution in Delta tables?

A

Adding new nullable columns or fields while leaving existing columns compatible with prior versions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why must schema evolution be intentional and not automatic?

A

Uncontrolled changes can break downstream readers, so explicit evolution helps maintain compatibility and governance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between a Delta table and a raw Parquet folder?

A

A Delta table has a transaction log that tracks adds/removes and schema changes, enabling ACID operations and time travel; a Parquet folder is just files without table semantics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does ‘table version’ mean in Delta Lake?

A

An integer that identifies a specific committed state of the table, incremented with each transaction.

17
Q

How does time travel work in Delta tables?

A

By reading the table as of a previous version or timestamp, using the transaction log to reconstruct that historical snapshot.

18
Q

What are typical use cases for Delta time travel?

A

Auditing historical data, debugging changes, recovering from accidental modifications, and reproducing past experiments.

19
Q

What is a Delta table ‘path-based’ table?

A

A table defined directly on a storage path, accessed via its filesystem location rather than via a catalog name.

20
Q

What is a ‘managed’ or ‘catalog-registered’ Delta table?

A

A Delta table registered in a metastore or Unity Catalog with a logical name, where the catalog tracks its location, schema, and permissions.

21
Q

Why are catalog-registered Delta tables often preferable in production?

A

They provide consistent naming, governance, permissions, and lineage, decoupling logical access from raw storage paths.

22
Q

What is a Delta Lake ‘transaction’?

A

A set of operations (such as adding or removing files, schema changes) that are committed as a single atomic change to the table’s log.

23
Q

How is a Delta transaction committed in the log?

A

By writing a new JSON entry in the _delta_log and possibly updating checkpoints, describing which files are added or removed and any metadata changes.

24
Q

What is a checkpoint in the Delta transaction log?

A

A Parquet file that periodically summarizes the log up to a version, allowing faster reconstruction of table state than replaying all JSON logs.

25
Why are checkpoints important for large, long-lived Delta tables?
They speed up startup and reading performance by avoiding scanning many individual log files.
26
What is data skipping in Delta Lake?
An optimization where statistics about columns in each file (such as min/max) are stored and used to avoid reading files that cannot satisfy query predicates.
27
How does data skipping improve performance?
By allowing the engine to skip entire files whose statistics show they do not contain values needed for a query, reducing I/O and CPU work.
28
What is Z-ordering (ZORDER) in Delta Lake?
A technique to co-locate related values in the same set of files, improving data skipping effectiveness for specific columns or queries.
29
When might you use ZORDER on a Delta table?
When you frequently filter or join on certain columns and want to improve clustering and skipping for those columns.
30
What is VACUUM in Delta Lake?
A maintenance command that removes old data files that are no longer referenced by the transaction log after a retention period.
31
Why is VACUUM important for Delta tables?
To prevent build-up of unused files and control storage costs, while respecting retention settings for time travel and rollback.
32
What is the retention period in Delta VACUUM?
The duration for which old file versions are kept to support time travel and rollback; files older than this can be removed by VACUUM.
33
Why must the retention period not be set too low?
Too low retention may eliminate the ability to time-travel or recover from recent mistakes, reducing safety and auditability.
34
What is the default file format used by Delta tables?
Parquet for data files, with JSON and Parquet for the transaction log.
35
How do Delta tables support both batch and streaming workloads?
The transactional log makes it possible to read and write Delta tables in batch jobs and as streaming sources/sinks with consistent semantics.
36
Why is it beneficial to use the same Delta tables for batch and streaming?
It unifies historical and real-time data, reducing duplication and ensuring consistent semantics across processing modes.
37
What happens if multiple writers update a Delta table at the same time?
Each writer attempts an optimistic commit; if a conflicting commit is detected, one writer must retry or fail to preserve ACID guarantees.
38
Why is it important for a data engineer to understand Delta Lake internals conceptually?
Because it informs how to design robust pipelines, reason about performance and concurrency, and debug issues involving versions or schema changes.
39
In one sentence, what is the core mental model for Delta Lake on Databricks?
Delta Lake turns folders of Parquet files on object storage into ACID tables with transaction logs, schema control, and time travel, forming the foundation of the lakehouse.