What is Delta Lake at a high level?
An open table storage format that adds ACID transactions, schema enforcement, and other data management features on top of files in cloud object storage.
How does Delta Lake relate to Parquet files?
Delta Lake stores data in Parquet files but adds a transaction log and metadata layer that tracks versions and operations on those files.
Where are Delta Lake tables typically stored on Databricks?
On cloud object storage (such as S3, ADLS, or GCS), accessed via DBFS or a catalog, with a _delta_log directory that holds transaction logs.
What is the _delta_log directory in a Delta table?
A folder that contains JSON and checkpoint files describing all committed transactions, schema changes, and file-level metadata for the table.
What does it mean that Delta Lake provides ACID transactions?
Operations on Delta tables are atomic, consistent, isolated, and durable, ensuring readers see consistent snapshots and writes are either all applied or not at all.
What is optimistic concurrency control in Delta Lake?
A mechanism where writers assume they will not conflict, but check the transaction log at commit time and retry or fail if conflicting changes have occurred.
Why is optimistic concurrency control well-suited to Delta on object storage?
It avoids centralized locks and coordinates through the transaction log, working with the eventual consistency and immutability of object storage.
What is snapshot isolation in Delta Lake?
A property that each query sees a consistent snapshot of the table as of a particular version, unaffected by concurrent writes that commit later.
How does Delta Lake achieve snapshot isolation for readers?
Readers use the transaction log to build a view of which data files are valid for a particular table version and read only those files.
What is schema enforcement in Delta Lake?
The requirement that data written to a Delta table matches its defined schema, preventing incompatible data from being appended by default.
Why is schema enforcement important in a data lake setting?
It prevents silent corruption or drifting of schemas that would otherwise be easy when writing raw Parquet files directly to object storage.
What is schema evolution in Delta Lake?
The ability to change a table’s schema over time (e.g., adding columns) in a controlled way, tracked via the transaction log.
What is a common safe form of schema evolution in Delta tables?
Adding new nullable columns or fields while leaving existing columns compatible with prior versions.
Why must schema evolution be intentional and not automatic?
Uncontrolled changes can break downstream readers, so explicit evolution helps maintain compatibility and governance.
What is the difference between a Delta table and a raw Parquet folder?
A Delta table has a transaction log that tracks adds/removes and schema changes, enabling ACID operations and time travel; a Parquet folder is just files without table semantics.
What does ‘table version’ mean in Delta Lake?
An integer that identifies a specific committed state of the table, incremented with each transaction.
How does time travel work in Delta tables?
By reading the table as of a previous version or timestamp, using the transaction log to reconstruct that historical snapshot.
What are typical use cases for Delta time travel?
Auditing historical data, debugging changes, recovering from accidental modifications, and reproducing past experiments.
What is a Delta table ‘path-based’ table?
A table defined directly on a storage path, accessed via its filesystem location rather than via a catalog name.
What is a ‘managed’ or ‘catalog-registered’ Delta table?
A Delta table registered in a metastore or Unity Catalog with a logical name, where the catalog tracks its location, schema, and permissions.
Why are catalog-registered Delta tables often preferable in production?
They provide consistent naming, governance, permissions, and lineage, decoupling logical access from raw storage paths.
What is a Delta Lake ‘transaction’?
A set of operations (such as adding or removing files, schema changes) that are committed as a single atomic change to the table’s log.
How is a Delta transaction committed in the log?
By writing a new JSON entry in the _delta_log and possibly updating checkpoints, describing which files are added or removed and any metadata changes.
What is a checkpoint in the Delta transaction log?
A Parquet file that periodically summarizes the log up to a version, allowing faster reconstruction of table state than replaying all JSON logs.