What is CDF and how is it enabled on a table?
Where are the changes tracked for a table and where are they stored?
What are the columns added to a table with CDF enabled?
delta.enableChangeDataFeed = true
_change_data subdirectory is added the table’s directory where these changes are tracked in parquet files.
_change_type
_commit_version
_commit_timestamp
How can CDF be accessed for a table?
Select * from table_changes(“<target_table>", "<target_version>")</target_version></target_table>
What situations should CDF be used and not used?
When are these changes lost?
Use when changes includes updates and deletes. Small % of records are changing.
Don’t use for append only or overwrite.
VACUUM drops CDF data
Static table vs Streaming table
Static tables can contain updates and deletes. Streaming tables must be append only. Once a table becomes updatably it can no longer be streamed.
How can the no deletes rule on streaming tables be managed?
spark.readStream
.option(“ignoreDeletes”,True)
.table(“my_table”)
What is Delta Caching?
Query results are cached on the current cluster for faster results on subsequent runs.
What happens on a table commit?
What are transaction log checkpoints? How often are they created?
Where are they placed?
Each commit creates a json file to advance the state of the table.
Checkpoints are parquet files that save the entire state of the table after ten commits. The allow spark to skip ahead with out reading each commit file.
my_table/_delta_log/01.json …./10.json/10.checkpoint.parquet
What stats are collected for each new dara file?
How are these stats used?
1.# of recors
First 32 column stats:
2.Min Value
3.Max Value
4. Null Count
Leveraged for query result
Are nested field columns e.g. Arrays/Structs including in the 32 column rule?
How should high cardinality string fields be organized within a table?
Yes.
Move outside first 32 columns.
How does the VACUUM command deal with files in the _delta_log table?
How does Databricks cleanup log files.
How is the data retention interval changed?
How does this affect Time Travel?
The are ignored.
When a new checkpoint is created databricks checks for checkpoints that are older than the data retention interval (default 30 days).
delta.logRententionDuration
The data retention interval is the floor for time travel.
What is Auto Optimize?
How is implemented?
Instead of the normal 1GB partions used by OPTIMIZE, 128 MB files are used. Once written, smaller files are compacted to 128 MB files if possible
Table Creation or Alters:
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
Spark Session:
delta.autoOptimize.optimizeWrite.enabled
delta.autoOptimize.autoCompact.enabled
What is a Deletion Vector?
(Optional) Storage Optimization feature that captures Delete, Update, Merge ops in small files seperate from the parquet data. This delays data file rewrites until OPTIMZE or other rewrite of data file.