Databricks Features Flashcards

(13 cards)

1
Q

What is CDF and how is it enabled on a table?

Where are the changes tracked for a table and where are they stored?

What are the columns added to a table with CDF enabled?

A

delta.enableChangeDataFeed = true

_change_data subdirectory is added the table’s directory where these changes are tracked in parquet files.

_change_type
_commit_version
_commit_timestamp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can CDF be accessed for a table?

A

Select * from table_changes(“<target_table>", "<target_version>")</target_version></target_table>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What situations should CDF be used and not used?
When are these changes lost?

A

Use when changes includes updates and deletes. Small % of records are changing.

Don’t use for append only or overwrite.

VACUUM drops CDF data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Static table vs Streaming table

A

Static tables can contain updates and deletes. Streaming tables must be append only. Once a table becomes updatably it can no longer be streamed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can the no deletes rule on streaming tables be managed?

A

spark.readStream
.option(“ignoreDeletes”,True)
.table(“my_table”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Delta Caching?

A

Query results are cached on the current cluster for faster results on subsequent runs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens on a table commit?

What are transaction log checkpoints? How often are they created?

Where are they placed?

A

Each commit creates a json file to advance the state of the table.

Checkpoints are parquet files that save the entire state of the table after ten commits. The allow spark to skip ahead with out reading each commit file.

my_table/_delta_log/01.json …./10.json/10.checkpoint.parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What stats are collected for each new dara file?

How are these stats used?

A

1.# of recors
First 32 column stats:
2.Min Value
3.Max Value
4. Null Count

Leveraged for query result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are nested field columns e.g. Arrays/Structs including in the 32 column rule?

How should high cardinality string fields be organized within a table?

A

Yes.

Move outside first 32 columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does the VACUUM command deal with files in the _delta_log table?

How does Databricks cleanup log files.

How is the data retention interval changed?

How does this affect Time Travel?

A

The are ignored.

When a new checkpoint is created databricks checks for checkpoints that are older than the data retention interval (default 30 days).

delta.logRententionDuration

The data retention interval is the floor for time travel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Auto Optimize?

How is implemented?

A

Instead of the normal 1GB partions used by OPTIMIZE, 128 MB files are used. Once written, smaller files are compacted to 128 MB files if possible

Table Creation or Alters:
SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)

Spark Session:
delta.autoOptimize.optimizeWrite.enabled
delta.autoOptimize.autoCompact.enabled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Deletion Vector?

A

(Optional) Storage Optimization feature that captures Delete, Update, Merge ops in small files seperate from the parquet data. This delays data file rewrites until OPTIMZE or other rewrite of data file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly