Autoloader Flashcards

High-Level Points (24 cards)

1
Q

What are the three parts of the basic streaming architecture and where does Auto Loader sit?

A

1.Stream
2.Storage Layer (ADLS Gen 2)
3.Delta Lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Autoloader?

A

AutoLoader is propertary api built on top of Spark Structred Streaming that simplifies/improves streaming data into Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Spark Structured Streaming and how is it used by Auto Loader?

A

SSS connects blob storage to the sink (LakeHouse) using Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is Auto Loader called in SSS?

A

Autoloader is called by starting the SSS syntax with .format(“cloudfiles”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are checkpoints?

What do they protect against?

Where are they stored?

Best Practice for multiple pipelines?

A

Logs that track write progress in a Spark pipeline.

Fault tolerance against job failures, restarts,and infratructure outages.

Cloud storage folder using key-value pairs.

Each pipeline should have a different location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are checkpoints specified in SSS?

A

.option(“checkpointlocation”, ‘Volumes/checkpoint/location/folder”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are watermarks?

What are their two common use cases?

A

Handle late date arriving within the specified delay period.

1.Time based Aggregations (e.g Revenue per hour)

2.Joins between two streaming sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a SSS window?

A

Time period (< watermark period) establishing how often data is proccessed for the specified column(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are watermarks specified in SSS?

How are windows specified in SSS?

A

.withWatermark(“column_name”, “event_time”)

from pyspark.sql.functions import window

.groupBy(
window(“event_time”, “5 minutes”),
“column_name”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 5 key advantages of Auto Loader?

A

1.No Notification/Queue Services
2.No scheduling
3.Once and Only once processing
4.Schema Inference
5.Scales with number of files instead of directories (cost savings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does AutoLoader deal with “bad” data?

A

Bad data is stored in the _rescued_data column in target table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name the 2 file discovery types used by Auto Loader

A

1.Directory Listing Mode (default)

2.File Notification Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Directory Listing Mode?

A

File discovery method that periodically lists files in specified directory. When new file is detected task is triggered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is File Notification Mode?
What additions does it make to the streaming architecture?

A

File discovery method for high volume streaming tasks.

Creates and manages notification and queue services requiring additional permissions for Autoloader

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What cloudfiles option controls the max amount of data processed in each micro-batch? How is it called?

A

.options(“cloudfiles.maxBytesPerTrigger”, “1g”$

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What SSS read stream option allows malformed JSON, CSV records to be isolated? How is it called?

A

.option(“badRecordsPath”, “/path/to/quarantine”)

17
Q

What SSS read stream option allows malformed JSON, CSV records to be isolated? How is it called?

A

.option(“badRecordsPath”, “/path/to/quarantine”)

18
Q

What SSS read stream option filters based on a pattern? How is it called?

A

.option(“pathGlobfilter”, “*<search>*")</search>

19
Q

What cloudfiles option determine how new columns are handled? How is it called?

A

.option”cloudFiles.schemaEvolutionMode, <mode>)

20
Q

What are the modes available for schema evolution with autoloader?

A

1.addNewColumns (default)
2.rescue
3.failOnNewColumns
4.none

21
Q

What is the behavior for the autoloader mode addNewColumns when schema changes occur? When is it not allowed?

A

Stram Fails w/ UnknownFildException. New Columns are added to the end of the schema. Next run tries the new schema.
Schema is provided.

22
Q

What is the behavior of the autoloader mode none? When does it become the default?

A

New columns are ignored and stream does not fail or capture new columns. When no schema is provided.

23
Q

What is the behavior of the autoloader mode rescue?

A

No stream failure. New columns routed to rescued data column

24
Q

What is the behavior of the autoloader mode failOnNewColumns?

A

Stream fails and won’t restart until it is provided with an updated schema or new columns removed