Autoloader Flashcards

Question 1

Q

What are the three parts of the basic streaming architecture and where does Auto Loader sit?

Answer

A

1.Stream
2.Storage Layer (ADLS Gen 2)
3.Delta Lake

Question 2

Q

What is Autoloader?

Answer

A

AutoLoader is propertary api built on top of Spark Structred Streaming that simplifies/improves streaming data into Databricks

Question 3

Q

What is Spark Structured Streaming and how is it used by Auto Loader?

Answer

A

SSS connects blob storage to the sink (LakeHouse) using Spark

Question 4

Q

How is Auto Loader called in SSS?

Answer

A

Autoloader is called by starting the SSS syntax with .format(“cloudfiles”)

Question 5

Q

What are checkpoints?

What do they protect against?

Where are they stored?

Best Practice for multiple pipelines?

Answer

A

Logs that track write progress in a Spark pipeline.

Fault tolerance against job failures, restarts,and infratructure outages.

Cloud storage folder using key-value pairs.

Each pipeline should have a different location.

Question 6

Q

How are checkpoints specified in SSS?

Answer

A

.option(“checkpointlocation”, ‘Volumes/checkpoint/location/folder”

Question 7

Q

What are watermarks?

What are their two common use cases?

Answer

A

Handle late date arriving within the specified delay period.

1.Time based Aggregations (e.g Revenue per hour)

2.Joins between two streaming sources

Question 8

Q

What is a SSS window?

Answer

A

Time period (< watermark period) establishing how often data is proccessed for the specified column(s).

Question 9

Q

How are watermarks specified in SSS?

How are windows specified in SSS?

Answer

A

.withWatermark(“column_name”, “event_time”)

from pyspark.sql.functions import window

.groupBy(
window(“event_time”, “5 minutes”),
“column_name”)

Question 10

Q

What are the 5 key advantages of Auto Loader?

Answer

A

1.No Notification/Queue Services
2.No scheduling
3.Once and Only once processing
4.Schema Inference
5.Scales with number of files instead of directories (cost savings)

Question 11

Q

How does AutoLoader deal with “bad” data?

Answer

A

Bad data is stored in the _rescued_data column in target table

Question 12

Q

Name the 2 file discovery types used by Auto Loader

Answer

A

1.Directory Listing Mode (default)

2.File Notification Mode

Question 13

Q

What is Directory Listing Mode?

Answer

A

File discovery method that periodically lists files in specified directory. When new file is detected task is triggered.

Question 14

Q

What is File Notification Mode?
What additions does it make to the streaming architecture?

Answer

A

File discovery method for high volume streaming tasks.

Creates and manages notification and queue services requiring additional permissions for Autoloader

Question 15

Q

What cloudfiles option controls the max amount of data processed in each micro-batch? How is it called?

Answer

A

.options(“cloudfiles.maxBytesPerTrigger”, “1g”$

Question 16

Q

What SSS read stream option allows malformed JSON, CSV records to be isolated? How is it called?

Answer

Study These Flashcards

A

.option(“badRecordsPath”, “/path/to/quarantine”)

Question 17

Q

What SSS read stream option allows malformed JSON, CSV records to be isolated? How is it called?

Answer

Study These Flashcards

A

.option(“badRecordsPath”, “/path/to/quarantine”)

Question 18

Q

What SSS read stream option filters based on a pattern? How is it called?

Answer

Study These Flashcards

A

.option(“pathGlobfilter”, “*<search>*")</search>

Question 19

Q

What cloudfiles option determine how new columns are handled? How is it called?

Answer

Study These Flashcards

A

.option”cloudFiles.schemaEvolutionMode, <mode>)

Question 20

Q

What are the modes available for schema evolution with autoloader?

Answer

Study These Flashcards

A

1.addNewColumns (default)
2.rescue
3.failOnNewColumns
4.none

Question 21

Q

What is the behavior for the autoloader mode addNewColumns when schema changes occur? When is it not allowed?

Answer

Study These Flashcards

A

Stram Fails w/ UnknownFildException. New Columns are added to the end of the schema. Next run tries the new schema.
Schema is provided.

Question 22

Q

What is the behavior of the autoloader mode none? When does it become the default?

Answer

Study These Flashcards

A

New columns are ignored and stream does not fail or capture new columns. When no schema is provided.

Question 23

Q

What is the behavior of the autoloader mode rescue?

Answer

Study These Flashcards

A

No stream failure. New columns routed to rescued data column

Question 24

Q

What is the behavior of the autoloader mode failOnNewColumns?

Answer

Study These Flashcards

A

Stream fails and won’t restart until it is provided with an updated schema or new columns removed

Autoloader Flashcards

High-Level Points (24 cards)