What are the three parts of the basic streaming architecture and where does Auto Loader sit?
1.Stream
2.Storage Layer (ADLS Gen 2)
3.Delta Lake
What is Autoloader?
AutoLoader is propertary api built on top of Spark Structred Streaming that simplifies/improves streaming data into Databricks
What is Spark Structured Streaming and how is it used by Auto Loader?
SSS connects blob storage to the sink (LakeHouse) using Spark
How is Auto Loader called in SSS?
Autoloader is called by starting the SSS syntax with .format(“cloudfiles”)
What are checkpoints?
What do they protect against?
Where are they stored?
Best Practice for multiple pipelines?
Logs that track write progress in a Spark pipeline.
Fault tolerance against job failures, restarts,and infratructure outages.
Cloud storage folder using key-value pairs.
Each pipeline should have a different location.
How are checkpoints specified in SSS?
.option(“checkpointlocation”, ‘Volumes/checkpoint/location/folder”
What are watermarks?
What are their two common use cases?
Handle late date arriving within the specified delay period.
1.Time based Aggregations (e.g Revenue per hour)
2.Joins between two streaming sources
What is a SSS window?
Time period (< watermark period) establishing how often data is proccessed for the specified column(s).
How are watermarks specified in SSS?
How are windows specified in SSS?
.withWatermark(“column_name”, “event_time”)
from pyspark.sql.functions import window
.groupBy(
window(“event_time”, “5 minutes”),
“column_name”)
What are the 5 key advantages of Auto Loader?
1.No Notification/Queue Services
2.No scheduling
3.Once and Only once processing
4.Schema Inference
5.Scales with number of files instead of directories (cost savings)
How does AutoLoader deal with “bad” data?
Bad data is stored in the _rescued_data column in target table
Name the 2 file discovery types used by Auto Loader
1.Directory Listing Mode (default)
2.File Notification Mode
What is Directory Listing Mode?
File discovery method that periodically lists files in specified directory. When new file is detected task is triggered.
What is File Notification Mode?
What additions does it make to the streaming architecture?
File discovery method for high volume streaming tasks.
Creates and manages notification and queue services requiring additional permissions for Autoloader
What cloudfiles option controls the max amount of data processed in each micro-batch? How is it called?
.options(“cloudfiles.maxBytesPerTrigger”, “1g”$
What SSS read stream option allows malformed JSON, CSV records to be isolated? How is it called?
.option(“badRecordsPath”, “/path/to/quarantine”)
What SSS read stream option allows malformed JSON, CSV records to be isolated? How is it called?
.option(“badRecordsPath”, “/path/to/quarantine”)
What SSS read stream option filters based on a pattern? How is it called?
.option(“pathGlobfilter”, “*<search>*")</search>
What cloudfiles option determine how new columns are handled? How is it called?
.option”cloudFiles.schemaEvolutionMode, <mode>)
What are the modes available for schema evolution with autoloader?
1.addNewColumns (default)
2.rescue
3.failOnNewColumns
4.none
What is the behavior for the autoloader mode addNewColumns when schema changes occur? When is it not allowed?
Stram Fails w/ UnknownFildException. New Columns are added to the end of the schema. Next run tries the new schema.
Schema is provided.
What is the behavior of the autoloader mode none? When does it become the default?
New columns are ignored and stream does not fail or capture new columns. When no schema is provided.
What is the behavior of the autoloader mode rescue?
No stream failure. New columns routed to rescued data column
What is the behavior of the autoloader mode failOnNewColumns?
Stream fails and won’t restart until it is provided with an updated schema or new columns removed