Intro & Auto Loader Flashcards by BR K

What is Unity Catalog?

➡️ Unified governance solution for all data assets — tables, files, machine learning models — across workspaces.
➡️It provides following at account level instead of managing permissions per workspace.
✅Centralized access control
✅Lineage
✅Auditing
✅Data discovery

How well did you know this?

Not at all

Perfectly

What is Meta store?

➡️Central Repository that stores metadata about your datasets

How well did you know this?

Not at all

Perfectly

What are data assets?

✅Tables (Managed or External)
✅Views
✅Volumes (Unstructured or Semi-structured files in cloud storage
✅Functions
✅Machine Learning Models

How well did you know this?

Not at all

Perfectly

What is the difference between Unity catalog and meta store?

➡️Meta store is like database that stores metadata of data assets
➡️Unity Catalog is like management system built on top of meta store

How well did you know this?

Not at all

Perfectly

What are managed tables?

➡️Databricks controls both the metadata and the underlying data files
➡️If you drop the table, both metadata and data are deleted

 ⚒️CREATE TABLE sales.orders (id INT, amount DOUBLE);

How well did you know this?

Not at all

Perfectly

What are external tables?

➡️ Databricks stores only the metadata, while the actual data files remain in an external location that you manage
➡️When you drop the table, only the metadata is removed — the data files stay in place.

 ⚒️CREATE TABLE sales.retail.orders_external (order_id    BIGINT,
       customer_id BIGINT)
       USING DELTA
       LOCATION 'abfss://raw-      data@storageaccountname.dfs.core.windows.net/bronze/orders';

How well did you know this?

Not at all

Perfectly

When the data will be created in DBFS location?

➡️When not using unit catalog, this was using when we were using Hive Meta store which is legacy now

How well did you know this?

Not at all

Perfectly

What is a workspace?

➡️Collaboration environment in Databricks where users run notebooks, jobs, queries, and manage resources.
➡️Environment separation – e.g., Dev, Test, Prod.

How well did you know this?

Not at all

Perfectly

Databricks Hierarchy

✅Account
✅Region
✅Metastore
✅Workspace
✅Catalog
✅Schema
✅Table / View / Function / Volume

How well did you know this?

Not at all

Perfectly

How many workspaces can we create in one databricks account?

Multiple workspaces, there is no limit

How well did you know this?

Not at all

Perfectly

Can we create multiple workspaces in same region?

Account: Company_Databricks

Region: East US
Metastore: eastus_metastore (1 metastore per account per region)
├── Workspace: Finance_Dev (shares eastus_metastore)
├── Workspace: Finance_Prod (shares eastus_metastore)
└── Workspace: Sales_Analytics (shares eastus_metastore)

Region: West Europe (1 metastore per account per region)
Metastore: weu_metastore
├── Workspace: Marketing_Dev (shares weu_metastore)
└── Workspace: Marketing_Prod (shares weu_metastore)

Region: Central India (1 metastore per account per region)
Metastore: ci_metastore
├── Workspace: R&D_Analytics (shares ci_metastore)
├── Workspace: R&D_Test (shares ci_metastore)
└── Workspace: R&D_Prod (shares ci_metastore)

How well did you know this?

Not at all

Perfectly

Whats the purpose of External Data access in databricks?

➡️Support for external tables
➡️Catalog -> Settings -> Metastore -> Enable External Data Access

How well did you know this?

Not at all

Perfectly

What is auto loader?

➡️Incrementally ingest new data files from cloud storage into Delta Lake tables without having to repeatedly scan the entire directory.
➡️Uses a metadata log (_checkpoint and _committed files) to track processed files.

How well did you know this?

Not at all

Perfectly

Auto Loader file format supports

✅CSV
✅JSON
✅Parquet
✅Avro
✅Binary files

How well did you know this?

Not at all

Perfectly

How auto loader built on spark streaming?

➡️Built on top of spark structured streaming
➡️ Auto Loader leverages Spark Structured Streaming APIs to watch a directory and process files as they arrive.

How well did you know this?

Not at all

Perfectly

What are file discovery modes in Auto Loader?

Study These Flashcards

➡️ Two main modes for file detection, both implemented as Spark Structured Streaming
✅File Notification Mode
➡️Uses cloud-native event services (e.g., Azure Event Grid, AWS SQS/SNS, GCS Pub/Sub).
➡️When a file lands in cloud storage, an event is sent to Databricks, so the stream job immediately knows there’s new data.
✅Directory Listing Mode
➡️Incrementally lists the directory but with optimizations to avoid full scans.
➡️Uses a persisted checkpoint of already processed files to avoid duplicates.

How Incremental processing works in Auto Loader?

Study These Flashcards

➡️Auto Loader maintains checkpoint locations in DBFS or external storage.
➡️The checkpoints are
✅Offsets (list of processed files)
✅Schema Information
✅Progress Logs

What is Rocks DB in auto loader?

Study These Flashcards

➡️ High-performance embedded database to store and manage state information during streaming ingestion.
➡️Stores metadata (file path, modification time, size, etc.)
➡️Optimized for key-value lookups

What are Cloud File Options

Study These Flashcards

✅cloudFiles.format — json|csv|parquet|avro|binary
✅cloudFiles.schemaLocation — required to track inferred schema
✅cloudFiles.useNotifications — true|false (use cloud event notifications)
✅cloudFiles.schemaEvolutionMode — addNewColumns or none
✅cloudFiles.includeExistingFiles — true|false (process existing files when starting)
✅recursiveFileLookup — true|false (nested dirs)
✅maxFilesPerTrigger / cloudFiles.maxBytesPerTrigger — throttling
✅cloudFiles.incremental — internal behavior controlling listing; usually default

Simple Auto Loader — CSV (Directory listing mode)

Study These Flashcards

df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “csv”)
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.option(“cloudFiles.schemaLocation”, schema_location)
.load(source_path))

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.outputMode(“append”)
.toTable(“bronze_orders”)) # or .save(target_delta_path)

JSON with Schema Evolution (add new columns automatically)

Study These Flashcards

df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaLocation”, schema_location)
.option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”) # allow adding new columns
.load(source_path))

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.option(“mergeSchema”, “true”) # important — allows Delta table schema to be updated
.outputMode(“append”)
.toTable(“bronze_orders_json”))

Large-scale ingestion — File Notification Mode (recommended at scale)

Study These Flashcards

df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “parquet”)
.option(“cloudFiles.schemaLocation”, schema_location)
.option(“cloudFiles.useNotifications”, “true”) # enable notification mode
.load(source_path))

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.toTable(“bronze_parquet”))

Partitioned Writes & Performance Options

Study These Flashcards

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.partitionBy(“order_date”) # partition column
.option(“mergeSchema”, “true”)
.trigger(availableNow=False) # optional: micro-batch (default)
.toTable(“orders_bronze_parted”))

Exactly-once / Idempotent upserts into Delta (foreachBatch + MERGE)

Study These Flashcards

from delta.tables import DeltaTable
from pyspark.sql.functions import expr

def upsert_to_delta(microbatch_df, batch_id):
if microbatch_df.rdd.isEmpty():
return
# temp view
microbatch_df.createOrReplaceTempView(“micro_orders”)

# Merge into existing Delta table with dedup by order_id (example)
spark.sql("""
  MERGE INTO silver.orders AS target
  USING micro_orders AS source
  ON target.order_id = source.order_id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *
""")

(query = (spark.readStream.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaLocation”, schema_location)
.load(source_path))
.writeStream
.foreachBatch(upsert_to_delta)
.option(“checkpointLocation”, checkpoint_path)
.start())

External table (create table referencing files) — SQL + Auto Loader writes

-- create external table (registered in Unity Catalog) pointing to path CREATE TABLE finance.sales_orders USING DELTA LOCATION 'abfss://tables@storageaccount.dfs.core.windows.net/sales/orders_delta';

Handling corrupt/malformed records

df = (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "csv") .option("badRecordsPath", "abfss://bad@.../bad_records/") # store bad rows .option("cloudFiles.schemaLocation", schema_location) .load(source_path))

Recursive file lookup (if files nested in subdirectories)

.option("recursiveFileLookup", "true")

Sample: Full end-to-end robust pipeline (JSON → Delta with evolution + notification + merge)

source = "abfss://raw@storageaccount.dfs.core.windows.net/orders/" schema_loc = "abfss://schema@.../orders_schema/" ckpt = "abfss://checkpoints@.../orders_ckpt/" silver_table = "silver.orders" raw_stream = (spark.readStream .format("cloudFiles") .option("cloudFiles.format","json") .option("cloudFiles.schemaLocation", schema_loc) .option("cloudFiles.useNotifications","true") # notification mode .option("cloudFiles.schemaEvolutionMode","addNewColumns") .load(source)) def upsert(micro_df, batch_id): if micro_df.rdd.isEmpty(): return micro_df.createOrReplaceTempView("incoming") spark.sql(f""" MERGE INTO {silver_table} AS tgt USING incoming AS src ON tgt.order_id = src.order_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """) (stream = (raw_stream.writeStream .foreachBatch(upsert) .option("checkpointLocation", ckpt) .start()))

What is inferSchema

➡️Auto Loader to automatically detect the schema (column names, types, and structure) of incoming data files, instead of you explicitly defining it.

What is cloudFiles

➡️cloudFiles is the special format name you use when reading streaming data from cloud storage (like AWS S3, Azure Blob, or Google Cloud Storage).

What is checkpoint folder

➡️To store the state of your streaming job ✅offsets/ ✅commits/ ✅sources/ ✅metadata/ ✅rocksDB/ - Stores state in RocksDB for incremental file discovery. ✅schema/

What is data folder

Create portioned parquet file with delta_log

Intro & Auto Loader Flashcards

(32 cards)