Intro & Auto Loader Flashcards

(32 cards)

1
Q

What is Unity Catalog?

A

➡️ Unified governance solution for all data assets — tables, files, machine learning models — across workspaces.
➡️It provides following at account level instead of managing permissions per workspace.
✅Centralized access control
✅Lineage
✅Auditing
✅Data discovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Meta store?

A

➡️Central Repository that stores metadata about your datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are data assets?

A

✅Tables (Managed or External)
✅Views
✅Volumes (Unstructured or Semi-structured files in cloud storage
✅Functions
✅Machine Learning Models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between Unity catalog and meta store?

A

➡️Meta store is like database that stores metadata of data assets
➡️Unity Catalog is like management system built on top of meta store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are managed tables?

A

➡️Databricks controls both the metadata and the underlying data files
➡️If you drop the table, both metadata and data are deleted

 ⚒️CREATE TABLE sales.orders (id INT, amount DOUBLE);
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are external tables?

A

➡️ Databricks stores only the metadata, while the actual data files remain in an external location that you manage
➡️When you drop the table, only the metadata is removed — the data files stay in place.

 ⚒️CREATE TABLE sales.retail.orders_external (order_id    BIGINT,
       customer_id BIGINT)
       USING DELTA
       LOCATION 'abfss://raw-      data@storageaccountname.dfs.core.windows.net/bronze/orders';
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When the data will be created in DBFS location?

A

➡️When not using unit catalog, this was using when we were using Hive Meta store which is legacy now

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a workspace?

A

➡️Collaboration environment in Databricks where users run notebooks, jobs, queries, and manage resources.
➡️Environment separation – e.g., Dev, Test, Prod.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Databricks Hierarchy

A

✅Account
✅Region
✅Metastore
✅Workspace
✅Catalog
✅Schema
✅Table / View / Function / Volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many workspaces can we create in one databricks account?

A

Multiple workspaces, there is no limit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can we create multiple workspaces in same region?

A

Account: Company_Databricks

Region: East US
Metastore: eastus_metastore (1 metastore per account per region)
├── Workspace: Finance_Dev (shares eastus_metastore)
├── Workspace: Finance_Prod (shares eastus_metastore)
└── Workspace: Sales_Analytics (shares eastus_metastore)

Region: West Europe (1 metastore per account per region)
Metastore: weu_metastore
├── Workspace: Marketing_Dev (shares weu_metastore)
└── Workspace: Marketing_Prod (shares weu_metastore)

Region: Central India (1 metastore per account per region)
Metastore: ci_metastore
├── Workspace: R&D_Analytics (shares ci_metastore)
├── Workspace: R&D_Test (shares ci_metastore)
└── Workspace: R&D_Prod (shares ci_metastore)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Whats the purpose of External Data access in databricks?

A

➡️Support for external tables
➡️Catalog -> Settings -> Metastore -> Enable External Data Access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is auto loader?

A

➡️Incrementally ingest new data files from cloud storage into Delta Lake tables without having to repeatedly scan the entire directory.
➡️Uses a metadata log (_checkpoint and _committed files) to track processed files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Auto Loader file format supports

A

✅CSV
✅JSON
✅Parquet
✅Avro
✅Binary files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How auto loader built on spark streaming?

A

➡️Built on top of spark structured streaming
➡️ Auto Loader leverages Spark Structured Streaming APIs to watch a directory and process files as they arrive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are file discovery modes in Auto Loader?

A

➡️ Two main modes for file detection, both implemented as Spark Structured Streaming
✅File Notification Mode
➡️Uses cloud-native event services (e.g., Azure Event Grid, AWS SQS/SNS, GCS Pub/Sub).
➡️When a file lands in cloud storage, an event is sent to Databricks, so the stream job immediately knows there’s new data.
✅Directory Listing Mode
➡️Incrementally lists the directory but with optimizations to avoid full scans.
➡️Uses a persisted checkpoint of already processed files to avoid duplicates.

17
Q

How Incremental processing works in Auto Loader?

A

➡️Auto Loader maintains checkpoint locations in DBFS or external storage.
➡️The checkpoints are
✅Offsets (list of processed files)
✅Schema Information
✅Progress Logs

18
Q

What is Rocks DB in auto loader?

A

➡️ High-performance embedded database to store and manage state information during streaming ingestion.
➡️Stores metadata (file path, modification time, size, etc.)
➡️Optimized for key-value lookups

19
Q

What are Cloud File Options

A

✅cloudFiles.format — json|csv|parquet|avro|binary
✅cloudFiles.schemaLocation — required to track inferred schema
✅cloudFiles.useNotifications — true|false (use cloud event notifications)
✅cloudFiles.schemaEvolutionMode — addNewColumns or none
✅cloudFiles.includeExistingFiles — true|false (process existing files when starting)
✅recursiveFileLookup — true|false (nested dirs)
✅maxFilesPerTrigger / cloudFiles.maxBytesPerTrigger — throttling
✅cloudFiles.incremental — internal behavior controlling listing; usually default

20
Q

Simple Auto Loader — CSV (Directory listing mode)

A

df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “csv”)
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.option(“cloudFiles.schemaLocation”, schema_location)
.load(source_path))

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.outputMode(“append”)
.toTable(“bronze_orders”)) # or .save(target_delta_path)

21
Q

JSON with Schema Evolution (add new columns automatically)

A

df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaLocation”, schema_location)
.option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”) # allow adding new columns
.load(source_path))

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.option(“mergeSchema”, “true”) # important — allows Delta table schema to be updated
.outputMode(“append”)
.toTable(“bronze_orders_json”))

22
Q

Large-scale ingestion — File Notification Mode (recommended at scale)

A

df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “parquet”)
.option(“cloudFiles.schemaLocation”, schema_location)
.option(“cloudFiles.useNotifications”, “true”) # enable notification mode
.load(source_path))

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.toTable(“bronze_parquet”))

23
Q

Partitioned Writes & Performance Options

A

(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.partitionBy(“order_date”) # partition column
.option(“mergeSchema”, “true”)
.trigger(availableNow=False) # optional: micro-batch (default)
.toTable(“orders_bronze_parted”))

24
Q

Exactly-once / Idempotent upserts into Delta (foreachBatch + MERGE)

A

from delta.tables import DeltaTable
from pyspark.sql.functions import expr

def upsert_to_delta(microbatch_df, batch_id):
if microbatch_df.rdd.isEmpty():
return
# temp view
microbatch_df.createOrReplaceTempView(“micro_orders”)

# Merge into existing Delta table with dedup by order_id (example)
spark.sql("""
  MERGE INTO silver.orders AS target
  USING micro_orders AS source
  ON target.order_id = source.order_id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *
""")

(query = (spark.readStream.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaLocation”, schema_location)
.load(source_path))
.writeStream
.foreachBatch(upsert_to_delta)
.option(“checkpointLocation”, checkpoint_path)
.start())

25
External table (create table referencing files) — SQL + Auto Loader writes
-- create external table (registered in Unity Catalog) pointing to path CREATE TABLE finance.sales_orders USING DELTA LOCATION 'abfss://tables@storageaccount.dfs.core.windows.net/sales/orders_delta';
26
Handling corrupt/malformed records
df = (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "csv") .option("badRecordsPath", "abfss://bad@.../bad_records/") # store bad rows .option("cloudFiles.schemaLocation", schema_location) .load(source_path))
27
Recursive file lookup (if files nested in subdirectories)
.option("recursiveFileLookup", "true")
28
Sample: Full end-to-end robust pipeline (JSON → Delta with evolution + notification + merge)
source = "abfss://raw@storageaccount.dfs.core.windows.net/orders/" schema_loc = "abfss://schema@.../orders_schema/" ckpt = "abfss://checkpoints@.../orders_ckpt/" silver_table = "silver.orders" raw_stream = (spark.readStream .format("cloudFiles") .option("cloudFiles.format","json") .option("cloudFiles.schemaLocation", schema_loc) .option("cloudFiles.useNotifications","true") # notification mode .option("cloudFiles.schemaEvolutionMode","addNewColumns") .load(source)) def upsert(micro_df, batch_id): if micro_df.rdd.isEmpty(): return micro_df.createOrReplaceTempView("incoming") spark.sql(f""" MERGE INTO {silver_table} AS tgt USING incoming AS src ON tgt.order_id = src.order_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * """) (stream = (raw_stream.writeStream .foreachBatch(upsert) .option("checkpointLocation", ckpt) .start()))
29
What is inferSchema
➡️Auto Loader to automatically detect the schema (column names, types, and structure) of incoming data files, instead of you explicitly defining it.
30
What is cloudFiles
➡️cloudFiles is the special format name you use when reading streaming data from cloud storage (like AWS S3, Azure Blob, or Google Cloud Storage).
31
What is checkpoint folder
➡️To store the state of your streaming job ✅offsets/ ✅commits/ ✅sources/ ✅metadata/ ✅rocksDB/ - Stores state in RocksDB for incremental file discovery. ✅schema/
32
What is data folder
Create portioned parquet file with delta_log