What is Unity Catalog?
➡️ Unified governance solution for all data assets — tables, files, machine learning models — across workspaces.
➡️It provides following at account level instead of managing permissions per workspace.
✅Centralized access control
✅Lineage
✅Auditing
✅Data discovery
What is Meta store?
➡️Central Repository that stores metadata about your datasets
What are data assets?
✅Tables (Managed or External)
✅Views
✅Volumes (Unstructured or Semi-structured files in cloud storage
✅Functions
✅Machine Learning Models
What is the difference between Unity catalog and meta store?
➡️Meta store is like database that stores metadata of data assets
➡️Unity Catalog is like management system built on top of meta store
What are managed tables?
➡️Databricks controls both the metadata and the underlying data files
➡️If you drop the table, both metadata and data are deleted
⚒️CREATE TABLE sales.orders (id INT, amount DOUBLE);
What are external tables?
➡️ Databricks stores only the metadata, while the actual data files remain in an external location that you manage
➡️When you drop the table, only the metadata is removed — the data files stay in place.
⚒️CREATE TABLE sales.retail.orders_external (order_id BIGINT,
customer_id BIGINT)
USING DELTA
LOCATION 'abfss://raw- data@storageaccountname.dfs.core.windows.net/bronze/orders';When the data will be created in DBFS location?
➡️When not using unit catalog, this was using when we were using Hive Meta store which is legacy now
What is a workspace?
➡️Collaboration environment in Databricks where users run notebooks, jobs, queries, and manage resources.
➡️Environment separation – e.g., Dev, Test, Prod.
Databricks Hierarchy
✅Account
✅Region
✅Metastore
✅Workspace
✅Catalog
✅Schema
✅Table / View / Function / Volume
How many workspaces can we create in one databricks account?
Multiple workspaces, there is no limit
Can we create multiple workspaces in same region?
Account: Company_Databricks
Region: East US
Metastore: eastus_metastore (1 metastore per account per region)
├── Workspace: Finance_Dev (shares eastus_metastore)
├── Workspace: Finance_Prod (shares eastus_metastore)
└── Workspace: Sales_Analytics (shares eastus_metastore)
Region: West Europe (1 metastore per account per region)
Metastore: weu_metastore
├── Workspace: Marketing_Dev (shares weu_metastore)
└── Workspace: Marketing_Prod (shares weu_metastore)
Region: Central India (1 metastore per account per region)
Metastore: ci_metastore
├── Workspace: R&D_Analytics (shares ci_metastore)
├── Workspace: R&D_Test (shares ci_metastore)
└── Workspace: R&D_Prod (shares ci_metastore)
Whats the purpose of External Data access in databricks?
➡️Support for external tables
➡️Catalog -> Settings -> Metastore -> Enable External Data Access
What is auto loader?
➡️Incrementally ingest new data files from cloud storage into Delta Lake tables without having to repeatedly scan the entire directory.
➡️Uses a metadata log (_checkpoint and _committed files) to track processed files.
Auto Loader file format supports
✅CSV
✅JSON
✅Parquet
✅Avro
✅Binary files
How auto loader built on spark streaming?
➡️Built on top of spark structured streaming
➡️ Auto Loader leverages Spark Structured Streaming APIs to watch a directory and process files as they arrive.
What are file discovery modes in Auto Loader?
➡️ Two main modes for file detection, both implemented as Spark Structured Streaming
✅File Notification Mode
➡️Uses cloud-native event services (e.g., Azure Event Grid, AWS SQS/SNS, GCS Pub/Sub).
➡️When a file lands in cloud storage, an event is sent to Databricks, so the stream job immediately knows there’s new data.
✅Directory Listing Mode
➡️Incrementally lists the directory but with optimizations to avoid full scans.
➡️Uses a persisted checkpoint of already processed files to avoid duplicates.
How Incremental processing works in Auto Loader?
➡️Auto Loader maintains checkpoint locations in DBFS or external storage.
➡️The checkpoints are
✅Offsets (list of processed files)
✅Schema Information
✅Progress Logs
What is Rocks DB in auto loader?
➡️ High-performance embedded database to store and manage state information during streaming ingestion.
➡️Stores metadata (file path, modification time, size, etc.)
➡️Optimized for key-value lookups
What are Cloud File Options
✅cloudFiles.format — json|csv|parquet|avro|binary
✅cloudFiles.schemaLocation — required to track inferred schema
✅cloudFiles.useNotifications — true|false (use cloud event notifications)
✅cloudFiles.schemaEvolutionMode — addNewColumns or none
✅cloudFiles.includeExistingFiles — true|false (process existing files when starting)
✅recursiveFileLookup — true|false (nested dirs)
✅maxFilesPerTrigger / cloudFiles.maxBytesPerTrigger — throttling
✅cloudFiles.incremental — internal behavior controlling listing; usually default
Simple Auto Loader — CSV (Directory listing mode)
df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “csv”)
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.option(“cloudFiles.schemaLocation”, schema_location)
.load(source_path))
(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.outputMode(“append”)
.toTable(“bronze_orders”)) # or .save(target_delta_path)
JSON with Schema Evolution (add new columns automatically)
df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaLocation”, schema_location)
.option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”) # allow adding new columns
.load(source_path))
(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.option(“mergeSchema”, “true”) # important — allows Delta table schema to be updated
.outputMode(“append”)
.toTable(“bronze_orders_json”))
Large-scale ingestion — File Notification Mode (recommended at scale)
df = (spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “parquet”)
.option(“cloudFiles.schemaLocation”, schema_location)
.option(“cloudFiles.useNotifications”, “true”) # enable notification mode
.load(source_path))
(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.toTable(“bronze_parquet”))
Partitioned Writes & Performance Options
(df.writeStream
.format(“delta”)
.option(“checkpointLocation”, checkpoint_path)
.partitionBy(“order_date”) # partition column
.option(“mergeSchema”, “true”)
.trigger(availableNow=False) # optional: micro-batch (default)
.toTable(“orders_bronze_parted”))
Exactly-once / Idempotent upserts into Delta (foreachBatch + MERGE)
from delta.tables import DeltaTable
from pyspark.sql.functions import expr
def upsert_to_delta(microbatch_df, batch_id):
if microbatch_df.rdd.isEmpty():
return
# temp view
microbatch_df.createOrReplaceTempView(“micro_orders”)
# Merge into existing Delta table with dedup by order_id (example)
spark.sql("""
MERGE INTO silver.orders AS target
USING micro_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")(query = (spark.readStream.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaLocation”, schema_location)
.load(source_path))
.writeStream
.foreachBatch(upsert_to_delta)
.option(“checkpointLocation”, checkpoint_path)
.start())