Batch vs Streaming
When should you choose streaming over batch processing?
Use streaming when:
Low latency is required (near real-time insights)
Continuous data arrival (events, logs, IoT)
Use batch when:
Latency is not critical
Data arrives in bulk
π Most systems today use hybrid (Lambda/Kappa-style) architectures.
Structured Streaming
What is Structured Streaming in Databricks?
A high-level API built on Spark
Treats streaming as incremental batch processing
π Same code works for batch and streaming β unified model.
Micro-Batch vs Continuous Processing
What is the difference between micro-batch and continuous processing?
Micro-batch: processes data in small intervals (default)
Continuous: processes data row-by-row (lower latency, less used)
π Most production systems use micro-batch for stability.
Checkpointing
Why is checkpointing critical in streaming pipelines?
Stores progress (offsets, state)
Enables recovery after failure
π Without checkpointing β duplicate or lost data.
Exactly-Once Semantics
How does Databricks achieve exactly-once processing?
Checkpointing tracks processed data
Delta ensures transactional writes
π Combination ensures:
No duplicates
No data loss (in most scenarios)
Idempotent Pipelines
How do you design an idempotent data pipeline?
Techniques:
Use MERGE instead of INSERT
Deduplicate using keys
Track processed files/events
π Essential for retries and failure recovery.
Auto Loader
What is Auto Loader and when should you use it?
Incremental file ingestion tool (cloudFiles)
Efficiently detects new files
Use when:
Large-scale file ingestion (e.g., millions of files)
π Preferred over manual ingestion.
Auto Loader vs COPY INTO
What is the difference between Auto Loader and COPY INTO?
Auto Loader: streaming + scalable + incremental
COPY INTO: batch-oriented
π Auto Loader is better for:
Continuous ingestion
Large volumes
Schema Evolution in Streaming
How do you handle schema changes in streaming pipelines?
Enable schema evolution carefully
Validate schema changes
Use a schema registry if needed
π Uncontrolled evolution can break downstream pipelines.
Late-Arriving Data
How do you handle late-arriving data in streaming?
Use watermarking
Allow updates via MERGE
π Ensures correctness without infinite waiting.
Watermarking
What is watermarking in streaming?
Defines how long to wait for late data
Helps manage state and memory
π Tradeoff:
Too short β data loss
Too long β high memory usage
Trigger Modes
What are trigger modes in Structured Streaming?
Processing time (e.g., every 1 min)
Once (batch-like execution)
Available now
π Controls frequency of execution.
Stateful vs Stateless Processing
What is the difference between stateful and stateless streaming?
Stateless β each record independent
Stateful β depends on past data (e.g., aggregations)
π Stateful requires more memory and checkpointing.
Handling Duplicates
How do you handle duplicate data in pipelines?
Use unique keys
Deduplicate using window functions
Use MERGE logic
π Duplicate handling is critical in distributed systems.
CDC (Change Data Capture)
How do you implement CDC pipelines in Databricks?
Use MERGE INTO
Track inserts/updates/deletes
Use Delta Change Data Feed if available
π Enables incremental updates instead of full reloads.
Designing a File Ingestion Pipeline
How would you design a pipeline to ingest millions of files daily?
Use Auto Loader
Store raw data in Bronze
Enable schema evolution
Optimize with checkpointing
π Avoid listing all files manually β scalability issue.
Failure Handling
How do you handle failures in data pipelines?
Retry mechanisms
Idempotent design
Checkpointing
Logging and alerting
π Pipelines must be resilient by design.
Incremental vs Full Load
Why is incremental processing preferred over full load?
Faster
Cheaper
Scales better
π Full load is only for:
Initial load
Rare reprocessing scenarios
Orchestration (Databricks Jobs)
How do you orchestrate pipelines in Databricks?
Use Jobs / Workflows
Define task dependencies
Schedule runs
π Supports:
Retry logic
Monitoring
Bronze Layer Design
What is the best practice for Bronze layer ingestion?
Append-only
Minimal transformation
Store raw data
π Ensures replayability.
Silver Layer Design
What transformations belong in the Silver layer?
Data cleaning
Deduplication
Joins
π Creates a trusted dataset.
Gold Layer Design
What is the purpose of Gold layer tables?
Business-level aggregations
Optimized for BI queries
π Should be:
Clean
Fast
Easy to query
Data Quality Checks
Where should data quality checks be applied?
Primarily in Silver layer
Optionally in Bronze for critical validations
π Prevents bad data propagation.
Streaming vs Batch Cost Tradeoff
What is the cost tradeoff between streaming and batch?
Streaming β always running β higher cost
Batch β scheduled β cheaper
π Choose based on latency requirements.