Databricks Interview Prep - streaming & Pipeline Design Flashcards by Ngoc Dung Bui

Batch vs Streaming

When should you choose streaming over batch processing?

Use streaming when:
Low latency is required (near real-time insights)
Continuous data arrival (events, logs, IoT)
Use batch when:
Latency is not critical
Data arrives in bulk
👉 Most systems today use hybrid (Lambda/Kappa-style) architectures.

How well did you know this?

Not at all

Perfectly

Structured Streaming

What is Structured Streaming in Databricks?

A high-level API built on Spark
Treats streaming as incremental batch processing
👉 Same code works for batch and streaming → unified model.

How well did you know this?

Not at all

Perfectly

Micro-Batch vs Continuous Processing

What is the difference between micro-batch and continuous processing?

Micro-batch: processes data in small intervals (default)
Continuous: processes data row-by-row (lower latency, less used)
👉 Most production systems use micro-batch for stability.

How well did you know this?

Not at all

Perfectly

Checkpointing

Why is checkpointing critical in streaming pipelines?

Stores progress (offsets, state)
Enables recovery after failure
👉 Without checkpointing → duplicate or lost data.

How well did you know this?

Not at all

Perfectly

Exactly-Once Semantics

How does Databricks achieve exactly-once processing?

Checkpointing tracks processed data
Delta ensures transactional writes
👉 Combination ensures:
No duplicates
No data loss (in most scenarios)

How well did you know this?

Not at all

Perfectly

Idempotent Pipelines

How do you design an idempotent data pipeline?

Techniques:
Use MERGE instead of INSERT
Deduplicate using keys
Track processed files/events
👉 Essential for retries and failure recovery.

How well did you know this?

Not at all

Perfectly

Auto Loader

What is Auto Loader and when should you use it?

Incremental file ingestion tool (cloudFiles)
Efficiently detects new files
Use when:
Large-scale file ingestion (e.g., millions of files)
👉 Preferred over manual ingestion.

How well did you know this?

Not at all

Perfectly

Auto Loader vs COPY INTO

What is the difference between Auto Loader and COPY INTO?

Auto Loader: streaming + scalable + incremental
COPY INTO: batch-oriented
👉 Auto Loader is better for:
Continuous ingestion
Large volumes

How well did you know this?

Not at all

Perfectly

Schema Evolution in Streaming

How do you handle schema changes in streaming pipelines?

Enable schema evolution carefully
Validate schema changes
Use a schema registry if needed
👉 Uncontrolled evolution can break downstream pipelines.

How well did you know this?

Not at all

Perfectly

Late-Arriving Data

How do you handle late-arriving data in streaming?

Use watermarking
Allow updates via MERGE
👉 Ensures correctness without infinite waiting.

How well did you know this?

Not at all

Perfectly

Watermarking

What is watermarking in streaming?

Defines how long to wait for late data
Helps manage state and memory
👉 Tradeoff:
Too short → data loss
Too long → high memory usage

How well did you know this?

Not at all

Perfectly

Trigger Modes

What are trigger modes in Structured Streaming?

Processing time (e.g., every 1 min)
Once (batch-like execution)
Available now
👉 Controls frequency of execution.

How well did you know this?

Not at all

Perfectly

Stateful vs Stateless Processing

What is the difference between stateful and stateless streaming?

Stateless → each record independent
Stateful → depends on past data (e.g., aggregations)
👉 Stateful requires more memory and checkpointing.

How well did you know this?

Not at all

Perfectly

Handling Duplicates

How do you handle duplicate data in pipelines?

Use unique keys
Deduplicate using window functions
Use MERGE logic
👉 Duplicate handling is critical in distributed systems.

How well did you know this?

Not at all

Perfectly

CDC (Change Data Capture)

How do you implement CDC pipelines in Databricks?

Use MERGE INTO
Track inserts/updates/deletes
Use Delta Change Data Feed if available

👉 Enables incremental updates instead of full reloads.

How well did you know this?

Not at all

Perfectly

Designing a File Ingestion Pipeline

How would you design a pipeline to ingest millions of files daily?

Study These Flashcards

Use Auto Loader
Store raw data in Bronze
Enable schema evolution
Optimize with checkpointing

👉 Avoid listing all files manually → scalability issue.

Failure Handling

How do you handle failures in data pipelines?

Study These Flashcards

Retry mechanisms
Idempotent design
Checkpointing
Logging and alerting
👉 Pipelines must be resilient by design.

Incremental vs Full Load

Why is incremental processing preferred over full load?

Study These Flashcards

Faster
Cheaper
Scales better
👉 Full load is only for:
Initial load
Rare reprocessing scenarios

Orchestration (Databricks Jobs)

How do you orchestrate pipelines in Databricks?

Study These Flashcards

Use Jobs / Workflows
Define task dependencies
Schedule runs
👉 Supports:
Retry logic
Monitoring

Bronze Layer Design

What is the best practice for Bronze layer ingestion?

Study These Flashcards

Append-only
Minimal transformation
Store raw data
👉 Ensures replayability.

Silver Layer Design

What transformations belong in the Silver layer?

Study These Flashcards

Data cleaning
Deduplication
Joins
👉 Creates a trusted dataset.

Gold Layer Design

What is the purpose of Gold layer tables?

Study These Flashcards

Business-level aggregations
Optimized for BI queries
👉 Should be:
Clean
Fast
Easy to query

Data Quality Checks

Where should data quality checks be applied?

Study These Flashcards

Primarily in Silver layer
Optionally in Bronze for critical validations
👉 Prevents bad data propagation.

Streaming vs Batch Cost Tradeoff

What is the cost tradeoff between streaming and batch?

Study These Flashcards

Streaming → always running → higher cost
Batch → scheduled → cheaper
👉 Choose based on latency requirements.

# End-to-End Pipeline Design How would you design a robust end-to-end pipeline in Databricks?

Ingest with Auto Loader → Bronze Clean + deduplicate → Silver Aggregate → Gold Use MERGE for idempotency Add monitoring + retries 👉 Key principles: Scalability Reliability Maintainability

Databricks Interview Prep - streaming & Pipeline Design Flashcards

(25 cards)