Databricks Interview Prep - streaming & Pipeline Design Flashcards

(25 cards)

1
Q

Batch vs Streaming

When should you choose streaming over batch processing?

A

Use streaming when:
Low latency is required (near real-time insights)
Continuous data arrival (events, logs, IoT)
Use batch when:
Latency is not critical
Data arrives in bulk
πŸ‘‰ Most systems today use hybrid (Lambda/Kappa-style) architectures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured Streaming

What is Structured Streaming in Databricks?

A

A high-level API built on Spark
Treats streaming as incremental batch processing
πŸ‘‰ Same code works for batch and streaming β†’ unified model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Micro-Batch vs Continuous Processing

What is the difference between micro-batch and continuous processing?

A

Micro-batch: processes data in small intervals (default)
Continuous: processes data row-by-row (lower latency, less used)
πŸ‘‰ Most production systems use micro-batch for stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Checkpointing

Why is checkpointing critical in streaming pipelines?

A

Stores progress (offsets, state)
Enables recovery after failure
πŸ‘‰ Without checkpointing β†’ duplicate or lost data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Exactly-Once Semantics

How does Databricks achieve exactly-once processing?

A

Checkpointing tracks processed data
Delta ensures transactional writes
πŸ‘‰ Combination ensures:
No duplicates
No data loss (in most scenarios)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Idempotent Pipelines

How do you design an idempotent data pipeline?

A

Techniques:
Use MERGE instead of INSERT
Deduplicate using keys
Track processed files/events
πŸ‘‰ Essential for retries and failure recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Auto Loader

What is Auto Loader and when should you use it?

A

Incremental file ingestion tool (cloudFiles)
Efficiently detects new files
Use when:
Large-scale file ingestion (e.g., millions of files)
πŸ‘‰ Preferred over manual ingestion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Auto Loader vs COPY INTO

What is the difference between Auto Loader and COPY INTO?

A

Auto Loader: streaming + scalable + incremental
COPY INTO: batch-oriented
πŸ‘‰ Auto Loader is better for:
Continuous ingestion
Large volumes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Schema Evolution in Streaming

How do you handle schema changes in streaming pipelines?

A

Enable schema evolution carefully
Validate schema changes
Use a schema registry if needed
πŸ‘‰ Uncontrolled evolution can break downstream pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Late-Arriving Data

How do you handle late-arriving data in streaming?

A

Use watermarking
Allow updates via MERGE
πŸ‘‰ Ensures correctness without infinite waiting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Watermarking

What is watermarking in streaming?

A

Defines how long to wait for late data
Helps manage state and memory
πŸ‘‰ Tradeoff:
Too short β†’ data loss
Too long β†’ high memory usage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Trigger Modes

What are trigger modes in Structured Streaming?

A

Processing time (e.g., every 1 min)
Once (batch-like execution)
Available now
πŸ‘‰ Controls frequency of execution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Stateful vs Stateless Processing

What is the difference between stateful and stateless streaming?

A

Stateless β†’ each record independent
Stateful β†’ depends on past data (e.g., aggregations)
πŸ‘‰ Stateful requires more memory and checkpointing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Handling Duplicates

How do you handle duplicate data in pipelines?

A

Use unique keys
Deduplicate using window functions
Use MERGE logic
πŸ‘‰ Duplicate handling is critical in distributed systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CDC (Change Data Capture)

How do you implement CDC pipelines in Databricks?

A

Use MERGE INTO
Track inserts/updates/deletes
Use Delta Change Data Feed if available

πŸ‘‰ Enables incremental updates instead of full reloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Designing a File Ingestion Pipeline

How would you design a pipeline to ingest millions of files daily?

A

Use Auto Loader
Store raw data in Bronze
Enable schema evolution
Optimize with checkpointing

πŸ‘‰ Avoid listing all files manually β†’ scalability issue.

17
Q

Failure Handling

How do you handle failures in data pipelines?

A

Retry mechanisms
Idempotent design
Checkpointing
Logging and alerting
πŸ‘‰ Pipelines must be resilient by design.

18
Q

Incremental vs Full Load

Why is incremental processing preferred over full load?

A

Faster
Cheaper
Scales better
πŸ‘‰ Full load is only for:
Initial load
Rare reprocessing scenarios

19
Q

Orchestration (Databricks Jobs)

How do you orchestrate pipelines in Databricks?

A

Use Jobs / Workflows
Define task dependencies
Schedule runs
πŸ‘‰ Supports:
Retry logic
Monitoring

20
Q

Bronze Layer Design

What is the best practice for Bronze layer ingestion?

A

Append-only
Minimal transformation
Store raw data
πŸ‘‰ Ensures replayability.

21
Q

Silver Layer Design

What transformations belong in the Silver layer?

A

Data cleaning
Deduplication
Joins
πŸ‘‰ Creates a trusted dataset.

22
Q

Gold Layer Design

What is the purpose of Gold layer tables?

A

Business-level aggregations
Optimized for BI queries
πŸ‘‰ Should be:
Clean
Fast
Easy to query

23
Q

Data Quality Checks

Where should data quality checks be applied?

A

Primarily in Silver layer
Optionally in Bronze for critical validations
πŸ‘‰ Prevents bad data propagation.

24
Q

Streaming vs Batch Cost Tradeoff

What is the cost tradeoff between streaming and batch?

A

Streaming β†’ always running β†’ higher cost
Batch β†’ scheduled β†’ cheaper
πŸ‘‰ Choose based on latency requirements.

25
# End-to-End Pipeline Design How would you design a robust end-to-end pipeline in Databricks?
Ingest with Auto Loader β†’ Bronze Clean + deduplicate β†’ Silver Aggregate β†’ Gold Use MERGE for idempotency Add monitoring + retries πŸ‘‰ Key principles: Scalability Reliability Maintainability