One Spark task is much slower than others. What’s happening? Likely data skew: One partition has much more data Fix: Salting keys Repartitioning Skew join optimization 👉 Common real-world bottleneck.

How would you design a CDC pipeline? Capture changes (insert/update/delete) Use MERGE INTO Optionally use Delta Change Data Feed 👉 Avoid full reloads → efficiency.

Databricks Interview Prep - Scenarios-based questions Flashcards by Ngoc Dung Bui

End-to-End Pipeline Design

Design a pipeline to ingest and process millions of files daily.

Use Auto Loader for scalable ingestion
Store raw data in Bronze (append-only)
Clean + deduplicate in Silver (MERGE)
Aggregate in Gold

Add:

Checkpointing for reliability
OPTIMIZE for performance

👉 Key focus: scalability + idempotency + maintainability

How well did you know this?

Not at all

Perfectly

Debugging Slow Pipeline

A pipeline that used to run in 10 minutes now takes 1 hour. What do you check?

Data growth (volume increased?)
Small file problem
Data skew
Execution plan (shuffles?)

Fix:

OPTIMIZE + ZORDER
Repartition
Broadcast joins

👉 Always identify the biggest bottleneck first.

How well did you know this?

Not at all

Perfectly

Handling Duplicate Data

Your pipeline is producing duplicate records. How do you fix it?

Identify unique keys
Use MERGE instead of INSERT
Deduplicate using window functions

👉 Ensure pipeline is idempotent.

How well did you know this?

Not at all

Perfectly

Late-Arriving Data Scenario

Data arrives late in a streaming pipeline. How do you handle it?

Use watermarking
Allow updates via MERGE
Adjust latency vs accuracy tradeoff

👉 Balance:

Correctness vs performance

How well did you know this?

Not at all

Perfectly

Data Skew Issue

One Spark task is much slower than others. What’s happening?

Likely data skew:

One partition has much more data

Fix:

Salting keys
Repartitioning
Skew join optimization

👉 Common real-world bottleneck.

How well did you know this?

Not at all

Perfectly

Small File Problem in Production

Your Delta table has thousands of small files. What do you do?

Run OPTIMIZE (compaction)
Adjust ingestion batch size
Use proper partitioning

👉 Prevent future issues, not just fix current ones.

How well did you know this?

Not at all

Perfectly

Choosing Partition Strategy

How do you choose a partition column?

Based on query patterns
Low cardinality (e.g., date)

Avoid:

High-cardinality columns

👉 Partitioning should match access patterns, not guesswork.

How well did you know this?

Not at all

Perfectly

Streaming vs Batch Decision

How do you decide between streaming and batch?

Streaming → low latency needed
Batch → cost efficiency

👉 Most systems combine both.

How well did you know this?

Not at all

Perfectly

Handling Pipeline Failure

A pipeline fails midway. How do you ensure no data loss or duplication?

Use checkpointing
Design idempotent logic (MERGE)
Retry safely

👉 Recovery must be automatic and safe.

How well did you know this?

Not at all

Perfectly

CDC Pipeline Design

How would you design a CDC pipeline?

Capture changes (insert/update/delete)
Use MERGE INTO
Optionally use Delta Change Data Feed

👉 Avoid full reloads → efficiency.

How well did you know this?

Not at all

Perfectly

Optimizing Join Performance

A join operation is very slow. How do you optimize it?

Use broadcast join if possible
Reduce data before join (filter early)
Check partitioning

👉 Joins are often the biggest bottleneck.

How well did you know this?

Not at all

Perfectly

Reprocessing Historical Data

You need to reprocess 6 months of data. How do you approach it?

Use Bronze as source of truth
Rebuild Silver/Gold

👉 Avoid re-ingestion → faster and safer.

How well did you know this?

Not at all

Perfectly

Handling Schema Changes

A new column appears in incoming data. What do you do?

Enable schema evolution (if safe)
Validate downstream impact

👉 Schema changes should be controlled, not automatic everywhere.

How well did you know this?

Not at all

Perfectly

Designing for Scalability

How do you design a pipeline that scales over time?

Use distributed processing (Spark)
Avoid small files
Partition data properly

👉 Always assume data will grow.

How well did you know this?

Not at all

Perfectly

Ensuring Data Quality

How do you ensure data quality in pipelines?

Validate schema
Deduplicate
Apply rules in Silver layer

👉 Catch issues early before reaching Gold.

How well did you know this?

Not at all

Perfectly

Cost Optimization

Study These Flashcards

How do you reduce cost in Databricks pipelines?

Use batch instead of streaming when possible
Optimize file sizes
Turn off idle clusters

👉 Balance performance vs cost.

Choosing Between Databricks and Snowflake

Study These Flashcards

When would you choose Databricks over Snowflake?

Complex transformations
Streaming pipelines
ML workloads

👉 Databricks = engineering flexibility
👉 Snowflake = SQL simplicity.

Real-Time Analytics Design

Study These Flashcards

How would you design a real-time analytics system?

Streaming ingestion (Auto Loader / Kafka)
Process with Structured Streaming
Store in Delta tables

👉 Ensure:

Low latency
Fault tolerance

Handling Backfill + Streaming Together

Study These Flashcards

How do you handle historical backfill while streaming is running?

Run batch backfill separately
Merge results into same table

👉 Avoid disrupting streaming pipeline.

Monitoring Pipelines

Study These Flashcards

How do you monitor pipeline health?

Logs + metrics
Job monitoring
Alerts on failures

👉 Observability is critical in production.

Multi-Tenant Data Design

Study These Flashcards

How do you design a system for multiple teams using the same platform?

Use Unity Catalog
Separate catalogs/schemas
Apply RBAC

👉 Ensures isolation + governance.

Handling Large Tables

Study These Flashcards

A table has billions of rows and queries are slow. What do you do?

Partition properly
Use ZORDER
Optimize file size

👉 Reduce scan size as much as possible.

Designing Idempotent Streaming

Study These Flashcards

How do you ensure idempotency in streaming pipelines?

Use checkpointing
Deduplicate records
Use MERGE

👉 Required for reliable streaming.

Tradeoff: Performance vs Cost

Study These Flashcards

How do you balance performance and cost?

Optimize only bottlenecks
Avoid over-engineering
Use right cluster size

👉 Don’t optimize everything—optimize what matters.

Thinking Like a Senior Engineer

What distinguishes a senior data engineer in Databricks? * Focus on system design, not just code * Understand tradeoffs * Build reliable and scalable pipelines 👉 It’s about **decision-making, not just implementation**.

Databricks Interview Prep - Scenarios-based questions Flashcards

(25 cards)