Databricks Interview Prep - Scenarios-based questions Flashcards

(25 cards)

1
Q

End-to-End Pipeline Design

A

Design a pipeline to ingest and process millions of files daily.

  • Use Auto Loader for scalable ingestion
  • Store raw data in Bronze (append-only)
  • Clean + deduplicate in Silver (MERGE)
  • Aggregate in Gold

Add:

  • Checkpointing for reliability
  • OPTIMIZE for performance

πŸ‘‰ Key focus: scalability + idempotency + maintainability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Debugging Slow Pipeline

A

A pipeline that used to run in 10 minutes now takes 1 hour. What do you check?

  • Data growth (volume increased?)
  • Small file problem
  • Data skew
  • Execution plan (shuffles?)

Fix:

  • OPTIMIZE + ZORDER
  • Repartition
  • Broadcast joins

πŸ‘‰ Always identify the biggest bottleneck first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Handling Duplicate Data

A

Your pipeline is producing duplicate records. How do you fix it?

  • Identify unique keys
  • Use MERGE instead of INSERT
  • Deduplicate using window functions

πŸ‘‰ Ensure pipeline is idempotent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Late-Arriving Data Scenario

A

Data arrives late in a streaming pipeline. How do you handle it?

  • Use watermarking
  • Allow updates via MERGE
  • Adjust latency vs accuracy tradeoff

πŸ‘‰ Balance:

  • Correctness vs performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Skew Issue

A

One Spark task is much slower than others. What’s happening?

Likely data skew:

  • One partition has much more data

Fix:

  • Salting keys
  • Repartitioning
  • Skew join optimization

πŸ‘‰ Common real-world bottleneck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Small File Problem in Production

A

Your Delta table has thousands of small files. What do you do?

  • Run OPTIMIZE (compaction)
  • Adjust ingestion batch size
  • Use proper partitioning

πŸ‘‰ Prevent future issues, not just fix current ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Choosing Partition Strategy

A

How do you choose a partition column?

  • Based on query patterns
  • Low cardinality (e.g., date)

Avoid:

  • High-cardinality columns

πŸ‘‰ Partitioning should match access patterns, not guesswork.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Streaming vs Batch Decision

A

How do you decide between streaming and batch?

  • Streaming β†’ low latency needed
  • Batch β†’ cost efficiency

πŸ‘‰ Most systems combine both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Handling Pipeline Failure

A

A pipeline fails midway. How do you ensure no data loss or duplication?

  • Use checkpointing
  • Design idempotent logic (MERGE)
  • Retry safely

πŸ‘‰ Recovery must be automatic and safe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

CDC Pipeline Design

A

How would you design a CDC pipeline?

  • Capture changes (insert/update/delete)
  • Use MERGE INTO
  • Optionally use Delta Change Data Feed

πŸ‘‰ Avoid full reloads β†’ efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Optimizing Join Performance

A

A join operation is very slow. How do you optimize it?

  • Use broadcast join if possible
  • Reduce data before join (filter early)
  • Check partitioning

πŸ‘‰ Joins are often the biggest bottleneck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reprocessing Historical Data

A

You need to reprocess 6 months of data. How do you approach it?

  • Use Bronze as source of truth
  • Rebuild Silver/Gold

πŸ‘‰ Avoid re-ingestion β†’ faster and safer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Handling Schema Changes

A

A new column appears in incoming data. What do you do?

  • Enable schema evolution (if safe)
  • Validate downstream impact

πŸ‘‰ Schema changes should be controlled, not automatic everywhere.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Designing for Scalability

A

How do you design a pipeline that scales over time?

  • Use distributed processing (Spark)
  • Avoid small files
  • Partition data properly

πŸ‘‰ Always assume data will grow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ensuring Data Quality

A

How do you ensure data quality in pipelines?

  • Validate schema
  • Deduplicate
  • Apply rules in Silver layer

πŸ‘‰ Catch issues early before reaching Gold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cost Optimization

A

How do you reduce cost in Databricks pipelines?

  • Use batch instead of streaming when possible
  • Optimize file sizes
  • Turn off idle clusters

πŸ‘‰ Balance performance vs cost.

17
Q

Choosing Between Databricks and Snowflake

A

When would you choose Databricks over Snowflake?

  • Complex transformations
  • Streaming pipelines
  • ML workloads

πŸ‘‰ Databricks = engineering flexibility
πŸ‘‰ Snowflake = SQL simplicity.

18
Q

Real-Time Analytics Design

A

How would you design a real-time analytics system?

  • Streaming ingestion (Auto Loader / Kafka)
  • Process with Structured Streaming
  • Store in Delta tables

πŸ‘‰ Ensure:

  • Low latency
  • Fault tolerance
19
Q

Handling Backfill + Streaming Together

A

How do you handle historical backfill while streaming is running?

  • Run batch backfill separately
  • Merge results into same table

πŸ‘‰ Avoid disrupting streaming pipeline.

20
Q

Monitoring Pipelines

A

How do you monitor pipeline health?

  • Logs + metrics
  • Job monitoring
  • Alerts on failures

πŸ‘‰ Observability is critical in production.

21
Q

Multi-Tenant Data Design

A

How do you design a system for multiple teams using the same platform?

  • Use Unity Catalog
  • Separate catalogs/schemas
  • Apply RBAC

πŸ‘‰ Ensures isolation + governance.

22
Q

Handling Large Tables

A

A table has billions of rows and queries are slow. What do you do?

  • Partition properly
  • Use ZORDER
  • Optimize file size

πŸ‘‰ Reduce scan size as much as possible.

23
Q

Designing Idempotent Streaming

A

How do you ensure idempotency in streaming pipelines?

  • Use checkpointing
  • Deduplicate records
  • Use MERGE

πŸ‘‰ Required for reliable streaming.

24
Q

Tradeoff: Performance vs Cost

A

How do you balance performance and cost?

  • Optimize only bottlenecks
  • Avoid over-engineering
  • Use right cluster size

πŸ‘‰ Don’t optimize everythingβ€”optimize what matters.

25
Thinking Like a Senior Engineer
What distinguishes a senior data engineer in Databricks? * Focus on system design, not just code * Understand tradeoffs * Build reliable and scalable pipelines πŸ‘‰ It’s about **decision-making, not just implementation**.