Databricks Interview Prep - Miscelaneous Flashcards

(50 cards)

1
Q

What are the limitations of Medallion architecture in very large organizations?

A

Can lead to duplication, unclear ownership, and over-layering if not governed properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How would you design a Lakehouse for both BI and ML workloads without conflicts?

A

Separate compute, shared Delta tables, and workload-specific clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is schema-on-read risky in large-scale systems?

A

Leads to inconsistent interpretations and late detection of data quality issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What happens if your Bronze layer becomes corrupted?

A

Re-ingestion required; highlights importance of immutability and backup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you enforce consistency across multiple pipelines writing to the same table?

A

Use Delta transactions + standardized write patterns + governance controls.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What happens internally when a Delta table grows to millions of files?

A

Metadata overhead increases; query planning slows; requires compaction + checkpointing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why can VACUUM be dangerous in production?

A

It permanently deletes data, breaking time travel and recovery options.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would you debug a corrupted Delta table?

A

Inspect _delta_log, check last valid version, restore via time travel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why might MERGE operations become slow at scale?

A

Large shuffles, file rewrites, and lack of partition pruning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What’s the tradeoff between frequent OPTIMIZE vs infrequent OPTIMIZE?

A

Frequent = better performance but higher cost; infrequent = cheaper but slower queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why does increasing cluster size not always improve performance?

A

Bottlenecks like skew, shuffle, or I/O may dominate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the impact of too many partitions in a job?

A

Task overhead increases → slower execution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Spark decide task distribution across nodes?

A

Based on partitions and cluster resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why can caching sometimes make jobs slower?

A

Memory pressure → spills to disk → worse performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a “stage retry” and why does it happen?

A

Spark retries failed stages due to task failures or node issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is “exactly-once” difficult to guarantee in distributed systems?

A

Network failures, retries, and duplicate events complicate guarantees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How would you design a streaming pipeline that can handle sudden spikes in data?

A

Auto-scaling clusters + buffering + backpressure handling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What happens if checkpoint data is lost?

A

Pipeline loses state → risk of duplicates or data loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why should streaming pipelines avoid heavy transformations?

A

Increases latency and resource usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you ensure ordering of events in streaming systems?

A

Use event-time processing + watermarking (with limitations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do you design tables for both analytics and operational queries?

A

Separate workloads or optimize with partitioning + indexing strategies.

22
Q

Why is denormalization common in analytics systems?

A

Reduces joins → improves query performance.

23
Q

What is the risk of over-normalization in a Lakehouse?

A

Increased joins → poor performance.

24
Q

How do you handle slowly changing dimensions efficiently?

A

Use MERGE with versioning logic.

25
When would you choose wide tables vs narrow tables?
Wide for analytics, narrow for flexibility and reuse.
26
What is the biggest risk of poor data governance?
Data leaks and compliance violations.
27
How do you prevent unauthorized data access in shared environments?
RBAC + row/column-level security + auditing.
28
Why is lineage critical in enterprise systems?
Enables impact analysis and debugging.
29
How do you manage access across multiple teams with different needs?
Use hierarchical permissions and role-based groups.
30
What’s the challenge of governance in streaming pipelines?
Continuous data flow makes enforcement and auditing harder.
31
What is the difference between job clusters and all-purpose clusters in practice?
Job clusters are ephemeral and cost-efficient; all-purpose are interactive but costly.
32
Why should you avoid long-running clusters?
Higher cost and potential resource waste.
33
How do you debug a failed Databricks job in production?
Check logs, Spark UI, and execution plan.
34
What are common causes of cluster instability?
Memory pressure, skew, excessive shuffle.
35
How do you manage dependencies in Databricks jobs?
Use workflows and task dependencies.
36
What is the biggest hidden cost in Databricks pipelines?
Inefficient jobs (shuffle-heavy, small files, idle clusters).
37
How do you identify cost inefficiencies?
Monitor job duration, cluster usage, and resource utilization.
38
Why can over-partitioning increase cost?
Too many files → more compute overhead.
39
How do you balance compute vs storage costs?
Optimize data layout to reduce compute usage.
40
When should you scale up vs scale out?
Scale up for memory-heavy tasks; scale out for parallel workloads.
41
Why is “just use Spark” not always the right solution?
Simpler tools may be more efficient for small workloads.
42
What are the risks of over-engineering data pipelines?
Increased complexity, maintenance cost, and slower delivery.
43
How do you decide between real-time and near-real-time systems?
Based on business requirements vs cost.
44
Why is data engineering more about tradeoffs than tools?
Every decision impacts performance, cost, and reliability.
45
What is the hardest part of scaling data systems?
Managing complexity and ensuring reliability over time.
46
Describe a time when a pipeline failed in production—what did you learn?
Focus on root cause, prevention, and improvement.
47
How do you prioritize performance vs correctness?
Correctness always first, then optimize.
48
How do you communicate tradeoffs to non-technical stakeholders?
Translate into cost, speed, and business impact.
49
How do you keep pipelines maintainable over time?
Modular design, documentation, and standardization.
50
What would you improve first in a poorly designed data platform?
Observability + reliability before performance tuning.