Databricks Interview Prep-Architecture & Fundamentals Flashcards

Databricks Interview Prep (18 cards)

1
Q

Lakehouse Concept

What problem does the Lakehouse architecture solve compared to traditional Data Lakes and Data Warehouses?

A

Lakehouse combines the strengths of both systems:
Data Lakes → cheap storage but lack reliability (no ACID, poor governance)
Data Warehouses → strong governance and performance but expensive and less flexible
Lakehouse (via Delta Lake) adds:
ACID transactions on data lake storage
Schema enforcement and evolution
Support for both BI (SQL) and ML workloads

👉 It eliminates the need to maintain separate systems (no duplication between lake + warehouse).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lakehouse vs Data Warehouse

When would you choose a traditional Data Warehouse over a Lakehouse?

A

A Data Warehouse is preferable when:
Workloads are purely structured and BI-focused
You need extremely predictable performance (low concurrency variability)
Minimal need for raw/unstructured data
Lakehouse is better when:
You need to handle batch + streaming + ML in one platform
You want flexibility with semi/unstructured data
Cost optimization is important

👉 In practice, Lakehouse is more versatile, but DW can still win in simplicity and stability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Medallion Architecture

What is the Medallion architecture and why is it used?

A

It structures data into layers:
Bronze → raw ingested data (append-only, minimal processing)
Silver → cleaned, validated, joined data
Gold → business-level aggregates for analytics
Benefits:
Improves data quality progressively
Makes pipelines modular and easier to debug
Supports reprocessing without re-ingestion

👉 It enforces separation of concerns in data pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why Not Transform Directly to Gold?

Why shouldn’t you transform raw data directly into Gold tables?

A

Skipping layers causes:
Loss of traceability (hard to debug issues)
Reprocessing becomes expensive (must re-ingest data)
Poor data quality control
Silver acts as a reusable, trusted intermediate layer.

👉 Without it, pipelines become fragile and tightly coupled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Separation of Storage and Compute

Why is separating storage and compute important in Databricks architecture?

A

Storage (e.g., S3, ADLS) is cheap and scalable
Compute (clusters) can be scaled independently
Benefits:
Cost efficiency (don’t pay compute when idle)
Independent scaling (e.g., large storage, small compute)
Better concurrency handling

👉 This is a key advantage over traditional on-prem systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Role of Delta Lake

Why is Delta Lake essential to the Databricks Lakehouse?

A

Delta Lake adds reliability to data lakes:
ACID transactions
Schema enforcement
Time travel/versioning
Efficient updates (MERGE, DELETE, UPDATE)
Without Delta:
Data lakes are just “dumb storage” (no guarantees)

👉 Delta transforms a Data Lake into a transactional system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Schema Enforcement vs Schema Evolution

What’s the difference between schema enforcement and schema evolution?

A

Schema enforcement: Rejects data that doesn’t match schema → ensures data quality
Schema evolution: Allows controlled schema changes (e.g., new columns)
👉 Best practice:
Use enforcement in production pipelines
Enable evolution carefully (e.g., Auto Loader scenarios)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Batch vs Streaming (Conceptual)

What is the key architectural difference between batch and streaming pipelines?

A

Batch → processes finite data at scheduled intervals
Streaming → processes data continuously as it arrives
In Databricks:
Both use same engine (Structured Streaming)
👉 The real difference is latency requirements, not technology.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Idempotency in Data Pipelines

What does idempotency mean in data engineering and why is it important?

A

Running the same job multiple times produces the same result.
Why important:
Handles retries safely
Prevents duplicate data
Ensures pipeline reliability
👉 Common techniques:
MERGE INTO
Deduplication keys
Checkpointing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Lake vs Delta Lake

What are the limitations of a traditional Data Lake compared to Delta Lake?

A

Traditional Data Lake issues:
No ACID guarantees → data corruption risk
No schema enforcement → messy data
Poor performance (no indexing, metadata optimization)
Delta Lake solves:
Reliability (transactions)
Performance (data skipping, indexing)
Manageability (time travel)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Metadata Management

Why is metadata critical in Databricks architecture?

A

Metadata enables:
Query optimization (data skipping, pruning)
Governance (permissions, lineage)
Efficient file tracking
In Delta:
Metadata stored in _delta_log
👉 Without metadata, performance and governance collapse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Skipping (Conceptual)

What is data skipping and why does it matter?

A

Delta stores statistics (min/max per file), allowing queries to:
Skip irrelevant files
Reduce I/O
👉 This is why properly organized data (partitioning, ZORDER) is critical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Small File Problem (Conceptual)

Why are small files a problem in Databricks?

A

Each file adds metadata overhead
Spark struggles with too many files
Leads to slower queries and job execution
👉 Solution:
OPTIMIZE (file compaction)
Control ingestion batch size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Pipeline Layers Responsibility

What responsibility should NOT belong to the Bronze layer?

A

Bronze should NOT:
Perform heavy transformations
Apply business logic
Clean aggressively
👉 It should remain:
Raw
Append-only
Replayable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Tradeoff: Flexibility vs Governance

What is the main tradeoff when using a Lakehouse architecture?

A

Flexibility (schema evolution, raw data storage)
vs
Governance (strict schemas, controlled access)
👉 Poor control → “data swamp”
👉 Too strict → lose flexibility
Balance is key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Databricks vs Snowflake (Architecture Level)

At a high level, how does Databricks differ from Snowflake architecturally?

A

Databricks: Spark-based, flexible, supports ML + streaming + engineering
Snowflake: SQL-first, optimized for BI and structured data
👉 Databricks = engineering platform
👉 Snowflake = analytics warehouse

17
Q

Spark vs Traditional Engines?

Why does Databricks use Spark instead of traditional query engines?

A

Spark provides:
Distributed processing
Unified batch + streaming
Support for large-scale transformations
👉 It’s designed for big data workloads, not just SQL queries.

18
Q

Reprocessing Strategy

How does the Lakehouse architecture support data reprocessing?

A

Bronze keeps raw data
Silver/Gold can be rebuilt anytime
👉 This allows:
Fixing bugs
Updating logic
Recomputing historical data