Lakehouse Concept
What problem does the Lakehouse architecture solve compared to traditional Data Lakes and Data Warehouses?
Lakehouse combines the strengths of both systems:
Data Lakes → cheap storage but lack reliability (no ACID, poor governance)
Data Warehouses → strong governance and performance but expensive and less flexible
Lakehouse (via Delta Lake) adds:
ACID transactions on data lake storage
Schema enforcement and evolution
Support for both BI (SQL) and ML workloads
👉 It eliminates the need to maintain separate systems (no duplication between lake + warehouse).
Lakehouse vs Data Warehouse
When would you choose a traditional Data Warehouse over a Lakehouse?
A Data Warehouse is preferable when:
Workloads are purely structured and BI-focused
You need extremely predictable performance (low concurrency variability)
Minimal need for raw/unstructured data
Lakehouse is better when:
You need to handle batch + streaming + ML in one platform
You want flexibility with semi/unstructured data
Cost optimization is important
👉 In practice, Lakehouse is more versatile, but DW can still win in simplicity and stability.
Medallion Architecture
What is the Medallion architecture and why is it used?
It structures data into layers:
Bronze → raw ingested data (append-only, minimal processing)
Silver → cleaned, validated, joined data
Gold → business-level aggregates for analytics
Benefits:
Improves data quality progressively
Makes pipelines modular and easier to debug
Supports reprocessing without re-ingestion
👉 It enforces separation of concerns in data pipelines.
Why Not Transform Directly to Gold?
Why shouldn’t you transform raw data directly into Gold tables?
Skipping layers causes:
Loss of traceability (hard to debug issues)
Reprocessing becomes expensive (must re-ingest data)
Poor data quality control
Silver acts as a reusable, trusted intermediate layer.
👉 Without it, pipelines become fragile and tightly coupled.
Separation of Storage and Compute
Why is separating storage and compute important in Databricks architecture?
Storage (e.g., S3, ADLS) is cheap and scalable
Compute (clusters) can be scaled independently
Benefits:
Cost efficiency (don’t pay compute when idle)
Independent scaling (e.g., large storage, small compute)
Better concurrency handling
👉 This is a key advantage over traditional on-prem systems.
Role of Delta Lake
Why is Delta Lake essential to the Databricks Lakehouse?
Delta Lake adds reliability to data lakes:
ACID transactions
Schema enforcement
Time travel/versioning
Efficient updates (MERGE, DELETE, UPDATE)
Without Delta:
Data lakes are just “dumb storage” (no guarantees)
👉 Delta transforms a Data Lake into a transactional system.
Schema Enforcement vs Schema Evolution
What’s the difference between schema enforcement and schema evolution?
Schema enforcement: Rejects data that doesn’t match schema → ensures data quality
Schema evolution: Allows controlled schema changes (e.g., new columns)
👉 Best practice:
Use enforcement in production pipelines
Enable evolution carefully (e.g., Auto Loader scenarios)
Batch vs Streaming (Conceptual)
What is the key architectural difference between batch and streaming pipelines?
Batch → processes finite data at scheduled intervals
Streaming → processes data continuously as it arrives
In Databricks:
Both use same engine (Structured Streaming)
👉 The real difference is latency requirements, not technology.
Idempotency in Data Pipelines
What does idempotency mean in data engineering and why is it important?
Running the same job multiple times produces the same result.
Why important:
Handles retries safely
Prevents duplicate data
Ensures pipeline reliability
👉 Common techniques:
MERGE INTO
Deduplication keys
Checkpointing
Data Lake vs Delta Lake
What are the limitations of a traditional Data Lake compared to Delta Lake?
Traditional Data Lake issues:
No ACID guarantees → data corruption risk
No schema enforcement → messy data
Poor performance (no indexing, metadata optimization)
Delta Lake solves:
Reliability (transactions)
Performance (data skipping, indexing)
Manageability (time travel)
Metadata Management
Why is metadata critical in Databricks architecture?
Metadata enables:
Query optimization (data skipping, pruning)
Governance (permissions, lineage)
Efficient file tracking
In Delta:
Metadata stored in _delta_log
👉 Without metadata, performance and governance collapse.
Data Skipping (Conceptual)
What is data skipping and why does it matter?
Delta stores statistics (min/max per file), allowing queries to:
Skip irrelevant files
Reduce I/O
👉 This is why properly organized data (partitioning, ZORDER) is critical.
Small File Problem (Conceptual)
Why are small files a problem in Databricks?
Each file adds metadata overhead
Spark struggles with too many files
Leads to slower queries and job execution
👉 Solution:
OPTIMIZE (file compaction)
Control ingestion batch size
Data Pipeline Layers Responsibility
What responsibility should NOT belong to the Bronze layer?
Bronze should NOT:
Perform heavy transformations
Apply business logic
Clean aggressively
👉 It should remain:
Raw
Append-only
Replayable
Tradeoff: Flexibility vs Governance
What is the main tradeoff when using a Lakehouse architecture?
Flexibility (schema evolution, raw data storage)
vs
Governance (strict schemas, controlled access)
👉 Poor control → “data swamp”
👉 Too strict → lose flexibility
Balance is key.
Databricks vs Snowflake (Architecture Level)
At a high level, how does Databricks differ from Snowflake architecturally?
Databricks: Spark-based, flexible, supports ML + streaming + engineering
Snowflake: SQL-first, optimized for BI and structured data
👉 Databricks = engineering platform
👉 Snowflake = analytics warehouse
Spark vs Traditional Engines?
Why does Databricks use Spark instead of traditional query engines?
Spark provides:
Distributed processing
Unified batch + streaming
Support for large-scale transformations
👉 It’s designed for big data workloads, not just SQL queries.
Reprocessing Strategy
How does the Lakehouse architecture support data reprocessing?
Bronze keeps raw data
Silver/Gold can be rebuilt anytime
👉 This allows:
Fixing bugs
Updating logic
Recomputing historical data