What are the limitations of Medallion architecture in very large organizations?
Can lead to duplication, unclear ownership, and over-layering if not governed properly.
How would you design a Lakehouse for both BI and ML workloads without conflicts?
Separate compute, shared Delta tables, and workload-specific clusters.
Why is schema-on-read risky in large-scale systems?
Leads to inconsistent interpretations and late detection of data quality issues.
What happens if your Bronze layer becomes corrupted?
Re-ingestion required; highlights importance of immutability and backup.
How do you enforce consistency across multiple pipelines writing to the same table?
Use Delta transactions + standardized write patterns + governance controls.
What happens internally when a Delta table grows to millions of files?
Metadata overhead increases; query planning slows; requires compaction + checkpointing.
Why can VACUUM be dangerous in production?
It permanently deletes data, breaking time travel and recovery options.
How would you debug a corrupted Delta table?
Inspect _delta_log, check last valid version, restore via time travel.
Why might MERGE operations become slow at scale?
Large shuffles, file rewrites, and lack of partition pruning.
What’s the tradeoff between frequent OPTIMIZE vs infrequent OPTIMIZE?
Frequent = better performance but higher cost; infrequent = cheaper but slower queries.
Why does increasing cluster size not always improve performance?
Bottlenecks like skew, shuffle, or I/O may dominate.
What is the impact of too many partitions in a job?
Task overhead increases → slower execution.
How does Spark decide task distribution across nodes?
Based on partitions and cluster resources.
Why can caching sometimes make jobs slower?
Memory pressure → spills to disk → worse performance.
What is a “stage retry” and why does it happen?
Spark retries failed stages due to task failures or node issues.
Why is “exactly-once” difficult to guarantee in distributed systems?
Network failures, retries, and duplicate events complicate guarantees.
How would you design a streaming pipeline that can handle sudden spikes in data?
Auto-scaling clusters + buffering + backpressure handling.
What happens if checkpoint data is lost?
Pipeline loses state → risk of duplicates or data loss.
Why should streaming pipelines avoid heavy transformations?
Increases latency and resource usage.
How do you ensure ordering of events in streaming systems?
Use event-time processing + watermarking (with limitations).
How do you design tables for both analytics and operational queries?
Separate workloads or optimize with partitioning + indexing strategies.
Why is denormalization common in analytics systems?
Reduces joins → improves query performance.
What is the risk of over-normalization in a Lakehouse?
Increased joins → poor performance.
How do you handle slowly changing dimensions efficiently?
Use MERGE with versioning logic.