Lazy Evaluation
What is lazy evaluation in Spark and why is it important?
Spark does not execute transformations immediately.
It builds a logical execution plan (DAG) and only runs when an action is triggered.
Benefits:
Optimizes execution (combines steps, removes redundancies)
Reduces unnecessary computation
π Example: multiple filters can be merged into one execution step.
Transformations vs Actions
What is a DAG in Spark?
A DAG represents the sequence of transformations:
Nodes β operations
Edges β data flow
Spark uses DAG to:
Optimize execution
Determine stages and tasks
π Understanding DAG helps debug performance issues.
Narrow vs Wide Transformations
Narrow: no shuffle (e.g., filter, map)
Wide: requires shuffle (e.g., join, groupBy)
π Wide transformations are expensive β major performance bottleneck.
What is Shuffle?
What is Shuffle?
Shuffle involves:
Data redistribution across nodes
Disk I/O + network transfer
π Causes:
High latency
Memory pressure
Potential failures
How to Reduce Shuffle
How can you reduce shuffle in Spark jobs?
Use broadcast joins
Filter early
Repartition wisely
Avoid unnecessary groupBy
π Minimizing shuffle = biggest performance win.
Broadcast Join
When should you use a broadcast join?
When one table is small enough to fit in memory
Benefits:
Avoids shuffle
Faster join execution
π Spark automatically broadcasts small tables (if below threshold).
Repartition vs Coalesce
What is the difference between repartition and coalesce?
Repartition: full shuffle β evenly distributes data
Coalesce: reduces partitions without full shuffle
π Use:
Repartition β increase or rebalance
Coalesce β reduce partitions efficiently
Partitioning Strategy
How do you choose a good partitioning strategy?
Use columns frequently used in filters
Prefer low-cardinality columns (e.g., date)
Avoid:
High-cardinality β too many partitions
π Poor partitioning = slow queries + small file problem.
Skewed Data
What is data skew and why is it a problem?
When some partitions have much more data than others:
One task becomes slow
Others finish early
π Leads to stragglers β job slowdown
Handling Data Skew
How do you handle skewed data in Spark?
Salting keys
Using skew join optimization
Repartitioning
Filtering out heavy keys
π Skew is a very common real-world issue.
Caching
When should you cache data in Spark?
Cache when:
Data is reused multiple times
Computation is expensive
Avoid when:
Data is used once
Memory is limited
π Over-caching can degrade performance.
Memory Management
What happens if Spark runs out of memory?
Tasks fail (OOM errors)
Jobs may retry or crash
π Causes:
Large shuffles
Too much caching
Skewed partitions
File Size Impact
How does file size impact Spark performance?
Too small β too many tasks (overhead)
Too large β less parallelism
π Optimal size β 128MB per file.
Adaptive Query Execution (AQE)
What is Adaptive Query Execution (AQE)?
Spark dynamically adjusts execution at runtime:
Changes join strategy
Coalesces partitions
Handles skew
π Improves performance without manual tuning.
Join Strategies
What join strategies does Spark use?
Broadcast join
Sort-merge join
Shuffle hash join
π Spark chooses based on data size and configuration.
Sort-Merge Join
When does Spark use sort-merge join?
Large datasets
No broadcast possible
π Requires shuffle + sorting β expensive.
Jobs Become Slow Over Time
Why does a Spark job that was fast initially become slow later?
Common reasons:
Data volume growth
Increasing small files
Data skew
Poor partitioning
π Pipelines must be continuously optimized.
Filter Pushdown
What is filter pushdown?
Filters are applied at the data source level:
Reduces data read
Improves performance
π Especially effective with Parquet/Delta.
Predicate Pushdown vs Data Skipping
What is the difference between predicate pushdown and data skipping?
Predicate pushdown β filter at storage layer
Data skipping β skip files based on metadata
π Both reduce I/O but operate differently.
Execution Plan Debugging
How do you debug a slow Spark query?
Check execution plan (explain())
Look for shuffles
Identify skew
Analyze stages/tasks
π This is a must-have interview skill.
Stages vs Tasks
What is the difference between stages and tasks?
Stage β group of operations without shuffle
Task β unit of work per partition
π More partitions = more tasks.
Parallelism
What determines parallelism in Spark?
Number of partitions
Cluster resources
π Too few partitions β underutilization
π Too many β overhead
When NOT to Repartition
When should you avoid repartitioning?
When data is already well distributed
When unnecessary shuffle would be introduced
π Repartition blindly = performance degradation.
End-to-End Optimization Thinking
How do you approach optimizing a slow Databricks pipeline?
Step-by-step:
Check data size & growth
Identify shuffles
Optimize joins (broadcast if possible)
Fix partitioning
Compact files (OPTIMIZE)
Enable AQE
π Always optimize biggest bottleneck first.