Databricks Interview prep - Spark & Performance Optimization Flashcards

(24 cards)

1
Q

Lazy Evaluation

What is lazy evaluation in Spark and why is it important?

A

Spark does not execute transformations immediately.
It builds a logical execution plan (DAG) and only runs when an action is triggered.
Benefits:
Optimizes execution (combines steps, removes redundancies)
Reduces unnecessary computation

πŸ‘‰ Example: multiple filters can be merged into one execution step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Transformations vs Actions

What is a DAG in Spark?

A

A DAG represents the sequence of transformations:
Nodes β†’ operations
Edges β†’ data flow
Spark uses DAG to:
Optimize execution
Determine stages and tasks

πŸ‘‰ Understanding DAG helps debug performance issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Narrow vs Wide Transformations

A

Narrow: no shuffle (e.g., filter, map)
Wide: requires shuffle (e.g., join, groupBy)
πŸ‘‰ Wide transformations are expensive β†’ major performance bottleneck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Shuffle?

What is Shuffle?

A

Shuffle involves:
Data redistribution across nodes
Disk I/O + network transfer
πŸ‘‰ Causes:
High latency
Memory pressure
Potential failures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to Reduce Shuffle

How can you reduce shuffle in Spark jobs?

A

Use broadcast joins
Filter early
Repartition wisely
Avoid unnecessary groupBy
πŸ‘‰ Minimizing shuffle = biggest performance win.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Broadcast Join

When should you use a broadcast join?

A

When one table is small enough to fit in memory
Benefits:
Avoids shuffle
Faster join execution

πŸ‘‰ Spark automatically broadcasts small tables (if below threshold).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Repartition vs Coalesce

What is the difference between repartition and coalesce?

A

Repartition: full shuffle β†’ evenly distributes data
Coalesce: reduces partitions without full shuffle
πŸ‘‰ Use:
Repartition β†’ increase or rebalance
Coalesce β†’ reduce partitions efficiently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Partitioning Strategy

How do you choose a good partitioning strategy?

A

Use columns frequently used in filters
Prefer low-cardinality columns (e.g., date)
Avoid:
High-cardinality β†’ too many partitions
πŸ‘‰ Poor partitioning = slow queries + small file problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Skewed Data

What is data skew and why is it a problem?

A

When some partitions have much more data than others:
One task becomes slow
Others finish early

πŸ‘‰ Leads to stragglers β†’ job slowdown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Handling Data Skew

How do you handle skewed data in Spark?

A

Salting keys
Using skew join optimization
Repartitioning
Filtering out heavy keys

πŸ‘‰ Skew is a very common real-world issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Caching

When should you cache data in Spark?

A

Cache when:
Data is reused multiple times
Computation is expensive
Avoid when:
Data is used once
Memory is limited

πŸ‘‰ Over-caching can degrade performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Memory Management

What happens if Spark runs out of memory?

A

Tasks fail (OOM errors)
Jobs may retry or crash
πŸ‘‰ Causes:
Large shuffles
Too much caching
Skewed partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

File Size Impact

How does file size impact Spark performance?

A

Too small β†’ too many tasks (overhead)
Too large β†’ less parallelism
πŸ‘‰ Optimal size β‰ˆ 128MB per file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Adaptive Query Execution (AQE)

What is Adaptive Query Execution (AQE)?

A

Spark dynamically adjusts execution at runtime:
Changes join strategy
Coalesces partitions
Handles skew

πŸ‘‰ Improves performance without manual tuning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Join Strategies

What join strategies does Spark use?

A

Broadcast join
Sort-merge join
Shuffle hash join
πŸ‘‰ Spark chooses based on data size and configuration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sort-Merge Join

When does Spark use sort-merge join?

A

Large datasets
No broadcast possible
πŸ‘‰ Requires shuffle + sorting β†’ expensive.

17
Q

Jobs Become Slow Over Time

Why does a Spark job that was fast initially become slow later?

A

Common reasons:
Data volume growth
Increasing small files
Data skew
Poor partitioning

πŸ‘‰ Pipelines must be continuously optimized.

18
Q

Filter Pushdown

What is filter pushdown?

A

Filters are applied at the data source level:
Reduces data read
Improves performance
πŸ‘‰ Especially effective with Parquet/Delta.

19
Q

Predicate Pushdown vs Data Skipping

What is the difference between predicate pushdown and data skipping?

A

Predicate pushdown β†’ filter at storage layer
Data skipping β†’ skip files based on metadata
πŸ‘‰ Both reduce I/O but operate differently.

20
Q

Execution Plan Debugging

How do you debug a slow Spark query?

A

Check execution plan (explain())
Look for shuffles
Identify skew
Analyze stages/tasks
πŸ‘‰ This is a must-have interview skill.

21
Q

Stages vs Tasks

What is the difference between stages and tasks?

A

Stage β†’ group of operations without shuffle
Task β†’ unit of work per partition
πŸ‘‰ More partitions = more tasks.

22
Q

Parallelism

What determines parallelism in Spark?

A

Number of partitions
Cluster resources
πŸ‘‰ Too few partitions β†’ underutilization
πŸ‘‰ Too many β†’ overhead

23
Q

When NOT to Repartition

When should you avoid repartitioning?

A

When data is already well distributed
When unnecessary shuffle would be introduced
πŸ‘‰ Repartition blindly = performance degradation.

24
Q

End-to-End Optimization Thinking

How do you approach optimizing a slow Databricks pipeline?

A

Step-by-step:
Check data size & growth
Identify shuffles
Optimize joins (broadcast if possible)
Fix partitioning
Compact files (OPTIMIZE)
Enable AQE
πŸ‘‰ Always optimize biggest bottleneck first.