Databricks Interview prep - Spark & Performance Optimization Flashcards

Question 1

Q

Lazy Evaluation

What is lazy evaluation in Spark and why is it important?

Answer

A

Spark does not execute transformations immediately.
It builds a logical execution plan (DAG) and only runs when an action is triggered.
Benefits:
Optimizes execution (combines steps, removes redundancies)
Reduces unnecessary computation

👉 Example: multiple filters can be merged into one execution step.

Question 2

Q

Transformations vs Actions

What is a DAG in Spark?

Answer

A

A DAG represents the sequence of transformations:
Nodes → operations
Edges → data flow
Spark uses DAG to:
Optimize execution
Determine stages and tasks

👉 Understanding DAG helps debug performance issues.

Question 3

Q

Narrow vs Wide Transformations

Answer

A

Narrow: no shuffle (e.g., filter, map)
Wide: requires shuffle (e.g., join, groupBy)
👉 Wide transformations are expensive → major performance bottleneck.

Question 4

Q

What is Shuffle?

Answer

A

Shuffle involves:
Data redistribution across nodes
Disk I/O + network transfer
👉 Causes:
High latency
Memory pressure
Potential failures

Question 5

Q

How to Reduce Shuffle

How can you reduce shuffle in Spark jobs?

Answer

A

Use broadcast joins
Filter early
Repartition wisely
Avoid unnecessary groupBy
👉 Minimizing shuffle = biggest performance win.

Question 6

Q

Broadcast Join

When should you use a broadcast join?

Answer

A

When one table is small enough to fit in memory
Benefits:
Avoids shuffle
Faster join execution

👉 Spark automatically broadcasts small tables (if below threshold).

Question 7

Q

Repartition vs Coalesce

What is the difference between repartition and coalesce?

Answer

A

Repartition: full shuffle → evenly distributes data
Coalesce: reduces partitions without full shuffle
👉 Use:
Repartition → increase or rebalance
Coalesce → reduce partitions efficiently

Question 8

Q

Partitioning Strategy

How do you choose a good partitioning strategy?

Answer

A

Use columns frequently used in filters
Prefer low-cardinality columns (e.g., date)
Avoid:
High-cardinality → too many partitions
👉 Poor partitioning = slow queries + small file problem.

Question 9

Q

Skewed Data

What is data skew and why is it a problem?

Answer

A

When some partitions have much more data than others:
One task becomes slow
Others finish early

👉 Leads to stragglers → job slowdown

Question 10

Q

Handling Data Skew

How do you handle skewed data in Spark?

Answer

A

Salting keys
Using skew join optimization
Repartitioning
Filtering out heavy keys

👉 Skew is a very common real-world issue.

Question 11

Q

Caching

When should you cache data in Spark?

Answer

A

Cache when:
Data is reused multiple times
Computation is expensive
Avoid when:
Data is used once
Memory is limited

👉 Over-caching can degrade performance.

Question 12

Q

Memory Management

What happens if Spark runs out of memory?

Answer

A

Tasks fail (OOM errors)
Jobs may retry or crash
👉 Causes:
Large shuffles
Too much caching
Skewed partitions

Question 13

Q

File Size Impact

How does file size impact Spark performance?

Answer

A

Too small → too many tasks (overhead)
Too large → less parallelism
👉 Optimal size ≈ 128MB per file.

Question 14

Q

Adaptive Query Execution (AQE)

What is Adaptive Query Execution (AQE)?

Answer

A

Spark dynamically adjusts execution at runtime:
Changes join strategy
Coalesces partitions
Handles skew

👉 Improves performance without manual tuning.

Question 15

Q

Join Strategies

What join strategies does Spark use?

Answer

A

Broadcast join
Sort-merge join
Shuffle hash join
👉 Spark chooses based on data size and configuration.

Question 16

Q

Sort-Merge Join

When does Spark use sort-merge join?

Answer

Study These Flashcards

A

Large datasets
No broadcast possible
👉 Requires shuffle + sorting → expensive.

Question 17

Q

Jobs Become Slow Over Time

Why does a Spark job that was fast initially become slow later?

Answer

Study These Flashcards

A

Common reasons:
Data volume growth
Increasing small files
Data skew
Poor partitioning

👉 Pipelines must be continuously optimized.

Question 18

Q

Filter Pushdown

What is filter pushdown?

Answer

Study These Flashcards

A

Filters are applied at the data source level:
Reduces data read
Improves performance
👉 Especially effective with Parquet/Delta.

Question 19

Q

Predicate Pushdown vs Data Skipping

What is the difference between predicate pushdown and data skipping?

Answer

Study These Flashcards

A

Predicate pushdown → filter at storage layer
Data skipping → skip files based on metadata
👉 Both reduce I/O but operate differently.

Question 20

Q

Execution Plan Debugging

How do you debug a slow Spark query?

Answer

Study These Flashcards

A

Check execution plan (explain())
Look for shuffles
Identify skew
Analyze stages/tasks
👉 This is a must-have interview skill.

Question 21

Q

Stages vs Tasks

What is the difference between stages and tasks?

Answer

Study These Flashcards

A

Stage → group of operations without shuffle
Task → unit of work per partition
👉 More partitions = more tasks.

Question 22

Q

Parallelism

What determines parallelism in Spark?

Answer

Study These Flashcards

A

Number of partitions
Cluster resources
👉 Too few partitions → underutilization
👉 Too many → overhead

Question 23

Q

When NOT to Repartition

When should you avoid repartitioning?

Answer

Study These Flashcards

A

When data is already well distributed
When unnecessary shuffle would be introduced
👉 Repartition blindly = performance degradation.

Question 24

Q

End-to-End Optimization Thinking

How do you approach optimizing a slow Databricks pipeline?

Answer

Study These Flashcards

A

Step-by-step:
Check data size & growth
Identify shuffles
Optimize joins (broadcast if possible)
Fix partitioning
Compact files (OPTIMIZE)
Enable AQE
👉 Always optimize biggest bottleneck first.

Databricks Interview prep - Spark & Performance Optimization Flashcards

(24 cards)