Databricks Runtime & Performance Tuning Flashcards

(31 cards)

1
Q

What is Databricks Runtime (DBR)?

A

A curated, optimized Spark distribution provided by Databricks that bundles specific Spark versions, libraries, and performance improvements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why should you pin a specific DBR version instead of always using the latest?

A

Pinning ensures reproducibility and stability; automatically upgrading to the latest can introduce behavior changes or incompatibilities unexpectedly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Photon in Databricks?

A

A next-generation query engine built in C++ that accelerates SQL and DataFrame workloads, particularly for Delta and Parquet data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When can enabling Photon be beneficial?

A

For SQL-heavy, scan-heavy, and aggregation-heavy workloads where you want lower latency and better throughput on supported operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it important to match DBR type (e.g., ‘Standard’ vs ‘Photon’) to workload characteristics?

A

Some workloads benefit more from Photon and vectorized execution, while others may rely on libraries or features not fully optimized by Photon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is adaptive query execution (AQE) in Spark/Databricks?

A

A runtime optimization that adjusts query execution plans on the fly based on observed statistics, such as changing join strategies or shuffle partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can AQE improve performance?

A

By coalescing small shuffle partitions, dynamically switching join types (e.g., to broadcast), and handling skew more intelligently at runtime.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is input file layout important for performance on Databricks?

A

Poorly organized data can lead to many small files, skewed partitions, and inefficient scans, while well-laid-out Delta/Parquet data enables skipping and parallelism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some Delta-specific optimizations that affect performance?

A

OPTIMIZE to compact small files, ZORDER BY for better clustering, and proper partitioning to enable partition pruning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data skipping in Delta and why does it matter for performance?

A

Data skipping uses file-level statistics (e.g., min/max per column) to avoid scanning files that cannot satisfy filter predicates, reducing I/O.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does choosing partition columns affect performance in Databricks?

A

Good partition keys align with common filters and produce balanced partition sizes; poor keys can create hotspots or many small files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is spark.sql.shuffle.partitions and why is it important?

A

A configuration controlling the default number of shuffle partitions; tuning it helps avoid too many tiny partitions or too few large ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When might you reduce spark.sql.shuffle.partitions from its default?

A

When your dataset is modest or your cluster is small, to avoid overhead from scheduling an excessive number of tiny tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When might you increase spark.sql.shuffle.partitions?

A

When processing very large datasets on large clusters to provide enough parallelism and avoid huge single partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do broadcast joins help performance in Databricks?

A

They replicate small tables to all executors so that large tables can be joined locally, avoiding expensive shuffles of both sides.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What configuration can influence when Spark chooses broadcast joins?

A

spark.sql.autoBroadcastJoinThreshold, which sets the maximum size of a table that Spark will consider broadcasting automatically.

17
Q

Why is SELECT * a performance anti-pattern in production?

A

It reads all columns even when only a subset is needed, increasing I/O, CPU, and network usage and making schema evolution riskier.

18
Q

What is column pruning and how does it work in Spark SQL?

A

An optimization where the engine reads only the columns required by the query, skipping unnecessary data in columnar formats.

19
Q

Why is predicate pushdown important for performance?

A

It pushes filter operations down to the data scan layer so fewer rows are read from storage, reducing I/O and compute costs.

20
Q

How does caching interact with Databricks Runtime performance?

A

Caching frequently reused DataFrames can greatly speed up repeated queries, but over-caching can exhaust memory and cause spills.

21
Q

What is a good rule of thumb for caching in Databricks?

A

Cache only intermediate results that are reused multiple times, and unpersist them when they are no longer needed.

22
Q

What are common signs that a job is I/O-bound on Databricks?

A

High scan times, low CPU utilization, lots of time spent in ‘scan’ stages, and performance improving mainly with better layout or pruning.

23
Q

What are common signs that a job is CPU-bound?

A

High CPU utilization, long-running compute-heavy stages (joins, aggregations), and limited improvement from changing storage layout.

24
Q

How can you identify skew issues using the Spark UI?

A

By looking for stages where a few tasks take much longer and process much more data than others, indicating uneven partition sizes.

25
What runtime strategies help mitigate data skew?
Key salting, pre-aggregation, using broadcast joins for small tables, and redesigning join keys or partitioning schemes.
26
Why is it useful to examine explain() plans regularly?
They reveal where scans, shuffles, and joins occur, helping spot unnecessary complexity, missed filters, or suboptimal join orders.
27
How can cluster autoscaling settings impact runtime performance?
Too aggressive scaling down can cause fragmentation and warm-up overhead; too conservative scaling up can leave the cluster underpowered under load.
28
Why are long-running all-purpose clusters more prone to performance drift?
They accumulate library changes, cached state, and possible misconfigurations; ephemeral job clusters often provide more predictable performance.
29
What is the relationship between DBR version and library compatibility?
Each DBR version supports specific Spark/Python/Scala versions and library versions; using incompatible libraries can cause runtime errors or degraded performance.
30
What is a good process for runtime and performance tuning on Databricks?
Start with a pinned DBR and modest cluster, measure job behavior via Spark UI and metrics, then iteratively tune partitions, layout (Delta optimizations), joins, and cluster settings based on evidence.
31
In one sentence, what is the core mental model for Databricks Runtime & Performance Tuning?
Use the optimized runtime and Delta features, then refine partitions, joins, and cluster settings based on actual job profiles instead of guessing, so you minimize shuffles and I/O while avoiding skew and memory issues.