What is Databricks Runtime (DBR)?
A curated, optimized Spark distribution provided by Databricks that bundles specific Spark versions, libraries, and performance improvements.
Why should you pin a specific DBR version instead of always using the latest?
Pinning ensures reproducibility and stability; automatically upgrading to the latest can introduce behavior changes or incompatibilities unexpectedly.
What is Photon in Databricks?
A next-generation query engine built in C++ that accelerates SQL and DataFrame workloads, particularly for Delta and Parquet data.
When can enabling Photon be beneficial?
For SQL-heavy, scan-heavy, and aggregation-heavy workloads where you want lower latency and better throughput on supported operations.
Why is it important to match DBR type (e.g., ‘Standard’ vs ‘Photon’) to workload characteristics?
Some workloads benefit more from Photon and vectorized execution, while others may rely on libraries or features not fully optimized by Photon.
What is adaptive query execution (AQE) in Spark/Databricks?
A runtime optimization that adjusts query execution plans on the fly based on observed statistics, such as changing join strategies or shuffle partitions.
How can AQE improve performance?
By coalescing small shuffle partitions, dynamically switching join types (e.g., to broadcast), and handling skew more intelligently at runtime.
Why is input file layout important for performance on Databricks?
Poorly organized data can lead to many small files, skewed partitions, and inefficient scans, while well-laid-out Delta/Parquet data enables skipping and parallelism.
What are some Delta-specific optimizations that affect performance?
OPTIMIZE to compact small files, ZORDER BY for better clustering, and proper partitioning to enable partition pruning.
What is data skipping in Delta and why does it matter for performance?
Data skipping uses file-level statistics (e.g., min/max per column) to avoid scanning files that cannot satisfy filter predicates, reducing I/O.
How does choosing partition columns affect performance in Databricks?
Good partition keys align with common filters and produce balanced partition sizes; poor keys can create hotspots or many small files.
What is spark.sql.shuffle.partitions and why is it important?
A configuration controlling the default number of shuffle partitions; tuning it helps avoid too many tiny partitions or too few large ones.
When might you reduce spark.sql.shuffle.partitions from its default?
When your dataset is modest or your cluster is small, to avoid overhead from scheduling an excessive number of tiny tasks.
When might you increase spark.sql.shuffle.partitions?
When processing very large datasets on large clusters to provide enough parallelism and avoid huge single partitions.
How do broadcast joins help performance in Databricks?
They replicate small tables to all executors so that large tables can be joined locally, avoiding expensive shuffles of both sides.
What configuration can influence when Spark chooses broadcast joins?
spark.sql.autoBroadcastJoinThreshold, which sets the maximum size of a table that Spark will consider broadcasting automatically.
Why is SELECT * a performance anti-pattern in production?
It reads all columns even when only a subset is needed, increasing I/O, CPU, and network usage and making schema evolution riskier.
What is column pruning and how does it work in Spark SQL?
An optimization where the engine reads only the columns required by the query, skipping unnecessary data in columnar formats.
Why is predicate pushdown important for performance?
It pushes filter operations down to the data scan layer so fewer rows are read from storage, reducing I/O and compute costs.
How does caching interact with Databricks Runtime performance?
Caching frequently reused DataFrames can greatly speed up repeated queries, but over-caching can exhaust memory and cause spills.
What is a good rule of thumb for caching in Databricks?
Cache only intermediate results that are reused multiple times, and unpersist them when they are no longer needed.
What are common signs that a job is I/O-bound on Databricks?
High scan times, low CPU utilization, lots of time spent in ‘scan’ stages, and performance improving mainly with better layout or pruning.
What are common signs that a job is CPU-bound?
High CPU utilization, long-running compute-heavy stages (joins, aggregations), and limited improvement from changing storage layout.
How can you identify skew issues using the Spark UI?
By looking for stages where a few tasks take much longer and process much more data than others, indicating uneven partition sizes.