Cluster & Compute Management Flashcards by O Cam

What is a Databricks cluster at a high level?

A set of compute resources (driver and workers) configured with a runtime and libraries, used to run Spark jobs, notebooks, and SQL workloads.

How well did you know this?

Not at all

Perfectly

What are the two main types of clusters in Databricks?

All-purpose clusters for interactive use and job clusters for scheduled or one-off job runs.

How well did you know this?

Not at all

Perfectly

When are all-purpose clusters typically used?

For interactive development, exploratory analysis, and collaborative notebook work.

How well did you know this?

Not at all

Perfectly

When are job clusters typically used?

For production jobs and workflows, where each run gets a clean, ephemeral cluster with defined configuration.

How well did you know this?

Not at all

Perfectly

Why are job clusters recommended for production workloads?

They avoid interference between users, ensure reproducible environments, and can be sized exactly for each job’s needs.

How well did you know this?

Not at all

Perfectly

What is the driver node’s role on a Databricks cluster?

It runs the Spark driver process, coordinates execution, and usually holds the notebook state or main application logic.

How well did you know this?

Not at all

Perfectly

What is a worker node’s role on a Databricks cluster?

It runs executor processes that perform actual data processing tasks on partitions and store intermediate data.

How well did you know this?

Not at all

Perfectly

What does autoscaling mean for a Databricks cluster?

The ability to automatically increase or decrease the number of worker nodes based on workload demand within configured bounds.

How well did you know this?

Not at all

Perfectly

Why is autoscaling useful?

It can improve resource utilization and cost efficiency by scaling up under load and scaling down when idle or lightly loaded.

How well did you know this?

Not at all

Perfectly

What is a reasonable upper limit on autoscaling tied to?

Expected peak concurrency and data volume; too high a limit can cause excessive costs or strain downstream systems.

How well did you know this?

Not at all

Perfectly

What is a cluster policy in Databricks?

An admin-defined set of constraints and defaults on cluster configurations to enforce best practices and cost controls.

How well did you know this?

Not at all

Perfectly

Why are cluster policies important in larger organizations?

They standardize runtimes, node types, autoscaling ranges, and security settings, reducing misconfiguration and cost surprises.

How well did you know this?

Not at all

Perfectly

What are some common parameters controlled by cluster policies?

Instance families and sizes, autoscaling bounds, runtime versions, spot vs on-demand usage, and permission scopes.

How well did you know this?

Not at all

Perfectly

What is Databricks Runtime (DBR)?

A curated and optimized Spark runtime with bundled libraries and performance improvements provided by Databricks.

How well did you know this?

Not at all

Perfectly

Why is it important to choose a specific DBR version rather than ‘latest’?

Pinning versions ensures reproducibility and avoids unexpected behavior changes when runtimes are updated.

How well did you know this?

Not at all

Perfectly

What is the difference between CPU and GPU clusters in Databricks?

CPU clusters use standard compute nodes; GPU clusters use nodes with GPUs for accelerated ML, DL, or compute-intensive tasks.

How well did you know this?

Not at all

Perfectly

When might you choose a GPU cluster?

For training deep learning models or other GPU-accelerated libraries where workloads benefit from large parallelism.

How well did you know this?

Not at all

Perfectly

What is spot (or preemptible) capacity in cluster configuration?

Study These Flashcards

Discounted instances that can be reclaimed by the cloud provider, offering lower cost at the expense of potential interruptions.

Why is mixing spot and on-demand nodes sometimes advantageous?

Study These Flashcards

It can lower cost while retaining some resiliency, as critical tasks can run on on-demand nodes while extra capacity uses cheaper spot nodes.

What is cluster auto-termination?

Study These Flashcards

A setting that automatically terminates a cluster after a period of inactivity, preventing idle compute charges.

Why is auto-termination essential for all-purpose clusters?

Study These Flashcards

Interactive clusters are easy to forget; auto-termination avoids paying for unused resources when work stops.

What is the relationship between cluster size and parallelism?

Study These Flashcards

Larger clusters (more cores) can run more tasks in parallel, but require enough data and proper partitioning to be fully utilized.

Why can a cluster that is too large be inefficient?

Study These Flashcards

If workloads and data sizes are too small, many cores sit idle, wasting cost without speeding up jobs significantly.

What is executor memory used for?

Study These Flashcards

Holding data partitions, cached DataFrames, and intermediate results during computations.

What happens when executors run out of memory?

Spark spills data to disk, which slows down jobs, or may trigger out-of-memory errors and task failures if severe.

What is the effect of very small partitions on performance?

They increase scheduling and overhead, leading to less efficient use of the cluster and longer runtimes.

What is the effect of very large partitions on performance?

They can cause skew, long-running tasks, and higher memory pressure, risking spills or failures on specific executors.

What configuration controls partition sizes in Spark?

Settings like spark.sql.shuffle.partitions and input partitioning of source data influence the number and size of partitions.

Why should spark.sql.shuffle.partitions be tuned?

The default may be too high or low for your workload; tuning can reduce unnecessary shuffles and improve performance.

What is data skew and how does it show up on a cluster?

Uneven distribution of data across partitions; it appears as a few tasks taking much longer than others and underutilized workers.

What strategies help mitigate skew on Databricks?

Salting keys, pre-aggregation, using broadcast joins where appropriate, and redesigning partitioning or join keys.

Why is reading from Delta tables often faster than raw Parquet on Databricks?

Delta’s transaction log, statistics, and data skipping enable more efficient file pruning and planning compared to plain folders.

What is Spark caching and how does it relate to cluster memory?

Caching stores DataFrames in executor memory for reuse; if memory is limited, caching may cause spills or eviction of useful data.

When is it appropriate to cache a DataFrame on Databricks?

When it is reused multiple times across actions and fits comfortably within available memory without crowding out other workloads.

Why should unnecessary caching or persist calls be removed?

They consume memory and can degrade performance if cached data is rarely reused or too large.

What is the purpose of 'explain' plans for performance tuning?

They show the physical operations (scans, filters, joins, shuffles) and help identify bottlenecks or unnecessary work.

What metrics or tools can be used to analyze job performance on Databricks?

Spark UI, Databricks job run details, cluster metrics dashboards, and logs for tasks, executors, and stages.

Why is the Spark UI important for tuning?

It reveals stage runtimes, shuffle volumes, skewed tasks, and executor utilization, guiding where to optimize code or configuration.

How can you reduce data scanned per query in Databricks?

By using partition pruning, data skipping, ZORDER on Delta tables, and selecting only necessary columns.

Why should SELECT * be avoided in production ETL and analytics?

It reads more columns than needed, increases I/O and network usage, and makes schema evolution riskier.

What is a good practice for isolating heavy workloads on Databricks?

Run them on dedicated job clusters or separate SQL warehouses so they do not impact interactive or latency-sensitive workloads.

How does cluster concurrency affect performance?

Too many concurrent jobs can cause resource contention, leading to longer runtimes and higher failure rates; concurrency limits help maintain stability.

What is a reasonable approach to cluster sizing on Databricks?

Start with a modest configuration, measure utilization and runtimes, and adjust size and autoscaling based on empirical evidence rather than guesswork.

In one sentence, what is the core mental model for cluster and performance tuning on Databricks?

Treat clusters as flexible, experiment-driven resources: tune size, autoscaling, partitions, and storage layout so Spark jobs fully use the cluster without wasting compute or suffering from skew and spills.

Cluster & Compute Management Flashcards

(44 cards)