What is a Databricks cluster at a high level?
A set of compute resources (driver and workers) configured with a runtime and libraries, used to run Spark jobs, notebooks, and SQL workloads.
What are the two main types of clusters in Databricks?
All-purpose clusters for interactive use and job clusters for scheduled or one-off job runs.
When are all-purpose clusters typically used?
For interactive development, exploratory analysis, and collaborative notebook work.
When are job clusters typically used?
For production jobs and workflows, where each run gets a clean, ephemeral cluster with defined configuration.
Why are job clusters recommended for production workloads?
They avoid interference between users, ensure reproducible environments, and can be sized exactly for each job’s needs.
What is the driver node’s role on a Databricks cluster?
It runs the Spark driver process, coordinates execution, and usually holds the notebook state or main application logic.
What is a worker node’s role on a Databricks cluster?
It runs executor processes that perform actual data processing tasks on partitions and store intermediate data.
What does autoscaling mean for a Databricks cluster?
The ability to automatically increase or decrease the number of worker nodes based on workload demand within configured bounds.
Why is autoscaling useful?
It can improve resource utilization and cost efficiency by scaling up under load and scaling down when idle or lightly loaded.
What is a reasonable upper limit on autoscaling tied to?
Expected peak concurrency and data volume; too high a limit can cause excessive costs or strain downstream systems.
What is a cluster policy in Databricks?
An admin-defined set of constraints and defaults on cluster configurations to enforce best practices and cost controls.
Why are cluster policies important in larger organizations?
They standardize runtimes, node types, autoscaling ranges, and security settings, reducing misconfiguration and cost surprises.
What are some common parameters controlled by cluster policies?
Instance families and sizes, autoscaling bounds, runtime versions, spot vs on-demand usage, and permission scopes.
What is Databricks Runtime (DBR)?
A curated and optimized Spark runtime with bundled libraries and performance improvements provided by Databricks.
Why is it important to choose a specific DBR version rather than ‘latest’?
Pinning versions ensures reproducibility and avoids unexpected behavior changes when runtimes are updated.
What is the difference between CPU and GPU clusters in Databricks?
CPU clusters use standard compute nodes; GPU clusters use nodes with GPUs for accelerated ML, DL, or compute-intensive tasks.
When might you choose a GPU cluster?
For training deep learning models or other GPU-accelerated libraries where workloads benefit from large parallelism.
What is spot (or preemptible) capacity in cluster configuration?
Discounted instances that can be reclaimed by the cloud provider, offering lower cost at the expense of potential interruptions.
Why is mixing spot and on-demand nodes sometimes advantageous?
It can lower cost while retaining some resiliency, as critical tasks can run on on-demand nodes while extra capacity uses cheaper spot nodes.
What is cluster auto-termination?
A setting that automatically terminates a cluster after a period of inactivity, preventing idle compute charges.
Why is auto-termination essential for all-purpose clusters?
Interactive clusters are easy to forget; auto-termination avoids paying for unused resources when work stops.
What is the relationship between cluster size and parallelism?
Larger clusters (more cores) can run more tasks in parallel, but require enough data and proper partitioning to be fully utilized.
Why can a cluster that is too large be inefficient?
If workloads and data sizes are too small, many cores sit idle, wasting cost without speeding up jobs significantly.
What is executor memory used for?
Holding data partitions, cached DataFrames, and intermediate results during computations.