What is Apache Spark at a high level?
A distributed data processing engine that executes computations in parallel across a cluster, supporting batch, streaming, SQL, and ML workloads.
How does Databricks relate to Spark?
Databricks provides a managed, optimized Spark runtime with additional services like notebooks, jobs, Delta Lake, and governance tooling.
What is a Spark application on Databricks?
A program (notebook, job, script) that uses Spark to process data on a Databricks cluster.
What are the two main node roles in a Spark cluster?
The driver node, which coordinates the application, and worker nodes, which execute tasks on partitions of the data.
What is the driver in Spark?
The process that runs the main application code, builds the logical plan, and schedules tasks on worker nodes.
What is an executor in Spark?
A process running on a worker node that executes tasks and holds data partitions in memory or on disk.
What is a Spark RDD (Resilient Distributed Dataset)?
An immutable, distributed collection of records that can be processed in parallel and recomputed from lineage information.
Why are RDDs rarely used directly in Databricks for most workloads now?
Higher-level APIs like DataFrames and Spark SQL provide better optimization, safety, and convenience for structured data.
What is a Spark DataFrame?
A distributed, tabular data structure with named columns and a schema, similar to a table in a relational database.
How do DataFrames relate to Spark SQL?
Spark SQL queries operate on DataFrames, and DataFrames can be registered as temporary views to be queried with SQL.
Why are DataFrames usually preferred over RDDs in Databricks?
They enable the Catalyst optimizer to plan and optimize queries, often yielding better performance with less code.
What is a transformation in Spark?
An operation that defines a new dataset from an existing one, such as select, filter, map, or join, without triggering execution.
What is an action in Spark?
An operation that triggers execution and returns a result to the driver or writes data out, such as count, collect, show, or write.
What does it mean that Spark has lazy evaluation?
Transformations build a logical plan but nothing is actually executed until an action is called, allowing global optimization.
What is a DAG (Directed Acyclic Graph) in Spark?
A graph of stages and operations that represent the logical execution plan of transformations leading up to an action.
Why does Spark use DAGs instead of a fixed MapReduce pattern?
DAGs allow more complex multi-stage workflows and better optimization across chained transformations.
What is a stage in Spark execution?
A set of tasks that can be executed without reshuffling data, separated by shuffle boundaries in the DAG.
What is a task in Spark execution?
A unit of work that processes one partition of the data for a specific stage on an executor.
What is a partition in Spark?
A chunk of the dataset that is processed as a unit by a single task on one executor, supporting data parallelism.
Why is the number and size of partitions important?
Too few partitions underutilize the cluster; too many create overhead in scheduling and task management.
What is a shuffle in Spark?
A data movement operation where records are redistributed across partitions, typically for joins, groupBy, or aggregations by key.
Why are shuffles expensive operations?
They involve network transfer, disk I/O for intermediate data, and additional coordination, often dominating job runtime.
How can you reduce shuffles in Spark jobs?
By avoiding unnecessary groupBy/join operations, pre-partitioning data appropriately, and reusing partitioning where possible.
What is a wide transformation?
A transformation that requires data from many partitions, such as groupByKey, and usually triggers a shuffle.