Spark Core Concepts Flashcards by O Cam

What is Apache Spark at a high level?

A distributed data processing engine that executes computations in parallel across a cluster, supporting batch, streaming, SQL, and ML workloads.

How well did you know this?

Not at all

Perfectly

How does Databricks relate to Spark?

Databricks provides a managed, optimized Spark runtime with additional services like notebooks, jobs, Delta Lake, and governance tooling.

How well did you know this?

Not at all

Perfectly

What is a Spark application on Databricks?

A program (notebook, job, script) that uses Spark to process data on a Databricks cluster.

How well did you know this?

Not at all

Perfectly

What are the two main node roles in a Spark cluster?

The driver node, which coordinates the application, and worker nodes, which execute tasks on partitions of the data.

How well did you know this?

Not at all

Perfectly

What is the driver in Spark?

The process that runs the main application code, builds the logical plan, and schedules tasks on worker nodes.

How well did you know this?

Not at all

Perfectly

What is an executor in Spark?

A process running on a worker node that executes tasks and holds data partitions in memory or on disk.

How well did you know this?

Not at all

Perfectly

What is a Spark RDD (Resilient Distributed Dataset)?

An immutable, distributed collection of records that can be processed in parallel and recomputed from lineage information.

How well did you know this?

Not at all

Perfectly

Why are RDDs rarely used directly in Databricks for most workloads now?

Higher-level APIs like DataFrames and Spark SQL provide better optimization, safety, and convenience for structured data.

How well did you know this?

Not at all

Perfectly

What is a Spark DataFrame?

A distributed, tabular data structure with named columns and a schema, similar to a table in a relational database.

How well did you know this?

Not at all

Perfectly

How do DataFrames relate to Spark SQL?

Spark SQL queries operate on DataFrames, and DataFrames can be registered as temporary views to be queried with SQL.

How well did you know this?

Not at all

Perfectly

Why are DataFrames usually preferred over RDDs in Databricks?

They enable the Catalyst optimizer to plan and optimize queries, often yielding better performance with less code.

How well did you know this?

Not at all

Perfectly

What is a transformation in Spark?

An operation that defines a new dataset from an existing one, such as select, filter, map, or join, without triggering execution.

How well did you know this?

Not at all

Perfectly

What is an action in Spark?

An operation that triggers execution and returns a result to the driver or writes data out, such as count, collect, show, or write.

How well did you know this?

Not at all

Perfectly

What does it mean that Spark has lazy evaluation?

Transformations build a logical plan but nothing is actually executed until an action is called, allowing global optimization.

How well did you know this?

Not at all

Perfectly

What is a DAG (Directed Acyclic Graph) in Spark?

A graph of stages and operations that represent the logical execution plan of transformations leading up to an action.

How well did you know this?

Not at all

Perfectly

Why does Spark use DAGs instead of a fixed MapReduce pattern?

DAGs allow more complex multi-stage workflows and better optimization across chained transformations.

How well did you know this?

Not at all

Perfectly

What is a stage in Spark execution?

A set of tasks that can be executed without reshuffling data, separated by shuffle boundaries in the DAG.

How well did you know this?

Not at all

Perfectly

What is a task in Spark execution?

Study These Flashcards

A unit of work that processes one partition of the data for a specific stage on an executor.

What is a partition in Spark?

Study These Flashcards

A chunk of the dataset that is processed as a unit by a single task on one executor, supporting data parallelism.

Why is the number and size of partitions important?

Study These Flashcards

Too few partitions underutilize the cluster; too many create overhead in scheduling and task management.

What is a shuffle in Spark?

Study These Flashcards

A data movement operation where records are redistributed across partitions, typically for joins, groupBy, or aggregations by key.

Why are shuffles expensive operations?

Study These Flashcards

They involve network transfer, disk I/O for intermediate data, and additional coordination, often dominating job runtime.

How can you reduce shuffles in Spark jobs?

Study These Flashcards

By avoiding unnecessary groupBy/join operations, pre-partitioning data appropriately, and reusing partitioning where possible.

What is a wide transformation?

Study These Flashcards

A transformation that requires data from many partitions, such as groupByKey, and usually triggers a shuffle.

What is a narrow transformation?

A transformation like map or filter where each output partition depends on data from a single input partition, avoiding shuffles.

What is caching (persisting) in Spark?

Storing intermediate DataFrames or RDDs in memory (and optionally disk) so they can be reused without recomputation.

Why is caching useful in Databricks notebooks?

Interactive analysis often reuses the same intermediate data; caching speeds up subsequent actions on that data.

What is the difference between `cache()` and `persist()` in Spark?

`cache()` uses a default storage level (usually memory-only), while `persist()` allows specifying different storage levels (e.g., memory-and-disk).

Why should caching be used judiciously?

Caching too many or very large datasets can exhaust executor memory and lead to spills or eviction of useful data.

What is a broadcast variable or broadcast join?

A mechanism to send a small dataset to all executors so that large datasets can be joined locally without shuffling both sides.

When is a broadcast join appropriate in Databricks?

When one side of the join is small enough to fit in memory on each executor, significantly reducing shuffle cost.

What is the Catalyst optimizer?

Spark SQL’s query optimizer that analyzes logical plans, applies rules, and generates optimized physical execution plans.

Why is understanding Catalyst helpful even if you write only SQL/DataFrames?

It explains why certain query patterns are faster, how filters and projections are pushed down, and when joins or shuffles occur.

What is Tungsten in Spark runtime?

An optimization project focusing on memory management and code generation for efficient binary processing of data.

How does Databricks enhance Spark’s optimizer/runtime?

Through runtime improvements, cost-based optimizations, and Delta Lake-specific features like data skipping and file pruning.

What is a SparkSession in Databricks?

The entry point for Spark functionality, exposed as 'spark' in Databricks notebooks and used to create DataFrames, run SQL, and manage configs.

How do you typically create a DataFrame from a Delta or Parquet table in Databricks?

By using `spark.read.format('delta' or 'parquet').load(path)` or `spark.table('db.table_name')`.

What is the difference between `DataFrame.show()` and `DataFrame.collect()`?

`show()` prints a limited number of rows for inspection, while `collect()` retrieves all results to the driver, which can be dangerous for large datasets.

Why is calling `collect()`on large datasets a pitfall?

It can overwhelm driver memory and crash the application; large results should be written to storage or inspected with limited samples.

What is the effect of using `display()` in Databricks notebooks?

It triggers an action similar to `show()` but with richer visualization; it still executes a job under the hood.

How can you inspect the physical plan of a DataFrame or SQL query in Databricks?

By using `df.explain()` or `EXPLAIN` in SQL, optionally with 'EXTENDED' or 'CODEGEN' flags to see detailed plans.

Why is examining `explain()` output useful for engineers?

It reveals where scans, filters, joins, and shuffles occur, guiding performance tuning and schema or query changes.

What is a good mental model for Spark on Databricks from a data engineer’s perspective?

Write transformations in DataFrames/SQL, let Spark build a DAG, be conscious of partitions and shuffles, cache deliberately, and always think about what triggers actions and where data moves.

Spark Core Concepts Flashcards

(43 cards)