Spark Flashcards

Question 1

Q

Break down how a spark job works:

Answer

A

1) Jobs are created whenever a Spark action (e.g., count(), collect(), write()) is called on a dataset (RDD or DataFrame). Transformations (e.g., map(), filter()) create a logical plan but do not execute any computation until an action triggers the job.

2) driver is the central node that initiates the Spark job
SparkSession is the entry point for Spark applications and provides access to all of Spark’s functionalities

3) spark creates a logical plan of job by analyzing transformations, then logical plan is converted to physical plan which breaks down the job into stages and tasks

4) a spark job is broken down into stages,
which represent sets of transformations that can run sequentially. Each stage is further broken down into tasks, which are the smallest units of work that can be executed on a single partition of data
- Tasks are distributed to different worker nodes (executors) for parallel execution.

5) When operations require data to move across partitions (e.g., join, groupBy), Spark performs shuffling

6) Executors are processes on the worker nodes that execute tasks assigned to them by the driver.
- Executors communicate with the driver to report the status and outcome of their tasks

7) Once all tasks in a job are completed, results are either collected to the driver or written to a specified output

8) To optimize performance, Spark allows caching (or persisting) intermediate data across stages, so it doesn’t have to recompute the data repeatedly for the same Spark job.

Question 2

Q

What is difference between batch processing and real time processing?

Answer

A

Batch: Data is collected and processed in large chunks at specific time, and not processed immediatly but after a certain interval, it is high latency

Real: data is processed immediately, used when actions are crucial like stock trading system, live analytics on streaming. low latency and more complex to implement

Question 3

Q

What is a broadcast join

Answer

A

Smaller dataset broadcast to all executor nodes, Each executor has a local copy of the smaller dataset and can perform the join with the partitions of the larger dataset in parallel without needing to shuffle data.

Question 4

Q

What is hadoop

Answer

A

An open-source framework designed to store and process large volumes of data across a distributed environment.
Enables the processing of big data by splitting workloads across multiple nodes in a cluster, allowing organizations to manage vast datasets effectively and cost-efficiently.

Question 5

Q

In what cases will spark driver die due to OOM

Answer

A

Collecting Large Datasets to the Driver (df.collect())
Broadcasting Large Variables (broadcast()), Broadcasting a large dataset or variable with sparkContext.broadcast() distributes it to all worker nodes, but the initial broadcast variable must fit into the driver’s memory.
Large Joins with Broadcast Variables
Running Complex Actions (e.g., countByKey(), collectAsMap())
Large Result Sets with take(n) or head(n) on a Large Dataset
Skewed Data and Skewed Aggregations
Overloaded Spark UI (Event Logs and Web UI Data): If too much data accumulates in the driver for the Spark UI, it can lead to an OOM error.

Question 6

Q

How are clusters set up

Answer

A

Cluster Types: standard, job, high concurrency
cluster mode: single, normal, concurrency

Spark Flashcards

(6 cards)