Break down how a spark job works:
1) Jobs are created whenever a Spark action (e.g., count(), collect(), write()) is called on a dataset (RDD or DataFrame). Transformations (e.g., map(), filter()) create a logical plan but do not execute any computation until an action triggers the job.
2) driver is the central node that initiates the Spark job
SparkSession is the entry point for Spark applications and provides access to all of Spark’s functionalities
3) spark creates a logical plan of job by analyzing transformations, then logical plan is converted to physical plan which breaks down the job into stages and tasks
4) a spark job is broken down into stages,
which represent sets of transformations that can run sequentially. Each stage is further broken down into tasks, which are the smallest units of work that can be executed on a single partition of data
- Tasks are distributed to different worker nodes (executors) for parallel execution.
5) When operations require data to move across partitions (e.g., join, groupBy), Spark performs shuffling
6) Executors are processes on the worker nodes that execute tasks assigned to them by the driver.
- Executors communicate with the driver to report the status and outcome of their tasks
7) Once all tasks in a job are completed, results are either collected to the driver or written to a specified output
8) To optimize performance, Spark allows caching (or persisting) intermediate data across stages, so it doesn’t have to recompute the data repeatedly for the same Spark job.
What is difference between batch processing and real time processing?
Batch: Data is collected and processed in large chunks at specific time, and not processed immediatly but after a certain interval, it is high latency
Real: data is processed immediately, used when actions are crucial like stock trading system, live analytics on streaming. low latency and more complex to implement
What is a broadcast join
Smaller dataset broadcast to all executor nodes, Each executor has a local copy of the smaller dataset and can perform the join with the partitions of the larger dataset in parallel without needing to shuffle data.
What is hadoop
In what cases will spark driver die due to OOM
How are clusters set up
Cluster Types: standard, job, high concurrency
cluster mode: single, normal, concurrency