The performance gain in Distributed Computing lies in splitting the data into smaller chunks and _ processing them on _ machine(s).
(iteratively/parallely)
(single machine/several machines)
Code (containing business logic) for processing is brought to the place where the data chunk is actually stored. This technique is called “Data Parallel Processing”
The MapReduce paradigm breaks down a Data Parallel Processing problem into three kinds of stages.
Name them.
There can be multiple Map stages followed by multiple Reduce stages. However, a Reduce
stage only starts after all of the Map stages have been completed.
The Map stage takes the input dataset, splits it into _ .
(key, value) pairs
The Shuffle stage takes the (key, value) pairs from the Map stage and _ them so that pairs with the same key end up together.
sorts/groups
Sorting/Grouping is based on keys.
A processing core in the cluster cannot have two non-trivial groups (with freq > 1)
The Reduce stage takes the resultant groups of (key, value) pairs from the Shuffle stage and _ them (groups) to produce the final result.
reduces or aggregates
Commodity Hardware used to store data chunks is prone to failures.
MapReduce achieves resiliency to hardware failures by saving the results of every stage to _ , which offers
* replication
* checkpointing
* re-execution
as resiliency mechanisms.
HDFS
(Hadoop Distributed File System)
How does “replication of blocks” in HDFS help make mapreduce resilient?
Input and intermediate data chunks are stored in HDFS, which replicates blocks (default 3 copies) across different machines. If one disk fails, another copy exists elsewhere.
How does “Checkpointing intermediate results” in HDFS help make mapreduce resilient?
Each stage’s output is written to disk so that if a node fails mid‑computation, the job can restart from the last completed stage instead of reprocessing everything.
How does “Task re‑execution” in HDFS help make mapreduce resilient?
If a mapper/reducer fails, the mapreduce framework reassigns the task to another node having a replica of the data
The round-trip to HDFS (disk) after every stage makes MapReduce relatively _ at
processing data because of the _ I/O performance of physical disks in general.
(slow/fast)
To overcome the limitation of slow performance of mapreduce, Spark keeps intermediate results in memory ( _ than I/O), but still falling back to disk when needed for resiliency.
Additionally, Spark offers much more _ APIs to express data transformations as compared to mapreduce (Hadoop).
(slower/faster)
(low-level/high-level)
A _ is a group of computers (machines) all working together as a single unit to solve a distributed computing problem.
cluster
The primary machine of a Cluster, called the _ , takes care of the orchestration and management of the Cluster.
Master Node
The secondary machines in a cluster that actually perform the task (data processing code)
are called _ .
Worker Nodes
At any time, a Cluster has exactly one Master node and one or more Worker nodes.
In Spark, Resilient Distributed Dataset (RDD) is _ data structure that is _ across the cluster, residing in memory (RAM) of _ .
(mutable/immutable)
(localized/distributed)
(single machine / several machines)
An RDD consists of _ , which are logical divisions of an RDD, with a few of them residing on each machine/node of the cluster.
partitions
_ are used to manipulate the data stored within the partitions of an RDD.
Hint: They operate on RDDs.
Higher-order Functions
map, flatMap, reduce, fold, filter, reduceByKey, join, and union to name a few.
Higher-order function accepts _ as
parameter, which is an inner function that helps us define the actual business logic that transforms data and is applied to each partition of the RDD in parallel.
lambda
Note: lambdas apply (transformations) to all partitions of the RDD at once (in parallel)
Given SparkContext sc, write PySpark program to read a text file as an RDD and split each word based on a delimiter such as a whitespace. Transform each word into tuple with default count = 1.
linesRDD = sc.textFile("/path/to/file")
wordsRDD = linesRDD.flatMap(lambda s: s.split())
tuplesRDD = wordsRDD.map(lambda s: (s, 1))At any given point in time, an RDD has information of all the individual operations performed on it, going back all the way to the data source itself.
How is this “lineage information” useful for fault tolerance?
If any Executors are lost due to any failures and one or more of its partitions are lost, RDD can easily recreate the lost partitions from the source data by making use of the lineage information, thus making it Resilient to failures.
List 3 major components of Apache Spark architecture.
Each Executor process is launched on _ node of the Spark cluster.
(master/worker)
worker
In both modes, Executors still run on worker nodes of the cluster (close to distributed data)
The deploy mode is specified via flag when we submit a Spark job through spark-submit command.
spark-submit \ --master yarn \ --deploy-mode cluster \ my_app.py (or my_scala_app.jar)
Spark supports running with a variety of clusters depending upon your scale, infrastructure, and operational needs.
The concept of a master node (for Driver) is externalized when Spark runs on
1. YARN
2. Kubernetes.
as clusters.
Name the cluster manager for each of the following clusters:
1. Standalone
2. YARN
3. Kubernetes
spark-submit \ --master yarn \ --deploy-mode cluster \ my_app.py (or my_scala_app.jar)