What is SLA in context of ETL?
Give example.
SLA = Service Level Agreement
Example of SLA:
“Your data must be available in transformed way till 6 in the morning.”
Data Processing framework should be able to keep the latency under control, even when data volume scales.
i.e., Latency should not scale linearly with data _ .
volume
This can be achieved by applying vertical and horizontal scalability
Data Processing Framework should be resilient towards transient network failures.
To this end, the restarting of a failed job should be _ .
automatic
Data Processing Framework should support integration (ingestion) with diverse set of data sources.
Name some.
Similarly the DP Framework should support writing to diverse set of data targets.
Data Processing Framework should support all kinds of data _ :
* batch
* streaming
* near real-time
ingestion
Additionally, DP Framework should be easy to upgrade with backward compatibility, easy to customize and extend and operationally smooth enough to hide implementational complexities.
Spark is in-memory distributed data processing framework, which is an improvement over _ framework.
map-reduce
RDD full form?
Resilient Distibuted Dataset
RDD is an immutable data structure. It doesn’t contain data, only the lineage information.
Partitions of RDD are distributed across the _ of the cluster.
nodes
When an RDD is created or transformed, the associated _ (dependency) information is updated and kept in-memory by Spark.
lineage
Lineage information is essential to avoid reprocessing source RDD if one of the partitions of RDD gets lost due to failure.
The operations on RDD can be classified into two categories.
Name them.
How do operations on RDD relate with lineage?
1. Transformation _ the lineage.
2. Action _ the result/outcome of all updates in the lineage
Partitions of RDD are distributed across the nodes of the cluster.
Each partition of RDD is processed _ by one task of the executor core.
parallely (in parallel)
Before and after Shuffling, the number of partitions of the RDD _ .
(stay same/can change)
can change
Lineage information is essential to avoid reprocessing entire source RDD if one of the partitions of RDD gets lost due to failure.
When one partition of RDD fails, Spark launches a new task which uses lineage information of the RDD to traverse back to read only the _ (avoiding the need to reprocess entire RDD) and rebuild the partition using transformations stored in lineage.
source partition
Adding an extra transformation to an RDD does not result in Spark computation, unless it causes a _ .
shuffle
Transformations that cause shuffling are known as _ .
“wide transformations”
Thus, narrow transformations do not lead to any extra performance penalty.
Spark Driver program is responsible for instantiating a _ , which provides a single unified entry point to all of Spark’s functionality (API).
An entry point is the main object through which you access Spark’s capabilities like
1. Core RDD APIs
2. DataFrame/Dataset APIs
3. SQL queries
4. Streaming
5. Hive support
SparkSession
It represents a single Spark Application, i.e., one spark application can have only one SparkSession.
SparkSession is responsible for sending extra JAR files for _ on executors (worker nodes).
configuration
Write Scala code to create a SparkSession.
Specify app name: “SparkApp”.
Specify master node as “local”.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("SparkApp")
.master("local[*]")
.getOrCreate()Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings with 10 partitions.
path = “hdfs:/path/to/file”
// Access SparkContext from SparkSession val sc = spark.sparkContext // File path can be HDFS, local FS, or any Hadoop-supported URI val filePath = args(0) // e.g. "hdfs://namenode:8020/path/to/file.txt" // or "file:///path/to/localfile.txt" // Read file into RDD[String] with 10 partitions val textRDD = sc.textFile(filePath, minPartitions = 10) // Example action: count lines val lineCount = textRDD.count() println(s"Total lines in file: $lineCount") // Stop SparkSession spark.stop()
Add a JAR dependency for all tasks to be executed on this SparkContext in the future.
// Access SparkContext from SparkSession val sc = spark.sparkContext // JAR File path can be HDFS, local FS, or any Hadoop-supported URI val filePath = args(0) sc.addJar(filePath)
Note: If a jar is added during execution, it will not be available until the next TaskSet starts.
_ provides resources to your Spark Application.
Cluster Manager
resources like network (bandwidth), cpu, ram
Cluster Manager is not a part of the Spark Application.
It’s an external service used by Spark Application to acquire resources on the cluster.
Name few examples of Cluster Managers.
Driver node connects with Cluster Manager to request _ for executors, until the SparkSession acquires them.
resources