Apache Spark Core and Structured Streaming by Amit Ranjan Flashcards by Sahil Kakkar

What is SLA in context of ETL?
Give example.

SLA = Service Level Agreement
Example of SLA:
“Your data must be available in transformed way till 6 in the morning.”

How well did you know this?

Not at all

Perfectly

Data Processing framework should be able to keep the latency under control, even when data volume scales.
i.e., Latency should not scale linearly with data _ .

volume

This can be achieved by applying vertical and horizontal scalability

How well did you know this?

Not at all

Perfectly

Data Processing Framework should be resilient towards transient network failures.
To this end, the restarting of a failed job should be _ .

automatic

How well did you know this?

Not at all

Perfectly

Data Processing Framework should support integration (ingestion) with diverse set of data sources.
Name some.

file stored in HDFS/S3 storage
JSON file from REST api
data from a SQL/NoSQL database

Similarly the DP Framework should support writing to diverse set of data targets.

How well did you know this?

Not at all

Perfectly

Data Processing Framework should support all kinds of data _ :
* batch
* streaming
* near real-time

ingestion

Additionally, DP Framework should be easy to upgrade with backward compatibility, easy to customize and extend and operationally smooth enough to hide implementational complexities.

How well did you know this?

Not at all

Perfectly

Spark is in-memory distributed data processing framework, which is an improvement over _ framework.

map-reduce

How well did you know this?

Not at all

Perfectly

RDD full form?

Resilient Distibuted Dataset

RDD is an immutable data structure. It doesn’t contain data, only the lineage information.

How well did you know this?

Not at all

Perfectly

Partitions of RDD are distributed across the _ of the cluster.

nodes

How well did you know this?

Not at all

Perfectly

When an RDD is created or transformed, the associated _ (dependency) information is updated and kept in-memory by Spark.

lineage

Lineage information is essential to avoid reprocessing source RDD if one of the partitions of RDD gets lost due to failure.

How well did you know this?

Not at all

Perfectly

The operations on RDD can be classified into two categories.
Name them.

Transformations
Actions

How well did you know this?

Not at all

Perfectly

How do operations on RDD relate with lineage?
1. Transformation _ the lineage.
2. Action _ the result/outcome of all updates in the lineage

updates
demands

How well did you know this?

Not at all

Perfectly

Partitions of RDD are distributed across the nodes of the cluster.
Each partition of RDD is processed _ by one task of the executor core.

parallely (in parallel)

How well did you know this?

Not at all

Perfectly

Before and after Shuffling, the number of partitions of the RDD _ .

(stay same/can change)

can change

How well did you know this?

Not at all

Perfectly

Lineage information is essential to avoid reprocessing entire source RDD if one of the partitions of RDD gets lost due to failure.

When one partition of RDD fails, Spark launches a new task which uses lineage information of the RDD to traverse back to read only the _ (avoiding the need to reprocess entire RDD) and rebuild the partition using transformations stored in lineage.

source partition

How well did you know this?

Not at all

Perfectly

Adding an extra transformation to an RDD does not result in Spark computation, unless it causes a _ .

shuffle

How well did you know this?

Not at all

Perfectly

Transformations that cause shuffling are known as _ .

Study These Flashcards

“wide transformations”

Thus, narrow transformations do not lead to any extra performance penalty.

Spark Driver program is responsible for instantiating a _ , which provides a single unified entry point to all of Spark’s functionality (API).

An entry point is the main object through which you access Spark’s capabilities like
1. Core RDD APIs
2. DataFrame/Dataset APIs
3. SQL queries
4. Streaming
5. Hive support

Study These Flashcards

SparkSession

It represents a single Spark Application, i.e., one spark application can have only one SparkSession.

SparkSession is responsible for sending extra JAR files for _ on executors (worker nodes).

Study These Flashcards

configuration

Write Scala code to create a SparkSession.
Specify app name: “SparkApp”.
Specify master node as “local”.

Study These Flashcards

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("SparkApp")
.master("local[*]")
.getOrCreate()

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings with 10 partitions.

path = “hdfs:/path/to/file”

Study These Flashcards

// Access SparkContext from SparkSession
val sc = spark.sparkContext 

// File path can be HDFS, local FS, or any Hadoop-supported URI 
val filePath = args(0) 

// e.g. "hdfs://namenode:8020/path/to/file.txt" // or "file:///path/to/localfile.txt" 

// Read file into RDD[String] with 10 partitions 
val textRDD = sc.textFile(filePath, minPartitions = 10) 

// Example action: count lines 
val lineCount = textRDD.count() println(s"Total lines in file: $lineCount") 

// Stop SparkSession 
spark.stop()

Add a JAR dependency for all tasks to be executed on this SparkContext in the future.

Study These Flashcards

// Access SparkContext from SparkSession
val sc = spark.sparkContext

// JAR File path can be HDFS, local FS, or any Hadoop-supported URI 
val filePath = args(0) 

sc.addJar(filePath)

Note: If a jar is added during execution, it will not be available until the next TaskSet starts.

_ provides resources to your Spark Application.

Study These Flashcards

Cluster Manager

resources like network (bandwidth), cpu, ram

Cluster Manager is not a part of the Spark Application.
It’s an external service used by Spark Application to acquire resources on the cluster.
Name few examples of Cluster Managers.

Study These Flashcards

Standalone Manager
Mesos
Yarn (in Hadoop)
Kubernetes

Driver node connects with Cluster Manager to request _ for executors, until the SparkSession acquires them.

Study These Flashcards

resources

The node where SparkSession gets created is known as the _ .

Driver Node

Number of Tasks for processing a RDD = Number of _ it has.

partitions

The application code defined by JAR files is sent over network to each _ (JVM process).

executor ## Footnote This is handled by Spark internally; no code required.

Apache Spark Core and Structured Streaming by Amit Ranjan Flashcards

Udemy course/apache-spark-core-30-in-depth/learn (28 cards)