Week 8 - Spark Framework Flashcards

Question 1

Q

What are the features of Spark RDD?

A. In-memory computation
B. Lazy evaluation
C. Fault tolerance
D. All of the above

Answer

A

D. All of the above

Question 2

Q

How many Spark Context can be active per JVM?

A. Two
B. One
C. Unlimited

Answer

A

B. One

C. Unlimited
(Only one SparkContext may be active per JVM. You must stop() the active
SparkContext before creating a new one.)

Question 3

Q

Which of the following is not a transformation?

A. flatMap(func)
B. map(func)
C. reduce(func)
D. filter(func)

Answer

A

C. reduce(func)(reduce(func) is an action operation and it aggregates the elements
of the dataset using a function func (which takes two arguments and returns one).
The function should be commutative and associative so that it can be computed
correctly in parallel.)

A. flatMap(func) (flatMap(func) is a transformation operation and it’s similar to
map(func), but each input item can be mapped to 0 or more output items (so func
should return a Seq rather than a single item).)

B. map(func)(map(func) is a transformation operation and returns a new distributed
dataset formed by passing each element of the source through a function func)

D. filter(func)(filter(func) is a transformation and returns a new dataset formed
by selecting those elements of the source on which func returns true.)

Question 4

Q

Which one of the following operations does NOT trigger an eager evaluation?

A. take(n)
B. collect()
C. count()
D. join(otherDataset,[numPartitions])

Answer

A

D. join(otherDataset,[numPartitions]) (join(otherDataset,
[numPartitions]) is a transformation operation and won’t trigger the evaluation.)

A. take(n) (take(n) is an action operation and returns an array with the first n
elements of the dataset.)

B. collect() (collect() is an action and returns all the elements of the dataset as an
array at the driver program. This is usually useful after a filter or other operation that
returns a sufficiently small subset of the data.)

C. count() (count() is an action and returns the number of elements in the dataset.)

Question 5

Q

Which of the following commands will NOT generate a shuffle of data from each
executor across the cluster?

A. collect()
B. map(func)
C. repartition(numPartitions)
D. distinct([numPartitions]))

Answer

A

B. map(func)

D. distinct([numPartitions]))
(map() transformation is narrow and does not trigger a shuffle. The other options, such
as repartition(), distinct() and distinct(), typically cause data shuffling across
the cluster.)

Question 6

Q

What is Apache Spark? How does it compare with Apache Hadoop?

Answer

A

Apache Spark is an open-source distributed general-purpose cluster-computing framework. It
provides an interface for programming entire clusters with implicit data parallelism and fault
tolerance.

Apache Spark VS. Apache Hadoop, Apache Hadoop, on the other hand, reads and writes from the disk, as a result,
it slows down the computation. While Spark can run on top of Hadoop and provides a better
computational speed solution

Question 7

Q

What is an RDD

Answer

A

Resilient Distributed Data is a fundamental data structure of spark, immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, fault tolerant collection of elements that can be operated on in parallel. They can also be self recovered in case of failure.

Question 8

Q

How does Spark work?

Answer

A

Create and RDD object
SparkContext is responsible for calculating the dependencies between RDD’s and building the DAG’s
DAG scheduler is reponsible for decomposing the DAG graph into multiple stages, each stage with multiple tasks
Task scheduler launches tasks to distribute across the worker nodes via cluster manager (Standalone, Mesos or YARN) The task scheduler does not know about the dependencies between stages.

Question 9

Q

Explain Lazy Evaluation in Spark

Answer

A

Transformations on RDD’s are computed in a lazy manner spark will not begin to exec until it sees an action called. Spark internally records metadata to indicate that the operation has been requested. Spark can then decided what the best way is to perform the a series of transformations is that are recorded. Uses lazy evaluation to reduce the number of passes on the storage disk. Dependencies between RDDs are logged in a lineage graph.

Question 10

Q

What transformations will cause a shuffle and communications across nodes when repartitioning in spark.

Answer

A

join
groupby
reducebykey
combinebykey

Assume all wide transformations will trigger a shuffle.

Question 11

Q

What are transformations that are an example of a narrow dependency

Answer

A

map
flatmap
filter
sample
union

each partition of the parent RDD
is used by at most one partition of the child RDD

Question 12

Q

What is the fundamental data structure in spark?

Question 13

Q

In Spark programming, the default storage level of cache() is

a. MEMORY_ONLY
b. MEMORY_AND_DISK
c. DISK_ONLY
d. MEMORY_ONLY_SER

Answer

A

a. MEMORY_ONLY

Week 8 - Spark Framework Flashcards

(13 cards)