What are the features of Spark RDD?
A. In-memory computation
B. Lazy evaluation
C. Fault tolerance
D. All of the above
D. All of the above
How many Spark Context can be active per JVM?
A. Two
B. One
C. Unlimited
B. One
C. Unlimited
(Only one SparkContext may be active per JVM. You must stop() the active
SparkContext before creating a new one.)
Which of the following is not a transformation?
A. flatMap(func)
B. map(func)
C. reduce(func)
D. filter(func)
C. reduce(func)(reduce(func) is an action operation and it aggregates the elements
of the dataset using a function func (which takes two arguments and returns one).
The function should be commutative and associative so that it can be computed
correctly in parallel.)
A. flatMap(func) (flatMap(func) is a transformation operation and it’s similar to
map(func), but each input item can be mapped to 0 or more output items (so func
should return a Seq rather than a single item).)
B. map(func)(map(func) is a transformation operation and returns a new distributed
dataset formed by passing each element of the source through a function func)
D. filter(func)(filter(func) is a transformation and returns a new dataset formed
by selecting those elements of the source on which func returns true.)
Which one of the following operations does NOT trigger an eager evaluation?
A. take(n)
B. collect()
C. count()
D. join(otherDataset,[numPartitions])
D. join(otherDataset,[numPartitions]) (join(otherDataset,
[numPartitions]) is a transformation operation and won’t trigger the evaluation.)
A. take(n) (take(n) is an action operation and returns an array with the first n
elements of the dataset.)
B. collect() (collect() is an action and returns all the elements of the dataset as an
array at the driver program. This is usually useful after a filter or other operation that
returns a sufficiently small subset of the data.)
C. count() (count() is an action and returns the number of elements in the dataset.)
Which of the following commands will NOT generate a shuffle of data from each
executor across the cluster?
A. collect()
B. map(func)
C. repartition(numPartitions)
D. distinct([numPartitions]))
B. map(func)
D. distinct([numPartitions]))
(map() transformation is narrow and does not trigger a shuffle. The other options, such
as repartition(), distinct() and distinct(), typically cause data shuffling across
the cluster.)
What is Apache Spark? How does it compare with Apache Hadoop?
Apache Spark is an open-source distributed general-purpose cluster-computing framework. It
provides an interface for programming entire clusters with implicit data parallelism and fault
tolerance.
Apache Spark VS. Apache Hadoop, Apache Hadoop, on the other hand, reads and writes from the disk, as a result,
it slows down the computation. While Spark can run on top of Hadoop and provides a better
computational speed solution
What is an RDD
Resilient Distributed Data is a fundamental data structure of spark, immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, fault tolerant collection of elements that can be operated on in parallel. They can also be self recovered in case of failure.
How does Spark work?
Explain Lazy Evaluation in Spark
Transformations on RDD’s are computed in a lazy manner spark will not begin to exec until it sees an action called. Spark internally records metadata to indicate that the operation has been requested. Spark can then decided what the best way is to perform the a series of transformations is that are recorded. Uses lazy evaluation to reduce the number of passes on the storage disk. Dependencies between RDDs are logged in a lineage graph.
What transformations will cause a shuffle and communications across nodes when repartitioning in spark.
Assume all wide transformations will trigger a shuffle.
What are transformations that are an example of a narrow dependency
each partition of the parent RDD
is used by at most one partition of the child RDD
What is the fundamental data structure in spark?
RDD’s
In Spark programming, the default storage level of cache() is
a. MEMORY_ONLY
b. MEMORY_AND_DISK
c. DISK_ONLY
d. MEMORY_ONLY_SER
a. MEMORY_ONLY