For iterative map/reduce implementations, what negatively affects performance (2)
What is Apache Spark?
an open-source, distributed framework designed for big data processing and analytics that takes advantage of in-memory processing
How is data handled with in-memory processing? (2)
In what circumstance is in-memory processing suitable?
What is an in-memory processing approach?
the technique of storing and manipulating data that is stored in the computer’s main memory (RAM), as opposed to traditional disk-based storage
Where do Spark components run?
In Java virtual machines
What is the Spark cluster manager?
Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or Mesos which is more generic) keeps track of resources available
What do Spark applications consist of? (2)
How are spark applications run? (5)
What is an RDD?
a partitioned collection of records
How do you control spark applications?
through a driver process called the SparkSession
How are RDDs created? (2)
What are the two types of spark operation?
What is a transformation?
A lazy operation to build an RDD from another RDD
What is an action? (3)
An operation to take an RDD and return a result to driver//HDFS
Describe how operations are evaluated lazily in Spark (3)
When is it necessary to use low-level API? (3)
Describe the “micro-batch” approach?
(accumulates small batches of input data and then processes them in
parallel
What are Spark low-level APIs?
APIs to write applications that operate on RDDs directly
What are the three execution modes?
Which is the most common mode for execution spark programs?
Cluster mode
What is the difference between client and cluster mode
In cluster mode, the cluster manager launches the driver process and executor processes on worker nodes inside the cluster meaning the cluster manager is responsible for maintaining all Spark Application–related processes
In client mode, the driver process is run on the same machine that submits the application, meaning that the that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processses
what is a narrow operation?
operations, which are applied to each record independently e.g. map, flatMap
what is a wide operation?
operations which involve records from multiple partitions (and are costly)