What is Spark?
An opensource cluster computing framework
What does Spark do?
What do RDDs do?
They make a distributed list over a cluster of machines like a Scala collection.
What are some attributes of RDDs?
What are the types of operations on RDDs?
What does RDD stand for?
Resilient Distributed Dataset
What are the common transformations?
map, flatMap, filter
What are the common actions?
collect, take, reduce, fold, aggregate
What is the main aspect of Pair RDDs?
They can be iterated and indexed, and take the form RDD[(K, V)]. Operations such as joins are defined in Pair RDDs.
What are the 3 Spark partitioning schemes?
What is the difference between narrow and wide partition dependencies?
How does persistence work in the Spark framework (i.e. Java/serialized/FS)
What are the two main dataset abstractions used by SparkSQL
How can Dataframes and Datasets be created?
How can you access a column in a DataFrame?
df(“column_name”)
What are some common operations on DataFrames?