What does reduceByKey do in Spark
It merges values per key with map side combine
What is the purpose of flatMap in Spark
It expands one input element into zero or more outputs
What does collect do in Spark
It returns the entire RDD to the driver
What does take do in Spark
It returns the first N elements from the RDD
How is a text file loaded into an RDD in Spark
By using sc textFile
What is a transformation in Spark
It defines a new RDD from an existing one lazily
What is an action in Spark
It triggers execution of the DAG and returns a result
What does repartition do in Spark
It reshuffles the data into a new number of partitions
What does cache do in Spark
It stores the RDD in memory for faster reuse
What is a shuffle in Spark
It is a data redistribution step between executors
What is the purpose of broadcasting in Spark
It distributes read only data efficiently to all executors
What does mapValues do in Spark
It transforms only the values of key value pairs
What kind of join is performed by join in Spark on two pair RDDs
An inner join
What is a broadcast join in Spark SQL
A join where a small table is broadcast to all executors
What storage system does MapReduce depend on
HDFS distributed storage
What does the mapper phase do in MapReduce
It emits intermediate key value pairs
What does the reducer phase do in MapReduce
It aggregates values for each key
What is the shuffle in MapReduce
It groups intermediate values by key between map and reduce
Why is MapReduce slow for iterative algorithms
Because each iteration writes to disk
What is HDFS block size significance
It defines how input is split for mappers
What is the key idea of PageRank
It assigns importance scores based on link structure
What does adjacency list represent in graphs
It lists neighbors of each node
What is BFS used for in graph analysis
It computes shortest paths in unweighted graphs
What is cosine similarity used for
It measures angle based similarity between vectors