What applications are best suited for GraphX or big graph processing?
Applications involving extremely large graphs that cannot fit on a single machine, such as:
Web graphs (pages + hyperlinks, billions of edges)
Knowledge graphs (entities + relationships)
Social networks (hundreds of millions of users and connections)
Why does big graph processing need distributed systems like GraphX?
Big graphs often contain hundreds of millions of vertices and billions of edges, making them too large to store or process on a single machine. Distributed graph frameworks handle large-scale analytics efficiently
What are the two major types of graph partitioning?
Edge Cut:
Each vertex goes to one partition.
Edges may be replicated across partitions.
Vertex Cut:
Each edge is assigned to one partition.
Vertices may be replicated.
GraphX uses Vertex Cut to reduce communication cost
Which graph partitioning method does GraphX use, and why?
GraphX uses Vertex Cut, because assigning edges to partitions and replicating vertices helps minimize inter-partition communication, which is crucial for scalable graph analytics.
What are basic vertex operations in GraphX?
filter: keep only vertices satisfying a predicate
mapValues: modify vertex attributes
diff: remove vertices that exist in another vertex set
What are basic edge operations in GraphX?
mapValues: modify edge values
reverse: flip direction of edges (u → v becomes v → u)
How are graph vertices and edges represented in GraphX?
GraphX uses the Property Graph Model:
VertexRDD[VD] stores (VertexID, VertexData)
EdgeRDD[ED] stores (SourceID, DestinationID, EdgeData
What is Pregel-style graph processing?
Each vertex sends messages along outgoing edges.
Each vertex updates its data by aggregating incoming messages.
Steps repeat until a stopping condition is reached (e.g., no updates).
What is the Pregel API in GraphX used for?
vprog: how a vertex updates its value
sendMsg: how vertices send messages
mergeMsg: how messages are combined
It returns a new graph after each iteration.
What are examples of Pregel-style graph algorithms shown in the PDF?
Connected Components: propagate smallest vertex ID
Single-Source Shortest Path (SSSP): propagate distance updates
PageRank: propagate rank contributions
All follow the message-passing, iterative Pregel model.
What is the general flow of a Pregel superstep?
Vertex receives messages
Vertex updates its state (vprog)
Vertex sends new messages (sendMsg)
System merges them (mergeMsg)
Repeat until convergence or iteration limit