What are the complexities of Data and Analytics (4V Data)
What are the complexities of Data and Analytics (4I Analysis)
What are the challenges of large-scale data processing
● Large clusters/clouds have 100s/1000s of servers
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers
What are the three Requirements for Data- Intensive Applications as an abstraction layer
What is Map and Reduce?
MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.
What are the Signature and Guarantees of the Map Function?
Signature: (k1, v1) → list(k2, v2) ● Guarantees to the first-order function ■ First-order function is invoked once for each KV pair ■ Can produce a [0,*] KV pairs as output ● Useful for projections, selection, ...
What are the Signature and Guarantees of the Reduce Function?
Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …
Name the five steps of Map and Reduce.
Explain how MapReduce works. (Example Word Count)
Explain the Map function?
■ First-order function provided by user
■ Specifies what happens to the data in job‘s map phase
Explain the Mapper?
■ A process running on a worker node
■ Invokes map function for each KV pair
Explain the Reduce function?
■ First-order function provided by user
■ Specifies what happens to the data in job‘s reduce phase
Explain the Reducer?
■ Process invoking reduce function on grouped data
How does the Distributed Execution of MapReduce work?
What are the Limitation of MapReduce?
What are three scenarios of possbile MapReduce fault?
How can a Mapper fault be corrected?
■ Master detects failure through missing status report
■ Mapper is restarted on diff. node, re-reads data from GFS
How can a Reducer fault be corrected?
■ Again, detected through missing status report
■ Reducer is restarted on different node, pulls intermediate results for its partition from mappers again
How can an entire worker node fault be corrected?
■ Master re-schedules lost mappers and reducers
■ Finished mappers may be restarted to recompute lost intermediate results
Explain master-worker pattern an MapReduce
Master
■ Responsible for job scheduling
■ Monitoring worker nodes, detecting dead nodes
■ Load balancing
Workers
■ Executing map and reduce functions
■ Storing input/output data (in traditional setup)
■ Periodically report availability to master node