Chapter 4: Flashcards

Question 1

Q

What are the complexities of Data and Analytics (4V Data)

Answer

A

volume (data size) ← Big Data
velocity (freshness, data rate, streams) ←
variability (format/media type)
veracity (uncertainty/quality)

Question 2

Q

What are the complexities of Data and Analytics (4I Analysis)

Answer

A

interactive (visual analytics, ad-hoc)
integrative (extraction, fusion)
iterative (learning, models)
incremental (mutable state, windows)

Question 3

Q

What are the challenges of large-scale data processing

Answer

A

● Large clusters/clouds have 100s/1000s of servers
■ Extremely high performance/throughput possible
■ Problem: Highly parallel environment
■ Developers dont want to deal with concurrency issues, fault tolerance
● Needed: Suitable abstraction layer for developers

Question 4

Q

What are the three Requirements for Data- Intensive Applications as an abstraction layer

Answer

A

Developers don‘t have to think about parallelization
♦ Can continue to write sequential code
♦ Code is independent of degree of parallelism at runtime
Developers don‘t have to think about fault tolerance
♦ Abstraction layer takes care of failed nodes
♦ Re-executes lost parts of computations if necessary
Developers don‘t have to think about load balancing
♦ Abstraction layer is in charge of distributing the work evenly across the available compute nodes

Question 5

Q

What is Map and Reduce?

Answer

A

MapReduce is a Programming Model.
Map and reduce is a second order function which takes first order functions provided by the developer as an input. It operates on a key-value model, meaning data is passed as KV pairs at all phases.

Question 6

Q

What are the Signature and Guarantees of the Map Function?

Answer

A

Signature: (k1, v1) → list(k2, v2)
● Guarantees to the first-order function
■ First-order function is invoked once for each KV pair 
■ Can produce a [0,*] KV pairs as output
● Useful for projections, selection, ...

Question 7

Q

What are the Signature and Guarantees of the Reduce Function?

Answer

A

Signature: (k2, list(v2)) → list(k3, v3)
● Guarantees to the first-order function
■ All KV-pairs with the same key are presented to the same invocation of the first-order function
● Useful for aggregation, grouping, …

Question 8

Q

Name the five steps of Map and Reduce.

Answer

A

Input Data
Map Phase
Shuffle Phase - Group intermediate results of map phase by key
Reduce phase
Output Data

Question 9

Q

Explain how MapReduce works. (Example Word Count)

Answer

A

Input:KV pairs as are transferd as an Input for the Map function
Map Phase: Map function is executed on the KV pair > Intermediate results for each word per line
Schuffle phase groups the results from the map phase by key
Reduce Phase: execute function (count) on the grouped KV pairs.

Question 10

Q

Explain the Map function?

Answer

A

■ First-order function provided by user

■ Specifies what happens to the data in job‘s map phase

Question 11

Q

Explain the Mapper?

Answer

A

■ A process running on a worker node

■ Invokes map function for each KV pair

Question 12

Q

Explain the Reduce function?

Answer

A

■ First-order function provided by user

■ Specifies what happens to the data in job‘s reduce phase

Question 13

Q

Explain the Reducer?

Answer

A

■ Process invoking reduce function on grouped data

Question 14

Q

How does the Distributed Execution of MapReduce work?

Answer

A

Client partitions input file into input splits
Client submits job to master
Mapper started for each input splits
Reducers pull data from mappers over network

Question 15

Q

What are the Limitation of MapReduce?

Answer

A

Assumes finite input (files only)
Limitation of finite input prevents streaming processing
Data between MR jobs must go to Google File System
Constraint to write to GFS especially detrimental for iterative algorithms

Question 16

Q

What are three scenarios of possbile MapReduce fault?

Answer

Study These Flashcards

A

Mapper fails
Reducer fails
Entire worker node fails

Question 17

Q

How can a Mapper fault be corrected?

Answer

Study These Flashcards

A

■ Master detects failure through missing status report

■ Mapper is restarted on diff. node, re-reads data from GFS

Question 18

Q

How can a Reducer fault be corrected?

Answer

Study These Flashcards

A

■ Again, detected through missing status report

■ Reducer is restarted on different node, pulls intermediate results for its partition from mappers again

Question 19

Q

How can an entire worker node fault be corrected?

Answer

Study These Flashcards

A

■ Master re-schedules lost mappers and reducers

■ Finished mappers may be restarted to recompute lost intermediate results

Question 20

Q

Explain master-worker pattern an MapReduce

Answer

Study These Flashcards

A

Master
■ Responsible for job scheduling
■ Monitoring worker nodes, detecting dead nodes
■ Load balancing
Workers
■ Executing map and reduce functions
■ Storing input/output data (in traditional setup)
■ Periodically report availability to master node

Chapter 4: Flashcards

(20 cards)