What is Moore’s Law?
The number of transistors in an integrated circuit doubles about every two years
Why was there a switch from faster to parallel execution?
Paradigm shift:
speed of light, atomic boundaries, limited 3D layering
What is Hadoop?
What do we need to write for MapReduce?
How do we store large files with the Hadoop Distributed File System
What are the components of the Hadoop Architecture? (overview)
What is the infrastructure of YARN?
Resource Manager (1/cluster): assign cluster resources to applications
Node Manager (many/cluster): monitor node
App Master: app (MapReduce)
Container: task (map, reduce, …)
What are the shortcomings of MapReduce?
What should be counted when calculating algorithmic complexity?
What is Google MapReduce?
A framework for processing LOADS of data
-> framework’s job: fault tolerance, scaling & coordination
-> programmer’s job: write program in MapReduce Form
What are the 2 main components of Hadoop?
HDFS - big data storage
MapReduce - big data processing
How can you tell that the Hadoop Architecture is inspired by LISP (list processing)?
Functional programming:
* Immutable data
* Pure functions (no side effects): map, reduce
What is the difference between a job and a task tracker in the Hadoop Architecture?
Job tracker: in charge of managing the resources of the cluster
-> first point of contact when a client submits a process
-> one per cluster
Task tracker: does the actual process
-> mostly connected to one or more specific data nodes
What are the 3 functions in Google MapReduce? (2 primary, one optional)
How does the Map function work, of Google’s MapReduce?
Map each <key, value> data pair of input list onto 0, 1, or more pairs of type <key2, value2> of output list
-> Map to 0 elements in the output = filtering
-> Map to +1 elements in the output = distribution
How does the Reduce function work, of Google’s MapReduce?
[summarizing]
Combine the <key, value> pairs of the input list to an aggregate output value
What does the Shuffle function do, in Google’s MapReduce?
[consolidating relevant record]
* Helps the pipeline
* It will help channel the partial result to the right or most appropriate reduce node
What is YARN short for?
Yet Another Resource Negotiator
Explain the Hadoop eco-system.
Hadoop provides very good functions on its own but the main power of Hadoop comes out when we start combining it with different other technologies.
(Ex.: Pig, Hive, Kafka)
What is Apache Spark?
What are RDD’s?
Resilient Distributed Datasets
-> immutable distributed collection of objects
-> fault tolerant
-> used in every spark comonents
How do you create new RDD’s?
By using transformations
-> from storage
-> from other RDDs
What are DataFrames?
A way to organize the data in named columns.
Similar to a relational database
-> immutable once constructed
-> enable distributed computations
How can you construct data frames?