What is a distributed file system?
A long-term information storage which enables storing large amounts of info and access of multiple processes
How are files stored in a DFS?
Files are split into chunks and the chunks are stored separately. Typically, chunks are replicated and kept on different racks for fault tolerance
What are the advantages of a DFS?
What is the cluster architecture for a DFS?
There are nodes made up of memory, a CPU, and a disk
Nodes are organized into racks
Several racks are linked by a switch to provide fault tolerance between racks
Several switches are linked by a backbone switch to provide fault tolerance between switches
What are the speeds of switches in a cluster architecture?
Rack switch has 1 Gbps between any pair of nodes in a rack
Backbone switch has 2-10 Gbps between racks
What is a commodity cluster?
Low cost distributed computers that allow for cluster architecture. They are less specialized but are affordable
What are some common failures in commodity clusters?
How can we solve the issue of network bottlenecks when using commodity clusters?
What is a big data programming model?
Programmability on top of distributed file systems
What are the requirements of a big data programming model?
1, Must support big data operations: fast access, distribute computation to nodes
2. Handles fault tolerance: replicates partitions, recovers files when needed
3. Enables adding more racks
What is map reduce?
A big data programming model that applies an operation to all elements (map) and then performs a summarizing operation on the elements (reduce)
What are the challenges of big data programming models that map reduce overcomes?
Describe how the map reduce algorithm performs the word count task
What is map reduce a bad tool for?
What are the components of a distributed file system?
What is the master node (name node)?
What is a client library (data node)?
What is a chunk server?
A node that stores replicated file chunks typically 16-64MB in size. Replicas of chunks should be kept in different racks
What does the map reduce environment take care of?
Where is data stored during the map reduce process?
What are the possible states a task can be in?
What happens when a map task completes its work?
How are node failures detected?
The master node pings workers periodically
How are failures of map nodes handled?