Where does data come from?
How would Billions of TB files be stored?
vs
How would Millions of PB files be stored?
Object Storage - Key Value Model (like in S3) + Object Storage (lots of files but single files aren’t very large)
vs
File Storage - File System (brings back the file hierarchy natively) + Block Storage (supports block-based storage natively)
Fault Tolerance in Clusters: How likely are Local Disks and Clusters likely to fail?
How to achieve fault tolerance and robustness on clusters?
What kind of ways do we want to read/write large files in DFS?
Furthermore, this must be supported even with hundreds of con- current users reading and writing to and from the same file system.
Random Access is very difficult in clusters
What performance requirements do we have for DFS?
The bottleneck must be Throughput (how fast we can read/write), NOT LATENCY
This is more consistent with the full-scan pattern we use
How do we solve the capacity-throughput discrepancy?
The solution for the capacity-throughput discrepancy is to parallelize
How do we solve the throughput-latency discrepancy?
The solution for the throughput- latency discrepancy is to do batch-processing
What 3 components is Hadoop primarily composed of?
Describe the model for distributed file systems?
Seperating the logical model and physical storage
File system (logical model): In distributed file systems, we have a file hierarchy (unlike the key-value model in object storage which is flat).
Block storage (physical storage): In distributed file systems, we have block storage (unlike object storage (S3) where we have a blackbox model).
Terminology:** HDFS: Block**, GFS: Chunk. We thus have a hierarchy of files where each file is associated with a chain of blocks.
Why do we used blocks in DFS?
What is the size of a block in DFS?
64-128MB
Why is a DFS Block 64-128MB?
Due to the throughput-latency discrepancy, we want larger blocks (for small blocks, the latency would outweigh the transfer time). Further, large blocks lead to less blocks being read per file. The block size in distributed file systems is 64MB - 128MB. This is a good compromise - not too many blocks for big files, but also small enough to have several blocks on one machine.
What architecture does HDFS use?
HDFS uses a centralised architecture: the namenode (has the names of the files) is the master and the datanodes (have the actual data) are the workers.
How is data actually stored in HDFS?
The file is divided into 128MB chunks. The chunks are then stored in datanodes. Each chunk is replicated 3 times (the # of replicas can be specified).
What three activities is the NameNode responsible for?
What are the DataNodes responsible for?
Blocks of data are stored on these local disks. Datanodes are responsible for failure detection. Each datanode has its own local view over its disks - proximity to hardware facilitates disk failure detection. Each block has a block ID (64bit). It is also possible to access blocks at a subblock granularity to request parts of a block.
How do DataNodes and NameNodes interact?
The datanode always initiates the connection.
* Registration
* Heartbeat: datanode tells namenode every 3s that it’s still alive
* Block Report: every 6h, datanode sends full list of blocks (not the contents of the blocks) to the namenode
* Block Received
The datanode protocol handles datanode-namenode communication.
What is the Client Protocol?
Clients sends metadata operations (e.g. create directory, delete directory, write file, append to file, read file, delete file) and the namenode responds with the datanode location and the block IDs
Can DataNodes connect to one another?
Yes! DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts.
How are files read from HDFS?
How is a file written in HDFS?
All acks must be returned to move on to next block
How does replication physically happen in HDFS?
A cluster contains multiple racks and a rack contains multiple nodes
How do we define distance in a cluster?
Like a tree!
Two nodes in the same rack have a distance of 2 (one edge from the first node to the rack, and one from the rack to the other node).
Two nodes in different racks have a distance of 4