Distributed File Systems - Mabel Flashcards by Mabel Wylie

Where does data come from?

Raw data: sensors, measurements, events, logs (e.g. CERN sensor measurements)
Derived data: aggregated data, intermediate data (e.g. computational results on sensor measurements)

How well did you know this?

Not at all

Perfectly

How would Billions of TB files be stored?
vs
How would Millions of PB files be stored?

Object Storage - Key Value Model (like in S3) + Object Storage (lots of files but single files aren’t very large)
vs
File Storage - File System (brings back the file hierarchy natively) + Block Storage (supports block-based storage natively)

How well did you know this?

Not at all

Perfectly

Fault Tolerance in Clusters: How likely are Local Disks and Clusters likely to fail?

Local disk: the disk might fail, we can just keep a backup in case.
Cluster with 100s to 10’000s machines: nodes will fail (P [at least one node fails ] = 1 − (1 − p)^n, with n: #nodes, p: failure probability of a single node). We thus need a stronger property for clusters, we can’t
restore from backups every single day.

How well did you know this?

Not at all

Perfectly

How to achieve fault tolerance and robustness on clusters?

Fault tolerance (system keeps working under faults)
Automatic Recovery (can’t recover manually with lots of disks)
Error detection (know about failed disks)
Monitoring (keep an overview over what’s working)

How well did you know this?

Not at all

Perfectly

What kind of ways do we want to read/write large files in DFS?

Efficiently scan a large file (in its entirety) – for data analysis
Append efficiently new data at the end of an existing large file – particularly for logging and sensors

Furthermore, this must be supported even with hundreds of con- current users reading and writing to and from the same file system.

Random Access is very difficult in clusters

How well did you know this?

Not at all

Perfectly

What performance requirements do we have for DFS?

The bottleneck must be Throughput (how fast we can read/write), NOT LATENCY

This is more consistent with the full-scan pattern we use

How well did you know this?

Not at all

Perfectly

How do we solve the capacity-throughput discrepancy?

The solution for the capacity-throughput discrepancy is to parallelize

How well did you know this?

Not at all

Perfectly

How do we solve the throughput-latency discrepancy?

The solution for the throughput- latency discrepancy is to do batch-processing

How well did you know this?

Not at all

Perfectly

What 3 components is Hadoop primarily composed of?

Distributed File System (HDFS) (inspired by Google’s GFS)
MapReduce (inspired by Google’s MapReduce)
Wide column store (HBase) (inspired by Google’s BigTable)

How well did you know this?

Not at all

Perfectly

Describe the model for distributed file systems?

Seperating the logical model and physical storage

File system (logical model): In distributed file systems, we have a file hierarchy (unlike the key-value model in object storage which is flat).
Block storage (physical storage): In distributed file systems, we have block storage (unlike object storage (S3) where we have a blackbox model).

Terminology:** HDFS: Block**, GFS: Chunk. We thus have a hierarchy of files where each file is associated with a chain of blocks.

How well did you know this?

Not at all

Perfectly

Why do we used blocks in DFS?

Files can be bigger than individual disks (PBs)
Simple level of abstraction and blocks are easy to distribute over multiple nodes.

How well did you know this?

Not at all

Perfectly

What is the size of a block in DFS?

64-128MB

How well did you know this?

Not at all

Perfectly

Why is a DFS Block 64-128MB?

Due to the throughput-latency discrepancy, we want larger blocks (for small blocks, the latency would outweigh the transfer time). Further, large blocks lead to less blocks being read per file. The block size in distributed file systems is 64MB - 128MB. This is a good compromise - not too many blocks for big files, but also small enough to have several blocks on one machine.

How well did you know this?

Not at all

Perfectly

What architecture does HDFS use?

HDFS uses a centralised architecture: the namenode (has the names of the files) is the master and the datanodes (have the actual data) are the workers.

How well did you know this?

Not at all

Perfectly

How is data actually stored in HDFS?

The file is divided into 128MB chunks. The chunks are then stored in datanodes. Each chunk is replicated 3 times (the # of replicas can be specified).

How well did you know this?

Not at all

Perfectly

What three activities is the NameNode responsible for?

Study These Flashcards

File namespace (+Access Control): keep track of hierarchy of files. This is actually rather small (hierarchy doesn’t contain the actual file data, just the hierarchy).
File to block mapping: every file is associated with a list of blocks. The namenode keeps track of the mapping file
3.** Block location**: for every block, the namenode needs to know on which 3 (default) datanodes the block is stored.

What are the DataNodes responsible for?

Study These Flashcards

Blocks of data are stored on these local disks. Datanodes are responsible for failure detection. Each datanode has its own local view over its disks - proximity to hardware facilitates disk failure detection. Each block has a block ID (64bit). It is also possible to access blocks at a subblock granularity to request parts of a block.

How do DataNodes and NameNodes interact?

Study These Flashcards

The datanode always initiates the connection.
* Registration
* Heartbeat: datanode tells namenode every 3s that it’s still alive
* Block Report: every 6h, datanode sends full list of blocks (not the contents of the blocks) to the namenode
* Block Received

The datanode protocol handles datanode-namenode communication.

What is the Client Protocol?

Study These Flashcards

Clients sends metadata operations (e.g. create directory, delete directory, write file, append to file, read file, delete file) and the namenode responds with the datanode location and the block IDs

Can DataNodes connect to one another?

Study These Flashcards

Yes! DataNodes are also capable of communicating with each other by forming replication pipelines. A pipeline happens whenever a new HDFS file is created. The client does not send a copy of the block to all the destination DataNodes, but only to the first one. This first DataNode is then responsible for creating the pipeline and propagating the block to its counterparts.

How are files read from HDFS?

Study These Flashcards

Client asks namenode for file
Namenode sends block location (multiple datanodes for each block, sorted by distance) to client
Client reads data from datanode via input stream

How is a file written in HDFS?

Study These Flashcards

Client sends create to namenode
Namenode sends datanodes (all replicas) for first block
Client contacts one datanode and tells it which others to organise replication pipeline with
Client sends the data over to first node
Datanode sends Ack to client which informs namenode.
Namenode sends client datanodes for second block (writing happens block by block)
Repeat
Client asks Name node to close/release lock
Datanodes check with namenode for minimal replication (datanode protocol)
Namenode sends Ack to client
Namenode can tell datanodes to replicate further asynchronously

All acks must be returned to move on to next block

How does replication physically happen in HDFS?

A cluster contains multiple racks and a rack contains multiple nodes

Study These Flashcards

Replica 1: same node as client (or random) - rack A- because it’s very efficient
Replica 2: a node in a different rack B (put in a different rack from 1st replica - racks can fail)
Replica 3: in same rack B but on a different node
Replica 4 and beyond: random, but if possible: – at most one replica per node – at most two replicas per rack

How do we define distance in a cluster?

Study These Flashcards

Like a tree!

Two nodes in the same rack have a distance of 2 (one edge from the first node to the rack, and one from the rack to the other node).

Two nodes in different racks have a distance of 4

Why don't we place replicas 1 & 2 on the same rack? (client rack)

Because then 2/3 of replicas of all blocks written by the client will be on the same rack (the client stays on the same node

What is the single point of failure of HDFS?

**Namenode** If the metadata stored on it is lost, then all the data on the cluster is lost, because it is not possible to reassemble the blocks into files any more.

How do we protect from NameNode failure?

Backups! The** file namespace **containing the directory and file hierarchy as well as the **mapping from files to block IDs** is backed up to a 'snapshot'. Updates to the file system arriving after the snapshot has been made are instead stored in a journal, called edit log.

How do we use the backup to recover from a NameNode failure?

Restore the initial hierarchy/ block mapping from 'snapshot' and then ’play’ the edit log and apply changes to the file systems (restore last logged version) Receive the block locations from the block reports, which are periodically sent by the datanode (can also be manually requested)

How do we reduce the time taken to restart the system after a crash? | Start up with basic HDFS > 30 mins

1. Edit log is periodically merged back into a new, larger snapshot and reset to an empty edit log. This is called a checkpoint and is performed by a "phantom" NameNode with the same structure 2. Configure a phantom NameNode to be able to instantly take over from the NameNode in case of a crash. This considerably shortens the amount of time to recover (high availability) 3. Federated NameNodes are non-overlapping NameNodes responsible for sections of data thus are faster

How do you move log data to HDFS?

**Apache Flume**: Collects, aggregates, moves log data (into HDFS)

How do you import from a relational database to HDFS?

Distributed File Systems - Mabel Flashcards

(31 cards)