Question 1

Distributed File System

Accepted Answer

An abstraction for data stored across multiple machines to appear as a unified storage system

Question 2

Goals of DFS

Accepted Answer

1. Split files 2. Hide complexity 3. Have Fault-tolerance

Question 3

Hadoop Distributed File System (4)

Accepted Answer

- Most widely used DFS - Clusters of commodity hardware - Handling up to petabytes of data - Designed for high throughput batch (not low latency access)

Question 4

Blocks

Accepted Answer

Smallest unit of storage that can be read or written - default size 64-128MB

Question 5

How does HDFS ensure data consistency?

Accepted Answer

Write-once, Read-many model

Question 6

Hadoop default Fault-tolerance

Accepted Answer

Blocks duplicated at factor of three to ensure data remains accessible if one machine fails (hardware failure = NORM!)

Question 7

Namenode (NN) (2)

Accepted Answer

Master in master-slave architecture 1. Store metadata about location of specific blocks. 2. Control client access to data

Question 8

Datanodes (DN)

Accepted Answer

The slaves because they store and process the actual data. Send periodic heartbeats to update master

Question 9

The big issue with Master-Slave architecture

Accepted Answer

Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree

Question 10

High Availability and how it's achieved with Master-Slave Architecture

Accepted Answer

A system that can tolerate faults Two separate machines as NNs: - 1 Active State - 1 Standby State

Question 11

Besides HA, another reason it's beneficial to configure additional name nodes

Accepted Answer

of blocks in system is limited by RAM of NN, since they store metadata about blocks in memory

Question 12

Erasure Coding

Accepted Answer

A way to store less redundant data by splitting into smaller data cells called “Stripes”

with

Parity cells as backup pieces to help recover data.

If you lose some data cells, you can still rebuild thanks to parity cells. (Data cells are, of course split across nodes)

Question 13

Pros of erasure coding (2)

Accepted Answer

1. Reduce data redundancy from 200% to 50% 2. Faster writes without replicating

Question 14

Cons of Erasure Coding (3)

Accepted Answer

1. Higher cpu cost for reads and writes 2. Longer recovery time in case of failure 3. Loss of data locality (too much splitting!!)

Question 15

For which types of datasets does erasure coding work best?

Accepted Answer

Those with low I/O activities (not HOT or interactive)

Storage: DFS Flashcards

(25 cards)