Distributed File System
An abstraction for data stored across multiple machines to appear as a unified storage system
Goals of DFS
Hadoop Distributed File System (4)
Blocks
Smallest unit of storage that can be read or written
- default size 64-128MB
How does HDFS ensure data consistency?
Write-once, Read-many model
Hadoop default Fault-tolerance
Blocks duplicated at factor of three to ensure data remains accessible if one machine fails (hardware failure = NORM!)
Namenode (NN) (2)
Master in master-slave architecture
1. Store metadata about
location of specific blocks.
2. Control client access to data
Datanodes (DN)
The slaves because they store and process the actual data. Send periodic “heartbeats” to update master
The big issue with Master-Slave architecture
Single Point of Failure (SPOF), there is only one namenode that maintains the filesystem tree
High Availability and how it’s achieved with Master-Slave Architecture
A system that can tolerate faults
Two separate machines as NNs:
- 1 Active State
- 1 Standby State
Besides HA, another reason it’s beneficial to configure additional name nodes
of blocks in system is limited by RAM of NN, since they store metadata about blocks in memory
Erasure Coding
A way to store less redundant data by splitting into smaller data cells called “Stripes”
with
Parity cells as backup pieces to help recover data.
If you lose some data cells, you can still rebuild thanks to parity cells. (Data cells are, of course split across nodes)
Pros of erasure coding (2)
Cons of Erasure Coding (3)
For which types of datasets does erasure coding work best?
Those with low I/O activities (not HOT or interactive)
HDFS cons in general (2)
Standard data formats (3)
3 benefits of big data specific file formats (BCS)
Row-oriented file format
Store data in rows so it’s easy to read or write full record of fields.
Good application for Row oriented format
OLTP
Updating customers profile, adding a new order (access to all fields for the record)
Column-oriented file format
store data by columns makes easier compression because same data type is stored in same record (type specific encodings)
Easier to access specific features of data for aggregation (reduced I/O for analytical queries)
Good application for column oriented format
OLAP
Average salary of employees by department (get all instances for one field with one query!)
Row-oriented file formats examples (3)
SequenceFiles (key-value pairs, designed for MapReduce)
Apache Thrift (good communication between programs)
Apache Avro (schema evolution)
Column-oriented file formats examples (2)
ORC (first on Hadoop, designed for Hive - sql on hadoop)
Apache Parquet (schema evolution)