What is Big Data?
Big Data is the field that analyzes and deals with complex and large sets of data (thousands of exabytes) that cannot be handled by conventional software.
It is used by Netflix to gather user behaviour data from its more than 100 million customers which helps Netflix analyze it and recommend a movie or show.
Big Data categorizes the data in 3 types:
- Structured: all the data currently existing in databases, data arranged in relational row-column format such as excel, data gathered from medical devices, GPS etc.
What is Hadoop?
It is an open source software platform for distributed storage and processing of very large data sets on computer clusters built from commodity hardware.
It is basically a solution framework to process and analyze Big Data and is mainly written in Java. It is not just one project, but a set of projects (ecosystem).
With the default replication value of 3, data is stored on three nodes: two on the same rack, and one on a different rack.
Hadoop is made up of two parts: MapReduce and Hadoop Distributed File System (HDFS). MapReduce handles processing and HDFS handles storage of data.
What are the four Vs of Big Data?
Volume: deals with scale of data
Velocity: deals with analysis of streaming data
Veracity: deals with degree of accuracy of a data set.
Variety: deals with different forms of data (medical data(patient information, x-rays etc) , data from social media (posts, images, videos), data generated by GPS etc)
What is the need for Big Data solutions like Hadoop?
With traditional databases, we would scale vertically to store the data but it involves problems such as higher disk seek times, hardware failures and processing times.
Also, traditional databases store relational/structured data whereas Big Data contains unstructured data.
As Hadoop supports huge volumes of unstructured data storage and processing with horizontal scaling, it is a good solution.
Explain the two parts of Hadoop.
Hadoop consists of two parts: MapReduce and Hadoop Distributed File System (HDFS). These two exist on every machine on which we are storing the data.
MapReduce is the processing part of Hadoop.
HDFS stores all the data, contains files and directories scaling out to many petabytes. Hadoop interacts with HDFS with shell commands.
MapReduce server on each machine is called TaskTracker and is responsible for launching the tasks on that machine. HDFS server on each machine is called DataNode and it stores blocks of data and provides access to it.
One TaskTracker and a DataNode make up a single machine. To make a cluster, we replicate the pattern of TaskTracker and DataNode on several machines, adding to our storage. A JobTracker and NameNode co-ordinate all the TaskTrackers and DataNodes in a cluster.
Explain JobTracker.
The MapReduce needs a co-ordinator to co-ordinate between all the TaskTrackers on multiple machines.
This co-ordinator is called JobTracker and is responsible for accepting user’s jobs, dividing it into tasks and assigning each task to individual TaskTracker. The TaskTrackers then run the job and report their status to JobTracker.
JobTracker is also responsible for noticing if a TaskTracker disappers because of software/hardware failure. It then automatically assigns those tasks to another TaskTracker.
Explain NameNode.
The MapReduce needs a co-ordinator to co-ordinate between all the DataNodes on multiple machines.
This co-ordinator is called NameNode and is responsible for keeping the location information of stored data.
When client writes to HDFS, he/she talks to the NameNode, gets told where to store the data and then writes the data to that DataNode.
When client reads from HDFS, he/she talks to the NameNode, gets told where the data is stored and then talks directly to that DataNode.
The actual data never flows through NameNode,only the information about where the data is located.
Similar to JobTracker, NameNode is responsible for noticing when a DataNode has disappeared and automatically replicating the data.
What are the key characteristics of Hadoop?
Explain Hadoop ecosystem.
Hadoop is not a single project, but a set of multiple projects such as:
Explain YARN.
Yet Another Resource Negotiator (YARN) is a Hadoop ecosystem module, introduced in Hadoop 2.0 providing resource management and is responsible for monitoring and managing. It is also called operating system of Hadoop.
There are 4 components of YARN: