Hadoop Flashcards

Question 1

Q

What is the NameNode

Answer

A

the term specified in the HDFS Hadoop framework for the master node. The name node holds the Resource Manager / Job tracker Daemon
Works as part of yarn to hold onto which Data node / slave nodes have additional resources in HDFS

Question 2

Q

What 5 pillars of Hadoop

Answer

A

1) Data Management
2) Data Access
3) Data Governance and Integration
4) Security
5) Operations

Question 3

Q

Do should you not use the Hadoop Framework?

Answer

A

Low Latency data access : Quick access to small
parts of data
❑ Multiple data modification : Hadoop is a better fit
only if we are primarily concerned about reading
data and not writing data.
❑ Lots of small files : Hadoop is a better fit in scenarios,
where we have few but large files.

Question 4

Q

Where is job tracker stored

Answer

A

on the Namenode

Question 5

Q

What does the Master node hold

Answer

A

NameNode (HDFS) and ResourceManager (Map-Reduce)

Question 6

Q

where is Yarn located

Answer

A

Yet another resource negotiator (YARN) is located on the name-node

Question 7

Q

What are the largest challenges (per the powerpoint) facing the big data space?

Answer

A

❑ Lack of skilled staff

❑ Data governance issues – With so much data available, it becomes even more critical to have a framework in place for deciding what data belongs in the system. However, just 30% of the companies surveyed by TDWI said that data governance teams were heavily involved in Big Data management.

❑ Organizational readiness – As with business intelligence, successfully analyzing Big Data takes more than just installing software and other tools. The entire organization needs to be on the same page, and there must be a clearly articulated strategy built around actual business goals.

Question 8

Q

What are the 7 Hadoop file formats?

Answer

A

Text Files(CSV, TSV …)
JSON Records
Sequence Files
Avro Files
RC Files
ORC Files
Parquet Files

Question 9

Q

What is YARN?

Answer

A

A framework for job scheduling and cluster
resource management. It is the data processing layer of
Hadoop.

Question 10

Q

What is the MapReduce? is it the storage or processing layer of hadoop

Answer

A

A YARN-based system for parallel processing of large data sets. It is the data processing layer of Hadoop.

Question 11

Q

What is the the Hadoop HDFS get syntax

Answer

A

get [-crc]

❑ Hadoop HDFS get Command Description

This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path identified by local destination. This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a copy of them to one single, merged file in the local file system identified by local destination.

❑ Hadoop HDFS get Command Example:
hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop

Question 12

Q

what are the Read/Write Files commands in Hadoop

Answer

A

hdfs dfs -text {file_name}
hdfs dfs -cat /hadoop/test #cat command
hdfs dfs -appendtofile {source} {destination} /*puts name for the file */

Question 13

Q

How to copy files a file from the place locally onto the hadoop file

Answer

A

hdfs dfs-copyFromLocal {source} {new destination path}
hdfs dfs -get {source} {new destination}

hdfs dfs -copyToLocal {source path} {new destination path}
hdfs dfs -put {source} {new destination}

Question 14

Q

Create a directory in specified HDFS location. This command does not fail even if the directory already exists.

Answer

A

hdfs dfs -mkdir -f {destination e.g: ‘ /hadoop2’}

Question 15

Q

What are the three stages of MapReduce? what order do they go in?

Answer

A

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage (intermediate splitting followed by reducing): This stage is the combination of theShuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

Hadoop Flashcards

Learn primary terms in the Hadoop framework (15 cards)