Midterm 1 Flashcards

Question 1

Q

Ensuring the authenticity of online product reviews that could involve typos and sarcastic comments is a challenge of which Big Data aspect? volume, velocity, varacity or variety

Question 2

Q

Sports analytics platform processing data from video feeds, player statistics, and sensor-equipped equipment exemplifies which Big Data characteristic? volume, velocity, varacity or variety

Question 3

Q

In a distributed system, what is the benefit of shipping computation to data?

Answer

A

Reducing network congestion and latency

Question 4

Q

True or False: In HDFS, the information about the re[plication factor of a file is stored in the Namenode.

Question 5

Q

Data transfer of replicas is pipelined. What does this mean?

Answer

A

The process writes data to the first data node, which transfers a copy to the next node, which transfers a copy to the next one, until the replication factor is met.

Question 6

Q

Which class in the HDFS API is used to store a path to a file or directory?

Question 7

Q

What type of data is not stored by the Namenode in HDFS?

Answer

A

The actual data blocks

Question 8

Q

Given an HDFS cluster with 16 data nodes, a replication factor of 3, and disk storage capacity of 15TB on each node. This HDFS cluster has a capacity of ____ TB.

Answer

A

80 GO OVER HOW TO GET HERE

Question 9

Q

With 12 machines, each having 1 TB of disk space, and a replication factor of 3, what is the total HDFS storage capacity?

Question 10

Q

Consider a cluster of one NameNode and 10 DataNodes. The replication factor is 3 and the block size is 256MB. The NameNode is writing a file of size 5GB. What is the expected number of block replicas of that file stored on NameNode.

Answer

A

0 GO OVER HOW TO GET HERE

Question 11

Q

Consider a HDFS with block size = 256MB and a MapReduce job for word counting (count the number of words for a given input).

There is a document with size 2GB. How many mappers will be created by the word counting MapReduce job?

Note: 1GB = 1024MB

Question 12

Q

Consider an HDFS cluster with one NameNode and 9 DataNodes. The default HDFS block size is 128 MB and the replication factor is two. The total storage capacity of each DataNode is 10 TB.

Answer

A

The total capacity of HDFS storage is
Incorrect answer:45TB.

Data Node #1 is writing a file of size 16 GB. The amount of network traffic incurred on the network medium of the entire cluster while writing this file is 16 GB.

The total number of block replicas created in HDFS as a result of writing this 16GB file is
Correct answer:
256
.
The expected number of block replicas stored on Data Node #2 is 16
.
Note: 1GB = 1,024 MB and 1TB = 1,024GB

Question 13

Q

True/False: The following function is valid to use in big-data programs.

function(int x) {
return x + currentTimeInMillis;
}

Question 14

Q

Is the following function valid to use in big-data programs?

current_max = 0
def update_max(number):
current_max = max(current_max, number)
return current_max

Answer

A

No, it is stateful and the result depends on external state.

Question 15

Q

True or False: Spark can cache intermediate results in memory, which is not possible in Hadoop MapReduce

Question 16

Q

What is the role of the reduce function in MapReduce?

Answer

Study These Flashcards

A

To combine intermediate values associated with the same key.

Question 17

Q

Assessing the trustworthiness and accuracy of moderate-size user-generated content stored in a structured format on a public forum deals with which aspect of Big Data?

Answer

Study These Flashcards

A

Veracity

Question 18

Q

When is the combine function useful in a MapReduce job?

Answer

Study These Flashcards

A

When you want to reduce the amount of data transferred between the map and reduce phases.

Question 19

Q

True or False: The combine function in MapReduce is run after the reduce phase.

Answer

Study These Flashcards

A

False

Question 20

Q

True/False The following map and reduce functions are compatible to use with Hadoop.

Map(String document, int count) {
for each String word w in document:
return (w, 1);
}

Reduce(String word, int[] values) {
return (word, sum(values))
}

Answer

Study These Flashcards

A

True

Question 21

Q

True/False The following map and combine functions are compatible to use with Hadoop.

Map(String user, String pageView) {
return (user, length(pageView));
}

Combine(String key, int[] values) {
return (key, sum(values));
}

Answer

Study These Flashcards

A

True

Question 22

Q

Consider the following MapReduce program and input file. Let’s call the first column “fruit” and the second column “frequency”.

Note: The String#split function splits a string around the given separator and returns an array of values. The array is zero-based.

Input File
Apple,1
Orange,2
Apple,1
Lemon,3
Orange,2
Lime,3
Apple,2

Program
Map(String line) {
String[] parts = line.split(“,”);
context.write(parts[0], Integer.parseInt(parts[1]));
}

Reduce(String key, int values[]) {
int s = 0;
for (int value: values) {
s += value;
}
context.write(key, s);
}
How many intermediate records are created for the input file given above?

Answer

Study These Flashcards

A

7

Question 23

Q

Answer

Study These Flashcards

A

Question 24

Q

Answer

Study These Flashcards

A

Midterm 1 Flashcards

(24 cards)