Midterm 1 Flashcards

(24 cards)

1
Q

Ensuring the authenticity of online product reviews that could involve typos and sarcastic comments is a challenge of which Big Data aspect? volume, velocity, varacity or variety

A

Veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sports analytics platform processing data from video feeds, player statistics, and sensor-equipped equipment exemplifies which Big Data characteristic? volume, velocity, varacity or variety

A

Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a distributed system, what is the benefit of shipping computation to data?

A

Reducing network congestion and latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or False: In HDFS, the information about the re[plication factor of a file is stored in the Namenode.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data transfer of replicas is pipelined. What does this mean?

A

The process writes data to the first data node, which transfers a copy to the next node, which transfers a copy to the next one, until the replication factor is met.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which class in the HDFS API is used to store a path to a file or directory?

A

Path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of data is not stored by the Namenode in HDFS?

A

The actual data blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Given an HDFS cluster with 16 data nodes, a replication factor of 3, and disk storage capacity of 15TB on each node. This HDFS cluster has a capacity of ____ TB.

A

80 GO OVER HOW TO GET HERE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

With 12 machines, each having 1 TB of disk space, and a replication factor of 3, what is the total HDFS storage capacity?

A

4 TB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Consider a cluster of one NameNode and 10 DataNodes. The replication factor is 3 and the block size is 256MB. The NameNode is writing a file of size 5GB. What is the expected number of block replicas of that file stored on NameNode.

A

0 GO OVER HOW TO GET HERE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Consider a HDFS with block size = 256MB and a MapReduce job for word counting (count the number of words for a given input).

There is a document with size 2GB. How many mappers will be created by the word counting MapReduce job?

Note: 1GB = 1024MB

A

8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Consider an HDFS cluster with one NameNode and 9 DataNodes. The default HDFS block size is 128 MB and the replication factor is two. The total storage capacity of each DataNode is 10 TB.

A

The total capacity of HDFS storage is
Incorrect answer:45TB.

Data Node #1 is writing a file of size 16 GB. The amount of network traffic incurred on the network medium of the entire cluster while writing this file is 16 GB.

The total number of block replicas created in HDFS as a result of writing this 16GB file is
Correct answer:
256
.
The expected number of block replicas stored on Data Node #2 is 16
.
Note: 1GB = 1,024 MB and 1TB = 1,024GB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True/False: The following function is valid to use in big-data programs.

function(int x) {
return x + currentTimeInMillis;
}

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is the following function valid to use in big-data programs?

current_max = 0
def update_max(number):
current_max = max(current_max, number)
return current_max

A

No, it is stateful and the result depends on external state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True or False: Spark can cache intermediate results in memory, which is not possible in Hadoop MapReduce

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the role of the reduce function in MapReduce?

A

To combine intermediate values associated with the same key.

17
Q

Assessing the trustworthiness and accuracy of moderate-size user-generated content stored in a structured format on a public forum deals with which aspect of Big Data?

18
Q

When is the combine function useful in a MapReduce job?

A

When you want to reduce the amount of data transferred between the map and reduce phases.

19
Q

True or False: The combine function in MapReduce is run after the reduce phase.

20
Q

True/False The following map and reduce functions are compatible to use with Hadoop.

Map(String document, int count) {
for each String word w in document:
return (w, 1);
}

Reduce(String word, int[] values) {
return (word, sum(values))
}

21
Q

True/False The following map and combine functions are compatible to use with Hadoop.

Map(String user, String pageView) {
return (user, length(pageView));
}

Combine(String key, int[] values) {
return (key, sum(values));
}

22
Q

Consider the following MapReduce program and input file. Let’s call the first column “fruit” and the second column “frequency”.

Note: The String#split function splits a string around the given separator and returns an array of values. The array is zero-based.

Input File
Apple,1
Orange,2
Apple,1
Lemon,3
Orange,2
Lime,3
Apple,2

Program
Map(String line) {
String[] parts = line.split(“,”);
context.write(parts[0], Integer.parseInt(parts[1]));
}

Reduce(String key, int values[]) {
int s = 0;
for (int value: values) {
s += value;
}
context.write(key, s);
}
How many intermediate records are created for the input file given above?