Ensuring the authenticity of online product reviews that could involve typos and sarcastic comments is a challenge of which Big Data aspect? volume, velocity, varacity or variety
Veracity
Sports analytics platform processing data from video feeds, player statistics, and sensor-equipped equipment exemplifies which Big Data characteristic? volume, velocity, varacity or variety
Variety
In a distributed system, what is the benefit of shipping computation to data?
Reducing network congestion and latency
True or False: In HDFS, the information about the re[plication factor of a file is stored in the Namenode.
True
Data transfer of replicas is pipelined. What does this mean?
The process writes data to the first data node, which transfers a copy to the next node, which transfers a copy to the next one, until the replication factor is met.
Which class in the HDFS API is used to store a path to a file or directory?
Path
What type of data is not stored by the Namenode in HDFS?
The actual data blocks
Given an HDFS cluster with 16 data nodes, a replication factor of 3, and disk storage capacity of 15TB on each node. This HDFS cluster has a capacity of ____ TB.
80 GO OVER HOW TO GET HERE
With 12 machines, each having 1 TB of disk space, and a replication factor of 3, what is the total HDFS storage capacity?
4 TB
Consider a cluster of one NameNode and 10 DataNodes. The replication factor is 3 and the block size is 256MB. The NameNode is writing a file of size 5GB. What is the expected number of block replicas of that file stored on NameNode.
0 GO OVER HOW TO GET HERE
Consider a HDFS with block size = 256MB and a MapReduce job for word counting (count the number of words for a given input).
There is a document with size 2GB. How many mappers will be created by the word counting MapReduce job?
Note: 1GB = 1024MB
8
Consider an HDFS cluster with one NameNode and 9 DataNodes. The default HDFS block size is 128 MB and the replication factor is two. The total storage capacity of each DataNode is 10 TB.
The total capacity of HDFS storage is
Incorrect answer:45TB.
Data Node #1 is writing a file of size 16 GB. The amount of network traffic incurred on the network medium of the entire cluster while writing this file is 16 GB.
The total number of block replicas created in HDFS as a result of writing this 16GB file is
Correct answer:
256
.
The expected number of block replicas stored on Data Node #2 is 16
.
Note: 1GB = 1,024 MB and 1TB = 1,024GB
True/False: The following function is valid to use in big-data programs.
function(int x) {
return x + currentTimeInMillis;
}
False
Is the following function valid to use in big-data programs?
current_max = 0
def update_max(number):
current_max = max(current_max, number)
return current_max
No, it is stateful and the result depends on external state.
True or False: Spark can cache intermediate results in memory, which is not possible in Hadoop MapReduce
True
What is the role of the reduce function in MapReduce?
To combine intermediate values associated with the same key.
Assessing the trustworthiness and accuracy of moderate-size user-generated content stored in a structured format on a public forum deals with which aspect of Big Data?
Veracity
When is the combine function useful in a MapReduce job?
When you want to reduce the amount of data transferred between the map and reduce phases.
True or False: The combine function in MapReduce is run after the reduce phase.
False
True/False The following map and reduce functions are compatible to use with Hadoop.
Map(String document, int count) {
for each String word w in document:
return (w, 1);
}
Reduce(String word, int[] values) {
return (word, sum(values))
}
True
True/False The following map and combine functions are compatible to use with Hadoop.
Map(String user, String pageView) {
return (user, length(pageView));
}
Combine(String key, int[] values) {
return (key, sum(values));
}
True
Consider the following MapReduce program and input file. Let’s call the first column “fruit” and the second column “frequency”.
Note: The String#split function splits a string around the given separator and returns an array of values. The array is zero-based.
Input File
Apple,1
Orange,2
Apple,1
Lemon,3
Orange,2
Lime,3
Apple,2
Program
Map(String line) {
String[] parts = line.split(“,”);
context.write(parts[0], Integer.parseInt(parts[1]));
}
Reduce(String key, int values[]) {
int s = 0;
for (int value: values) {
s += value;
}
context.write(key, s);
}
How many intermediate records are created for the input file given above?
7