Briefly summarize the two Spark streaming paradigms
When using DStreams, how can we make spark incrementally undo reduce operations with older batches when new batches come in? What are the advantages and give one scenario when would be useful.
We can supply a invertible reduce function next to the standard reduce function in .reduceByKeyAndWindow:
lines.flatMap(…).mapToPair(…)
.reduceByKeyAndWindow((i1,i2)->i1+i2,
(i1,i2)->i1-i2, Durations.Seconds(1000))
Here (i1,i2)->i1-i2 is the invertible reduce function which tells spark how to incrementally remove the results from older batches. This makes it so that we don’t have to calculate the whole window again, but we just have to calculate the results for new arrivals and remove the old ones.
This will save much computation when the sliding window contains many batches. For example if you have a batch size of 1 second and your sliding window is 24 hours, then you will not have to compute the whole 24 hours worth of batches all over after every second, but you can just forget the oldest second.
Which of these reduce functions are invertible?
A. Count
B. Addition to a set
C. Addition to a multiset
D. Keeping only the most frequent items
Briefly explain each one.
A. Count: Yes, we can increment the count on arrival and decrement it when it falls out of the window.
B. Addition to a set: No. Sets don’t allow duplicates, so we cannot tell if the we have previously added an item twice and can now remove it safely.
C. Addition to a multiset: Yes, multisets allow duplicates, so we can just remove the element.
D. Keeping only the most frequent items: Not straightforward, but possible with specialized algorithms. Keeping track of frequencies over a stream is expensive and requires auxilary datastructures and/or algorithms.
Which output modes are there in Structured streaming?
Provide two downsides of DStreams.
Which types of sliding windows are there in spark structured streaming?
Some example:
Dataset<Row> events = ... // streaming DataFrame of schema { timestamp: Timestamp, userId: String }
// Group the data by session window and userId, and compute the count of each group
Dataset<Row> sessionizedCounts = events
.withWatermark("timestamp", "10 minutes")
.groupBy(
session_window(col("timestamp"), "5 minutes"),
col("userId"))
.count();What determines the performance of structured streaming?
When is DStream preferred over Structured Streaming?
When you have Legacy code that already performs operations on RDDs and it takes a lot of time to make it compatible with Dataframes/datasets.
Explain what BlinkDB does and which mechanism is important.
BlinkDB/Taster are used perform queries very fast by retrieving approximations. Instead of applying a query on the whole dataset, they use clever sampling on which to perform the query on. You can provide a target execution time, OR an error rate and confidence interval for this query. These parameters will influence each other.
These types of queries are called Interactive Queries, and can be useful for IoT applications.
How does Taster differ from BlinkDB
Samples are dropped and loaded in in an online fashion. Samples that are no longer relevant are expelled from the query.
Briefly explain the following Spark DStream Streaming commands:
- transform
- transformToPair
- join