New Structured Streaming Flashcards

(16 cards)

1
Q

What is the concept behind streaming in big data (according to the PDF)?

A

Streaming processes data continuously as it arrives instead of waiting for full datasets. Structured Streaming extends the DataFrame API to handle data in real time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When should you use streaming?

A

Use streaming when data arrives continuously (e.g., sensor feeds, logs, tweets, crime/incident data, weather, moving objects) and needs to be processed in near-real-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two main types of stream processing in Spark?

A

Micro-batching (Structured Streaming default) – processes ~1 MB batches at short intervals.

Continuous streaming – processes data one record at a time with lower latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does micro-batching work?

A

Spark groups incoming data into small batches (e.g., every few milliseconds) and processes each batch like a small DataFrame job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does continuous streaming work?

A

Every record is processed immediately as it arrives, giving lower latency but requiring stricter guarantees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the output modes in Structured Streaming?

A

Complete: Outputs the entire result table every batch.

Append: Outputs only new rows
.
Update: Outputs newly added and updated rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When is complete output mode typically used?

A

When debugging or when the entire output is small enough to print each batch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is append output mode used for?

A

When the system should only output rows that have been newly added and won’t change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is update output mode?

A

Outputs both new rows and rows whose values have changed since the last batch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main types of windows in streaming?

A

Tumbling windows
Sliding windows
Session windows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a tumbling window?

A

A fixed-size, non-overlapping window (e.g., every 10 minutes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a sliding window?

A

A fixed-size window that advances by a slide interval (e.g., 10 min window sliding every 5 min). Windows overlap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a session window?

A

A window defined by periods of activity separated by gaps of inactivity. The gap is configurable or dynamic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What problem does watermarking solve?

A

Helps Spark handle late or out-of-order data by specifying how much lateness to tolerate before finalizing results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is watermarking applied?

A

Developers call .withWatermark(“timestamp”, “10 minutes”) so Spark knows late data beyond 10 minutes can be ignored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is watermarking needed in streaming systems?

A

Input data may not be totally ordered, especially when combining multiple ordered streams; watermarking prevents waiting indefinitely for late data