New Structured Streaming Flashcards

Question 1

Q

What is the concept behind streaming in big data (according to the PDF)?

Answer

A

Streaming processes data continuously as it arrives instead of waiting for full datasets. Structured Streaming extends the DataFrame API to handle data in real time

Question 2

Q

When should you use streaming?

Answer

A

Use streaming when data arrives continuously (e.g., sensor feeds, logs, tweets, crime/incident data, weather, moving objects) and needs to be processed in near-real-time.

Question 3

Q

What are the two main types of stream processing in Spark?

Answer

A

Micro-batching (Structured Streaming default) – processes ~1 MB batches at short intervals.

Continuous streaming – processes data one record at a time with lower latency

Question 4

Q

How does micro-batching work?

Answer

A

Spark groups incoming data into small batches (e.g., every few milliseconds) and processes each batch like a small DataFrame job

Question 5

Q

How does continuous streaming work?

Answer

A

Every record is processed immediately as it arrives, giving lower latency but requiring stricter guarantees.

Question 6

Q

What are the output modes in Structured Streaming?

Answer

A

Complete: Outputs the entire result table every batch.

Append: Outputs only new rows
.
Update: Outputs newly added and updated rows.

Question 7

Q

When is complete output mode typically used?

Answer

A

When debugging or when the entire output is small enough to print each batch.

Question 8

Q

what is append output mode used for?

Answer

A

When the system should only output rows that have been newly added and won’t change

Question 9

Q

What is update output mode?

Answer

A

Outputs both new rows and rows whose values have changed since the last batch.

Question 10

Q

What are the main types of windows in streaming?

Answer

A

Tumbling windows
Sliding windows
Session windows

Question 11

Q

What is a tumbling window?

Answer

A

A fixed-size, non-overlapping window (e.g., every 10 minutes)

Question 12

Q

What is a sliding window?

Answer

A

A fixed-size window that advances by a slide interval (e.g., 10 min window sliding every 5 min). Windows overlap.

Question 13

Q

What is a session window?

Answer

A

A window defined by periods of activity separated by gaps of inactivity. The gap is configurable or dynamic.

Question 14

Q

What problem does watermarking solve?

Answer

A

Helps Spark handle late or out-of-order data by specifying how much lateness to tolerate before finalizing results.

Question 15

Q

How is watermarking applied?

Answer

A

Developers call .withWatermark(“timestamp”, “10 minutes”) so Spark knows late data beyond 10 minutes can be ignored

Question 16

Q

Why is watermarking needed in streaming systems?

Answer

Study These Flashcards

A

Input data may not be totally ordered, especially when combining multiple ordered streams; watermarking prevents waiting indefinitely for late data

New Structured Streaming Flashcards

(16 cards)