What is the concept behind streaming in big data (according to the PDF)?
Streaming processes data continuously as it arrives instead of waiting for full datasets. Structured Streaming extends the DataFrame API to handle data in real time
When should you use streaming?
Use streaming when data arrives continuously (e.g., sensor feeds, logs, tweets, crime/incident data, weather, moving objects) and needs to be processed in near-real-time.
What are the two main types of stream processing in Spark?
Micro-batching (Structured Streaming default) – processes ~1 MB batches at short intervals.
Continuous streaming – processes data one record at a time with lower latency
How does micro-batching work?
Spark groups incoming data into small batches (e.g., every few milliseconds) and processes each batch like a small DataFrame job
How does continuous streaming work?
Every record is processed immediately as it arrives, giving lower latency but requiring stricter guarantees.
What are the output modes in Structured Streaming?
Complete: Outputs the entire result table every batch.
Append: Outputs only new rows
.
Update: Outputs newly added and updated rows.
When is complete output mode typically used?
When debugging or when the entire output is small enough to print each batch.
what is append output mode used for?
When the system should only output rows that have been newly added and won’t change
What is update output mode?
Outputs both new rows and rows whose values have changed since the last batch.
What are the main types of windows in streaming?
Tumbling windows
Sliding windows
Session windows
What is a tumbling window?
A fixed-size, non-overlapping window (e.g., every 10 minutes)
What is a sliding window?
A fixed-size window that advances by a slide interval (e.g., 10 min window sliding every 5 min). Windows overlap.
What is a session window?
A window defined by periods of activity separated by gaps of inactivity. The gap is configurable or dynamic.
What problem does watermarking solve?
Helps Spark handle late or out-of-order data by specifying how much lateness to tolerate before finalizing results.
How is watermarking applied?
Developers call .withWatermark(“timestamp”, “10 minutes”) so Spark knows late data beyond 10 minutes can be ignored
Why is watermarking needed in streaming systems?
Input data may not be totally ordered, especially when combining multiple ordered streams; watermarking prevents waiting indefinitely for late data