What is streaming data in the context of data engineering?
A continuous flow of records generated over time, often as events such as clicks, sensor readings, or log entries.
How does streaming processing differ from batch processing?
Streaming processes data continuously as it arrives with low latency, while batch processes finite chunks at scheduled intervals.
What is an event in an event-driven system?
A record representing something that happened at a specific time, often including a key, timestamp, and payload.
What is a message broker or log-based streaming system?
A system that accepts, stores, and delivers ordered streams of messages or events to consumers, often partitioned for scalability.
What is a topic in a streaming system?
A named stream or category of messages that producers write to and consumers read from.
What is a partition in a topic?
An ordered, append-only sequence of messages that forms a shard of a topic for parallelism and scaling.
Why are partitions used in streaming systems?
To distribute load across brokers and consumers and enable parallel reads and writes for scalability.
What is an offset in a partition?
A monotonically increasing position that uniquely identifies a message’s location within a partition.
What is a consumer group?
A set of consumers that coordinate to share the partitions of a topic so that each partition is consumed by only one member at a time.
Why are consumer groups useful?
They allow horizontal scaling of consumption while providing a way to process each message once per group.
What is at-most-once delivery semantics?
Messages are delivered zero or one time; they are never delivered more than once but may be lost.
What is at-least-once delivery semantics?
Messages are delivered one or more times; they are not lost but may be processed more than once.
What is exactly-once processing semantics (conceptually)?
Guaranteeing that each message’s effect is applied logically once, even if the underlying system uses retries or duplicates.
Why is exactly-once processing hard to achieve in practice?
It often requires coordination across storage, compute, and sinks, idempotent operations, or transactional guarantees across systems.
Why is at-least-once delivery commonly used in streaming pipelines?
It prioritizes durability and correctness, accepting duplicates that can be handled with idempotent logic or deduplication.
What is idempotent processing in streaming?
Designing consumers so that processing the same message multiple times has the same effect as processing it once.
What techniques help implement idempotent processing?
Using unique event IDs and upserts, tracking processed offsets or IDs, and designing sinks to ignore duplicates based on keys.
What is event time?
The time at which an event actually occurred, as recorded in the event payload.
What is processing time?
The time at which an event is processed by the system, which may lag behind event time due to delays or reordering.
Why is distinguishing event time and processing time important?
Because analysis and windows should usually be based on when events happened, not when they were processed, especially with late data.
What is late-arriving data in streaming systems?
Events that arrive after the system has already processed or closed the time window corresponding to their event time.
What is a watermark in stream processing?
A marker indicating that the system believes it has seen all events up to a certain event-time, used to decide when to close windows.
Why are watermarks used?
To balance waiting for late data against providing timely results by defining when windows can be considered complete.
What is a window in streaming analytics?
A finite time range over which events are grouped for aggregation, such as 5-minute or hourly windows.