Streaming & Event-Driven Systems Flashcards by O Cam

What is streaming data in the context of data engineering?

A continuous flow of records generated over time, often as events such as clicks, sensor readings, or log entries.

How well did you know this?

Not at all

Perfectly

How does streaming processing differ from batch processing?

Streaming processes data continuously as it arrives with low latency, while batch processes finite chunks at scheduled intervals.

How well did you know this?

Not at all

Perfectly

What is an event in an event-driven system?

A record representing something that happened at a specific time, often including a key, timestamp, and payload.

How well did you know this?

Not at all

Perfectly

What is a message broker or log-based streaming system?

A system that accepts, stores, and delivers ordered streams of messages or events to consumers, often partitioned for scalability.

How well did you know this?

Not at all

Perfectly

What is a topic in a streaming system?

A named stream or category of messages that producers write to and consumers read from.

How well did you know this?

Not at all

Perfectly

What is a partition in a topic?

An ordered, append-only sequence of messages that forms a shard of a topic for parallelism and scaling.

How well did you know this?

Not at all

Perfectly

Why are partitions used in streaming systems?

To distribute load across brokers and consumers and enable parallel reads and writes for scalability.

How well did you know this?

Not at all

Perfectly

What is an offset in a partition?

A monotonically increasing position that uniquely identifies a message’s location within a partition.

How well did you know this?

Not at all

Perfectly

What is a consumer group?

A set of consumers that coordinate to share the partitions of a topic so that each partition is consumed by only one member at a time.

How well did you know this?

Not at all

Perfectly

Why are consumer groups useful?

They allow horizontal scaling of consumption while providing a way to process each message once per group.

How well did you know this?

Not at all

Perfectly

What is at-most-once delivery semantics?

Messages are delivered zero or one time; they are never delivered more than once but may be lost.

How well did you know this?

Not at all

Perfectly

What is at-least-once delivery semantics?

Messages are delivered one or more times; they are not lost but may be processed more than once.

How well did you know this?

Not at all

Perfectly

What is exactly-once processing semantics (conceptually)?

Guaranteeing that each message’s effect is applied logically once, even if the underlying system uses retries or duplicates.

How well did you know this?

Not at all

Perfectly

Why is exactly-once processing hard to achieve in practice?

It often requires coordination across storage, compute, and sinks, idempotent operations, or transactional guarantees across systems.

How well did you know this?

Not at all

Perfectly

Why is at-least-once delivery commonly used in streaming pipelines?

It prioritizes durability and correctness, accepting duplicates that can be handled with idempotent logic or deduplication.

How well did you know this?

Not at all

Perfectly

What is idempotent processing in streaming?

Designing consumers so that processing the same message multiple times has the same effect as processing it once.

How well did you know this?

Not at all

Perfectly

What techniques help implement idempotent processing?

Using unique event IDs and upserts, tracking processed offsets or IDs, and designing sinks to ignore duplicates based on keys.

How well did you know this?

Not at all

Perfectly

What is event time?

The time at which an event actually occurred, as recorded in the event payload.

How well did you know this?

Not at all

Perfectly

What is processing time?

Study These Flashcards

The time at which an event is processed by the system, which may lag behind event time due to delays or reordering.

Why is distinguishing event time and processing time important?

Study These Flashcards

Because analysis and windows should usually be based on when events happened, not when they were processed, especially with late data.

What is late-arriving data in streaming systems?

Study These Flashcards

Events that arrive after the system has already processed or closed the time window corresponding to their event time.

What is a watermark in stream processing?

Study These Flashcards

A marker indicating that the system believes it has seen all events up to a certain event-time, used to decide when to close windows.

Why are watermarks used?

Study These Flashcards

To balance waiting for late data against providing timely results by defining when windows can be considered complete.

What is a window in streaming analytics?

Study These Flashcards

A finite time range over which events are grouped for aggregation, such as 5-minute or hourly windows.

What are tumbling windows?

Non-overlapping, fixed-size windows that partition the timeline, such as consecutive 1-minute intervals.

What are sliding windows?

Windows of fixed size that advance by smaller steps, causing overlap, such as a 5-minute window that slides every 1 minute.

What are session windows?

Windows defined by periods of activity separated by gaps of inactivity, used to group events into sessions rather than fixed intervals.

What is stateful streaming processing?

Processing that keeps and updates state across events, such as counts, aggregates, or keyed state for joins and windows.

What is a state store in streaming architectures?

A storage layer (in-memory, local disk, or external) used by stream processors to keep per-key or per-window state.

Why does state management complicate streaming applications?

State must be checkpointed, recovered on failure, and scaled across nodes without losing consistency or performance.

What is backpressure in streaming systems?

A condition where downstream operators cannot keep up with incoming data rates, causing queues to grow and slowing upstream producers.

How can streaming systems handle backpressure?

By throttling producers, buffering with limits, scaling out consumers, or applying load shedding in extreme cases.

What is a dead-letter queue (DLQ) in streaming pipelines?

A separate stream where messages that cannot be processed successfully after retries are sent for later inspection and remediation.

Why are DLQs useful?

They prevent problematic messages from blocking the main stream and allow targeted manual or automated handling of bad records.

What is stream–table duality?

The idea that a changelog stream can be viewed as a table evolving over time, and a table can be viewed as the result of accumulating a stream.

What is a stream–stream join?

Joining two continuous streams of events, typically using time windows and keys to correlate events that occur within a time range.

What is a stream–table join?

Joining a stream of events to a lookup table or state representing the latest values for keys, often used to enrich events with reference data.

Why are ordering guarantees important in streaming?

Many computations, such as aggregations and state updates, assume events for a given key arrive in order; out-of-order data complicates logic.

How is ordering typically defined in partitioned topics?

Ordering is guaranteed only within a partition, not across partitions; events may arrive out-of-order by event time even within a partition.

What is a common partitioning strategy for topics?

Partitioning by a key such as customer ID or device ID so that all events for a key go to the same partition.

What is a trade-off when choosing a partition key?

Keys that are too skewed create hot partitions; keys that are too random may complicate aggregation and join patterns.

What is micro-batching in streaming systems?

Processing data in small, time-bounded batches approximating streaming behavior while using batch execution engines.

What is end-to-end latency in streaming pipelines?

The time between when an event occurs and when its derived results become available to consumers.

Why is monitoring lag important in streaming systems?

Lag indicates how far behind consumers are from the head of the stream, revealing performance issues or backlogs.

What is a good one-sentence mental model for streaming data engineering?

Continuously consume ordered event streams, manage state and time carefully, and design idempotent, backpressure-aware pipelines that can handle out-of-order and late data.

Streaming & Event-Driven Systems Flashcards

(45 cards)