Data streaming
processing continuously generated data in near-real time so organizations can react instantly to changes
2 reasons data streaming is important in modern big data systems
5 common use cases
Approximations
Only a portion of dataset can be kept in memory, so sometimes data has to be approximated if its too much
3 types of Data Streaming Models
4 main layers of streaming architecture (CADM)
Collection tier
Analysis tier
Data Access tier
<–message queuing–>
2 other optional layers of streaming architecture
In-memory storage (supports analysis)
long-term storage (keep for batch in future)
What is collection tier made of?
Multiple edge servers that receive data from external sources (Normally TCP/IP-based over HTTP protocol, a lot of JSON format)
Producer-broker-consumer concept
Producer is collection tier
Broker is messaging queue (many across nodes)
Consumer is Analysis tier
3 main types of message delivery semantics
Analysis tier
Heart of the streaming architecture that adopts continuous query model and design algorithms specific to streaming problem
Continuous query model
query that is issued once then continuously executed against the data, often maintaining a state. (Results regularly pushed to client)
Security system - query filters sensor data for human movement
Windowing
Carry out analyses on a per-window basis instead of a simple per-item basis
Sliding windows
define interval of analyses based off time
Fixed windows
analyze last 5 minutes of data every 5 minutes
Overlapping windows
analyze last 5 minutes of data every 2 minutes
Sampling windows
analyze last 2 minutes of data every 5 minutes
taking samples, not all of it!
data-driven windows
process only when session is active, and x time after it ends
don’t know the lengths ahead of time
Event time vs Stream time
Event time - when event actually occurs (gold standard of windowing)
stream time - when event enters the streaming system
Time difference is called a skew
Watermark
Captures progress of event time completeness as processing time progresses
Perfect watermark
guarantees no late data ever arrives
Heuristic watermark
estimates progress based on information available about the input stream (faster)
Allowed Lateness
policy for accepting late data, since it is common
ex: allows 5 minute late data, everything else discarded
note - higher tolerance, longer data must be buffered
Accumulation strategy
defines how intermediate results must be aggregated