Windowing in Stream Processing: Timing is Everything
In stream processing (Kafka Streams, Flink, Spark Streaming), you rarely want to aggregate data from the beginning of time. Instead, you want to perform calculations over specific time intervals. This is achieved through Windowing.
1. Tumbling Windows (Fixed-Size, No Overlap)
A Tumbling Window is a fixed-size, non-overlapping, and contiguous time interval.
- How it works: If you have a 1-minute tumbling window, data from
12:00:00to12:00:59lands in one bucket, and12:01:00starts a completely new bucket. - Best for: Reporting hourly sales, counting daily active users, or any discrete periodic reporting.
2. Sliding Windows (Overlapping)
A Sliding Window has a fixed size but "slides" forward by a smaller increment (the "slide").
- How it works: A 1-hour window with a 5-minute slide. You'll get a result every 5 minutes covering the last 60 minutes of data.
- Best for: Calculating a "Moving Average" of stock prices or detecting a spike in error rates over the last 15 minutes updated every minute.
3. Session Windows (Activity-Based)
Unlike fixed windows, Session Windows are defined by periods of activity followed by periods of inactivity (the "gap").
- How it works: A session window starts when a user event arrives. It stays open as long as new events arrive within the "session gap" (e.g., 30 minutes). If no events arrive for 30 minutes, the window closes.
- Best for: Website session analysis (tracking what a user does in one visit) or grouping together related sensor readings.
4. Hopping Windows (Alias for Sliding)
In some systems (like Kafka Streams), Sliding Windows are called Hopping Windows when the "hop" (slide) is larger than the window size, though this is rare in practice.
5. Handling Late Data: Watermarks
In a distributed system, data can arrive out of order. Watermarks tell the stream processor how long to wait for late-arriving data before "closing" a window and emitting the final result.
Summary
Choosing the right window type is critical for the accuracy of your real-time analytics. Use Tumbling for discrete periods, Sliding for continuous monitoring, and Session for user-centric activity tracking.
