Custom Window Operators for Observability-Native Stream Processing

By Devansh Saxena Published November 12, 2025

When building real-time monitoring systems, the choice between general-purpose stream processing and purpose-built solutions isn't just technical, it's strategic. We evaluated Apache Flink and found that while excellent for data pipelines, observability workloads have fundamentally different requirements where trend detection matters more than exact counts. This was the key motivation for us to build a purpose-built observability platform. In this post, we cover the platform's key features : custom window operators and discuss how we redesigned it with these considerations in mind.

The Core Insight: Different Goals, Different Designs

General-purpose stream processing prioritizes correctness guarantees with configurable latency perfect for ETL pipelines where eventual consistency is acceptable and exact results matter. Observability systems need real-time responsiveness with bounded memory, critical for alerting where correlating "database latency spikes with concurrent API timeouts and cache evictions" in real time can prevent customer-impacting escalations.

1. Watermark-Free Architecture

This is where observability requirements diverge most fundamentally from data pipelines.

The Flink Approach

Watermarks signal event time progress declaring "all events up to time T should have arrived." This ensures correctness but adds latency as operators wait before evaluating windows.

Flink Watermark Model (Event-time Driven)

                  Event Stream ──▶ [Buffer] ──▶ [Wait for watermark T] ──▶ [Emit Window]
                            │
                            └── Late events trigger retractions
                  Latency ≥ configured tolerance

The Chiron Approach

By dividing windows into fine-grained rolling subwindows, Chiron processes early signals instantly while still accepting delayed events into prior subwindows. This enables real time without sacrificing accuracy, perfect for observability, where detecting correlated anomalies (e.g., API latency rising as DB I/O stalls) minutes earlier can prevent cascading outages.

Chiron Time-Aligned Buckets:

                  Rolling Sub-Window Execution (No Watermarks)

                  Wall Clock:  10:00:00 ── 10:00:15 ── 10:00:30 ── 10:00:45
                  Buckets:     [B1]───────[B2]───────[B3]───────[B4]
                  Alignment:   Always anchored to :00, :15, :30, :45

                  Event Stream ─▶ [Update Buckets Immediately] ─▶ [Slide Window Forward]
                            └─ Late events merge into prior sub-windows
                  Latency: Instant (No watermark coordination, continuous recompute)

Impact: We do not wait for every event to arrive. In observability, early detection matters more than perfect counts. We flag trends (for example, "CPU is spiking"). Late arrivals still update the windows and can trigger alerts, but a minute head start often prevents outages.

2. Adaptive Window Granularity

The Challenge: Production observability systems run thousands of monitoring rules simultaneously from 30-second high frequency checks to 7-day trend analysis. General purpose streaming platform (like Flink) requires manual tuning of slide intervals for each window size: 15-second slides for 30-second windows, 1-minute slides for 1-hour windows, 1-hour slides for 1-week windows. This configuration overhead scales linearly with rule count.

Our Approach: Chiron automatically optimizes internal bucket granularity based on total time span. A 30-minute window uses 15-second buckets, a 2-hour window switches to 1-minute buckets, and a 1-week window uses 1-hour buckets.

Adaptive Bucket Selection by Window Span

                  Fine-grained →────┬────→ Coarse-grained
                              │
                  More precision    │  Less memory usage
                              │
                  Automatic adjustment = zero config

This matters because in production environments with thousands of rules across multi-tenant deployments, automatic optimization eliminates manual tuning while ensuring predictable memory usage. A 1-hour window consumes 15x less memory than fixed 15-second buckets for long-duration windows (1 week) and translating directly to infrastructure cost savings at scale.

3. Observability Native Functionalities

Generic aggregations don't understand monitoring patterns. Prometheus style counters reset on service restart. Different percentiles need different precision. These domain specific behaviors require custom logic in general purpose frameworks, increasing development time and error risk.

Counter Reset Detection

Chiron's rate operator automatically handles counter resets by tracking min/max values per bucket. When calculating rate, it uses the difference between the maximum in the last bucket and minimum in the first bucket automatically, if difference is negative that means there is a counter reset being done in between the event timestamps. This detects service restarts without custom logic.

Adaptive Compression

Chiron adjusts TDigest compression based on the query which means P50 uses lower compression (50x) for accuracy, while P99 uses higher compression (300x) since tail values are sparse. Typically ~1-2% relative error for P50, ~2-5% for P99, depending on data distribution.

Percentile	Compression Factor (δ)	Characteristics
P50	δ = 50	Higher accuracy, more clusters
P99	δ = 300	Lower compression, efficient tail tracking

Memory: O(δ) clusters
Each cluster stores: Centroid | Weight | Mean | Variance

Probabilistic Algorithms for Scale

Observability queries often involve massive cardinality that means tracking millions of unique users, endpoints, or trace IDs. Chiron uses probabilistic data structures optimized for monitoring: HyperLogLog for unique counts: ~1% error with kilobytes of memory vs gigabytes for exact counting.

HyperLogLog Cardinality Estimation

                  Estimate:  αₘ × m² / Σ(2^(-M[j]))
                  where:
                  m = 2^lgK   → number of registers
                  M[j] = max leading zeros per register
                  αₘ ≈ 0.7213 / (1 + 1.079/m)

                  Memory:  O(2^lgK) bytes
                  Error:   ≈ 1.04 / √m  →  lgK=12 ⇒ ~4 KB, ~1.6% error

Impact: Operators work correctly out of the box, no custom scripts or post-processing needed. Counter resets during service restarts no longer create false "throughput drops." Percentiles auto-tune for precision, so P99 latency spikes are detected without noise from P50 trends. HyperLogLog-based unique counting lets teams instantly see "active users per endpoint" or "distinct error codes per service" in real time using bounded memory. This domain expertise eliminates entire classes of operational bugs.

What This Means in Practice

Aspect	Flink (General-Purpose)	Chiron (Observability-Native)
Alert Latency	30-60 seconds	Real time
Memory Config	Manual per-rule tuning	Automatic optimization
Observability Patterns	Custom logic required	Built-in operators
Multi-Dimensional	Complex state management	Native support

Why Purpose-Built Wins

As observability scales to millions of entities and heterogeneous signals, a purpose-built engine like Chiron becomes essential engineered for sub-second detection, adaptive accuracy, and domain semantics that understand counters, percentiles, and traces natively.

The result: a platform that is not just optimized for monitoring, but fundamentally designed to deliver the fastest, most reliable context for resolving outages today.

Chiron is a streaming-first observability platform built from the ground up to deliver the root cause, not just another alert, helping teams dramatically reduce MTTR while keeping TCO in check. Book a demo to see how Chiron strengthens SLA reliability across real-time, mission-critical systems.

About the author:

Devansh Saxena is a Founding Engineer at Chiron, building observability-native stream processing systems. Previously, he worked at Yugabyte and Nutanix, bringing deep expertise in distributed systems and database technologies to Chiron's platform.

Stream-First Observability Live Streaming MTTR Root Cause Analysis Real-Time Correlation Observability Architecture