Alertsprocessor For Real-Time Alert Evaluation In OpenTelemetry
Hey everyone! Today, we're diving deep into an exciting new component for OpenTelemetry Collector Contrib called the alertsprocessor
. This processor is designed to supercharge your monitoring and alerting game by providing real-time evaluation of alerting rules. Let’s break down what it is, why it's a game-changer, and how you can start using it.
What is the AlertsProcessor?
The alertsprocessor
is a cutting-edge component engineered for real-time alert evaluation using a sliding-window approach. Think of it as your vigilant watchdog, constantly monitoring incoming telemetry data over a short, configurable time frame—typically around 5 seconds. This rapid assessment allows for immediate detection of anomalies and issues, significantly reducing the Mean Time to Detect (MTTD) problems. The processor takes metrics, logs, and traces as inputs and cleverly emits synthetic metrics that vividly describe the state and transitions of alerts. For those who crave immediate notifications, it also supports an optional HTTP notifier that can dispatch alerts to a webhook endpoint.
Key Features and Benefits
- Real-Time Evaluation: The real-time alerting provided by
alertsprocessor
ensures immediate detection of issues. This is critical for maintaining system health and minimizing downtime. The processor operates on a sliding window, allowing it to continuously monitor telemetry data and trigger alerts as soon as anomalies are detected. - Sliding Window Analysis: This processor employs a sliding window approach, making it incredibly efficient for short-term analysis. This means you can catch issues as they happen, rather than waiting for longer batch processes. The sliding window mechanism ensures that only the most recent data is considered, providing an accurate and up-to-date view of system performance.
- Multi-Telemetry Input: It seamlessly handles metrics, logs, and traces, making it a versatile tool for all your telemetry needs. This comprehensive approach allows for a holistic view of system health, as alerts can be triggered based on any type of telemetry data. This versatility is crucial for identifying complex issues that may span multiple telemetry domains.
- Synthetic Metrics: It emits synthetic metrics that describe alert states and transitions, providing a clear and actionable view of your system's health. These metrics can be easily integrated into existing monitoring dashboards and alerting systems, providing a seamless and consistent view of system performance.
- HTTP Notifier: The optional HTTP notifier allows for instant alerts via webhooks, ensuring you're always in the loop. This feature enables real-time communication with various alerting platforms, such as PagerDuty and Slack, ensuring that the right people are notified immediately when an issue is detected.
- Fast Guardrails and SLO Checks: Quickly establish guardrails and Service Level Objective (SLO) checks close to the data source. For example, you can set up rules like “Alert if there are ≥5 ERROR timeouts in 5s” or “Alert if ≥3 spans exceed 500ms by service/span.”
- Horizontal Scaling and Isolation: Pair it with the
routingconnector
for clean horizontal scaling. This allows you to run onealertsprocessor
per routed group (e.g., team, namespace, service), ensuring isolation and efficient resource utilization. This setup is ideal for large, complex environments where multiple teams or services share the same infrastructure. - Kubernetes-Friendly Operation: Inline rules in the Collector config make it Kubernetes-friendly, eliminating the need for PV/ConfigMap just to mount rules. This simplifies deployment and management, as rules can be defined directly within the Collector configuration.
- Improved MTTD: The primary goal of
alertsprocessor
is to reduce Mean Time to Detect (MTTD) by enabling faster DETECTION and quicker RECOVERY from failures.
Use Cases
The alertsprocessor
shines in several key scenarios:
- Fast Guardrails and SLO Checks: Imagine setting up a guardrail that instantly alerts you if there are five or more error timeouts within a 5-second window. Or perhaps you want to know if three or more spans exceed 500ms for a particular service or span. The
alertsprocessor
makes these checks lightning-fast and incredibly straightforward. - Clean Horizontal Scaling & Isolation: By pairing the
alertsprocessor
with theroutingconnector
, you can achieve clean horizontal scaling and isolation. Think of running onealertsprocessor
for each routed group, such as a team, namespace, or service. This setup keeps your alert engines segregated and manageable. - Kubernetes-Friendly Operation: The processor is designed to play nicely with Kubernetes. You can define rules inline within your Collector configuration, which means no more juggling Persistent Volumes (PVs) or ConfigMaps just to mount your rules. It’s all about simplicity and efficiency.
Resource Considerations
Keep in mind that the sliding_window.duration
directly impacts resource usage. A larger window means more CPU and RAM consumption, so it's wise to keep the window as small as practical for your use case.
Example Configurations
Let’s dive into some practical examples to see how you can configure the alertsprocessor
.
Minimal Inline-Rules Setup
This setup demonstrates a basic configuration where rules are defined directly within the processor configuration. It's perfect for getting started quickly and testing out the processor's capabilities.
processors:
alertsprocessor:
sliding_window:
duration: 5s # ⚠Larger window ⇒ more CPU/RAM
evaluation:
interval: 15s
timeout: 10s
statestore:
instance_id: demo
external_labels: { group: demo, source: collector }
notifier:
url: http://alertmanager:9093/api/v2/alerts
rules:
- id: high_error_logs
name: HighErrorLogs
signal: logs
for: 0s
labels: { severity: error }
logs:
severity_at_least: ERROR
body_contains: "timeout"
group_by: ["service.name"]
count_threshold: 5
- id: slow_spans
name: SlowSpans
signal: traces
for: 5s
labels: { severity: warning }
traces:
latency_ms_gt: 500
status_not_ok: false
group_by: ["service.name","span.name"]
count_threshold: 3
service:
pipelines:
logs: { receivers: [otlp], processors: [alertsprocessor], exporters: [debug] }
traces: { receivers: [otlp], processors: [alertsprocessor], exporters: [debug] }
In this example, we define two rules:
high_error_logs
: This rule triggers when there are five or more log entries with a severity of ERROR containing the word "timeout" within a 5-second window. It groups these logs byservice.name
.slow_spans
: This rule triggers when there are three or more spans with a latency greater than 500ms and a status not OK within a 5-second window. It groups these spans byservice.name
andspan.name
.
Multi-Group Topology with routingconnector
For more complex environments, consider using the routingconnector
to segregate alert engines. This is particularly useful when you want to apply different rules to different groups or services.
connectors:
routing:
default_pipelines:
logs: [logs/default]
traces: [traces/default]
table:
- statement: route() where resource.attributes["k8s.namespace.name"] == "payments"
pipelines:
logs: [logs/payments]
traces: [traces/payments]
processors:
alertsprocessor/payments:
sliding_window: { duration: 5s }
evaluation: { interval: 15s }
statestore:
instance_id: payments-engine
external_labels: { group: payments, source: collector }
rules:
# rules for the "payments" group …
service:
pipelines:
logs/in: { receivers: [otlp], exporters: [routing] }
traces/in: { receivers: [otlp], exporters: [routing] }
logs/payments: { receivers: [routing], processors: [alertsprocessor/payments], exporters: [debug] }
traces/payments:{ receivers: [routing], processors: [alertsprocessor/payments], exporters: [debug] }
Here, the routingconnector
routes telemetry data based on the k8s.namespace.name
attribute. Telemetry from the payments
namespace is routed to a dedicated alertsprocessor/payments
instance. This setup allows you to define specific rules for the payments
group, ensuring that alerts are tailored to their unique requirements.
Emitted Synthetic Metrics
The alertsprocessor
emits several synthetic metrics that provide valuable insights into alert states and transitions. These metrics are designed to be remote-write friendly, making them easy to integrate with existing monitoring systems.
otel_alert_state{rule_id, signal, ...} = 0|1
(gauge): This metric indicates the current state of an alert rule. A value of 1 means the alert is active, while 0 means it is inactive.otel_alert_transitions_total{rule_id, from, to, signal, ...}
(counter): This metric counts the number of transitions between alert states. It tracks transitions from one state to another, providing a historical view of alert activity.alertsprocessor_evaluation_duration_seconds
(self-telemetry gauge): This metric measures the duration of the alert evaluation process itself. It helps monitor the performance of thealertsprocessor
and identify any potential bottlenecks.
These metrics can be used to create dashboards and alerts that provide real-time visibility into system health. For example, you can set up alerts based on the otel_alert_state
metric to notify you when a critical alert is triggered. You can also use the otel_alert_transitions_total
metric to track the frequency of alert transitions and identify trends over time.
Telemetry Data Types Supported
The alertsprocessor
is versatile and supports various telemetry data types:
Consumes:
- Logs: Evaluated against rules.
- Traces: Evaluated against rules.
- Metrics: Evaluated against rules.
Produces (via pipeline):
- Metrics: Synthetic alert state & transition series.
Side Load (optional):
- HTTP notifications to a configured endpoint.
Behavioral Notes
It's worth noting a few behavioral nuances:
- Non-String Log Bodies: If a log record’s Body is non-string, the processor logs a WARN and stringifies it for
body_contains
matching. This is aligned with the OTel logs data model. - Filtering and Grouping: Resource attributes and record/span attributes are used for filtering and
group_by
operations, providing flexible rule configurations.
Code Owner and Sponsor
The primary code owner for this component is @aarvee11. Currently, there is no specified sponsor for this project.
Additional Context and Future Work
Key Details
- Component Name:
alertsprocessor
(contrib) - Stability: Development
- Rules Sources: Inline config under
processors.<name>.rules
(preferred) and/or file globs viarule_files.include
. - Cardinality Safeguards: Allowlist/blocklist and max value length trimming for labels on emitted metrics.
- Storm Control & Statestore: Initial scaffolding present; future enhancements may add adaptive cadence and external state sync.
Future Enhancements
The development team has exciting plans for the future, including:
- Pluggable State Backends / HA Sync: This will enhance the reliability and scalability of the processor by allowing it to use external state backends and synchronize state across multiple instances.
- Richer Notification Templating: More flexible and customizable notification templates will allow for more informative and actionable alerts.
Tip
If you're excited about this component, give the issue a 👍 on GitHub to help prioritize its development. And, of course, share your thoughts and context in the comments—avoiding +1
or me too
—so we can all learn and improve together!
Conclusion
The alertsprocessor
is a powerful addition to the OpenTelemetry ecosystem, offering real-time alert evaluation and synthetic metrics generation. Its versatility and Kubernetes-friendly design make it an excellent choice for modern monitoring solutions. Whether you're looking to implement fast guardrails, scale your alerting infrastructure, or simply improve your MTTD, the alertsprocessor
has you covered. So, go ahead, give it a try, and let us know what you think!