Troubleshooting OpenTelemetry Kube Stack Out-of-Order Samples
Introduction
This article addresses the common issue of "out of order samples" errors encountered when using the opentelemetry-kube-stack
chart to export metrics to Amazon Managed Service for Prometheus (AMP). These errors typically arise when the OpenTelemetry Collector scrapes metrics from various Kubernetes components like kubelet, cadvisor, and probes. This guide provides a comprehensive analysis of the problem, potential causes, troubleshooting steps, and solutions to ensure seamless metric collection and export. We'll delve into the specifics of configuring the OpenTelemetry Collector, examine common pitfalls, and offer practical recommendations to resolve these errors and maintain the integrity of your monitoring data.
The error "out of order sample" occurs when a time series database, like Prometheus, receives data points with timestamps that are earlier than the most recent data point it has stored for that particular series. This is a common issue in distributed systems where time synchronization and data delivery can be challenging. In the context of Kubernetes monitoring with OpenTelemetry, these errors can arise due to various reasons, including clock skew between nodes, inconsistent scrape intervals, misconfigured batch processing, or issues with the remote write endpoint. Understanding these potential causes is crucial for effective troubleshooting.
This article will guide you through the process of diagnosing and resolving these errors, ensuring that your metrics are accurately collected and exported to your monitoring backend. We will cover the common scenarios, configurations, and best practices to help you maintain a robust and reliable monitoring system. By the end of this guide, you should have a clear understanding of how to identify and fix out-of-order sample errors in your OpenTelemetry Kube Stack environment, enabling you to monitor your Kubernetes clusters effectively.
Problem Description
The primary issue is the occurrence of "out of order samples" errors when utilizing the opentelemetry-kube-stack
chart, specifically when exporting metrics scraped from kubelet, cadvisor, and probes to Amazon Managed Service for Prometheus (AMP). These errors manifest as HTTP 400 responses with messages indicating that the timestamps of incoming samples are older than the previously recorded timestamps for the same metric series. This prevents the metrics from being ingested into AMP, leading to gaps in monitoring data. The errors are observed across multiple metric series and endpoints, suggesting a systemic issue rather than an isolated incident.
These errors are particularly concerning because they can lead to incomplete or inaccurate monitoring data, which can hinder the ability to effectively diagnose and resolve issues in your Kubernetes environment. The sporadic nature of these errors can also make them difficult to troubleshoot, as they may not always be consistently reproducible. Therefore, a systematic approach to identifying and addressing the root cause is essential. The observed errors point to a fundamental issue in how the metrics are being collected, processed, or exported, necessitating a detailed investigation of the OpenTelemetry Collector configuration, scrape targets, and the interaction with the remote write endpoint.
The errors are not limited to a single metric or endpoint; they affect various metrics from kubelet, cadvisor, and probes. This suggests that the problem is not specific to a particular metric but rather a general issue with the configuration or the environment. The consistency of the scrape interval (30s) across all affected scrape configs should rule out issues related to varying scrape frequencies, but the fact that the errors persist despite this consistency indicates a deeper problem. The fact that each target is scraped only once, as verified through the TargetAllocator, eliminates the possibility of duplicate scrapes causing the out-of-order errors. This narrows down the potential causes to timing issues, batch processing configurations, or the interaction with the remote write endpoint.
Environment Details
The environment in which these errors are occurring is a critical factor in understanding the potential causes. The setup includes:
- Chart: opentelemetry-kube-stack v0.6.2
- OpenTelemetry Collector Version: 0.129.0
- Kubernetes Version: EKS 1.30
- Deployment Environment: AWS EKS
- Remote Write Target: Amazon Managed Service for Prometheus (AMP)
- Deployment Mode: DaemonSet (one collector per node)
- TargetAllocator: Enabled (with Prometheus CRs disabled)
This configuration implies a distributed metric collection architecture where each node in the Kubernetes cluster runs an OpenTelemetry Collector instance, ensuring comprehensive coverage of the node-level metrics. The use of Amazon EKS and AMP introduces specific considerations related to AWS services, such as authentication and network latency. The DaemonSet deployment mode is designed to ensure that a collector runs on each node, which should help in distributing the load and preventing bottlenecks. However, it also means that the configuration and behavior of each collector instance must be consistent to avoid discrepancies in the collected data. The TargetAllocator component is used to dynamically distribute scrape targets among the collectors, which helps in managing the workload and ensuring that each target is scraped by only one collector. This eliminates potential conflicts and ensures data consistency, but it also adds complexity to the system and requires careful configuration to function correctly.
The version of the opentelemetry-kube-stack
chart (v0.6.2) and the OpenTelemetry Collector (0.129.0) are also important considerations. Newer versions may include bug fixes or performance improvements that address the out-of-order samples issue, while older versions may have known issues that could be contributing to the problem. Therefore, verifying that the versions in use are up-to-date or at least within a stable range is a crucial step in the troubleshooting process. The Kubernetes version (EKS 1.30) is also relevant, as different Kubernetes versions may have varying behaviors or compatibility issues with the OpenTelemetry components. Understanding these environmental factors is essential for narrowing down the potential causes of the out-of-order samples errors.
Configuration
The configuration of the OpenTelemetry Collector is a crucial aspect to examine when troubleshooting out-of-order samples errors. The configuration provided includes exporters, extensions, processors, receivers, and service definitions. Here's a breakdown of the key components:
- Exporters: The
prometheusremotewrite
exporter is configured to send metrics to Amazon Managed Service for Prometheus (AMP). It uses AWS SigV4 authentication and includes settings for endpoint, external labels, resource-to-telemetry conversion, and timeout. - Extensions: The
sigv4auth
extension handles AWS SigV4 authentication for the Prometheus remote write exporter. - Processors: The
batch
processor is used to batch metrics before sending them to the exporter, with settings forsend_batch_max_size
,send_batch_size
, andtimeout
. Theresourcedetection/env
processor detects and adds resource attributes based on environment variables. - Receivers: The
otlp
receiver accepts metrics via gRPC and HTTP, while theprometheus
receiver scrapes metrics from configured targets. - Service: The service defines pipelines for metrics, which include the
prometheus
receiver,resourcedetection/env
andbatch
processors, and theprometheusremotewrite
exporter. It also configures telemetry settings for logs and metrics.
config:
exporters:
debug: {}
prometheusremotewrite:
auth:
authenticator: sigv4auth
endpoint: >-
https://aps-workspaces.[REGION].amazonaws.com/workspaces/[WORKSPACE_ID]/api/v1/remote_write
external_labels:
cluster: hyperion-dev
resource_to_telemetry_conversion:
enabled: true
target_info:
enabled: false
timeout: 10s
extensions:
sigv4auth:
assume_role: null
processors:
batch:
send_batch_max_size: 1500
send_batch_size: 1000
timeout: 1s
resourcedetection/env:
detectors:
- env
override: false
timeout: 2s
receivers:
otlp:
protocols:
grpc:
endpoint: '0.0.0.0:4317'
http:
endpoint: '0.0.0.0:4318'
prometheus:
config:
scrape_configs:
# Note: The kubelet, cadvisor, and probes scrape configs are identical to
# charts/opentelemetry-kube-stack/examples/prometheus-otel/kubelet_scrape_configs.yaml
# and have been removed here for brevity
service:
extensions:
- sigv4auth
pipelines:
metrics:
exporters:
- prometheusremotewrite
processors:
- resourcedetection/env
- batch
receivers:
- prometheus
telemetry:
logs:
encoding: json
level: info
metrics:
readers:
- pull:
exporter:
prometheus:
host: 0.0.0.0
port: 8888
The scrape configurations for kubelet, cadvisor, and probes are based on the examples provided in the opentelemetry-kube-stack
chart. These configurations define how metrics are scraped from the respective endpoints, including relabeling rules and scrape intervals. It is essential to ensure that these configurations are consistent and appropriate for the target endpoints. Any discrepancies or misconfigurations in these scrape configs can lead to timing issues or data inconsistencies that contribute to out-of-order samples errors. For instance, incorrect relabeling rules could lead to the same metric being scraped with different labels, resulting in duplicate time series and potential conflicts. Similarly, inconsistent scrape intervals can cause metrics to be collected at varying frequencies, leading to out-of-order errors if the data is not properly aligned before being sent to the remote write endpoint.
The batch processor settings, including send_batch_max_size
, send_batch_size
, and timeout
, play a crucial role in how metrics are batched and sent to the exporter. Incorrectly configured batching can lead to delays or inconsistencies in the delivery of metrics, potentially contributing to out-of-order errors. For example, if the batch timeout is too short, metrics may be sent before they are fully collected, resulting in partial batches being sent out of order. On the other hand, if the batch timeout is too long, metrics may be delayed, causing them to arrive at the remote write endpoint with timestamps that appear out of order compared to more recent data. Therefore, fine-tuning these batch processor settings is essential for ensuring the timely and consistent delivery of metrics to the remote write endpoint.
Error Details and Observations
The error manifests as HTTP 400 responses with "out of order sample" messages. A typical error message looks like this:
Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): maxFailure (quorum) on a given error family, addr=[REDACTED]:9095 state=ACTIVE zone=[REGION], rpc error: code = Code(400) desc = user=[REDACTED]: err: out of order sample. timestamp=2025-07-04T02:24:13.571Z, series={__name__="kubelet_runtime_operations_duration_seconds_bucket", cluster="hyperion-dev", endpoint="https-metrics", instance="[REDACTED]", job="kubelet", k8s_node_name="[REDACTED]", le="0.078125", metrics_path="/metrics", node="[REDACTED]", operation_type="stop_podsandbox", server_address="[REDACTED]", server_port="", service_instance_id="[REDACTED]", service_name="kubelet", url_scheme="https"}
Key observations about the errors include:
- The error affects multiple different metric series, indicating a systemic issue rather than a problem with a specific metric.
- The metrics are from kubelet, cadvisor, and probes endpoints, suggesting a common factor in how these metrics are being scraped or processed.
- All three scrape configs use the same relabeling pattern and scrape interval (30s), ruling out inconsistencies in scrape configurations as the primary cause.
- Verification through TargetAllocator confirms that each target is only being scraped once, eliminating the possibility of duplicate scrapes causing out-of-order errors.
- Running as DaemonSet ensures one collector per node, which should distribute the load and prevent bottlenecks, but it also means that the configuration and behavior of each collector instance must be consistent.
The error message provides valuable information about the context in which the error occurred. It includes the timestamp of the out-of-order sample, the metric series name, and the labels associated with the metric. This information can be used to identify the specific metrics that are experiencing issues and to trace the flow of data from the scrape target to the remote write endpoint. The fact that the error mentions a quorum failure might indicate issues with how samples are being batched or sent to the remote write endpoint. Quorum failures typically occur when a majority of the replicas in a distributed system fail to process a request, suggesting that there may be problems with the reliability or consistency of the data being sent.
Troubleshooting Steps and Solutions
To address the "out of order samples" errors, several troubleshooting steps can be taken. Here are some potential causes and solutions:
1. Clock Synchronization
Problem: Clock skew between nodes can cause timestamps to appear out of order. If the clocks on the nodes running the OpenTelemetry Collectors are not synchronized, the timestamps on the metrics they collect may be inaccurate, leading to out-of-order errors. This is especially common in distributed systems where nodes may drift over time if not properly synchronized. Clock skew can also be exacerbated by network latency or other environmental factors that affect the timing of data transmission.
Solution: Ensure proper NTP (Network Time Protocol) configuration on all nodes. NTP is a protocol designed to synchronize the clocks of computers over a network. By configuring NTP on all nodes in the Kubernetes cluster, you can ensure that their clocks are synchronized, minimizing the risk of clock skew. Most Linux distributions include NTP clients by default, but you may need to configure them to use a reliable NTP server. Cloud providers like AWS also offer NTP services that can be used to synchronize the clocks of instances running in their environment. Regularly monitoring the clock synchronization status can also help in detecting and addressing any potential issues proactively.
2. Scrape Interval Consistency
Problem: Inconsistent scrape intervals can lead to metrics being collected at varying frequencies, which can cause out-of-order errors. If some collectors are scraping targets more frequently than others, the resulting data may have inconsistent timestamps, leading to conflicts when the data is written to the remote write endpoint. This is particularly problematic when dealing with high-cardinality metrics, where even slight variations in scrape intervals can result in a large number of out-of-order samples.
Solution: Verify and enforce consistent scrape intervals across all collectors. Ensure that the scrape interval is explicitly defined and consistent across all scrape configurations. Using a configuration management tool or a centralized configuration system can help in maintaining consistency across the cluster. Regularly reviewing and validating the scrape configurations can also help in identifying and correcting any discrepancies. Additionally, consider using a global configuration management system to ensure that all collectors are using the same scrape intervals, which can help in preventing inconsistencies.
3. Batch Processor Configuration
Problem: The batch processor settings can significantly impact the order and timing of metric delivery. Incorrectly configured batching can lead to delays or inconsistencies in the delivery of metrics, potentially contributing to out-of-order errors. For example, if the batch timeout is too short, metrics may be sent before they are fully collected, resulting in partial batches being sent out of order. On the other hand, if the batch timeout is too long, metrics may be delayed, causing them to arrive at the remote write endpoint with timestamps that appear out of order compared to more recent data.
Solution: Adjust the batch
processor settings, including send_batch_max_size
, send_batch_size
, and timeout
. Experiment with different values to find the optimal configuration for your environment. A good starting point is to set the timeout
to a value slightly longer than the scrape interval to ensure that metrics are collected before being batched. You may also need to adjust the send_batch_size
and send_batch_max_size
to balance the trade-off between batching efficiency and latency. Monitoring the performance of the batch processor, such as the number of batches processed and the latency of batching, can help in fine-tuning these settings. It is also important to consider the capacity and performance characteristics of the remote write endpoint when configuring the batch processor, as the optimal settings may vary depending on the backend system.
4. Resource to Telemetry Conversion
Problem: The resource_to_telemetry_conversion
setting can sometimes cause issues with metric timestamps. When enabled, this setting converts resource attributes into metric labels, which can lead to inconsistencies if not handled correctly. In some cases, this conversion process may introduce delays or modifications to the timestamps, resulting in out-of-order errors. This is more likely to occur when dealing with complex resource attributes or when the conversion process is not optimized for performance.
Solution: Try disabling resource_to_telemetry_conversion
or adjusting its configuration. If disabling it resolves the issue, consider alternative methods for adding resource attributes to your metrics, such as using the resourcedetection
processor or adding labels directly in the scrape configurations. If you need to use resource_to_telemetry_conversion
, ensure that it is configured correctly and that the conversion process is optimized for performance. Monitoring the performance of the conversion process can help in identifying any bottlenecks or issues that may be contributing to out-of-order errors. It is also important to consider the impact of this setting on the cardinality of your metrics, as adding too many labels can lead to performance issues in the monitoring backend.
5. Remote Write Endpoint
Problem: The remote write endpoint itself might have limitations or issues that cause out-of-order errors. Amazon Managed Service for Prometheus (AMP), like any time series database, has specific requirements for the order and timing of incoming data. If the remote write endpoint is experiencing performance issues or is configured with strict timestamp validation rules, it may reject metrics with timestamps that appear out of order, even if they are only slightly out of sync. Network latency or other connectivity issues between the collectors and the remote write endpoint can also contribute to this problem.
Solution: Check the health and performance of the remote write endpoint. Monitor the error rates and latency of the remote write requests. If you are using AMP, review the AWS documentation for best practices and troubleshooting guidance. Ensure that the remote write endpoint has sufficient capacity to handle the incoming metric volume and that there are no network connectivity issues between the collectors and the endpoint. You may also need to adjust the configuration of the remote write endpoint to accommodate the specific characteristics of your metric data. For instance, you may need to increase the allowed timestamp tolerance or adjust the data retention policies. Regularly reviewing the logs and metrics of the remote write endpoint can help in identifying any potential issues proactively.
6. Kubelet Metrics Scraping
Problem: Scraping kubelet metrics can be particularly prone to timing issues due to the dynamic nature of Kubernetes and the high volume of metrics exposed by kubelet. Kubelet metrics include detailed information about the state and performance of pods, containers, and nodes, which can change rapidly. The timing of these metrics can be affected by various factors, such as pod scheduling, container restarts, and node resource utilization. If the collectors are not able to scrape these metrics consistently and in a timely manner, it can lead to out-of-order errors.
Solution: Ensure that the kubelet metrics are being scraped efficiently and consistently. Verify that the scrape targets are correctly configured and that the collectors have sufficient resources to handle the load. Consider adjusting the scrape interval or using relabeling rules to reduce the volume of metrics being scraped. You may also need to optimize the performance of the collectors by increasing their CPU and memory limits or by distributing the scrape load across multiple collectors. Monitoring the performance of the collectors and the kubelet endpoints can help in identifying any bottlenecks or issues that may be contributing to out-of-order errors. It is also important to consider the impact of the scrape configuration on the overall performance of the Kubernetes cluster, as excessive scraping can lead to resource contention and performance degradation.
7. Honor Timestamps
Problem: The honor_timestamps
setting in the Prometheus receiver can affect how timestamps are handled. When honor_timestamps
is set to true
, the receiver uses the timestamps provided by the scrape target. If the scrape target's timestamps are inaccurate or inconsistent, this can lead to out-of-order errors. When honor_timestamps
is set to false
, the receiver uses the time at which the metric was scraped, which can help in mitigating issues with inaccurate timestamps from the scrape target. However, setting honor_timestamps
to false
may also result in loss of precision or accuracy in the metric data.
Solution: Experiment with both honor_timestamps: true
and honor_timestamps: false
to see if either setting resolves the issue. If setting honor_timestamps
to false
resolves the issue, it may indicate that the scrape targets are providing inaccurate timestamps. In this case, you may need to investigate the source of the inaccurate timestamps and take steps to correct them. If setting honor_timestamps
to true
resolves the issue, it may indicate that the collectors are not processing the timestamps correctly. In this case, you may need to adjust the collector configuration or upgrade to a newer version of the collector. It is important to carefully consider the trade-offs between accuracy and consistency when choosing the appropriate setting for honor_timestamps
. Monitoring the impact of this setting on the metric data can help in making an informed decision.
Conclusion
Troubleshooting "out of order samples" errors in an OpenTelemetry Kube Stack environment requires a systematic approach. By understanding the potential causes, such as clock synchronization issues, scrape interval inconsistencies, batch processor misconfigurations, and remote write endpoint limitations, you can effectively diagnose and resolve these errors. This article has provided a comprehensive guide to identifying and addressing these issues, ensuring the integrity and reliability of your monitoring data. By implementing the suggested solutions and monitoring your system closely, you can maintain a robust and accurate monitoring system for your Kubernetes clusters. Remember to address clock synchronization, verify scrape interval consistency, adjust batch processor settings, evaluate resource to telemetry conversion, check the remote write endpoint, optimize kubelet metrics scraping, and experiment with the honor_timestamps
setting. Regularly reviewing your configuration and monitoring your system’s performance will help prevent future occurrences of these errors, ensuring that you have a stable and reliable monitoring setup.
By following these guidelines, you can ensure that your metrics are collected and exported to your monitoring backend without interruption, providing you with the visibility you need to effectively manage your Kubernetes environment. Continuous monitoring and proactive troubleshooting are key to maintaining a healthy and reliable monitoring system. As your environment evolves, it is important to revisit these recommendations and adjust your configuration as needed to ensure that your monitoring system continues to meet your needs. Staying informed about the latest best practices and updates in the OpenTelemetry and Kubernetes communities will also help you in addressing any new challenges that may arise.
This comprehensive approach not only resolves immediate issues but also lays the groundwork for a more resilient and efficient monitoring infrastructure. Proper monitoring is crucial for the health and performance of any Kubernetes environment, and addressing out-of-order samples is a key step in ensuring data integrity. With a well-configured and maintained monitoring system, you can proactively identify and address issues, optimize resource utilization, and ensure the smooth operation of your applications.