Phase 0.2 Building A Python Replay Engine And Toxiproxy Failure Injection

October 1, 2025 by StackCamp Team 74 views

Hey guys! Let's dive into Phase 0.2 of our project, where we're focusing on building a Python replay engine and using Toxiproxy for failure injection. This phase is crucial for instrumenting our services A, B, and C with OpenTelemetry, allowing us to capture and establish a baseline for their performance. We'll be discussing the tasks, acceptance criteria, and the overall goal of this phase in detail. So, buckle up and let's get started!

Goal: Instrument Services A/B/C with OpenTelemetry and Produce a First Capture + Baseline

The primary goal for this phase is to instrument our services A, B, and C with OpenTelemetry. This means adding the necessary OpenTelemetry components to these services so that we can collect and export telemetry data. Think of it as giving our services the ability to report on their internal operations. We want to capture a representative sample of their behavior and establish a baseline performance. This baseline will serve as a reference point for future comparisons, especially when we start injecting failures and observing how the services react.

Why is this important, you ask? Well, by having a clear baseline, we can quickly identify performance degradations or unexpected behavior when things go wrong. Imagine you're a doctor monitoring a patient's vital signs. You need to know what's normal before you can identify what's abnormal. Similarly, we need to know how our services behave under normal conditions before we can effectively diagnose issues when failures occur. OpenTelemetry is our stethoscope in this scenario, allowing us to listen to the heartbeat of our services.

To achieve this, we'll be using a combination of tools and techniques. We'll add the OpenTelemetry SDK to each service, set up an OTel Collector to aggregate the telemetry data, and export the traces to Jaeger for visualization. We'll also write the captured data to NDJSON files and compute a baseline using metrics like p50, p95, and error rates. This baseline will be stored in a YAML file for easy access and comparison. It's a multi-step process, but each step is essential for building a robust and observable system.

This phase sets the foundation for more advanced testing and analysis in later stages. By having a well-defined capture and baseline, we'll be able to confidently introduce failures using Toxiproxy and observe their impact on our services. It's like setting up a controlled experiment where we can systematically study the behavior of our system under different conditions. The data we collect in this phase will be invaluable for identifying bottlenecks, improving performance, and ensuring the resilience of our services. So, let's make sure we get it right!

Tasks

To achieve our goal of instrumenting services A/B/C with OpenTelemetry and producing a first capture and baseline, we have several key tasks to tackle. These tasks are designed to build upon each other, creating a clear path from initial instrumentation to a usable baseline. Let's break down each task in detail:

Add OpenTelemetry SDK to service-a, service-b, service-c

The first step is to integrate the OpenTelemetry SDK into our services A, B, and C. This involves adding the necessary libraries and code to each service to enable the collection of telemetry data. Think of it as installing sensors on our services so they can report their internal activities. We need to ensure that the SDK is properly configured to capture the relevant information, such as request timings, error rates, and resource utilization. This task is fundamental because without the SDK, we won't be able to collect any telemetry data at all. It's the foundation upon which all subsequent steps are built.

We'll need to consider the specific programming languages and frameworks used by each service when integrating the SDK. Each language has its own OpenTelemetry SDK with specific installation and configuration procedures. For example, if service A is written in Python, we'll use the OpenTelemetry Python SDK. If service B is in Java, we'll use the Java SDK, and so on. This might involve adding dependencies to our project's build files, importing necessary modules in our code, and configuring the SDK to export data to our OTel Collector. It's a bit like learning a new language for each service, but the payoff is huge in terms of observability.

The configuration of the SDK is also crucial. We need to specify how the SDK should collect and export data. This might involve setting sampling rates, configuring resource attributes (like service name and version), and defining the endpoint for the OTel Collector. A well-configured SDK will ensure that we're capturing the right data without overwhelming our system with unnecessary information. It's a balancing act between capturing enough data for meaningful analysis and minimizing the overhead on our services.

Stand up OTel Collector (observability/otel-collector-config.yaml)

Once we've instrumented our services with the OpenTelemetry SDK, we need a central place to collect and process the telemetry data. That's where the OTel Collector comes in. The OTel Collector is a vendor-agnostic service that receives, processes, and exports telemetry data. It acts as a hub, aggregating data from multiple services and forwarding it to various backends, such as Jaeger. Think of it as a post office for our telemetry data, sorting and routing it to the right destinations.

Setting up the OTel Collector involves deploying the collector service and configuring it to receive data from our services and export it to Jaeger. We'll be using a configuration file (observability/otel-collector-config.yaml) to define the collector's behavior. This file specifies the receivers (where the collector gets data from), processors (how the collector transforms data), and exporters (where the collector sends data). It's like writing the rules for our telemetry post office, defining how it should handle the incoming and outgoing mail.

The configuration file will typically include settings for the OTLP (OpenTelemetry Protocol) receiver, which is the standard protocol for sending data to the collector. It will also include settings for the Jaeger exporter, which tells the collector how to send data to Jaeger. We might also configure processors to filter or transform the data before it's exported. For example, we might use a processor to add additional attributes to the data or to sample traces to reduce the volume of data being exported.

Standing up the OTel Collector is a critical step because it provides the infrastructure for collecting and processing our telemetry data. Without the collector, our services would be sending data into the void. It's the central nervous system of our observability setup, ensuring that data flows smoothly from our services to our analysis tools.

Export traces to Jaeger; verify in UI

With the OTel Collector in place, the next step is to export the collected traces to Jaeger and verify that they are visible in the Jaeger UI. Jaeger is a distributed tracing system that allows us to visualize the flow of requests through our services. Think of it as a map of our system, showing how requests travel from service to service. By exporting traces to Jaeger, we can gain insights into the performance and behavior of our services.

This task involves configuring the OTel Collector to send traces to Jaeger and then accessing the Jaeger UI to view the traces. We'll need to ensure that the Jaeger endpoint is correctly configured in the OTel Collector's configuration file. Once the collector is sending data to Jaeger, we can access the Jaeger UI through a web browser and search for traces related to our services. It's like opening a map and seeing the roads and highways that connect different cities.

Verifying that the traces are visible in the Jaeger UI is crucial for confirming that our OpenTelemetry setup is working correctly. If we can see the traces, it means that the SDKs in our services are capturing data, the OTel Collector is receiving and processing the data, and Jaeger is storing and displaying the data. It's a complete end-to-end verification of our telemetry pipeline. If we don't see the traces, we'll need to troubleshoot the individual components of the pipeline to identify the issue.

In the Jaeger UI, we can explore the traces in detail, examining the spans (individual operations) that make up each trace. We can see the timings for each span, the services involved, and any errors that occurred. This information is invaluable for identifying performance bottlenecks and diagnosing issues. It's like zooming in on the map to see the traffic conditions on a particular road.

Write capture to NDJSON: data/captures/capture_001.json

Now that we have our telemetry data flowing into Jaeger, we need to capture a snapshot of this data for later analysis and replay. This involves writing the traces to an NDJSON (Newline Delimited JSON) file. Think of it as taking a photograph of our system's behavior at a specific point in time. This capture will serve as the basis for our baseline and for future comparisons when we inject failures.

Writing the capture to NDJSON involves configuring a tool or script to query Jaeger for the traces and write them to a file in NDJSON format. NDJSON is a simple and efficient format for storing JSON data, where each line represents a separate JSON object. This makes it easy to process the data line by line, which is useful for large datasets. It's like organizing our photographs into a neatly labeled album.

We'll be writing the capture to a file named data/captures/capture_001.json. This file will contain a series of JSON objects, each representing a trace. Each trace will include information about the spans, services, and timings involved in a particular request. This data is a rich source of information about our system's behavior.

The capture should include a representative sample of requests to our services. We want to capture a variety of different request types and scenarios to ensure that our baseline is accurate. The goal is to capture >20 requests to provide a sufficient dataset for analysis. It's like taking multiple photographs from different angles to get a complete picture.

Compute baseline (p50, p95, error_rate) → data/baselines/normal_baseline.yaml

With our capture written to NDJSON, the next step is to compute a baseline from this data. The baseline represents the normal performance of our services under typical conditions. Think of it as establishing a reference point for comparison. We'll be computing metrics such as p50 (median), p95 (95th percentile), and error rate to characterize the performance of our services.

Computing the baseline involves writing a script or using a tool to analyze the captured data and calculate these metrics. We'll need to parse the NDJSON file, extract the relevant data (e.g., request durations, error codes), and calculate the p50, p95, and error rate for each service. This is a bit like analyzing our photographs to identify key features and patterns.

The p50 (median) represents the typical response time for a request. It's the value that separates the higher half from the lower half of the response times. The p95 (95th percentile) represents the response time that 95% of requests are faster than. It's a good indicator of the worst-case performance of our services. The error rate represents the percentage of requests that result in an error. It's a measure of the reliability of our services.

We'll be storing the baseline in a YAML file named data/baselines/normal_baseline.yaml. YAML is a human-readable data serialization format that is commonly used for configuration files. The YAML file will contain the p50, p95, and error rate for each service, allowing us to easily compare these metrics against future captures. It's like writing down the key features we identified in our photographs so we can easily compare them to future photographs.

Update README with run steps

Finally, we need to update the README file with the steps required to run our OpenTelemetry setup. This is important for ensuring that others (and our future selves) can easily reproduce our work. Think of it as writing a user manual for our observability system. The README should include instructions on how to start the services, the OTel Collector, and Jaeger. It should also include instructions on how to generate traffic to the services and how to view the traces in the Jaeger UI. It's like providing a roadmap for others to follow.

Updating the README involves adding a section that describes the steps required to run the OpenTelemetry setup. This section should be clear, concise, and easy to follow. It should include commands for starting the services, the OTel Collector, and Jaeger, as well as instructions for generating traffic to the services. It should also include instructions for accessing the Jaeger UI and viewing the traces. The goal is to make it as easy as possible for others to reproduce our work. It’s like making sure everyone can read and understand the instructions for operating our system.

Acceptance Criteria

To ensure that we've successfully completed Phase 0.2, we need to meet certain acceptance criteria. These criteria define the minimum requirements for our OpenTelemetry setup and ensure that we've achieved our goal of instrumenting services A/B/C and producing a first capture and baseline. Let's break down each criterion:

Jaeger shows A→B→C spans for a sample flow

The first acceptance criterion is that Jaeger should display spans showing the flow of requests from service A to service B to service C. This confirms that our OpenTelemetry SDKs are correctly capturing traces and that the traces are being properly exported to Jaeger. Think of it as verifying that our map shows the correct route between cities. If we can see the A→B→C spans in Jaeger, it means that our services are communicating with each other and that the telemetry data is accurately reflecting this communication.

To verify this criterion, we'll need to generate a sample flow of requests that traverse all three services. This might involve sending a specific type of request to service A that then invokes service B, which in turn invokes service C. Once we've generated the flow, we can access the Jaeger UI and search for traces related to this flow. If we see spans that show the A→B→C sequence, we've met this criterion. It's like checking our GPS to make sure we're on the right path.

This criterion is important because it confirms that our basic OpenTelemetry setup is working correctly. If we can't see the A→B→C spans, it means that there's a problem with our instrumentation or with the flow of data to Jaeger. We'll need to troubleshoot the setup to identify the issue and ensure that we're capturing the necessary telemetry data.

`data/captures/capture_001.json` created with >20 requests

The second acceptance criterion is that the data/captures/capture_001.json file should be created and contain data for more than 20 requests. This ensures that we've captured a sufficient amount of data to compute a meaningful baseline. Think of it as taking enough photographs to create a representative album. If we have data for more than 20 requests, we'll have a better understanding of the typical performance of our services.

To verify this criterion, we'll need to generate traffic to our services and then run the script or tool that writes the traces to the NDJSON file. Once the file is created, we can open it and count the number of JSON objects (traces) it contains. If the count is greater than 20, we've met this criterion. It's like counting the photographs in our album to make sure we have enough.

This criterion is important because it ensures that our baseline is based on a sufficient amount of data. If we only capture a few requests, our baseline might not be representative of the typical performance of our services. Capturing more than 20 requests gives us a more accurate picture of our system's behavior. It's like taking multiple photographs from different angles to get a complete view.

`data/baselines/normal_baseline.yaml` has p50/p95/error_rate

The final acceptance criterion is that the data/baselines/normal_baseline.yaml file should be created and contain the p50, p95, and error rate metrics for each service. This confirms that we've successfully computed a baseline from our captured data. Think of it as writing down the key features of our photographs so we can easily compare them to future photographs. If the YAML file contains these metrics, we'll have a reference point for evaluating the performance of our services.

To verify this criterion, we'll need to run the script or tool that computes the baseline from the NDJSON file and writes it to the YAML file. Once the file is created, we can open it and check that it contains the p50, p95, and error rate metrics for each service. If the metrics are present and appear to be reasonable, we've met this criterion. It's like checking our notes to make sure we've recorded all the important details.

This criterion is important because it ensures that we have a usable baseline for future comparisons. The p50, p95, and error rate metrics provide a snapshot of the typical performance of our services. We can use these metrics to identify performance degradations or unexpected behavior when we inject failures or make changes to our system. It's like having a benchmark against which we can measure our progress.

Links

For more detailed information and resources, you can refer to the following links:

Project Flow Draft: docs/flow-draft.md (This document outlines the overall flow of the project and provides context for this phase.)
Docker Compose: docker-compose.yml (This file defines the services and their dependencies for our project, making it easy to set up and run the environment.)

These links provide valuable resources for understanding the project and the context of this phase. The Project Flow Draft gives a high-level overview of the project, while the Docker Compose file provides the details for setting up the environment. Make sure to check them out for a more comprehensive understanding.

Branch

All the work for this phase will be done in the feat/obs-otel-capture branch. This ensures that our changes are isolated from the main codebase and allows us to easily track our progress. It's like working in a separate room so we don't disturb the others. By working in a dedicated branch, we can keep our changes organized and avoid conflicts with other developers. This makes it easier to collaborate and maintain a clean codebase. So, let's get coding in the feat/obs-otel-capture branch! We've got this, guys!