Troubleshooting And Fixing Octopets Backend 5xx Errors Aligning IaC

by StackCamp Team 68 views

Introduction

Hey guys! Today, we're diving deep into a recent incident involving the Octopets backend, where we encountered those dreaded 5xx errors. These errors can be a real headache, disrupting service and frustrating users. In this article, we'll walk through the entire process of investigating, mitigating, and resolving the issue, as well as ensuring our Infrastructure as Code (IaC) is properly aligned with our scaled container resources. This isn't just about fixing a problem; it's about learning and improving our systems to prevent future occurrences. We'll cover everything from the initial incident context to proposed code fixes and IaC adjustments. So, grab your favorite beverage, and let's get started!

Incident Context: Octopets API 500 Errors

The incident kicked off with reports of 500 errors plaguing the Octopets API. For those not in the know, a 500 error is a generic server error, meaning something went wrong on the server's end, but it couldn't be more specific. This particular instance occurred in the ca5ce512-88e1-44b1-97c6-22caf84fb2b0 subscription, within the rg-octopets-lisbon resource group. The affected resource was identified as /subscriptions/ca5ce512-88e1-44b1-97c6-22caf84fb2b0/resourceGroups/rg-octopets-lisbon/providers/Microsoft.App/containerapps/octopetsapi. Understanding the context is crucial because it sets the stage for our investigation. Knowing the environment, resource group, and specific resource allows us to narrow our focus and efficiently pinpoint the root cause. Imagine trying to find a needle in a haystack without knowing which haystack you're looking in! This initial information is our map, guiding us through the troubleshooting process.

Investigation Findings: Unraveling the Mystery

Our investigation began by examining the application's state. We found the Octopets API was running, utilizing ingress port 8080, and operating on a consumption workload profile. The latest revision was octopetsapi--0000010, which later updated to --0000011 after scaling the resources. Next, we delved into the logs, scrutinizing the most recent 200 lines. The logs revealed the application had started successfully, but we also noticed Entity Framework Core (EF Core) warnings. These warnings, specifically Microsoft.EntityFrameworkCore.Model.Validation[10620], highlighted potential issues with collection properties (Listing.AllowedPets and Listing.Amenities) lacking value comparers. While these warnings didn't immediately scream "500 error," they were definitely worth noting as potential underlying problems. Interestingly, we didn't find any application stack traces or explicit 5xx error logs within the sampled window. This meant the errors weren't being directly logged, adding a layer of complexity to our investigation. Finally, we turned to metrics from the past 30 minutes. CpuPercentage hovered near 0% until we scaled the resources, then peaked at around 4% in low single digits. MemoryPercentage showed a baseline of about 6%, modestly increasing post-scale to around 20% at 17:21 UTC. Analyzing requests by status, we observed sparse traffic with some 404 errors, but crucially, 500 errors were reported as 0 in the last window. This was a bit puzzling, as it contradicted the initial reports. However, it also suggested the issue might be intermittent or related to specific conditions. The lack of clear 5xx errors in the logs and metrics pushed us to consider resource constraints as a possible culprit, leading us to the next step: mitigation.

Mitigation Performed: Scaling for Relief

To address the potential resource constraints, we decided to scale the container resources using the Azure CLI. We figured that if the application was running out of CPU or memory, increasing these resources might alleviate the 500 errors. The command we used was:

az containerapp update -g rg-octopets-lisbon -n octopetsapi --subscription ca5ce512-88e1-44b1-97c6-22caf84fb2b0 --cpu 1 --memory 2Gi --min-replicas 2 --max-replicas 5

This command updates the Octopets API container app within the specified resource group and subscription. We increased the CPU from 0.25 to 1 core and the memory from 0.5Gi to 2Gi. We kept the scale configuration the same (2–5 replicas) to ensure we had enough instances to handle traffic. Previously, the application was running on minimal resources, so we suspected this might be the bottleneck. After executing the command, we verified the changes in the Azure portal. The platform confirmed the resources were now at 1 CPU and 2Gi. Immediate telemetry indicated low CPU and memory percentages, suggesting the scaling was effective. Most importantly, we observed no recent 5xx errors in the Requests metric. This was a positive sign, indicating our mitigation might have resolved the immediate issue. We also prepared a rollback command in case we needed to revert the changes:

az containerapp update -g rg-octopets-lisbon -n octopetsapi --subscription ca5ce512-88e1-44b1-97c6-22caf84fb2b0 --cpu 0.25 --memory 0.5Gi --min-replicas 2 --max-replicas 5

Having a rollback plan is always a good practice, allowing us to quickly undo changes if they cause unexpected problems. While scaling the resources seemed to have addressed the immediate 5xx errors, it was crucial to understand the underlying cause and prevent future occurrences. This led us to propose code fixes and IaC adjustments.

Proposed Code Fixes: Strengthening the Foundation

Our investigation highlighted several areas in the codebase that could be improved to prevent future 5xx errors and enhance application stability. We identified three key areas for code fixes:

1. Address EF Core Collection Properties

The EF Core warnings we saw in the logs pointed to a potential issue with how collection properties were being handled. Specifically, the Listing.AllowedPets and Listing.Amenities properties were using value converters without a corresponding ValueComparer. This can lead to problems with equality comparisons and change tracking, potentially causing unexpected behavior. To address this, we proposed defining and assigning appropriate ValueComparer instances for these properties. A ValueComparer tells EF Core how to compare values within a collection, ensuring that changes are tracked correctly. Here’s an example of how you might implement this:

new ValueComparer<List<string>>(
 (c1, c2) => SequenceEqual(c1, c2),
 c => c.Aggregate(0, (h, v) => HashCode.Combine(h, v.GetHashCode())),
 c => c.ToList()
)

This snippet creates a ValueComparer tailored to a List<string>, using SequenceEqual for equality comparisons and a hash code aggregation for efficient change tracking. Applying this fix ensures that EF Core can properly manage these collection properties, reducing the risk of data-related issues.

2. Strengthen Error Handling

Generic 500 errors are frustrating because they don't provide specific information about what went wrong. To improve our error handling, we proposed several enhancements. First, validate request parameters and return 400 (Bad Request) or 422 (Unprocessable Entity) status codes for invalid inputs. This prevents the application from attempting to process faulty data, which can lead to errors. Similarly, return 404 (Not Found) errors for missing resources instead of throwing exceptions. This provides clearer feedback to the client and helps in debugging. Another crucial step is to wrap dependency calls with resilient policies using libraries like Polly. Polly allows you to implement timeouts, retries, and circuit breakers, making your application more resilient to transient failures in external services. Finally, ensure consistent exception-to-response mapping in middleware. This means defining a clear strategy for how exceptions are translated into HTTP responses, providing consistent and informative error messages to the client.

3. Add Pagination Limits and Guards

List endpoints, which return collections of data, can become a performance bottleneck if not properly managed. To prevent expensive queries, we proposed adding default pagination limits and guards on these endpoints. Pagination limits restrict the number of items returned in a single request, preventing the server from being overwhelmed by large datasets. Guards, such as checks on the requested page size and offset, can further protect against abuse and ensure efficient query execution. By implementing these measures, we can prevent performance degradation and ensure list endpoints remain responsive even under heavy load.

IaC Drift Check: Aligning Infrastructure and Code

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through code rather than manual processes. It's crucial for maintaining consistency and repeatability in our deployments. However, IaC can drift over time if changes made in the runtime environment aren't reflected in the codebase. In our case, the repository indicated an azd-managed infrastructure, with potential manifests (e.g., containerapp.tmpl.yaml) generated via azd infra synth. We observed that the current runtime was 1 CPU and 2Gi, while the IaC might not reflect these values. To address this, we recommended performing an IaC drift check. This involves comparing the current runtime configuration with the configuration defined in our IaC templates. If there's a mismatch, we need to take action to align them. We proposed two options: 1. Persist infra templates (azd infra synth) and update the container app resource settings (CPU: 1, memory: 2Gi, minReplicas: 2, maxReplicas: 5) in the Bicep/manifests. This ensures our IaC accurately reflects the current runtime configuration. 2. Adjust the runtime back to the original values if desired. This might be appropriate if the scaling was a temporary measure or if we want to maintain a consistent configuration across environments. Regardless of the chosen approach, aligning IaC with the runtime is essential for preventing configuration drift and ensuring our infrastructure is managed consistently.

Next Steps: Ensuring Long-Term Stability

To ensure the long-term stability of the Octopets API, we outlined several next steps:

  1. Implement EF Core ValueComparer fixes: We need to apply the proposed ValueComparer fixes to the Listing.AllowedPets and Listing.Amenities properties. This involves modifying the codebase and adding unit/integration tests to verify the changes. Thorough testing is crucial to ensure the fixes don't introduce new issues.
  2. Review error handling middleware: We need to review our error handling middleware to ensure it appropriately maps exceptions to HTTP status codes. This will provide more informative error messages to clients and simplify debugging.
  3. Persist and align IaC templates: We need to persist our IaC templates and align them with the current scaled configuration. This will prevent configuration drift and ensure our infrastructure is managed consistently.
  4. Monitor for recurrence: We need to continuously monitor the Octopets API for recurrence of 5xx errors. If they reappear, we need to capture detailed error logs to facilitate further investigation. Adding robust logging and telemetry is essential for surfacing exceptions and identifying the root cause of issues.

Conclusion

So, there you have it, guys! We've walked through the entire process of investigating, mitigating, and resolving the Octopets backend 5xx errors. From the initial incident context to the proposed code fixes and IaC adjustments, we've covered a lot of ground. Remember, troubleshooting isn't just about fixing the immediate problem; it's about learning and improving our systems to prevent future issues. By addressing the EF Core warnings, strengthening our error handling, aligning our IaC, and continuously monitoring our application, we can build a more robust and reliable system. And that's what it's all about, right? Keep learning, keep improving, and keep those Octopets running smoothly!