SQS Provider Fails To Recover After Queue Deletion Outside Terraform
Hey guys! Ever run into a situation where you've got your infrastructure all set up with Terraform, and then someone goes rogue and deletes a queue directly from the AWS console or CLI? It's a pain, right? Well, this article dives into a tricky issue where the AWS provider for Terraform doesn't quite handle this scenario as smoothly as we'd like. We're going to break down the problem, look at the error messages, and see why Terraform isn't automatically recreating the queue when it should. So, buckle up, and let's get into it!
The Problem: SQS Queue Deletion Outside Terraform
So, here’s the deal. Imagine you've defined your Simple Queue Service (SQS) queue in your Terraform configuration. Everything's running smoothly, and your applications are happily sending and receiving messages. But then, someone (maybe accidentally, maybe not!) deletes the queue directly from the AWS console or using the AWS CLI. What happens when you run terraform apply
again? Ideally, Terraform should detect that the queue is missing and recreate it, right? Well, in some cases, it doesn't, and that's what we're here to discuss.
Expected Behavior vs. Actual Behavior
Ideally, Terraform should detect that the queue has been deleted outside of its management and offer to recreate it. This is what we expect from infrastructure-as-code tools – they should reconcile the desired state (defined in our configuration) with the actual state in the cloud. If there's a discrepancy, Terraform should take action to correct it.
However, the actual behavior is that terraform apply
fails with an error. This error typically indicates that Terraform has timed out while waiting for the state to become 'success'. What's happening under the hood? Terraform is hitting the GetQueueAttributes
endpoint multiple times, but it's receiving a QueueDoesNotExist
error each time. It's like Terraform is stuck in a loop, repeatedly asking for something that isn't there anymore, and eventually giving up.
Diving Deeper into the Error
To really understand what's going on, let's look at a specific error message. You might see something like this in your console:
aws_sqs_queue.this: Refreshing state... [id=https://sqs.us-east-1.amazonaws.com/012345678901/MyQueue]
â•·
│ Error: reading SQS Queue (https://sqs.us-east-1.amazonaws.com/012345678901/MyQueue): timeout while waiting for state to become 'success' (last state: 'retryableerror', timeout: 20s)
│
│ with aws_sqs_queue.this,
│ on queue.tf line 1, in resource "aws_sqs_queue" "this":
│ 1: resource "aws_sqs_queue" "this" {
This error tells us a few things. First, Terraform is trying to refresh the state of the aws_sqs_queue
resource. It's using the queue's URL as the ID, which makes sense. However, it's timing out while waiting for the state to become 'success'. The last state it saw was 'retryableerror', which suggests that Terraform is encountering an error it thinks it can retry, but it eventually gives up after 20 seconds. This timeout indicates that Terraform is not correctly handling the QueueDoesNotExist
error and initiating the recreation of the queue.
Reproducing the Issue: A Step-by-Step Guide
Okay, so how can you reproduce this issue yourself? It's actually pretty straightforward. Let's walk through the steps:
-
Apply the Configuration: First, you need a basic Terraform configuration that defines an SQS queue. Here’s an example:
resource "aws_sqs_queue" "this" { name = "MyQueue" } output "sqs_queue_url" { value = aws_sqs_queue.this.url }
Apply this configuration using
terraform apply
. This will create the SQS queue in your AWS account. -
Delete the Queue: Now, the fun part. Go to the AWS console or use the AWS CLI to delete the queue. For example, using the CLI, you'd run:
aws sqs delete-queue --queue-url https://...
Replace
https://...
with the actual URL of your queue. This action simulates a real-world scenario where a resource is deleted outside of Terraform. -
Attempt to Apply Again: Finally, run
terraform apply
again. This is where you should see the error. Terraform will try to refresh the state of the queue, fail to find it, and eventually time out, giving you the error we discussed earlier.
Why This Matters
This issue highlights a crucial aspect of infrastructure-as-code: state management. Terraform relies on its state file to understand the current state of your infrastructure. When a resource is deleted outside of Terraform, the state file becomes out of sync. Ideally, Terraform should be able to detect this and take corrective action. However, in this case, the AWS provider isn't correctly handling the QueueDoesNotExist
error, leading to a timeout instead of a recreation.
Debugging the Issue: Peeking Under the Hood
To really get a handle on what's going wrong, let's dive into the debug logs. Terraform's debug logging can give us a ton of insight into the API calls it's making and the responses it's getting.
Enabling Debug Logging
To enable debug logging, you can set the TF_LOG
environment variable to DEBUG
. For example:
export TF_LOG=DEBUG
terraform apply
This will produce a lot of output, but it's incredibly valuable for troubleshooting.
Analyzing the Logs
In the debug logs, you'll see Terraform making GetQueueAttributes
calls to the SQS API. You'll also see the API returning a QueueDoesNotExist
error. The key thing to look for is how the provider handles this error. In the problematic scenario, the provider seems to treat this as a retryable error and keeps trying for a certain period before timing out.
Here's a snippet of what you might see in the logs:
2025-07-25T11:17:25.198+1000 [DEBUG] provider.terraform-provider-aws_v6.4.0_x5: HTTP Request Sent: ...
2025-07-25T11:17:26.265+1000 [DEBUG] provider.terraform-provider-aws_v6.4.0_x5: HTTP Response Received: ... {"__type":"com.amazonaws.sqs#QueueDoesNotExist","message":"The specified queue does not exist."}...
2025-07-25T11:17:26.265+1000 [DEBUG] provider.terraform-provider-aws_v6.4.0_x5: request failed with unretryable error https response error StatusCode: 400, RequestID: d9904739-b462-51f9-8bb9-27c569042cf9, AWS.SimpleQueueService.NonExistentQueue: The specified queue does not exist.: ...
You'll notice that even though the API returns a QueueDoesNotExist
error, the provider initially flags it as an