Infrastructure-as-code (IaC) errors are a rite of passage. Whether you're Terraform or CloudFormation (or both), you've seen the dreaded red text in the terminal at 3 AM. I've spent years debugging these failures, and the patterns are surprisingly consistent. This article is a field guide — not an introduction to IaC — for engineers who need to get to the root cause fast.
I'll cover the non-obvious stuff: how to parse error messages as structured data, why state drift is the silent killer, and the specific AWS API quirks that trip up both tools. I'll also share a war story from a production incident that taught me more than any documentation.
The Anatomy of an Error Message
Most engineers read error messages like prose. That's a mistake. Treat them as structured output: the resource type, the action, the reason, and the request ID. In Terraform, errors come from the provider's API calls. In CloudFormation, they're wrapped in stack events. Strip away the noise.
Here's an example Terraform error:
Error: Error creating RDS DB Instance: InvalidParameterCombination: Cannot find version 5.7.mysql_aurora.2.10.2 for Aurora MySQL
status code: 400, request id: abc-123Break it down: action = 'creating RDS DB Instance', service = 'RDS', error code = 'InvalidParameterCombination', reason = 'Cannot find version...'. The request ID is your golden ticket for AWS Support. For CloudFormation, the equivalent is in the stack events:
aws cloudformation describe-stack-events --stack-name my-stack --query "StackEvents[?ResourceStatus=='CREATE_FAILED']"I always write a small script to parse these into a table. It saves hours of scrolling.
State Drift: The Silent Killer
The most common Terraform error pattern I see is "ResourceInstance has been destroyed" or "ResourceInstance does not exist" — but the resource is still running. This is state drift. Someone (or a script) deleted or modified the resource outside Terraform. The state file is now out of sync.
Your first instinct might be to run `terraform refresh`. But be careful: refresh overwrites state with what actually exists. If you have manual changes you want to keep, you'll lose them. Instead, I use `terraform plan -refresh=false` to see what Terraform thinks, then manually inspect the resource.
Never run `terraform apply` on a drifted state without understanding the diff. I've seen engineers accidentally destroy production databases because they assumed refresh would fix everything.
War Story: The Case of the Stuck Rollback
The 3 AM Rollback Loop
- 02:15PagerDuty alert: CloudFormation stack update stuck in UPDATE_ROLLBACK_IN_PROGRESS for 20 minutes.
- 02:20Checked stack events: 'Resource creation cancelled' on an RDS instance. The root cause was a change from db.t3.medium to db.r5.large — immutable attribute, so CloudFormation tried to create a new instance before deleting the old one.
- 02:30The new instance creation failed because the subnet group didn't have enough IPs. Rollback tried to revert the original instance, but the original was already marked for deletion.
- 02:45Solution: manually deleted the new (failed) instance, then used `aws cloudformation continue-update-rollback --resources-to-skip MyDBInstance` to skip the stuck resource. Then fixed the subnet group and re-applied.
Lesson
Immutable resource updates are dangerous in CloudFormation. Always check the 'RequiresReplacement' flag in the change set before applying. For RDS, you can sometimes modify the instance class without replacement by using the 'ApplyImmediately' parameter — but test it first.
The error message is never the full story — it's just the first sentence.
Cross-Account IAM: The AccessDenied Trap
Both Terraform and CloudFormation rely on IAM roles to make API calls. Cross-account scenarios are where most AccessDenied errors hide. For example, Terraform using an IAM role from a different account to create resources in a shared VPC. The role must have a trust policy that allows the principal from the other account.
I once spent hours debugging a CloudFormation stack that kept failing with 'AccessDenied' when creating a VPC peering connection. The template looked correct. The issue was that the service-linked role for VPC peering didn't exist in the target account. The fix: create the role manually first or use a custom resource.
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:root"
},
"Action": "sts:AssumeRole",
"Condition": {}
}Parameter Validation: The Silent Failure Before the Failure
CloudFormation templates often fail before they even start. Parameter validation errors are common: wrong type, missing allowed values, or regex mismatch. But the error message is often cryptic: 'Parameters: [InstanceType] must be a valid EC2 instance type'.
To catch these early, use `cfn-lint` (from the AWS CloudFormation Linter) locally. It validates parameters, resource properties, and even cross-resource references. I run it in CI as a pre-commit hook. Here's an example:
cfn-lint my-template.yaml --ignore-checks WFor Terraform, `terraform validate` catches syntax errors, but not semantic ones like invalid AMI IDs. I use `terraform plan -out=tfplan` and then `terraform show -json tfplan | jq '.resource_changes[] | select(.change.actions[] == "create")'` to review what will be created.
Nested Stacks: The Error Obfuscator
CloudFormation nested stacks are great for modularity but terrible for debugging. When a nested stack fails, the parent stack shows 'CREATE_FAILED' with reason 'Nested stack <name> failed'. That's it. You then have to describe the nested stack events separately. I've seen teams waste hours because they didn't drill into the right nested stack.
My rule: never use more than two levels of nesting. And always set `DeletionPolicy: Retain` on nested stacks so you can inspect them after failure. Also, enable detailed CloudTrail logging for CloudFormation.
of CloudFormation errors are caused by IAM permissions or parameter validation (source: internal analysis of 500+ incidents)
Debugging Terraform Provider Issues
Sometimes the error isn't in your code — it's in the provider. Terraform providers are versioned and can have bugs. I once had a provider update that broke the `aws_lb_listener_rule` resource because the API changed the order of conditions. The fix was to pin the provider version.
Set provider version constraints in your Terraform configuration:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.0"
}
}
}When you hit a provider bug, upgrade to the latest patch or downgrade to a known good version. Check the provider's changelog and GitHub issues. Also, enable Terraform's debug logging with `TF_LOG=DEBUG` — but be warned, it's verbose.
Set TF_LOG_PATH to a file to avoid cluttering your terminal: `TF_LOG=DEBUG TF_LOG_PATH=./terraform.log terraform apply`
Wrapping Up
Debugging IaC errors is a skill that improves with systematic practice. Start by parsing the error message structure, then check for state drift, IAM permissions, and parameter validation. Use the tools available — cfn-lint, Terraform validate, and debug logs — but don't rely on them blindly.
The next time you see red at 3 AM, remember: it's never magic, it's just a misconfiguration waiting to be found.
Frequently asked questions
Why does Terraform say 'ResourceInstance has been destroyed' but the resource still exists?
This usually indicates state drift — the resource was deleted outside Terraform (e.g., via console), or the state file was corrupted. Run `terraform refresh` to sync state, then `terraform plan` to confirm. If the resource still exists, manually import it with `terraform import`.
CloudFormation stuck in 'UPDATE_ROLLBACK_IN_PROGRESS' — how do I recover?
First, check the stack events for the specific failure. Common causes: resource replacement failure (e.g., RDS instance class change) or IAM permissions. If rollback is stuck on a resource, you can skip the resource using `--resources-to-skip` in the CLI (though it's risky). Sometimes, you need to manually fix the resource or delete the stack with retention policy.
How do I debug 'ValidationError: Template format error: Unrecognized resource type' in CloudFormation?
This error means the resource type is not supported in the current region or you're using a type from a registered extension that hasn't been activated. Check the region support matrix and verify the type ARN. Use `aws cloudformation list-type-versions` to see available types.
Terraform apply fails with 'Error creating IAM Role: MalformedPolicyDocument' — what's wrong?
This is almost always a malformed JSON in the assume role policy. Common issues: missing required fields (like `Statement`), invalid principal, or unescaped characters. Validate your policy document with `aws iam simulate-custom-policy` or a JSON linter before running apply.