LEARN · DEBUGGING GUIDE

Debugging CloudFormation Stack Rollback: Finding the Real Reason

CloudFormation rollbacks hide the actual error behind a generic message. You need to dig into resource statuses, CloudTrail logs, and stack policies to find the real cause. Here's exactly how.

IntermediateCloud9 min read

What this usually means

CloudFormation rollbacks are designed to be safe, but the error reporting is notoriously opaque. The rollback reason you see in the console is often a truncated or generic message like 'Resource creation cancelled' or 'The following resource(s) failed to create' without specifics. The real error is buried in the resource-level status reason, which can be empty if the resource never started creating due to a dependency failure or a stack policy that blocked updates. In many cases, the actual failure is not in the resource you suspect—it's a prerequisite resource like a VPC or IAM role that failed silently, or a CloudFormation service limit (e.g., stack policy size, template body size) that triggers a rollback without a clear reason. The worst cases involve nested stacks where the parent stack gives no clue about which nested stack failed, and the nested stack's events are not surfaced in the parent console.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Run 'aws cloudformation describe-stack-events --stack-name <stack-id>' and grep for 'CREATE_FAILED' or 'UPDATE_FAILED' — focus on the RESOURCE_STATUS_REASON field, not just status
  • 2Check CloudTrail > Event history with filter 'eventSource = cloudformation.amazonaws.com' and look for 'CreateStack' or 'UpdateStack' events with errorCode
  • 3If using nested stacks, describe each nested stack's events separately — the parent stack events hide the nested stack errors
  • 4Inspect stack policies: 'aws cloudformation get-stack-policy --stack-name <name>' — look for a policy that might deny updates to critical resources
  • 5Check service quotas: 'aws cloudformation list-stack-resources' and compare against limits like stack policy body size (5120 bytes) or template body size (51,200 bytes for S3, 460,800 for S3 with URL)
  • 6Look for async resource failures: if a resource like Lambda or RDS takes too long to stabilize, CloudFormation may time out and roll back even if the resource eventually succeeds
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchCloudFormation console > Stacks > Stack events — read every row, especially RESOURCE_STATUS_REASON for CREATE_FAILED/UPDATE_FAILED entries
  • searchCloudTrail > Event history — filter by CloudFormation, look for 'errorCode' like 'ValidationError', 'InsufficientCapabilitiesException', 'LimitExceededException'
  • searchAWS CLI: 'aws cloudformation describe-stack-events --stack-name <id> --output json' — parse the JSON to get full error messages that the console truncates
  • searchNested stack root: 'aws cloudformation describe-stack-resources --stack-name <parent> | jq '.StackResources[] | select(.ResourceType=="AWS::CloudFormation::Stack") | .PhysicalResourceId' — then describe events on each physical ID
  • searchStack policy: if a resource update is denied by policy, the rollback reason may be 'Resource update cancelled' — check the stack policy document for Deny statements
  • searchCloudWatch Logs for custom resource Lambdas: if a custom resource fails, its Lambda logs are the only place with the real error — look for 'CREATE_FAILED' in stack events and find the Lambda log group
  • searchService Quotas console > AWS CloudFormation — check if you're near limits for stack count, stack sets, or template body size
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningDependency failures: a resource that the failing resource depends on (via DependsOn or intrinsic functions) failed silently — the dependent resource never even starts creating
  • warningStack policy blocks update: a stack policy Deny statement prevents CloudFormation from modifying a resource during an update, causing the update to roll back with no clear reason
  • warningAsync resource timeout: resources like RDS, Redshift, or ElastiCache take too long to become available and CloudFormation times out (default 1 hour for most resources) — the rollback reason says 'Resource creation cancelled'
  • warningNested stack failure hidden: the parent stack rolls back but the nested stack events are not visible in the parent console — you must inspect each nested stack separately
  • warningIAM permissions gap: CloudFormation lacks permissions to call a service API (e.g., ec2:CreateSecurityGroup) but the error is not surfaced because the service returns a generic 'Access Denied' that CloudFormation converts to 'Resource creation cancelled'
  • warningTemplate validation passed but runtime fails: a template parameter like an AMI ID is valid at submission but the AMI is deregistered or in a different region by the time the resource is created
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAdd explicit DependsOn to force serial creation and expose which dependency fails — then fix the dependency's configuration
  • buildTemporarily remove stack policy or add an Allow statement that matches the resource ARN you're updating — then re-attempt the update
  • buildIncrease the resource timeout using the 'ServiceToken' custom resource or by specifying a timeout in the resource properties (e.g., for RDS DBInstance, set 'TimeoutInMinutes')
  • buildFor nested stacks, implement a manual rollback strategy: before retrying, describe each nested stack's events and fix the root cause in the nested template
  • buildGrant CloudFormation a service role with broader permissions, or use a service-linked role with the required actions — ensure the role trust policy allows cloudformation.amazonaws.com
  • buildUse cfn-lint or checkov to validate template against runtime constraints (e.g., AMI availability, subnet CIDR conflicts) before deployment
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedRe-run the stack creation/update and watch the stack events in real-time with 'aws cloudformation describe-stack-events --stack-name <id> --query "StackEvents[?ResourceStatus=='CREATE_IN_PROGRESS']"' — confirm no resource stays in progress beyond expected time
  • verifiedAfter the fix, check the resource status reason for the previously failing resource — it should show 'CREATE_COMPLETE' with no error
  • verifiedVerify CloudTrail logs show no errorCode for the stack operation — run 'aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=CreateStack --query "Events[?ErrorCode!=null]"' and confirm empty
  • verifiedIf the issue was a stack policy, run 'aws cloudformation describe-stack-resources --stack-name <id>' and confirm all resources show UPDATE_COMPLETE
  • verifiedFor nested stacks, check each nested stack's output and events to ensure they are all in CREATE_COMPLETE or UPDATE_COMPLETE state
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningDo not rely solely on the console error message — it truncates and hides details; always use the CLI or CloudTrail
  • warningDo not delete and recreate a failed stack without first checking the stack events — you lose the error history
  • warningDo not assume the failing resource in the console is the root cause; it's often a downstream effect of a dependency failure
  • warningDo not skip stack policy inspection — a silent Deny policy is a common cause of 'Resource update cancelled'
  • warningDo not ignore nested stacks as separate entities; treat each nested stack as a full stack that needs independent debugging
  • warningDo not increase timeouts without understanding why the resource is slow — it may mask a deeper issue like insufficient capacity
( 07 )War story

The Case of the Silent Nested Stack Rollback

Senior DevOps EngineerAWS CloudFormation with nested stacks (parent: prod-infra, children: vpc-stack, app-stack, db-stack)

Timeline

  1. 14:00Pushed CloudFormation template changes to update prod-infra stack
  2. 14:02Console shows UPDATE_ROLLBACK_COMPLETE for prod-infra, no error message
  3. 14:05Checked stack events — only generic 'Resource update cancelled' for nested stacks
  4. 14:10Described each nested stack events: vpc-stack shows UPDATE_COMPLETE, app-stack shows UPDATE_COMPLETE, db-stack shows UPDATE_FAILED with reason 'DB instance creation timed out'
  5. 14:15Checked RDS console: DB instance was in 'creating' state for 45 minutes, then failed with 'insufficient capacity in this AZ'
  6. 14:20Realized the template specified a single AZ without Multi-AZ, and that AZ had no capacity
  7. 14:30Updated template to use Multi-AZ deployment, re-ran stack update
  8. 14:45Stack update completed successfully; all nested stacks showed UPDATE_COMPLETE

The trouble started when I pushed a routine update to our prod-infra CloudFormation stack. The stack had three nested stacks: vpc, app, and db. Almost immediately, the parent stack went into UPDATE_ROLLBACK_COMPLETE. The console showed nothing useful—just 'Resource update cancelled' for each nested stack. I felt that sinking feeling: without a clear error, I'd have to dig.

I ran the AWS CLI to describe stack events on the parent stack. Still generic. Then I remembered: nested stacks hide their events. I described each child stack's events. VPC and app were fine. The db-stack showed 'UPDATE_FAILED' with a reason that said 'DB instance creation timed out'. That was the clue I needed.

I checked the RDS console. The DB instance was in 'creating' state for 45 minutes before failing with 'insufficient capacity in this AZ'. Our template specified a single AZ without Multi-AZ. That AZ was out of capacity. I updated the template to use Multi-AZ, which spreads across two AZs, and re-ran the update. This time it worked. The lesson: never trust the parent stack error message; always descend into nested stacks individually.

Root cause

The db-stack nested stack failed because the specified Availability Zone had insufficient RDS capacity, causing a timeout. The parent stack only showed 'Resource update cancelled', hiding the real reason.

The fix

Modified the DB stack template to enable Multi-AZ deployment, which distributes the DB instance across two AZs, avoiding single-AZ capacity issues.

The lesson

Always inspect nested stack events individually. The parent stack error message is almost useless for diagnosing the actual failure.

( 08 )How CloudFormation Rollback Messages Are Generated

When a resource fails to create or update, CloudFormation captures the error from the service API call. However, the error message is often truncated to 255 characters in the console and may be replaced with a generic message like 'Resource creation cancelled' if the request was aborted due to a dependency failure or timeout.

The actual error is stored in the 'ResourceStatusReason' field of the stack event. You can retrieve it using the CLI: 'aws cloudformation describe-stack-events --stack-name <id> --query "StackEvents[?ResourceStatus=='CREATE_FAILED'].ResourceStatusReason"'. This returns the full, untruncated message. For nested stacks, you must query each child stack's events separately.

( 09 )Stack Policies: The Silent Rollback Trigger

A stack policy can explicitly deny updates to specific resources. When CloudFormation attempts to update a resource that is denied by the policy, it cancels the update and triggers a rollback. The error message in the console is 'Resource update cancelled' with no further details.

To diagnose, retrieve the stack policy: 'aws cloudformation get-stack-policy --stack-name <name>'. Look for 'Deny' statements that match the resource ARN or logical ID. If the policy is too restrictive, update it to allow the necessary modifications. Remember that stack policies only affect updates, not creation.

( 10 )Async Resource Timeouts and How to Handle Them

Resources like RDS, Redshift, and ElastiCache have asynchronous creation processes. CloudFormation waits for the resource to reach a stable state (e.g., 'available' for RDS). If the resource takes longer than the timeout (default 1 hour for most resources), CloudFormation cancels the creation and rolls back. The error message is 'Resource creation cancelled'.

To verify, check the resource's own console or API to see if it's stuck in a transient state. You can also increase the timeout using the 'TimeoutInMinutes' property where available, or use a custom resource with a longer timeout. However, always investigate why the resource is slow—it could be a capacity issue or a misconfiguration.

( 11 )Using CloudTrail to Find the Real Error

CloudTrail records every API call made by CloudFormation. When a resource fails, the corresponding service API call (e.g., 'CreateDBInstance') will have an error code. Search CloudTrail for events from 'cloudformation.amazonaws.com' and look for 'errorCode' fields. This often reveals the exact service error that CloudFormation swallowed.

Use this CLI command: 'aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=CreateStack --query "Events[?ErrorCode!=null].{Time:EventTime,Error:ErrorCode,Message:ErrorMessage}"'. This returns all errors for the stack creation. For updates, replace 'CreateStack' with 'UpdateStack'.

Frequently asked questions

Why does my CloudFormation stack show 'ROLLBACK_COMPLETE' but no resources are in FAILED state?

This usually happens when a stack policy blocks an update. The policy denies modification of a resource, causing CloudFormation to cancel the update immediately without creating any failed resources. Check the stack policy and look for Deny statements. Another possibility is a dependency failure where a resource never started creating because a prerequisite resource failed — but the prerequisite may have rolled back and its failure event is no longer visible. Always check the full stack event history using the CLI.

How do I find the real error in a nested stack rollback?

The parent stack events only show generic 'Resource update cancelled' for the nested stack resource. You must describe each nested stack individually: first, list the parent stack resources to get the physical IDs of the nested stacks: 'aws cloudformation list-stack-resources --stack-name <parent> | jq '.StackResources[] | select(.ResourceType=="AWS::CloudFormation::Stack") | .PhysicalResourceId''. Then describe the events for each physical ID: 'aws cloudformation describe-stack-events --stack-name <nested-id>'. The nested stack's events will contain the actual failure reason.

What does 'Resource creation cancelled' mean?

This message appears when CloudFormation abandons the creation of a resource. Common causes: a dependency failure (the resource DependsOn another resource that failed), a stack policy that denies creation (though policy usually applies to updates), a timeout (the resource took too long to become available), or a service API call that returned an error that CloudFormation interprets as a cancellation. To differentiate, check CloudTrail for the exact API error and look at the resource's own console for its state.

Can I prevent rollbacks and see the error instead?

Yes, you can disable rollback on failure by using the '--disable-rollback' flag when creating or updating a stack via CLI. For create operations, use 'aws cloudformation create-stack --disable-rollback ...'. For updates, use '--rollback-configuration RollbackTriggers=[],MonitoringTimeInMinutes=0' (or simply omit rollback configuration). However, be cautious: leaving a failed stack in a partially created state can incur costs and leave resources orphaned. Use this only for debugging in non-production environments.

How do I debug a custom resource that fails silently?

Custom resources rely on a Lambda function to send signals back to CloudFormation. If the Lambda fails, CloudFormation waits for a timeout (default 1 hour) and then reports 'Resource creation cancelled'. Check the Lambda function's CloudWatch Logs for stack traces. Also verify that the Lambda has permissions to call 'cloudformation:SignalResource' and that the function's response conforms to the expected JSON format (including the 'Status' field). You can test the Lambda independently using the 'test' event in the console.