Terraform State Drift Debug Guide

What this usually means

This pattern typically indicates that someone or some process modified the resource directly in the cloud provider's console, CLI, or API—bypassing Terraform. It can also happen when another IaC tool or an automated script touches the same resource. The root cause is a lack of governance: either humans making emergency changes, auto-scaling or recovery scripts that alter resources, or team members not understanding that Terraform owns the resource lifecycle. Once drift occurs, Terraform sees the current state as the source of truth and will attempt to revert or overwrite those changes, potentially causing outages if not handled carefully.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run 'terraform plan' and look for attributes marked as 'forces replacement' or 'in-place update' that you didn't change in code.
2Use 'terraform show' after plan to see the exact attribute differences (e.g., security group rules, instance type).
3Check cloud provider activity logs (AWS CloudTrail, Azure Activity Log, GCP Audit Logs) for events that modified the resource outside Terraform.
4Compare the Terraform state file (terraform.tfstate) against the actual resource configuration using 'terraform state show <resource>' and the provider's describe API.
5Look for any recent manual changes in your change management system or incident tickets that correspond to the drifted resource.

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchTerraform plan output (especially the '~' lines showing attribute changes)
searchTerraform state file (terraform.tfstate) or remote backend state (e.g., S3, Azure Storage)
searchCloud provider activity logs (AWS CloudTrail, Azure Monitor Activity Log, GCP Cloud Audit Logs)
searchCI/CD pipeline logs for any automated apply or refresh steps
searchVersion control history for your Terraform configuration files (git log)
searchMonitoring dashboards for resource changes (e.g., AWS Config, Azure Policy, GCP Asset Inventory)
searchAny automation scripts or runbooks that might modify resources (e.g., Lambda functions, Ansible playbooks)

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningAd-hoc manual changes via cloud console for troubleshooting or emergency fixes
warningAuto-scaling or recovery processes that modify resource attributes (e.g., security groups, tags)
warningAnother IaC tool (e.g., CloudFormation, ARM templates) managing the same resource
warningIncorrect Terraform configuration that doesn't match the desired state (e.g., missing lifecycle rules)
warningState file corruption or stale state due to concurrent modifications or failed applies
warningResource adoption (terraform import) without proper alignment of existing configuration

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildRun 'terraform apply -refresh-only' to update the state file with current resource attributes without making changes
buildUse 'terraform state rm' to remove a resource from state, then 'terraform import' with the correct config if needed
buildAdd lifecycle 'ignore_changes' to attributes that are expected to be modified externally (e.g., tags, user_data)
buildImplement CI/CD guardrails: require all changes to go through Terraform, disable manual edits via IAM policies
buildSet up drift detection with tools like Terraform Cloud, Atlantis, or custom scripts that alert on plan diffs
buildFor critical resources, use 'prevent_destroy = true' in lifecycle to avoid accidental deletion

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRun 'terraform plan' after applying the fix; it should show no changes if state and config are aligned
verifiedCheck the state file for the resource using 'terraform state show <resource>' and confirm attributes match the config
verifiedRe-run the same plan with 'terraform plan -detailed-exitcode'; exit code 0 means no drift
verifiedVerify cloud provider activity logs show no unauthorized modifications after the fix
verifiedPerform a full 'terraform apply' in a non-production environment to ensure idempotency
verifiedSet up a scheduled drift detection job (e.g., daily terraform plan) and monitor for zero diffs

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningRunning 'terraform apply' blindly without reviewing the plan—this can revert external changes and cause outages
warningManually editing the state file with 'terraform state' commands without understanding the implications
warningUsing 'lifecycle ignore_changes' too broadly, which hides legitimate drift and defeats the purpose of IaC
warningAssuming drift is harmless—some changes (like security group rules) can expose vulnerabilities
warningIgnoring the root cause: if you only fix the state but don't address why the manual change happened, it will recur
warningForgetting to update the Terraform config to reflect desired permanent changes after reconciling drift

( 07 )War story

Production Security Group Drift Causes Outage

Senior DevOps EngineerAWS, Terraform, GitHub Actions, S3 remote state

Timeline

09:15On-call engineer receives alert: production web servers unreachable
09:20Engineer logs into AWS console, notices security group for web tier has wrong ingress rules
09:25Engineer manually adds correct ingress rule via console to restore access
09:30Service restored; engineer notes to fix Terraform later
10:00Another engineer runs terraform apply for unrelated change; plan shows removal of the manually added rule
10:05Apply runs and removes the manual rule—web servers go down again
10:10PagerDuty alert fires again; team scrambles
10:20Senior engineer identifies drift: manual change vs. Terraform state mismatch
10:30Team runs terraform apply -refresh-only to update state, then applies the correct config

The incident started when an automated deployment script inadvertently removed a necessary security group rule from our production web tier. The on-call engineer, under pressure to restore service, jumped into the AWS console and added the rule manually. The fix worked, but no one updated the Terraform configuration or the state file. The manual change became a ticking time bomb.

An hour later, a different engineer ran a routine terraform apply for an unrelated DNS change. The plan showed that Terraform would remove the manually added security group rule—because it wasn't in the state file. The engineer, seeing a diff they didn't understand but assuming it was correct, approved the apply. Instantly, the web tier lost the critical ingress rule and the site went down again.

I was called in as the senior engineer. I quickly checked the AWS CloudTrail logs and saw the manual change. I ran 'terraform plan' and saw the discrepancy. We immediately did a 'terraform apply -refresh-only' to update the state to match reality (including the manual rule), then updated the Terraform config to include the rule properly, and reapplied. The root cause was a lack of process: manual changes without updating IaC, and no guardrails preventing blind applies. We later added IAM policies to disable console edits for critical resources and set up automated drift detection in CI/CD.

Root cause

Ad-hoc manual change to a security group via AWS console, not reflected in Terraform state or configuration, followed by a subsequent terraform apply that reverted the manual fix.

The fix

1. Run 'terraform apply -refresh-only' to sync state with actual resources. 2. Update Terraform config to include the desired rules. 3. Run 'terraform apply' to enforce the correct state. 4. Implement IAM policies to prevent console edits and add drift detection to CI/CD.

The lesson

Never make manual changes to resources managed by Terraform without immediately updating the state and config. Always review terraform plan diffs carefully, especially after an incident. Automate drift detection to catch mismatches before they cause outages.

( 08 )Understanding State Drift Mechanism

Terraform maintains a state file that maps your configuration to real-world resources. When a resource is modified outside of Terraform (via console, CLI, or API), the state file becomes stale—it no longer reflects the actual resource attributes. On the next 'terraform plan', Terraform compares its state (the old values) against the configuration (your desired values) and sees a diff: it thinks the resource needs to be updated to match config, but actually the resource has already been changed externally. This is drift.

Drift can be intentional (e.g., emergency fix) or accidental (e.g., auto-scaling group modifies launch config). The critical point is that Terraform's default behavior is to overwrite external changes, which can cause service disruption if not handled carefully. Understanding the refresh phase is key: when you run 'terraform plan', Terraform optionally refreshes state (unless -refresh=false) by querying the provider API. This refresh updates the state to match reality, but only for the duration of that command—it doesn't persist unless you explicitly run 'terraform apply -refresh-only'.

( 09 )Diagnostic Commands and Techniques

The first step is to isolate which resources have drifted. Use 'terraform plan -out=tfplan' and then 'terraform show -json tfplan' to get a machine-readable diff. Look for resources with actions like 'update in-place' or 'destroy and recreate'. For each drifted resource, run 'terraform state show <resource_address>' to see the current state values, and compare with your configuration file. For example, if an AWS security group rule is missing in state, you'll see it in the plan as an addition.

Cloud provider audit logs are gold. In AWS, use CloudTrail via CLI: 'aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=<resource-id>'. This shows who made the change, from where, and when. In GCP, use 'gcloud logging read "resource.type=aws_..."'. Cross-reference the timestamp with incident reports or deployment windows. Also check if any automated scripts (e.g., Lambda functions) have permissions to modify the resource—often drift comes from automation that isn't tracked in IaC.

( 10 )Reconciliation Strategies Without Downtime

Once drift is confirmed, the safest approach is to update the Terraform configuration to match the desired state (which includes the external changes if they are valid) and then run 'terraform apply -refresh-only' to sync state. This ensures the state file correctly represents reality without making any resource changes. After that, a normal 'terraform plan' should show no diff. If the external change is undesirable, you need to reverse it: first, revert the manual change via the console or API, then run 'terraform apply' to enforce the correct state.

For resources that must not be disrupted (e.g., production databases), use 'terraform plan -target=<resource>' to scope operations. If a resource has been deleted outside Terraform, 'terraform apply' will try to recreate it—which may be desired or not. Use 'terraform state rm' to remove the deleted resource from state if you don't want Terraform to recreate it. Always test in a non-production environment first. For complex drifts, consider using 'terraform import' to bring existing resources under management with the correct config.

( 11 )Preventing Drift at Scale

Prevention starts with governance. Use IAM policies to restrict write access to resources outside of Terraform. In AWS, you can use Service Control Policies (SCPs) to deny modifications unless they come from a specific IAM role used by Terraform. For example, add a condition that checks 'aws:ViaAWSService' or 'aws:PrincipalArn' matches your CI/CD role. This forces all changes through Terraform.

Implement automated drift detection in your CI/CD pipeline. After every apply, run a follow-up 'terraform plan' and fail the pipeline if there are any diffs. Use tools like Terraform Cloud's drift detection, Atlantis, or a simple cron job that runs 'terraform plan' and alerts on changes. Also, educate your team: any emergency fix must be followed by a ticket to update Terraform config and state. The 'lifecycle' meta-argument can be used sparingly—for example, 'ignore_changes' on tags that are modified by external systems, but avoid overuse as it hides drift.

( 12 )Advanced: Handling Concurrent Modifications and State Locking

When multiple team members or automation processes modify the same resource, state conflicts can arise. Terraform supports state locking via backends like S3 (with DynamoDB) or Azure Storage. Ensure locking is enabled and your CI/CD system uses it. If a lock is not released (e.g., due to a failed apply), you may need to force unlock: 'terraform force-unlock <lock_id>'. Be cautious—forcing unlock can lead to state corruption.

Another advanced scenario is when external changes happen between a plan and an apply. For example, you run 'terraform plan', see no issues, but by the time you apply, someone manually changes the resource. To mitigate, use 'terraform plan -out=tfplan' and then 'terraform apply tfplan'—this replans based on the saved plan, but the apply still does a final refresh. For critical resources, consider using 'precondition' and 'postcondition' checks to validate state before and after apply. In Terraform 1.2+, you can use 'check' blocks for custom assertions.

Frequently asked questions

What is the difference between 'terraform refresh' and 'terraform apply -refresh-only'?

'terraform refresh' is a deprecated command that updates the state file with current resource attributes without making any changes. 'terraform apply -refresh-only' is the modern replacement; it does the same but is safer because it uses the same workflow as a normal apply, including state locking and plan output. Always use 'terraform apply -refresh-only' instead of 'terraform refresh'.

Can I use 'lifecycle { ignore_changes }' to prevent drift?

No, 'ignore_changes' tells Terraform to ignore specific attribute changes during plan/apply, but it does not prevent the drift from happening. It merely suppresses the diff. It's useful for attributes that are expected to be modified externally (e.g., tags from an auto-tagging system), but it's not a solution for drift prevention. Overusing it can hide legitimate issues.

How do I revert an external change that is not desired?

First, manually undo the external change via the cloud provider console or API. Then run 'terraform plan' to confirm the diff is gone. Finally, run 'terraform apply' to ensure the state matches the desired configuration. If the external change is complex, you may need to update your Terraform config to reflect the correct state and then apply.

What should I do if the state file is corrupted or out of sync?

If the state file is corrupted, restore from a backup if you have one. Terraform backends like S3 support versioning—enable it. If the state is out of sync but not corrupted, use 'terraform apply -refresh-only' to sync. For severe corruption, you may need to rebuild state using 'terraform import' for each resource. Always keep backups and use remote state with locking.

How can I detect drift automatically?

Set up a scheduled job (e.g., cron, CI pipeline trigger) that runs 'terraform plan' and checks the exit code. If the exit code is 2 (meaning a diff exists), send an alert. Tools like Terraform Cloud have built-in drift detection. You can also use third-party tools like Atlantis, Terragrunt, or custom scripts that parse the plan output and notify via Slack or email.

Terraform State Drift: Resource Changed Outside Terraform

What this usually means

Frequently asked questions