Drift Detection, State Surgery, and Refactoring

No matter how disciplined your team is, reality eventually drifts from your codebase. During high-severity production incidents, an engineer might log into the AWS Management Console directly to increase a server's size or open a security group port.

When this happens, your Infrastructure has drifted.

If you run terraform apply after a manual change is made, one of two bad things will happen:

Terraform will attempt to overwrite the manual change, reverting the server to its original state and potentially re-introducing the incident.
The apply command will fail because it encounters unexpected cloud resources.

To maintain integrity, a DevOps engineer must master Drift Detection, Resource Importing, and State Surgery.

Step 1: Automated Drift Detection

To detect drift before it causes issues, you can run a scheduled, read-only plan:

# Detect drift between state and actual AWS resources
terraform plan -refresh-only

If drift has occurred, the terminal outputs detailed records:

Note: Objects have changed outside of Terraform.

Terraform detected the following changes state-only:
  # aws_security_group.app has been modified outside of IaC:
  ~ ingress {
      ~ cidr_blocks = [
          + "0.0.0.0/0",  # Manual port opening detected!
        ]
      ~ from_port   = 22
      ~ to_port     = 22
    }

Resolution Strategy:

Revert: If the manual change was a temporary fix or a mistake, run terraform apply to overwrite it and restore your code's configuration.
Reconcile: If the change was valid (e.g. permanent server scaling), update your HCL code to match the new values, and run terraform plan to confirm that the diff resolves to 0.

Step 2: Importing Untracked Cloud Assets

Often, you must absorb resources created manually inside an old AWS account into your Terraform workspace without destroying them.

First, write a blank placeholder resource block in your code:

# main.tf

# Placeholder for existing manual S3 bucket
resource "aws_s3_bucket" "legacy_assets" {
  # Leave arguments empty during import stage
}

Now, map the existing AWS resource to this placeholder using its cloud identifier:

# Command: terraform import [resource_type].[resource_name] [aws_identifier]
terraform import aws_s3_bucket.legacy_assets my-manually-created-bucket-name

Terraform connects to AWS, pulls down the resource configuration, and writes it directly into your remote state file. Next, run terraform plan to see what arguments are missing in your local HCL placeholder. Update your code until terraform plan returns zero changes, signifying that your code matches reality perfectly.

Step 3: Refactoring State (State Surgery)

When refactoring code (e.g. renaming a resource or moving it inside a module), Terraform's default behavior is to destroy the old resource and recreate the new one under the new name. For databases or load balancers, this results in catastrophic, unneeded downtime.

To rename a resource without destroying it, we perform State Surgery using the CLI:

# original.tf
resource "aws_s3_bucket" "old_name" { ... }

# After refactoring, we want it to be named:
resource "aws_s3_bucket" "new_name" { ... }

Before applying, rename the resource directly inside the state file:

# Command: terraform state mv [old_path] [new_path]
terraform state mv aws_s3_bucket.old_name aws_s3_bucket.new_name

Successfully moved 1 object(s).

Now, run terraform plan. Terraform will report 0 changes, because the database or S3 bucket wasn't modified in AWS—only its structural mapping key inside our state file was updated!

By using these advanced state operations, you manage infrastructure changes with absolute control, refactoring and expanding your topologies cleanly without ever risking accidental resource destruction.

Next Steps

We have reached the culmination of our DevOps pathway. In the final lesson, we will deploy our Capstone Project: designing and executing a fully automated, zero-downtime Blue-Green Infrastructure Upgrade on AWS.

Drift Detection, State Surgery, and Refactoring

Provision, secure, and automate production-grade cloud infrastructure at scale.