Terraform Monitoring: Drift Detection and Deploy Checks

terraform apply completed successfully. Zero errors. The infrastructure change is live.

Twenty minutes later, users start seeing 502 errors. The load balancer target group is healthy according to Terraform state, but the new security group rules are blocking traffic on port 443.

Terraform tells you what it changed. Monitoring tells you whether what it changed actually works.

This guide covers how to monitor infrastructure-as-code deployments so you catch drift, validate changes, and detect the failures that Terraform's exit code cannot.

Why Terraform Needs Monitoring

Terraform manages infrastructure declaratively. You define the desired state, and Terraform makes it happen. But several things can go wrong that Terraform itself cannot detect:

Apply succeeds, service breaks — A valid configuration change (new security group, DNS record, certificate) can be syntactically correct but operationally wrong.
Drift between applies — Someone manually changes infrastructure through the console. Terraform state no longer matches reality.
Partial applies — Terraform creates some resources but fails on others. The infrastructure is in an inconsistent state.
Propagation delays — DNS changes, certificate provisioning, and load balancer registration take time. Terraform reports success before the change is fully live.
Dependency timing — A database is created but not yet accepting connections when the application tries to connect.

Terraform ensures your infrastructure is configured correctly. Monitoring ensures it is working correctly.

What to Monitor After Terraform Applies

1) Endpoint Reachability

After any infrastructure change, verify that user-facing endpoints still work:

HTTP/HTTPS checks on all public URLs
API endpoint validation with auth headers
Health endpoint checks on internal services
DNS resolution for any changed or new domains

This is the single most important post-apply check. If users can reach the service and get correct responses, the apply was successful in the ways that matter.

2) SSL and DNS State

Terraform frequently manages certificates and DNS records. Monitor:

SSL certificate validity and expiry on all domains
DNS resolution returning expected IP addresses or CNAMEs
Certificate chain completeness (intermediate certificates)
HTTPS redirect behavior

DNS and certificate changes are among the most common causes of post-Terraform outages because propagation is not instant.

3) Network Connectivity

Security groups, firewalls, VPC configurations, and load balancer rules are common Terraform resources. After changes:

TCP port checks on critical services (databases, caches, message queues)
Verify load balancer health check targets are passing
Confirm services can reach their dependencies
Test cross-VPC or cross-region connectivity if changed

4) Service Health Post-Apply

Beyond reachability, validate that services are functioning correctly:

Content validation on key pages (not just 200 status)
Response time within expected thresholds
Background job completion (heartbeat monitoring)
Queue processing and worker health

A service can be reachable but broken if its configuration references a resource that Terraform changed.

Infrastructure Drift Detection

Drift is when the actual state of infrastructure diverges from what Terraform state says it should be. Common causes:

Manual changes through cloud provider console
Automated processes modifying resources outside Terraform
Another team's Terraform workspace changing shared resources
Cloud provider auto-updates or maintenance

How monitoring catches drift

External monitoring detects drift symptoms that terraform plan misses:

Drift Type	Terraform Detection	Monitoring Detection
Security group rule added manually	Next `plan` shows diff	Immediate — port check fails or succeeds unexpectedly
DNS record changed in console	Next `plan` shows diff	Immediate — resolution check returns wrong IP
Certificate replaced outside Terraform	Next `plan` shows diff	Immediate — SSL check detects new cert or expiry change
Auto-scaling changes instance count	Depends on lifecycle rules	Response time monitoring detects capacity changes
Database parameter changed manually	May not show in plan	Latency or error rate monitoring detects behavior change

Monitoring catches drift impact in real time. terraform plan catches drift definition on the next run, which may be hours or days later.

Continuous monitoring as a drift signal

When your monitoring detects an unexpected change (new SSL cert, different DNS response, changed response content), it may indicate infrastructure drift worth investigating with terraform plan.

Terraform CI/CD Pipeline Monitoring

Most teams run Terraform through CI/CD pipelines. Monitor the pipeline itself:

Heartbeat monitoring for scheduled applies

If you run Terraform on a schedule (e.g., hourly drift detection plans), use heartbeat monitoring to verify the pipeline completes:

Pipeline sends heartbeat signal after successful terraform plan or apply
If the heartbeat does not arrive on schedule, alert
Catches pipeline infrastructure failures, credential expirations, and scheduler issues

Post-apply validation step

Add a validation step after terraform apply:

- name: Apply infrastructure
  run: terraform apply -auto-approve

- name: Validate endpoints
  run: |
    # Wait for propagation
    sleep 30
    # Check critical endpoints
    curl -sf https://api.example.com/health || exit 1
    curl -sf https://www.example.com/ || exit 1

- name: Signal deploy complete
  run: curl -fsS --retry 3 https://heartbeat.web-alert.io/your-id

This closes the loop between "Terraform says it worked" and "the infrastructure actually works."

Apply duration monitoring

Track how long terraform apply takes. A sudden increase may indicate:

State file lock contention
Provider API rate limiting
Large-scale resource recreation
Dependency resolution issues

Common Terraform Failure Modes

Failure Mode	What Happens	Monitoring Detection
Security group blocks traffic	Service unreachable on specific ports	TCP port check + HTTP check failure
DNS record wrong or stale	Users hit wrong endpoint or get errors	DNS resolution check + content validation
Certificate not yet provisioned	HTTPS fails with SSL error	SSL check failure
Load balancer target unhealthy	502/503 errors for some requests	HTTP check + error rate monitoring
Database connection string changed	App cannot connect to database	Health endpoint check + error rate
IAM permission removed	Service cannot access dependencies	Content validation + error monitoring
Auto-scaling policy misconfigured	Service cannot handle traffic	Response time monitoring
Partial apply leaves inconsistent state	Some resources created, others failed	Endpoint-level checks across all services
Provider API timeout during apply	Resources in unknown state	Post-apply endpoint validation

Practical Setup for Terraform Teams

Minimum viable monitoring

For small teams running Terraform:

HTTP check on every public endpoint after each apply
SSL check on all domains managed by Terraform
DNS check on all DNS records managed by Terraform
Heartbeat for Terraform CI/CD pipeline completion
Response time alert to catch performance regressions from infrastructure changes

Comprehensive monitoring

For teams with complex multi-environment Terraform:

All of the above, plus:
TCP port checks on databases, caches, and internal services
Content validation on key endpoints (not just status codes)
Per-environment monitoring (staging and production separately)
Post-apply smoke test step in CI/CD pipeline
Multi-region checks for globally distributed infrastructure
Alerting per workspace/environment so the right team is notified

How Webalert Helps

Webalert validates that your Terraform-managed infrastructure works from the user's perspective:

HTTP/HTTPS checks every minute from global regions — verify endpoints after every apply
SSL monitoring — catch certificate provisioning failures and expiry drift
DNS monitoring — detect record changes and propagation issues
TCP port checks — validate database, cache, and service connectivity
Content validation — confirm responses are correct, not just reachable
Response time tracking — detect performance regressions from infrastructure changes
Heartbeat monitoring — verify Terraform CI/CD pipelines complete on schedule
Multi-channel alerts — Email, SMS, Slack, Discord, Teams, webhooks
Status pages — communicate infrastructure incidents to users

See features and pricing for details.

Summary

terraform apply success does not guarantee working infrastructure.
Monitor endpoints, SSL, DNS, and connectivity after every Terraform apply.
Use heartbeat monitoring to verify Terraform CI/CD pipelines complete.
External monitoring catches infrastructure drift impact in real time.
Add a post-apply validation step to close the loop between configuration and reality.
Start with HTTP, SSL, and DNS checks on all Terraform-managed endpoints.

Terraform defines your infrastructure. Monitoring proves it works.

Validate every Terraform apply automatically

Start monitoring with Webalert →

See features and pricing. No credit card required.