Skip to content
cloud aws azure gcp infrastructure monitoring uptime

Cloud Infrastructure Monitoring: AWS, Azure, and GCP Uptime Best Practices

Webalert Team
March 3, 2026
13 min read

Cloud Infrastructure Monitoring: AWS, Azure, and GCP Uptime Best Practices

Your cloud provider guarantees 99.99% uptime. That sounds like practically zero downtime.

But "cloud uptime" measures whether the provider's infrastructure is available — not whether your application running on that infrastructure is available. AWS can be perfectly healthy while your EC2 instance is unreachable, your RDS database is out of connections, or your S3 bucket policy is blocking requests.

Cloud infrastructure monitoring means watching your services from the outside, the same way your users experience them, regardless of what your provider's status page says.

This guide covers what to monitor, how cloud-specific failures differ from traditional hosting, and practical monitoring strategies for AWS, Azure, and GCP deployments.


Why Cloud Monitoring Is Different

Shared responsibility means shared blame

Every major cloud provider operates on a shared responsibility model: they're responsible for the infrastructure (hardware, network, hypervisor), and you're responsible for everything you deploy on it (OS, application, configuration, data).

When your application goes down, the root cause falls into one of three buckets:

  1. Cloud provider issue — Regional outage, service degradation, network problem. Rare but impactful.
  2. Your configuration — Security group blocking traffic, misconfigured load balancer, exhausted instance limits. Common and entirely your responsibility.
  3. Your application — Memory leak, unhandled exception, database connection exhaustion. Most common of all.

Internal cloud monitoring tools (CloudWatch, Azure Monitor, GCP Cloud Monitoring) are great at telling you about provider-level issues and resource metrics. But they can't tell you what your users actually experience. For that, you need external monitoring.

The provider's status page isn't enough

Cloud provider status pages are notoriously slow to update. AWS, Azure, and GCP have all had incidents where their status pages showed "all green" while customers experienced significant outages.

Reasons:

  • Status pages report on services globally or regionally, not on your specific resources
  • Updates are written by humans and go through review processes
  • Partial degradations often don't trigger status page updates
  • Your specific failure mode might not affect enough customers to register

External monitoring gives you your own source of truth, independent of the provider.


What to Monitor in the Cloud

1. Public-facing endpoints (HTTP/HTTPS)

The most important check: can your users reach your application?

What to monitor:

  • Your primary domain (https://yourapp.com)
  • API endpoints (https://api.yourapp.com/health)
  • Any public-facing services (webhooks, OAuth callbacks, CDN-served assets)

Why it matters in cloud: Load balancers (ALB, Azure Application Gateway, GCP Load Balancer) can silently fail to route traffic. Auto-scaling groups might scale to zero. CDN distributions might serve stale error pages. A healthy HTTP check from outside your cloud network catches all of these.

Configuration:

  • Check every 1 minute
  • Verify status code (200) and response body (content validation)
  • Check from multiple regions to catch cloud-region-specific issues

2. SSL certificates

Cloud-managed certificates (AWS ACM, Azure App Service Managed Certificates) handle renewal automatically — until they don't.

Common cloud SSL failures:

  • ACM certificate validation fails because DNS verification record was deleted
  • Azure managed certificate renewal fails due to custom domain misconfiguration
  • GCP-managed cert doesn't cover a recently added subdomain
  • Certificate attached to a load balancer that was recreated by Terraform/CloudFormation

Configuration:

  • Monitor certificate expiry (alert at 30, 14, and 7 days)
  • Monitor certificate validity (chain, domain match)
  • Check all subdomains, not just the primary domain

3. DNS resolution

DNS is the first point of failure for any cloud-hosted application. Cloud DNS services (Route 53, Azure DNS, Cloud DNS) are highly reliable but still have failure modes.

Cloud-specific DNS risks:

  • Route 53 health check routing fails, sending traffic to an unhealthy endpoint
  • Azure Traffic Manager profile misconfigured after infrastructure change
  • DNS records not updated after IP change (e.g., Elastic IP reassignment)
  • Terraform/IaC deployment overwrites DNS records incorrectly

Configuration:

  • Monitor DNS resolution for all critical domains
  • Verify records resolve to expected IPs
  • Check from multiple regions

4. TCP port connectivity

Cloud security groups, network ACLs, and firewall rules are the most common cause of "it works from inside the VPC but not from outside." TCP port monitoring detects connectivity issues that HTTP checks might not catch.

Ports to monitor:

  • 443 (HTTPS) — Your main application
  • 5432 or 3306 (PostgreSQL/MySQL) — If database is publicly accessible (or via bastion/VPN endpoint)
  • 6379 (Redis) — If cache is externally accessible
  • 22 (SSH) — Bastion host or jump server
  • Custom ports for any public-facing services

Cloud-specific risks:

  • Security group rule removed or modified during deployment
  • Network ACL change blocked inbound traffic
  • VPC peering or transit gateway misconfigured
  • Instance replaced by auto-scaling with different security group

5. Response time and latency

Cloud applications can be "up" but unusably slow. Response time monitoring catches degradation before it becomes an outage.

Cloud-specific causes of latency:

  • Instance running on degraded hardware (noisy neighbor)
  • Cross-region database queries (application in us-east-1, database in eu-west-1)
  • Cold starts on serverless functions (Lambda, Azure Functions, Cloud Functions)
  • Auto-scaling hasn't caught up with traffic spike
  • CDN cache miss rate increased after deployment

Configuration:

  • Track response time trends over time
  • Alert on sustained increases (not just spikes)
  • Compare response times across regions to detect cloud-region-specific degradation

6. Background jobs and scheduled tasks

Cloud schedulers (EventBridge, Azure Scheduler, Cloud Scheduler) are reliable but the tasks they trigger can fail silently.

What to monitor:

  • Cron jobs that process data, send emails, or clean up resources
  • Scheduled backups
  • Queue processors and workers
  • Periodic health or integrity checks

Configuration:

  • Use heartbeat monitoring — your job pings a URL when it completes successfully
  • If the heartbeat is missed, you get an alert
  • This catches both "the scheduler didn't fire" and "the job failed"

Cloud Provider-Specific Monitoring

AWS Monitoring Checklist

Resource What to Monitor How
EC2 / ECS / EKS Application reachable, response time HTTP check on public endpoint
ALB / NLB Traffic routing, health check status HTTP check through load balancer URL
RDS / Aurora Connectivity, query performance TCP port check + API response time
S3 (static hosting) Content accessible, correct responses HTTP check with content validation
CloudFront CDN serving correct content HTTP check from multiple regions
Route 53 DNS resolving correctly DNS monitoring
ACM certificates Validity, expiry SSL monitoring
Lambda (via API GW) Function responding, cold start latency HTTP check on API Gateway endpoint
EventBridge + Lambda Scheduled tasks completing Heartbeat monitoring

AWS-specific gotcha: When an Auto Scaling group replaces instances, the new instances might have different configurations if the launch template was modified. External HTTP monitoring catches this immediately — the response changes or breaks.

Azure Monitoring Checklist

Resource What to Monitor How
App Service / VMs Application reachable, response time HTTP check on public endpoint
Application Gateway Traffic routing, WAF not blocking legit traffic HTTP check through gateway URL
Azure SQL / CosmosDB Connectivity, query performance TCP port check + API response time
Blob Storage (static) Content accessible HTTP check with content validation
Azure CDN / Front Door Serving correct content, latency HTTP check from multiple regions
Azure DNS Records resolving correctly DNS monitoring
Managed certificates Validity, expiry SSL monitoring
Azure Functions (via APIM) Function responding HTTP check on API endpoint
Azure Scheduler / Logic Apps Scheduled workflows completing Heartbeat monitoring

Azure-specific gotcha: Azure App Service has a "warm-up" behavior after deployments and scaling events. Your app might return 200 but with significantly higher latency for the first few minutes. Response time monitoring with threshold alerts catches this degradation.

GCP Monitoring Checklist

Resource What to Monitor How
Compute Engine / GKE / Cloud Run Application reachable, response time HTTP check on public endpoint
Cloud Load Balancing Traffic routing, backend health HTTP check through LB URL
Cloud SQL / Firestore Connectivity, query performance TCP port check + API response time
Cloud Storage (static) Content accessible HTTP check with content validation
Cloud CDN Serving correct content HTTP check from multiple regions
Cloud DNS Records resolving correctly DNS monitoring
Google-managed certificates Validity, expiry SSL monitoring
Cloud Functions / Cloud Run Function responding, cold start latency HTTP check on endpoint
Cloud Scheduler Scheduled tasks completing Heartbeat monitoring

GCP-specific gotcha: Cloud Run scales to zero by default. The first request after scale-down incurs a cold start. If your monitoring check interval is longer than the scale-down window, every check triggers a cold start and shows elevated latency. Use 1-minute checks to keep the service warm, or set a minimum instance count.


Multi-Cloud and Hybrid Monitoring

If you run infrastructure across multiple providers or have a hybrid cloud/on-premise setup, monitoring becomes even more important.

Cross-provider dependencies

Your application might use AWS for compute, Cloudflare for CDN, and a third-party payment API. A failure in any one of these breaks the user experience. Monitor each dependency independently:

  • HTTP check on your primary application
  • HTTP check on your CDN-served assets
  • HTTP check on critical third-party APIs
  • DNS monitoring for your domain (which might be on a different provider than your hosting)

The single pane of glass

When infrastructure spans multiple providers, you can't rely on any single provider's monitoring dashboard. You need an independent, external tool that monitors everything from the user's perspective — regardless of where it's hosted.

This is where external uptime monitoring is essential: it doesn't care whether your application runs on AWS, Azure, GCP, or a server under your desk. It checks the URL and tells you if it works.


Infrastructure as Code and Monitoring

If you manage infrastructure with Terraform, CloudFormation, Pulumi, or similar tools, your monitoring should be part of that process.

Why IaC matters for monitoring

Infrastructure changes are the #1 cause of cloud outages. A Terraform apply that modifies a security group, a CloudFormation update that recreates a load balancer, a Pulumi deployment that changes DNS records — all of these can break your application.

Having monitoring in place means you catch these breaks immediately after deployment, not when a user reports a problem hours later.

The deployment monitoring pattern

  1. Deploy infrastructure change (Terraform apply, CloudFormation update)
  2. External monitor checks endpoint (within 1 minute)
  3. If check fails → Alert fires → Team investigates and rolls back
  4. If check passes → Deployment confirmed successful

This pattern works regardless of your IaC tool, CI/CD pipeline, or cloud provider. The external monitor is the final validation that your change didn't break anything.


Alerting Strategy for Cloud Infrastructure

Tier 1: Immediate response (page on-call)

  • Primary application endpoint down
  • API returning 5xx errors
  • SSL certificate invalid or expired
  • DNS resolution failing

Channels: SMS + phone call + Slack/Discord

Tier 2: Urgent but not immediate (alert team channel)

  • Response time >3x baseline sustained for 5+ minutes
  • Single-region check failing (others passing)
  • Background job heartbeat missed
  • Certificate expiring within 7 days

Channels: Slack/Discord + email

Tier 3: Informational (log for review)

  • Response time slightly elevated
  • Single check failure (recovered on next check)
  • Certificate expiring within 30 days
  • DNS TTL changed

Channels: Email or monitoring dashboard

Avoid cloud-monitoring alert storms

Cloud infrastructure can generate enormous volumes of alerts during a regional incident. Hundreds of checks fail simultaneously, each triggering its own alert. Strategies to manage this:

  • Group alerts by service — One alert for "Production API is down" instead of separate alerts for each endpoint
  • Use escalation policies — First alert goes to Slack. If not acknowledged in 5 minutes, SMS. If not acknowledged in 10 minutes, phone call.
  • Require consecutive failures — Alert after 2-3 consecutive check failures, not on the first one. This filters transient blips.
  • Use maintenance windows — During planned deployments, suppress alerts to avoid false positives

How Webalert Monitors Cloud Infrastructure

Webalert provides external monitoring that's independent of your cloud provider — giving you an unbiased view of what your users actually experience:

  • HTTP/HTTPS monitoring — Check any endpoint from multiple global regions every 1 minute
  • SSL monitoring — Catch certificate issues before they become outages, even with cloud-managed certs
  • DNS monitoring — Verify resolution works correctly, independent of your cloud DNS provider
  • TCP port monitoring — Confirm database, cache, and service ports are reachable through firewalls and security groups
  • Content validation — Verify correct responses, not just status codes — catch misconfigured deployments
  • Response time tracking — Detect performance degradation from noisy neighbors, cold starts, or scaling delays
  • Heartbeat monitoring — Confirm cloud-scheduled jobs and serverless functions run on time
  • Multi-region checks — Detect cloud-region-specific failures that single-location checks miss
  • On-call and escalation — Route alerts to the right person with tiered notification channels
  • Status pages — Keep users informed during cloud incidents

Your cloud provider monitors their infrastructure. Webalert monitors your application on their infrastructure.

See features and pricing for the full details.


Summary

  • Your cloud provider's uptime isn't your uptime. Their infrastructure can be healthy while your application is broken.
  • External monitoring is essential. It checks what your users experience, independent of provider dashboards and status pages.
  • Monitor the full stack: HTTP endpoints, SSL, DNS, TCP ports, response time, and scheduled tasks.
  • Each cloud provider has specific failure modes — auto-scaling misconfigurations, security group changes, cold starts, certificate renewal failures. Know yours.
  • Infrastructure deployments are the #1 risk. External monitoring is your final validation that a change didn't break anything.
  • Layer your alerts: Immediate for outages, urgent for degradation, informational for trends.

Cloud makes infrastructure easier to provision. It doesn't make it easier to keep running. That's what monitoring is for.


Monitor what your users experience, not what your cloud dashboard shows

Start monitoring free with Webalert →

See features and pricing. No credit card required.

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Get Started Free