Cloud Infrastructure Monitoring: AWS, Azure, and GCP Uptime Best Practices

Your cloud provider guarantees 99.99% uptime. That sounds like practically zero downtime.

But "cloud uptime" measures whether the provider's infrastructure is available — not whether your application running on that infrastructure is available. AWS can be perfectly healthy while your EC2 instance is unreachable, your RDS database is out of connections, or your S3 bucket policy is blocking requests.

Cloud infrastructure monitoring means watching your services from the outside, the same way your users experience them, regardless of what your provider's status page says.

This guide covers what to monitor, how cloud-specific failures differ from traditional hosting, and practical monitoring strategies for AWS, Azure, and GCP deployments.

Why Cloud Monitoring Is Different

Shared responsibility means shared blame

Every major cloud provider operates on a shared responsibility model: they're responsible for the infrastructure (hardware, network, hypervisor), and you're responsible for everything you deploy on it (OS, application, configuration, data).

When your application goes down, the root cause falls into one of three buckets:

Cloud provider issue — Regional outage, service degradation, network problem. Rare but impactful.
Your configuration — Security group blocking traffic, misconfigured load balancer, exhausted instance limits. Common and entirely your responsibility.
Your application — Memory leak, unhandled exception, database connection exhaustion. Most common of all.

Internal cloud monitoring tools (CloudWatch, Azure Monitor, GCP Cloud Monitoring) are great at telling you about provider-level issues and resource metrics. But they can't tell you what your users actually experience. For that, you need external monitoring.

The provider's status page isn't enough

Cloud provider status pages are notoriously slow to update. AWS, Azure, and GCP have all had incidents where their status pages showed "all green" while customers experienced significant outages.

Reasons:

Status pages report on services globally or regionally, not on your specific resources
Updates are written by humans and go through review processes
Partial degradations often don't trigger status page updates
Your specific failure mode might not affect enough customers to register

External monitoring gives you your own source of truth, independent of the provider.

What to Monitor in the Cloud

1. Public-facing endpoints (HTTP/HTTPS)

The most important check: can your users reach your application?

What to monitor:

Your primary domain (https://yourapp.com)
API endpoints (https://api.yourapp.com/health)
Any public-facing services (webhooks, OAuth callbacks, CDN-served assets)

Why it matters in cloud: Load balancers (ALB, Azure Application Gateway, GCP Load Balancer) can silently fail to route traffic. Auto-scaling groups might scale to zero. CDN distributions might serve stale error pages. A healthy HTTP check from outside your cloud network catches all of these.

Configuration:

Check every 1 minute
Verify status code (200) and response body (content validation)
Check from multiple regions to catch cloud-region-specific issues

2. SSL certificates

Cloud-managed certificates (AWS ACM, Azure App Service Managed Certificates) handle renewal automatically — until they don't.

Common cloud SSL failures:

ACM certificate validation fails because DNS verification record was deleted
Azure managed certificate renewal fails due to custom domain misconfiguration
GCP-managed cert doesn't cover a recently added subdomain
Certificate attached to a load balancer that was recreated by Terraform/CloudFormation

Configuration:

Monitor certificate expiry (alert at 30, 14, and 7 days)
Monitor certificate validity (chain, domain match)
Check all subdomains, not just the primary domain

3. DNS resolution

DNS is the first point of failure for any cloud-hosted application. Cloud DNS services (Route 53, Azure DNS, Cloud DNS) are highly reliable but still have failure modes.

Cloud-specific DNS risks:

Route 53 health check routing fails, sending traffic to an unhealthy endpoint
Azure Traffic Manager profile misconfigured after infrastructure change
DNS records not updated after IP change (e.g., Elastic IP reassignment)
Terraform/IaC deployment overwrites DNS records incorrectly

Configuration:

Monitor DNS resolution for all critical domains
Verify records resolve to expected IPs
Check from multiple regions

4. TCP port connectivity

Cloud security groups, network ACLs, and firewall rules are the most common cause of "it works from inside the VPC but not from outside." TCP port monitoring detects connectivity issues that HTTP checks might not catch.

Ports to monitor:

443 (HTTPS) — Your main application
5432 or 3306 (PostgreSQL/MySQL) — If database is publicly accessible (or via bastion/VPN endpoint)
6379 (Redis) — If cache is externally accessible
22 (SSH) — Bastion host or jump server
Custom ports for any public-facing services

Cloud-specific risks:

Security group rule removed or modified during deployment
Network ACL change blocked inbound traffic
VPC peering or transit gateway misconfigured
Instance replaced by auto-scaling with different security group

5. Response time and latency

Cloud applications can be "up" but unusably slow. Response time monitoring catches degradation before it becomes an outage.

Cloud-specific causes of latency:

Instance running on degraded hardware (noisy neighbor)
Cross-region database queries (application in us-east-1, database in eu-west-1)
Cold starts on serverless functions (Lambda, Azure Functions, Cloud Functions)
Auto-scaling hasn't caught up with traffic spike
CDN cache miss rate increased after deployment

Configuration:

Track response time trends over time
Alert on sustained increases (not just spikes)
Compare response times across regions to detect cloud-region-specific degradation

6. Background jobs and scheduled tasks

Cloud schedulers (EventBridge, Azure Scheduler, Cloud Scheduler) are reliable but the tasks they trigger can fail silently.

What to monitor:

Cron jobs that process data, send emails, or clean up resources
Scheduled backups
Queue processors and workers
Periodic health or integrity checks

Configuration:

Use heartbeat monitoring — your job pings a URL when it completes successfully
If the heartbeat is missed, you get an alert
This catches both "the scheduler didn't fire" and "the job failed"

Cloud Provider-Specific Monitoring

AWS Monitoring Checklist

Resource	What to Monitor	How
EC2 / ECS / EKS	Application reachable, response time	HTTP check on public endpoint
ALB / NLB	Traffic routing, health check status	HTTP check through load balancer URL
RDS / Aurora	Connectivity, query performance	TCP port check + API response time
S3 (static hosting)	Content accessible, correct responses	HTTP check with content validation
CloudFront	CDN serving correct content	HTTP check from multiple regions
Route 53	DNS resolving correctly	DNS monitoring
ACM certificates	Validity, expiry	SSL monitoring
Lambda (via API GW)	Function responding, cold start latency	HTTP check on API Gateway endpoint
EventBridge + Lambda	Scheduled tasks completing	Heartbeat monitoring

AWS-specific gotcha: When an Auto Scaling group replaces instances, the new instances might have different configurations if the launch template was modified. External HTTP monitoring catches this immediately — the response changes or breaks.

Azure Monitoring Checklist

Resource	What to Monitor	How
App Service / VMs	Application reachable, response time	HTTP check on public endpoint
Application Gateway	Traffic routing, WAF not blocking legit traffic	HTTP check through gateway URL
Azure SQL / CosmosDB	Connectivity, query performance	TCP port check + API response time
Blob Storage (static)	Content accessible	HTTP check with content validation
Azure CDN / Front Door	Serving correct content, latency	HTTP check from multiple regions
Azure DNS	Records resolving correctly	DNS monitoring
Managed certificates	Validity, expiry	SSL monitoring
Azure Functions (via APIM)	Function responding	HTTP check on API endpoint
Azure Scheduler / Logic Apps	Scheduled workflows completing	Heartbeat monitoring

Azure-specific gotcha: Azure App Service has a "warm-up" behavior after deployments and scaling events. Your app might return 200 but with significantly higher latency for the first few minutes. Response time monitoring with threshold alerts catches this degradation.

GCP Monitoring Checklist

Resource	What to Monitor	How
Compute Engine / GKE / Cloud Run	Application reachable, response time	HTTP check on public endpoint
Cloud Load Balancing	Traffic routing, backend health	HTTP check through LB URL
Cloud SQL / Firestore	Connectivity, query performance	TCP port check + API response time
Cloud Storage (static)	Content accessible	HTTP check with content validation
Cloud CDN	Serving correct content	HTTP check from multiple regions
Cloud DNS	Records resolving correctly	DNS monitoring
Google-managed certificates	Validity, expiry	SSL monitoring
Cloud Functions / Cloud Run	Function responding, cold start latency	HTTP check on endpoint
Cloud Scheduler	Scheduled tasks completing	Heartbeat monitoring

GCP-specific gotcha: Cloud Run scales to zero by default. The first request after scale-down incurs a cold start. If your monitoring check interval is longer than the scale-down window, every check triggers a cold start and shows elevated latency. Use 1-minute checks to keep the service warm, or set a minimum instance count.

Multi-Cloud and Hybrid Monitoring

If you run infrastructure across multiple providers or have a hybrid cloud/on-premise setup, monitoring becomes even more important.

Cross-provider dependencies

Your application might use AWS for compute, Cloudflare for CDN, and a third-party payment API. A failure in any one of these breaks the user experience. Monitor each dependency independently:

HTTP check on your primary application
HTTP check on your CDN-served assets
HTTP check on critical third-party APIs
DNS monitoring for your domain (which might be on a different provider than your hosting)

The single pane of glass

When infrastructure spans multiple providers, you can't rely on any single provider's monitoring dashboard. You need an independent, external tool that monitors everything from the user's perspective — regardless of where it's hosted.

This is where external uptime monitoring is essential: it doesn't care whether your application runs on AWS, Azure, GCP, or a server under your desk. It checks the URL and tells you if it works.

Infrastructure as Code and Monitoring

If you manage infrastructure with Terraform, CloudFormation, Pulumi, or similar tools, your monitoring should be part of that process.

Why IaC matters for monitoring

Infrastructure changes are the #1 cause of cloud outages. A Terraform apply that modifies a security group, a CloudFormation update that recreates a load balancer, a Pulumi deployment that changes DNS records — all of these can break your application.

Having monitoring in place means you catch these breaks immediately after deployment, not when a user reports a problem hours later.

The deployment monitoring pattern

Deploy infrastructure change (Terraform apply, CloudFormation update)
External monitor checks endpoint (within 1 minute)
If check fails → Alert fires → Team investigates and rolls back
If check passes → Deployment confirmed successful

This pattern works regardless of your IaC tool, CI/CD pipeline, or cloud provider. The external monitor is the final validation that your change didn't break anything.

Alerting Strategy for Cloud Infrastructure

Tier 1: Immediate response (page on-call)

Primary application endpoint down
API returning 5xx errors
SSL certificate invalid or expired
DNS resolution failing

Channels: SMS + phone call + Slack/Discord

Tier 2: Urgent but not immediate (alert team channel)

Response time >3x baseline sustained for 5+ minutes
Single-region check failing (others passing)
Background job heartbeat missed
Certificate expiring within 7 days

Channels: Slack/Discord + email

Tier 3: Informational (log for review)

Response time slightly elevated
Single check failure (recovered on next check)
Certificate expiring within 30 days
DNS TTL changed

Channels: Email or monitoring dashboard

Avoid cloud-monitoring alert storms

Cloud infrastructure can generate enormous volumes of alerts during a regional incident. Hundreds of checks fail simultaneously, each triggering its own alert. Strategies to manage this:

Group alerts by service — One alert for "Production API is down" instead of separate alerts for each endpoint
Use escalation policies — First alert goes to Slack. If not acknowledged in 5 minutes, SMS. If not acknowledged in 10 minutes, phone call.
Require consecutive failures — Alert after 2-3 consecutive check failures, not on the first one. This filters transient blips.
Use maintenance windows — During planned deployments, suppress alerts to avoid false positives

How Webalert Monitors Cloud Infrastructure

Webalert provides external monitoring that's independent of your cloud provider — giving you an unbiased view of what your users actually experience:

HTTP/HTTPS monitoring — Check any endpoint from multiple global regions every 1 minute
SSL monitoring — Catch certificate issues before they become outages, even with cloud-managed certs
DNS monitoring — Verify resolution works correctly, independent of your cloud DNS provider
TCP port monitoring — Confirm database, cache, and service ports are reachable through firewalls and security groups
Content validation — Verify correct responses, not just status codes — catch misconfigured deployments
Response time tracking — Detect performance degradation from noisy neighbors, cold starts, or scaling delays
Heartbeat monitoring — Confirm cloud-scheduled jobs and serverless functions run on time
Multi-region checks — Detect cloud-region-specific failures that single-location checks miss
On-call and escalation — Route alerts to the right person with tiered notification channels
Status pages — Keep users informed during cloud incidents

Your cloud provider monitors their infrastructure. Webalert monitors your application on their infrastructure.

See features and pricing for the full details.

Summary

Your cloud provider's uptime isn't your uptime. Their infrastructure can be healthy while your application is broken.
External monitoring is essential. It checks what your users experience, independent of provider dashboards and status pages.
Monitor the full stack: HTTP endpoints, SSL, DNS, TCP ports, response time, and scheduled tasks.
Each cloud provider has specific failure modes — auto-scaling misconfigurations, security group changes, cold starts, certificate renewal failures. Know yours.
Infrastructure deployments are the #1 risk. External monitoring is your final validation that a change didn't break anything.
Layer your alerts: Immediate for outages, urgent for degradation, informational for trends.

Cloud makes infrastructure easier to provision. It doesn't make it easier to keep running. That's what monitoring is for.

Monitor what your users experience, not what your cloud dashboard shows

Start monitoring free with Webalert →

See features and pricing. No credit card required.

Cloud Infrastructure Monitoring: AWS, Azure, and GCP Uptime Best Practices

Why Cloud Monitoring Is Different

Shared responsibility means shared blame

The provider's status page isn't enough

What to Monitor in the Cloud

1. Public-facing endpoints (HTTP/HTTPS)

2. SSL certificates

3. DNS resolution

4. TCP port connectivity

5. Response time and latency

6. Background jobs and scheduled tasks

Cloud Provider-Specific Monitoring

AWS Monitoring Checklist

Azure Monitoring Checklist

GCP Monitoring Checklist

Multi-Cloud and Hybrid Monitoring

Cross-provider dependencies

The single pane of glass

Infrastructure as Code and Monitoring

Why IaC matters for monitoring

The deployment monitoring pattern

Alerting Strategy for Cloud Infrastructure

Tier 1: Immediate response (page on-call)

Tier 2: Urgent but not immediate (alert team channel)

Tier 3: Informational (log for review)

Avoid cloud-monitoring alert storms

How Webalert Monitors Cloud Infrastructure

Summary

Monitor what your users experience, not what your cloud dashboard shows

Related Articles

Cloud SLAs Compared: What AWS, Azure & GCP Actually Guarantee

Server Monitoring Basics: Uptime, Reachability, and Response Checks

How to Create a Website Status Report (Template & Metrics)

Ready to Monitor Your Website?