Skip to content

gRPC Monitoring: Health Checks for Microservices

Webalert Team
May 4, 2026
11 min read

gRPC Monitoring: Health Checks for Microservices

Your microservices are deployed. Pods are healthy. Liveness probes are passing. Your service mesh dashboard is mostly green.

Then a single gRPC method starts returning RESOURCE_EXHAUSTED to one upstream caller, but only when the request payload exceeds 4MB. Or a streaming RPC quietly stops sending messages while the connection stays open. Or one client hits a stale binding to a pod that's already been replaced and gets UNAVAILABLE for two minutes before reconnecting.

None of these failure modes show up in HTTP-style uptime monitoring. gRPC has its own protocol semantics, status codes, and failure patterns — and most monitoring tools were built for the REST world.

This guide covers what to monitor on a gRPC service, how to do it across the four call patterns (unary, server streaming, client streaming, bidirectional), and how to integrate gRPC monitoring with the rest of your microservice observability.


Why gRPC Needs Its Own Monitoring Approach

gRPC is not HTTP-with-a-different-syntax. It has its own protocol semantics that demand specific monitoring:

  • HTTP/2 framing — Everything runs over HTTP/2, so connection failures look different
  • Status codes are different — gRPC uses OK, CANCELLED, DEADLINE_EXCEEDED, UNAVAILABLE, RESOURCE_EXHAUSTED, etc., not 200/500
  • Long-lived connections — Unlike most HTTP, gRPC connections are pooled and reused; failures can be sticky
  • Four call patterns — Unary, server streaming, client streaming, bidirectional streaming each fail differently
  • Deadlines, not timeouts — gRPC propagates deadlines across hops; a misconfigured deadline cascades through your whole call graph
  • Binary protocol — You can't just curl an endpoint; checks need a real gRPC client
  • Service mesh interaction — Envoy, Istio, Linkerd, and others terminate, retry, and shape gRPC traffic

A standard HTTP/200 check on a gRPC port will tell you if the TCP socket accepts connections. It won't tell you whether your service is actually serving RPCs.


What to Monitor

1) gRPC Health Checking Protocol

gRPC defines a standard health checking protocol — a service called grpc.health.v1.Health with Check and Watch methods. This is the canonical way to monitor gRPC service health.

  • Check — A single unary RPC that returns SERVING, NOT_SERVING, or UNKNOWN
  • Watch — A server-streaming RPC that pushes status changes
  • Per-service status — A single server can host multiple services; the health protocol supports per-service health

Almost every gRPC framework (gRPC-Go, gRPC-Java, grpc-python, .NET, etc.) provides a built-in implementation. Use it. Custom HTTP /health endpoints next to your gRPC server are a code smell — you end up monitoring HTTP while users actually use gRPC.

2) Per-Method Status Code Distribution

gRPC has 17 standardized status codes. Each tells you something different:

Code Meaning What it usually indicates
OK Success All good
CANCELLED Client cancelled Often deadline-related
INVALID_ARGUMENT Bad request data Input validation; often a client-side bug
DEADLINE_EXCEEDED Took too long Backend slow or deadline too tight
NOT_FOUND Resource missing Often expected, but spikes signal a problem
ALREADY_EXISTS Conflict Duplicate writes, race conditions
PERMISSION_DENIED Auth issue Often a client cred or role problem
RESOURCE_EXHAUSTED Quota/limit hit Rate limit, memory cap, payload too large
FAILED_PRECONDITION State invalid Client retry won't help
ABORTED Concurrency conflict Retry with backoff usually works
UNIMPLEMENTED Method missing Client/server version mismatch
INTERNAL Server bug The closest to a 500
UNAVAILABLE Service down The closest to a 503; retryable
DATA_LOSS Unrecoverable Serious; investigate
UNAUTHENTICATED Missing/bad creds Auth path problem

Track the distribution per method, not in aggregate. A surge in RESOURCE_EXHAUSTED on one method is a leading indicator for a backend running out of capacity — long before HTTP-style monitoring would notice.

3) Latency Distribution Per Method

gRPC services often have wildly different latency characteristics across methods. Aggregate latency hides the slow methods:

  • p50, p95, p99 per method — The methods that show up at p99 are your candidates for optimization
  • Compare against deadlines — If p95 latency is approaching your default deadline, you're about to start seeing DEADLINE_EXCEEDED
  • Track over time — A method that gets slower over weeks is your earliest warning of N+1 queries, growing indices, or memory leaks

4) Connection-Level Health

gRPC connections are pooled and long-lived. Connection-level metrics matter:

  • Active connection count per server
  • Connection establishment rate — A spike here means clients are reconnecting frequently
  • GOAWAY frame rate — Servers send GOAWAY before shutdown; bursts can indicate rolling deploys or pod churn
  • Keepalive ping success — gRPC keepalives detect dead connections; failures here precede cascade failures

5) Deadline Propagation

gRPC deadlines propagate across calls — a client setting a 5-second deadline on Service A means Service A has at most 5 seconds to call Service B and return. Mismonitor this and you'll spend hours chasing red herrings:

  • Track deadline budget consumed at each hop
  • Alert on services that consume disproportionate deadline budget — Often the slowest, least-loved code path
  • Watch for cascading DEADLINE_EXCEEDED — A single slow downstream pulls your whole call graph into timeout territory

6) Streaming RPC Health

Streaming RPCs (server, client, and bidi) need different monitoring than unary:

  • Stream open count — Active long-lived streams
  • Messages per stream — A stream with zero messages but an open connection is broken
  • Stream lifetime — Both anomalously short and anomalously long durations are suspicious
  • Cancellation rate — Spikes mean clients are giving up

7) TLS / mTLS

gRPC almost always runs over TLS, often with mutual TLS in service meshes:

  • Certificate expiry for both server and client certificates
  • Cert rotation success — Service mesh CAs (SPIRE, cert-manager) rotate certs; rotation failures cause UNAVAILABLE storms
  • TLS handshake failure rate — A spike here is often the first signal of a cert or trust-store problem

8) Service Mesh Behavior

If you run gRPC behind Envoy/Istio/Linkerd:

  • Retry budget consumption — Service mesh retries can mask real failures; track them separately
  • Circuit breaker trips — Counts and durations
  • Load balancer steady-state — gRPC's HTTP/2 means a single client connection sticks to one pod unless the LB does subset balancing

Common gRPC Failure Modes

Failure User Impact How to Detect
Server up but health check returns NOT_SERVING Clients route away from healthy-looking pod Health protocol monitoring
Single method returning RESOURCE_EXHAUSTED Specific feature broken for some clients Per-method status code tracking
Streaming RPC stops emitting messages "Live" features silently freeze Per-stream message rate alerts
Sticky connection to dead pod Some clients see UNAVAILABLE for minutes Connection establishment / GOAWAY metrics
Cert rotation failed mid-day mTLS handshake failures cascade Cert expiry + handshake failure rate
Deadline propagation misconfigured Cascading DEADLINE_EXCEEDED across services Per-hop deadline budget tracking
Service mesh retry storm Apparent recovery hides root cause Retry budget tracking
Protobuf schema mismatch after deploy UNIMPLEMENTED on a previously working method Per-method success rate
Large message exceeds maxInboundMessageSize RESOURCE_EXHAUSTED on specific clients Per-method, per-client error rates
Compression negotiation broken Bandwidth blow-up, latency increase Bytes-on-wire per method

Setting Up gRPC Monitoring

Quick start

  1. Health protocol checks — Run grpc.health.v1.Health/Check against each service from a synthetic monitor
  2. TLS / cert expiry monitoring on your gRPC endpoints
  3. Per-service success rate dashboard from server-side metrics
  4. Latency p95 alerts per method

Comprehensive setup

Add:

  1. Per-method status code distribution with alerts on non-OK, non-expected codes
  2. Per-method latency p95 and p99 with alerts on regressions
  3. Streaming-specific metrics (open streams, messages-per-stream, stream lifetime)
  4. Connection-level metrics (open connections, GOAWAY rate, keepalive failures)
  5. Deadline budget tracking at each hop
  6. Service mesh metrics (retry rate, circuit breaker state, traffic split)
  7. Synthetic checks that exercise both unary and streaming patterns
  8. Multi-region health checks for any externally-exposed gRPC service

How gRPC Monitoring Connects to the Rest of Your Stack

gRPC monitoring isn't standalone. It needs to integrate with the broader observability story:

  • APM/tracing — OpenTelemetry has first-class gRPC support; spans should include the gRPC method, status code, and deadline budget
  • Metrics — Prometheus exposition for gRPC servers is standard; scrape per-method counters
  • Logs — Structured logs with the gRPC method, status code, and request ID for cross-service correlation
  • External monitoring — For externally-exposed gRPC APIs, run synthetic clients that look like real clients (see Microservices Monitoring: Health Checks Guide)

If you're running gRPC inside Kubernetes, the patterns in Kubernetes Monitoring: Health Checks and Pod Uptime and the load-shedding considerations in Microservices Monitoring apply directly. For comparison with REST, see REST API Monitoring: Endpoints, Errors, and Performance.


What to Do When gRPC Monitoring Fires

Health protocol returns NOT_SERVING:

  1. Check pod logs for startup or dependency failures
  2. Verify downstream dependencies are healthy
  3. Check whether the service is in graceful shutdown
  4. Look for recent deploys that might have broken initialization

Surge in UNAVAILABLE:

  1. Check pod restart rate / rolling deploy in progress
  2. Verify load balancer subset balancing is working
  3. Look at GOAWAY frame rate
  4. Check if a single client is hammering one pod (subset balancing problem)

Surge in DEADLINE_EXCEEDED:

  1. Check downstream service latency
  2. Look for slow database queries, lock contention
  3. Verify deadlines are sane (not too tight after a recent change)
  4. Check whether a single hot key is creating a cascading slowdown

Surge in RESOURCE_EXHAUSTED:

  1. Check for rate-limit hits — server-side or downstream
  2. Verify max message size limits for the method
  3. Look at memory pressure on the server
  4. Check for any quota changes recently applied

Streaming methods quiet:

  1. Verify keepalive pings are firing
  2. Check for client-side bugs in stream consumption
  3. Look at backend pub/sub or event source health
  4. Verify the stream isn't being held open without server-side activity

How Webalert Helps

Webalert provides external monitoring for the parts of your gRPC stack that are exposed to clients, plus the supporting infrastructure:

  • TLS / SSL monitoring — catch certificate issues on gRPC endpoints before clients see handshake failures
  • DNS monitoring — detect resolution failures that prevent clients from reaching your services
  • HTTP and TCP checks — monitor the load balancer fronting your gRPC services
  • Multi-region checks — confirm externally-exposed gRPC services are reachable globally
  • Webhook + heartbeat monitoring — pair with your Prometheus alerts for in-cluster gRPC metrics
  • Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
  • Status pages — communicate service issues to API consumers transparently
  • 5-minute setup — start monitoring the public face of your gRPC stack today

See features and pricing for details.


Summary

  • gRPC fails differently than REST. Use the standard gRPC health checking protocol, not custom HTTP endpoints next to your gRPC server.
  • Monitor per-method status code distribution, not just aggregate success rate. The 17 standardized codes each tell you something specific.
  • Track latency at p95 and p99 per method, and watch the gap to your deadlines.
  • Streaming RPCs need their own metrics: open stream count, messages per stream, stream lifetime, cancellation rate.
  • Watch connection-level signals (active connections, GOAWAY frames, keepalive success) to catch sticky failures.
  • Deadline propagation makes downstream issues cascade; track deadline budget at each hop.
  • TLS / mTLS rotation is a leading cause of gRPC outages; monitor cert expiry and handshake failure rates.

gRPC gives you a structured, performant protocol. Monitoring proves it's actually delivering for clients.


Catch gRPC failures the protocol-aware way

Start monitoring with Webalert →

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Start Free Monitoring