gRPC Status Code Error Debug

What this usually means

gRPC status codes are not just HTTP status codes in disguise. They encode specific protocol-level and application-level failures. The errors often stem from mismatched expectations between client and server: protobuf schema versions, service definitions, or transport configurations. Non-obvious culprits include proxies that don't support HTTP/2 trailers, load balancers that kill idle connections, or deadline propagation issues where a downstream service's timeout bubbles up as a different code. The error code itself is just the first clue—the real cause is usually deeper in the RPC lifecycle.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Enable full gRPC debug logging: GRPC_VERBOSITY=debug GRPC_TRACE=all ./your_binary 2>&1 | tee grpc_trace.log
2Capture a packet trace with tcpdump on loopback or eth0: tcpdump -i any -s 0 -w grpc.pcap 'port 8080'
3Use grpcurl to test the endpoint from the same client environment: grpcurl -plaintext <server>:8080 list
4Check the server's gRPC reflection service to verify registered methods: grpcurl -plaintext <server>:8080 describe <package.Service>
5Inspect proxy logs (Envoy, Nginx) for RST_STREAM or HTTP/2 GOAWAY frames

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchserver-side gRPC access logs (often structured JSON with status code, method, duration)
searchclient-side gRPC interceptor logs (request/response headers, trailers)
searchEnvoy admin endpoint /clusters and /config_dump for route configuration
searchKubernetes pod logs with sidecar proxy (istio-proxy, linkerd-proxy) for upstream reset events
searchWireshark with HTTP/2 filter: 'http2.stream_id and grpc'
searchProtobuf file descriptors exposed via reflection: grpcurl -plaintext <server>:8080 describe grpc.reflection.v1alpha.ServerReflection

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningProtobuf schema version mismatch: client uses different .proto field numbers or service definitions
warningProxy or load balancer stripping gRPC trailers (e.g., AWS NLB, legacy HTTP proxies)
warningDeadline propagation: client deadline set too low, or intermediate service overwrites it
warninggRPC connection reuse with closed streams: client keeps using a channel that received GOAWAY
warningServer sends trailers before client expects them (misordered headers due to proxy buffering)
warningMissing server-side reflection: client can't discover methods, leading to UNIMPLEMENTED

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildRegenerate protobuf stubs on both client and server from the same .proto source
buildAdd a custom gRPC interceptor on the client to log full error details (grpc-status, grpc-message)
buildEnable keepalive pings on both sides to detect dead connections: grpc.keepalive_time_ms=10000
buildSet a sane default deadline on the client and propagate it via gRPC metadata
buildUpgrade proxy to one that fully supports HTTP/2 trailers (Envoy, HAProxy 2.0+, Nginx with http2 module)
buildVerify the server's gRPC version supports the features used (e.g., client-side streaming, retry)

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRun a consistent test with grpcurl from the same network path—if it works, the issue is client code
verifiedMonitor the gRPC success rate metric (e.g., prometheus grpc_server_handled_total{grpc_code="OK"}) after fix
verifiedCheck the server's health check endpoint before and after: /healthz or grpc.health.v1.Health/Check
verifiedAdd a chaos monkey that kills connections and verify clients recover with exponential backoff
verifiedCompare client and server protobuf descriptors using protoc --descriptor_set_out and diff them

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningAssuming UNAVAILABLE always means network down—it could be a connection refused due to port mismatch
warningDisabling gRPC logging entirely in production; instead, use sampled logging or structured logs
warningSetting an infinite deadline—always set a timeout, but make it realistic (e.g., 10s for unary, 30s for streaming)
warningRelying solely on HTTP-level health checks; use gRPC health check for accurate readiness
warningCopying proto files manually without version control—always use a shared repo or package manager

( 07 )War story

The Case of the Missing Trailers: gRPC UNAVAILABLE in Production

Senior Backend EngineerGo 1.20, gRPC-Go 1.50, Envoy 1.24, Kubernetes on AWS

Timeline

10:15Pager alert: 5xx errors spike on /api/v1/orders endpoint, gRPC status UNAVAILABLE
10:18Checked server logs: 'rpc error: code = Unavailable desc = all SubConns are in TransientFailure'
10:22Verified server pod is healthy via kubectl exec and grpcurl list works from within the pod
10:30Examined Envoy sidecar logs: 'stream error: stream ID 3; INTERNAL_ERROR'
10:38Dumped gRPC trace from client: 'Received RST_STREAM with error code 0' after sending request
10:45Compared response headers: Envoy returned 200 OK but without grpc-status trailer
10:50Found Envoy route config had 'upgrade_configs' for HTTP/2 but not 'grpc' prefix
10:55Applied fix: added 'grpc' protocol to Envoy filter, restarted rollout
11:00Errors dropped to zero, confirmed with monitoring dashboard

The alert came at 10:15 AM. Our orders service, which communicates via gRPC between microservices, started returning UNAVAILABLE errors for about 20% of requests. The client logs said 'all SubConns are in TransientFailure'—a classic gRPC connection error. My first instinct was to check if the server was actually down. But grpcurl from within the same pod worked perfectly, listing all services and even making a test call.

That ruled out a server crash. I moved to the network layer. We run Envoy sidecars in each pod for mTLS and traffic management. Envoy's logs showed 'INTERNAL_ERROR' on streams, but no clear reason. I enabled gRPC debug logging on the client with GRPC_TRACE=all. That's when I saw the client received an RST_STREAM frame with error code 0 (NO_ERROR) right after the server sent its initial response headers. The server thought it was done, but the client expected more data.

The root cause was that Envoy was stripping the gRPC status trailer. Our Envoy route configuration had an 'upgrade_configs' section for HTTP/2 upgrade, but it didn't include the 'grpc' content-type. So Envoy treated the gRPC response as a plain HTTP/2 response and terminated the stream after headers, discarding the trailer that contains the actual gRPC status code. The client then saw an incomplete response and reported UNAVAILABLE. The fix was to add a proper gRPC filter configuration in Envoy that understands trailers. After rolling out the config change, the errors vanished.

Root cause

Envoy sidecar misconfiguration: missing gRPC protocol filter caused trailers to be stripped, leading to client-side UNAVAILABLE errors.

The fix

Updated Envoy filter chain to include 'envoy.filters.http.grpc_http1_reverse_bridge' and properly handle gRPC trailers.

The lesson

gRPC relies heavily on trailers. Any proxy that manipulates HTTP/2 streams must be explicitly configured to preserve trailers. Always test with a proxy in the loop early in development.

( 08 )gRPC Status Codes vs HTTP Status Codes

gRPC uses its own set of status codes (0–16) that are transmitted as HTTP/2 trailers, not headers. This is a common point of confusion. For example, a gRPC UNAVAILABLE (14) might be delivered over an HTTP 200 OK response. The actual error is in the 'grpc-status' trailer. Tools like curl or browser dev tools will not show these trailers by default; you need to use a gRPC-aware client or capture raw HTTP/2 frames.

When a proxy or load balancer doesn't understand trailers, it may buffer the response and send the trailers as part of the body, or simply drop them. This results in the client seeing a successful HTTP response but missing the gRPC status, causing it to default to UNKNOWN or INTERNAL. To debug, capture the raw HTTP/2 frames with tcpdump and inspect using Wireshark with 'grpc' filter.

( 09 )Deadline Propagation and Timeout Bubbling

gRPC deadlines are propagated from client to server via the 'grpc-timeout' header. If an intermediate service (e.g., in a chain of microservices) receives a request with a deadline, it must propagate that header to downstream calls. Failure to do so can cause the downstream to use its own default timeout, which might be shorter, leading to DEADLINE_EXCEEDED errors that appear to come from upstream.

The fix is to always propagate the incoming gRPC context's deadline to outgoing calls. In Go, this means using `metadata.NewOutgoingContext(ctx, md)` and then `invoke(ctx, ...)`. In Python, use `grpc.intercept_channel` with a timeout interceptor. Monitor the actual deadline on each hop by logging the remaining time: `deadline.Sub(time.Now())`.

( 10 )Connection Reuse and GOAWAY Frames

gRPC clients maintain a pool of HTTP/2 connections. When a server shuts down or restarts, it sends a GOAWAY frame to gracefully drain connections. The client should then stop sending new requests on that connection and create new ones. However, if the client's connection management is buggy (e.g., outdated gRPC library version), it may continue to reuse a connection that has been closed, resulting in UNAVAILABLE or INTERNAL errors.

To diagnose, look for 'GOAWAY' in gRPC traces: `grep GOAWAY grpc_trace.log`. Also check the client's channel connectivity state: `client.GetState()` returns a connectivity state (IDLE, CONNECTING, READY, TRANSIENT_FAILURE, SHUTDOWN). If it sticks to TRANSIENT_FAILURE, the client is not reconnecting properly. The fix often involves upgrading the gRPC library or configuring keepalive pings to detect dead connections faster.

( 11 )Protobuf Schema Version Mismatch

gRPC uses protobuf serialization. If the client and server have different .proto definitions (e.g., different field numbers, missing fields, or different service names), the server may reject the request with UNIMPLEMENTED (12) or INTERNAL (13). This is especially tricky when using gRPC reflection: the server exposes its service definition, but the client might use a cached version.

To verify, use grpcurl to list the server's services and methods. Then compare with the client's proto file. The `grpc.reflection.v1alpha.ServerReflection` service can be queried to get the file descriptor. Run `grpcurl -plaintext <server>:8080 grpc.reflection.v1alpha.ServerReflection/ServerReflectionInfo` and pass a message requesting `file_containing_symbol`. The response includes the full proto descriptor. Diff it against your local proto to catch mismatches.

( 12 )Proxy Interference: The Envoy Case

Envoy is a common sidecar proxy for gRPC. It supports HTTP/2 but requires explicit configuration to handle gRPC trailers correctly. By default, Envoy's HTTP connection manager treats the response body as complete after the last DATA frame, ignoring trailers. This causes the gRPC status to be lost.

The fix is to add a gRPC-specific filter. In Envoy 1.24+, you can use `envoy.filters.http.grpc_http1_reverse_bridge` to convert gRPC responses to HTTP/1.1 for legacy clients, or simply ensure the HTTP connection manager has `http2_protocol_options` and `upgrade_configs` including `upgrade_type: CONNECT`. Also, set `stream_idle_timeout` to a high value to prevent Envoy from closing long-lived streams. For production, always test with a proxy in the loop.

Frequently asked questions

How do I get detailed error messages from gRPC status codes?

Use the gRPC error details feature. The server can attach structured error details via `status.WithDetails()` (Go) or `grpc-status-details-bin` header. On the client, cast the error to `status.Status` and call `Details()`. This returns protobuf messages like `RetryInfo`, `DebugInfo`, or `BadRequest`. Enable this by using the `grpc-status-details-bin` header, which is automatically parsed by gRPC libraries.

Why do I see DEADLINE_EXCEEDED even when the server responds quickly?

This often happens when the client's deadline is set too low, or the deadline is not propagated correctly through intermediate services. Check the `grpc-timeout` header in the server logs. If the server sees a very short timeout (e.g., 1ms), the request is canceled before processing. Also, if the server makes downstream calls, it must propagate the remaining deadline. Use context.WithDeadline and log the deadline on each hop.

What does the UNIMPLEMENTED status code actually mean?

UNIMPLEMENTED (12) means the server does not recognize the method being called. This can happen if the method name is misspelled, the service is not registered, or the client and server have different protobuf versions. Check the server's registered methods via reflection: `grpcurl -plaintext server:8080 list`. Also verify that the client's proto file matches the server's exactly (field numbers, package name).

How do I debug gRPC errors in a Kubernetes environment?

First, get into the pod: `kubectl exec -it <pod> -c <app-container> -- /bin/sh`. Then use grpcurl to test the service from inside the pod. Check Envoy/Istio proxy logs: `kubectl logs <pod> -c istio-proxy | grep '503'`. Also, enable gRPC tracing by setting environment variable `GRPC_GO_LOG_VERBOSITY_LEVEL=99` (Go) or `GRPC_TRACE=all` (C++). For network issues, run `tcpdump` in the pod or use a sidecar with a debug image.

Can a firewall or security group cause gRPC errors?

Yes. gRPC uses HTTP/2, which requires a long-lived TCP connection. Some firewalls or NAT gateways may close idle connections after a timeout. If the client doesn't detect the closure, it will get UNAVAILABLE. To mitigate, enable gRPC keepalive pings on both sides: set `grpc.keepalive_time_ms` to 10–30 seconds. Also ensure the firewall allows HTTP/2 (TLS ALPN negotiation for 'h2').

Debugging gRPC Status Code Errors: Root Causes and Fixes

What this usually means

Frequently asked questions