LEARN · DEBUGGING GUIDE

GCP Pub/Sub Subscription Backlog Growing: Diagnosis and Fix

A growing Pub/Sub backlog usually means your subscriber can't keep up or messages are being nacked/reset. This guide cuts through the confusion with specific commands and metrics to pinpoint the bottleneck.

IntermediateCloud7 min read

What this usually means

A backlog grows when the rate of messages published exceeds the rate at which the subscriber acknowledges them, or when messages are repeatedly redelivered without being processed. The underlying cause is often a subscriber that is too slow (e.g., due to a bottleneck in processing logic), a misconfigured ack deadline causing unnecessary redeliveries, or a flow control setting that limits throughput. It can also be caused by a subscriber crashing or hanging, leaving messages unacked. Non-obvious causes include a single slow message blocking batch processing, or a subscriber that is overwhelmed by a sudden spike in publish rate.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Check subscription's 'oldest unacked message age' in Cloud Monitoring: if >0, backlog exists.
  • 2Look at 'ack_message_operation_count' and 'modify_ack_deadline_operation_count' – high modifyAckDeadline suggests messages are being nacked or deadline extended.
  • 3Examine subscriber logs for errors like 'DEADLINE_EXCEEDED', 'RESOURCE_EXHAUSTED', or 'UNAVAILABLE'.
  • 4Check subscriber's CPU/memory usage and processing rate (e.g., messages processed per second).
  • 5Verify the subscription's 'ack_deadline_seconds' setting: default is 10s, too short for slow processing causes redeliveries.
  • 6Review the publish rate (publish_message_count) vs. subscriber acknowledge rate (ack_message_count). If publish > ack, backlog grows.
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchCloud Monitoring > Metrics Explorer: subscription/num_undelivered_messages, subscription/oldest_unacked_message_age
  • searchPub/Sub subscription details page in GCP Console (check ack deadline, message retention duration)
  • searchSubscriber application logs (e.g., Stackdriver Logging) for processing errors or timeouts
  • searchSubscriber's compute resource metrics (CPU, memory, network) – especially if running on GKE/Compute Engine
  • searchPub/Sub push subscription endpoint logs (if using push, check HTTP response codes from your webhook)
  • searchCloud Monitoring alerting policies that trigger on backlog metrics
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningSubscriber processing logic is too slow (e.g., blocking I/O, heavy computation, external API calls)
  • warningAck deadline too short: messages are redelivered before processing completes, causing duplicate work
  • warningSubscriber is crashing or restarting frequently, losing progress on unacked messages
  • warningFlow control settings in subscriber client (e.g., max outstanding messages) are too low, throttling throughput
  • warningA sudden spike in publish rate overwhelms the subscriber (e.g., batch job or DDoS)
  • warningPush subscriber endpoint returns non-2xx status, causing Pub/Sub to retry push deliveries
  • warningMessage size is large (e.g., >1MB) causing network or deserialization bottlenecks
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildIncrease ack deadline to match or exceed your max processing time (e.g., from 10s to 60s or more)
  • buildOptimize subscriber processing: add caching, batch operations, use async I/O, or scale horizontally
  • buildImplement flow control in subscriber: use 'max_outstanding_messages' and 'max_outstanding_bytes' to match capacity
  • buildUse exactly-once delivery if duplicates cause extra work (enable message ordering and deduplication)
  • buildScale subscribers: add more instances or increase concurrency (e.g., 'num_streaming_pull_streams')
  • buildIf push subscriber, ensure endpoint can handle load and returns 200/204 quickly; consider switching to pull with flow control
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedMonitor 'num_undelivered_messages' and 'oldest_unacked_message_age' to ensure they decrease over time
  • verifiedCheck subscriber ack rate (ack_message_count) now matches or exceeds publish rate
  • verifiedVerify subscriber logs show no more 'DEADLINE_EXCEEDED' errors and processing times are within ack deadline
  • verifiedRun a load test with expected peak publish rate and confirm backlog remains near zero
  • verifiedCheck subscriber resource utilization: if CPU/memory are reasonable and not maxed, scaling may be sufficient
  • verifiedFor push subscriptions, verify endpoint response times and HTTP 200 proportion
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningIncreasing ack deadline without checking actual processing time – can hide the real problem
  • warningBlindly scaling subscribers without investigating the bottleneck (e.g., if it's a shared database, more subscribers may make it worse)
  • warningSetting flow control limits too high, causing memory pressure or connection limits
  • warningIgnoring message ordering: if ordering is required, you can't parallelize easily
  • warningNot setting a message retention duration – messages expire and are lost if backlog grows too old
  • warningUsing synchronous pull with long poll timeout but no flow control – can cause idle connections
( 07 )War story

The Silent Backlog: A 15-Minute Ack Deadline Nightmare

Site Reliability EngineerGCP Pub/Sub (StreamingPull), Go subscriber on GKE, Cloud Monitoring

Timeline

  1. 14:00Alert fires: 'oldest_unacked_message_age' > 300s for critical subscription
  2. 14:01Engineer checks Cloud Monitoring: backlog_bytes at 2GB and growing, ack_rate = 50 msg/s, publish_rate = 200 msg/s
  3. 14:03Subscriber CPU at 30%, memory 40% – not saturated. No errors in logs.
  4. 14:05Checks ack_deadline: default 10s. Subscriber processes each message in ~5s on average, but some take 12s due to external API calls.
  5. 14:06Realizes messages taking >10s are being redelivered, causing duplicate processing and effectively halving throughput.
  6. 14:08Increases ack_deadline to 30s via gcloud command.
  7. 14:10Backlog growth stops; ack_rate climbs to 150 msg/s within minutes.
  8. 14:15Backlog begins to drain; oldest unacked message age drops below 60s.
  9. 14:20System stable; backlog cleared. Root cause documented.

I was on call when a backlog alert woke me up. The dashboard showed 2GB of undelivered messages and growing. The subscriber service on GKE looked healthy – low CPU, no OOM kills. My first instinct was that we had a publish spike, and indeed the publish rate was 200 msg/s vs ack rate of 50 msg/s. But the subscriber wasn't saturated.

I checked the ack deadline: default 10 seconds. Our subscriber processes messages in about 5 seconds on average, but I knew some messages triggered an external API call that could take 12 seconds. That meant those slow messages were getting redelivered before they could be acked. The duplicates were eating up processing time, effectively halving our throughput.

I increased the ack deadline to 30 seconds using gcloud pubsub subscriptions update. Within minutes, the ack rate jumped to 150 msg/s and the backlog started draining. The fix was simple, but the lesson was clear: always measure your p99 processing time and set the ack deadline accordingly. We also added a metric for per-message processing time to catch this earlier.

Root cause

Ack deadline (10s) was shorter than the p99 processing time (12s) due to slow external API calls, causing message redeliveries and duplicate processing.

The fix

Increased ack deadline to 30s using gcloud pubsub subscriptions update --ack-deadline 300.

The lesson

Always set ack deadline based on your p99 processing time, not average. Monitor per-message latency to catch slow outliers.

( 08 )Understanding Pub/Sub Flow Control and Backlog Dynamics

Pub/Sub's delivery model is pull-based (or push). The subscriber must acknowledge messages within the ack deadline. If not, the message becomes available for redelivery. A backlog grows when the publish rate exceeds the effective ack rate. The effective ack rate is limited by subscriber throughput and ack deadline efficiency.

Flow control in the client library (e.g., max outstanding messages) can artificially cap throughput. If flow control limits are set too low, the subscriber will not pull enough messages, causing backlog. Conversely, if set too high, memory can blow up. The metric 'streaming_pull_response_count' can show if the subscriber is requesting messages but not receiving them due to flow control.

( 09 )Diagnosing with Cloud Monitoring Metrics

Key metrics: subscription/num_undelivered_messages (backlog count), subscription/oldest_unacked_message_age (how long a message has been waiting), subscription/ack_message_count (rate of acks), subscription/publish_message_count (rate of publishes). If publish > ack, backlog grows. Also check subscription/modify_ack_deadline_message_count – high values indicate frequent deadline extensions, often due to processing taking too long.

Use Metrics Explorer to create a ratio: (publish_rate - ack_rate) / publish_rate. If positive, backlog grows. Also look at subscriber instance metrics (CPU, memory) to see if they are saturated. A common pitfall: the subscriber might be waiting on external resources (e.g., database write) – check those too.

( 10 )Push Subscriptions: The Silent Backlog from HTTP Errors

For push subscriptions, every non-2xx response from your endpoint causes Pub/Sub to retry with exponential backoff. If your endpoint is slow or returns errors, messages pile up. Check the push endpoint's response codes in Cloud Logging. Also, push subscriptions have a configurable 'push endpoint' and 'ack deadline' – the deadline is the time Pub/Sub waits for a 200/204 before retrying.

If your endpoint can't keep up, consider switching to pull with a more efficient subscriber. Push is simpler but less controllable. Use metrics like 'push_request_count' and 'push_response_count' with status codes.

( 11 )Scaling Subscribers: Horizontal vs Vertical

If the subscriber is stateless and message ordering is not required, horizontal scaling is straightforward. Add more subscriber instances; Pub/Sub will distribute messages across them. However, if ordering is required, you must partition by ordering key; each partition is processed by one subscriber at a time, limiting parallelism.

Vertical scaling (more CPU/memory per instance) helps if the bottleneck is CPU or memory. But often the bottleneck is I/O (e.g., database writes). In that case, scaling horizontally may overload the downstream. Use 'num_streaming_pull_streams' to increase concurrency within a single subscriber process (default is 1).

( 12 )Non-Obvious Cause: Message Size and Serialization

Large messages (>1MB) can cause network bandwidth issues or deserialization bottlenecks. Pub/Sub has a 10MB limit per message. If your messages are large, the time to transmit and process them increases. This can cause ack deadline expiration. Monitor 'message_size' in logs or use a custom metric. Consider compressing or splitting large messages.

Also, if the subscriber uses synchronous pull and processes messages one by one, a single large message can block the pipeline. Use streaming pull and async processing to mitigate.

Frequently asked questions

What is the difference between ack deadline and message retention duration?

Ack deadline is the time Pub/Sub waits for a subscriber to acknowledge a message before making it available for redelivery. Message retention duration is how long Pub/Sub keeps unacked messages before discarding them. If your backlog grows beyond retention duration, messages are lost. Retention can be set from 10 minutes to 7 days.

How do I increase the ack deadline on an existing subscription?

Use gcloud: gcloud pubsub subscriptions update SUBSCRIPTION_ID --ack-deadline=SECONDS (max 600). Or via the GCP Console: select subscription, edit, change ack deadline. You can also update it programmatically via the client library.

Why is my subscriber not pulling messages even though backlog exists?

Possible reasons: flow control limits (max outstanding messages) are too low, subscriber is not running, or the subscription is paused. Check subscriber logs for 'RESOURCE_EXHAUSTED' or 'DEADLINE_EXCEEDED' errors. Also ensure the subscriber is using the correct subscription ID and has permissions.

Can I set different ack deadlines per message?

No, ack deadline is a subscription-level property. However, you can use modifyAckDeadline to extend the deadline for specific messages that are still being processed. This is useful for long-running tasks. But be careful: it can mask performance issues.

What happens if I set ack deadline too high?

If a subscriber crashes, messages will not be redelivered until the ack deadline expires. This increases latency for recovery. Also, messages that are never acked will stay in the subscription longer, consuming storage. Set ack deadline just above your p99 processing time.