All guides

LEARN \u00b7 DEBUGGING GUIDE

Background job stuck: how to debug jobs that never complete

You enqueue a job. A worker picks it up. Hours later, the job is still 'in progress'. It is not failing. It is not completing. It is stuck.

AdvancedObservability/performance debugging

What this usually means

A stuck background job is one that started but cannot finish. The worker process is alive, the job is assigned, but something is blocking progress: an infinite loop, a deadlock on a database or lock, a hanging network call with no timeout, or waiting on an external resource that is unavailable. Unlike a crashed worker (which the queue would detect and reassign), a stuck worker looks healthy to the queue — it just never finishes.

( 01 )Fast diagnosis

The first ten minutes \u2014 establish facts before touching code.

  • 1Check the job's progress. Does the job have a heartbeat or progress log? When was the last update?
  • 2Check the worker process. Is it using CPU? If CPU is 0%, the worker is blocked on I/O (database, network, lock).
  • 3Check for an infinite loop. Look at the job code: any while(true) or unbounded recursion?
  • 4Check for a deadlock. Is the job waiting on a database lock, a distributed lock, or a mutex?
  • 5Check external dependencies. Is the job calling an external API that is hanging with no timeout?
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchJob queue dashboard — job status, start time, last heartbeat
  • searchWorker process metrics — CPU usage, memory, open connections
  • searchJob code — look for unbounded loops, synchronous waits, lock acquisition
  • searchDatabase — check for open transactions, locks held by the worker connection
  • searchExternal API calls — do they have timeouts? Are they hanging?
  • searchJob logs — last log entry before the job got stuck
  • searchQueue configuration — visibility timeout, max retries, job timeout
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningInfinite loop in job code — a while loop with a condition that never becomes false
  • warningExternal API call with no timeout — the worker hangs waiting for a response that never comes
  • warningDatabase or lock deadlock — the worker is waiting for a lock held by another process
  • warningJob is waiting for a condition that will never be met — another job must finish first but it also failed
  • warningWorker process is alive but the event loop is blocked by a synchronous CPU-intensive operation
  • warningVisibility timeout is too long — the queue thinks the worker is still working
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAdd a job execution timeout: if the job runs longer than N minutes, kill it and mark as failed
  • buildAdd a heartbeat mechanism: the job reports progress every N seconds, and the queue requeues the job if the heartbeat stops
  • buildSet timeouts on every external call: database queries, HTTP requests, lock acquisition
  • buildBreak long-running jobs into smaller steps with checkpointing so they can resume after restart
  • buildAdd a dead job detector: a separate process that finds jobs stuck for longer than expected and alerts or requeues them
  • buildMonitor job duration and alert if any job exceeds a threshold
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedSet a short timeout on a test job. Verify the job is marked as failed after the timeout.
  • verifiedRun a job that simulates a hanging external call. Verify it times out and does not block the worker.
  • verifiedCheck the queue dashboard: stuck jobs should be requeued and eventually succeed or fail definitively.
  • verifiedMonitor job duration distribution: no job should stay 'in progress' longer than the maximum expected duration.
  • verifiedSimulate a worker crash mid-job. Verify the queue reassigns the job to another worker.
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningNot setting timeouts on external calls inside background jobs
  • warningNot having a job execution timeout — a stuck job blocks the worker forever
  • warningNot monitoring job duration — you will not know a job is stuck until users complain
  • warningUsing a single queue with no prioritisation — a stuck job blocks all other jobs
  • warningNot having a retry or dead-letter mechanism for jobs that fail repeatedly