Background Job Stuck — Debugging Guide | Buglyst Learn

What this usually means

A stuck background job is one that started but cannot finish. The worker process is alive, the job is assigned, but something is blocking progress: an infinite loop, a deadlock on a database or lock, a hanging network call with no timeout, or waiting on an external resource that is unavailable. Unlike a crashed worker (which the queue would detect and reassign), a stuck worker looks healthy to the queue — it just never finishes.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Check the job's progress. Does the job have a heartbeat or progress log? When was the last update?
2Check the worker process. Is it using CPU? If CPU is 0%, the worker is blocked on I/O (database, network, lock).
3Check for an infinite loop. Look at the job code: any while(true) or unbounded recursion?
4Check for a deadlock. Is the job waiting on a database lock, a distributed lock, or a mutex?
5Check external dependencies. Is the job calling an external API that is hanging with no timeout?

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchJob queue dashboard — job status, start time, last heartbeat
searchWorker process metrics — CPU usage, memory, open connections
searchJob code — look for unbounded loops, synchronous waits, lock acquisition
searchDatabase — check for open transactions, locks held by the worker connection
searchExternal API calls — do they have timeouts? Are they hanging?
searchJob logs — last log entry before the job got stuck
searchQueue configuration — visibility timeout, max retries, job timeout

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningInfinite loop in job code — a while loop with a condition that never becomes false
warningExternal API call with no timeout — the worker hangs waiting for a response that never comes
warningDatabase or lock deadlock — the worker is waiting for a lock held by another process
warningJob is waiting for a condition that will never be met — another job must finish first but it also failed
warningWorker process is alive but the event loop is blocked by a synchronous CPU-intensive operation
warningVisibility timeout is too long — the queue thinks the worker is still working

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildAdd a job execution timeout: if the job runs longer than N minutes, kill it and mark as failed
buildAdd a heartbeat mechanism: the job reports progress every N seconds, and the queue requeues the job if the heartbeat stops
buildSet timeouts on every external call: database queries, HTTP requests, lock acquisition
buildBreak long-running jobs into smaller steps with checkpointing so they can resume after restart
buildAdd a dead job detector: a separate process that finds jobs stuck for longer than expected and alerts or requeues them
buildMonitor job duration and alert if any job exceeds a threshold

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedSet a short timeout on a test job. Verify the job is marked as failed after the timeout.
verifiedRun a job that simulates a hanging external call. Verify it times out and does not block the worker.
verifiedCheck the queue dashboard: stuck jobs should be requeued and eventually succeed or fail definitively.
verifiedMonitor job duration distribution: no job should stay 'in progress' longer than the maximum expected duration.
verifiedSimulate a worker crash mid-job. Verify the queue reassigns the job to another worker.

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningNot setting timeouts on external calls inside background jobs
warningNot having a job execution timeout — a stuck job blocks the worker forever
warningNot monitoring job duration — you will not know a job is stuck until users complain
warningUsing a single queue with no prioritisation — a stuck job blocks all other jobs
warningNot having a retry or dead-letter mechanism for jobs that fail repeatedly

Related debugging guides

Background job stuck: how to debug jobs that never complete

What this usually means