LEARN · DEBUGGING GUIDE

How to Debug a Failing Sidekiq Job That Keeps Retrying

A Sidekiq job that fails and retries is a symptom, not the problem. Here's how to find the real cause in under 10 minutes.

IntermediateRuby5 min read

What this usually means

A Sidekiq job that repeatedly fails and retries indicates an unhandled exception in the `perform` method. The job is re-enqueued with exponential backoff until it hits the default 25 retries (or your configured limit). Common causes are transient errors (network timeouts, deadlocks) that don't resolve, or permanent errors (missing records, invalid arguments) that will never succeed. The key is distinguishing between a job that should eventually succeed (transient) and one that will always fail (permanent) and should be discarded or fixed.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

  • 1Check Sidekiq web UI at /sidekiq/retries — note the error class and message
  • 2Tail Sidekiq log: `tail -f log/sidekiq.log | grep -E 'ERROR|WARN|fail'`
  • 3Inspect job arguments: `Sidekiq::RetrySet.new.each { |j| puts j.args }` in Rails console
  • 4Re-run the job locally with same args: `MyWorker.new.perform(*args)` to reproduce
  • 5Check if the error is transient (e.g., Net::OpenTimeout) or permanent (e.g., NoMethodError)
( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

  • searchSidekiq web UI: /sidekiq/retries and /sidekiq/dead
  • searchSidekiq log file: log/sidekiq.log (or stdout in production)
  • searchApplication log: log/production.log for related errors
  • searchError monitoring service (Sentry, Honeybadger, etc.) for exception frequency
  • searchRedis: `redis-cli -h <host> keys 'sidekiq:*'` to inspect queue sizes
  • searchWorker code: app/workers/your_worker.rb — check for missing error handling
  • searchExternal service status (API, database) if error is timeout/connection related
( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

  • warningNetwork timeout calling an external API that is down or slow
  • warningDatabase deadlock or ActiveRecord::Deadlocked due to concurrent writes
  • warningMissing database record that job assumes exists (race condition)
  • warningInvalid job arguments (e.g., nil ID passed when record ID expected)
  • warningUnhandled exception in middleware (e.g., custom middleware raising error)
  • warningSidekiq process OOM killed by OS, leaving job in 'running' limbo
( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

  • buildAdd retry limits and exponential backoff: `sidekiq_options retry: 5`
  • buildRescue transient errors and retry with `retry_job wait: 10.seconds`
  • buildAdd conditional logic to skip job if prerequisite data missing
  • buildImplement circuit breaker for external API calls (e.g., using Semian)
  • buildValidate arguments at job entry and fail fast with `raise` if invalid
  • buildUse `sidekiq-unique-jobs` gem to prevent duplicate processing
( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

  • verifiedDeploy fix, then re-enqueue a failed job from Sidekiq web UI (retry button)
  • verifiedMonitor Sidekiq logs for success: `grep 'done' log/sidekiq.log`
  • verifiedCheck Sidekiq stats: `Sidekiq::Stats.new.retry_size` should drop to zero
  • verifiedRun a load test that previously triggered the failure and confirm no retries
  • verifiedVerify error monitoring shows no new instances of the exception for that job type
( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

  • warningBlindly increasing max retries without fixing the root cause
  • warningRescuing all exceptions and swallowing errors silently
  • warningIgnoring job arguments — always log them for debugging
  • warningAssuming a retry will fix a permanent error (it won't)
  • warningNot using unique job constraints, causing duplicate processing on retry
  • warningForgetting to test the fix by re-running failed jobs from the dead set
( 07 )War story

The Silent Email Outage: A Sidekiq Retry Storm

Senior Backend EngineerRuby 3.2, Rails 7, Sidekiq 7, Redis 7, Postgres 15, SendGrid API

Timeline

  1. 09:15PagerDuty alert: 'High Sidekiq retry count' on production
  2. 09:17Check Sidekiq web UI: 1,200 retries for SendWelcomeEmailWorker
  3. 09:20Tail sidekiq.log: Net::ReadTimeout from SendGrid API
  4. 09:25Check SendGrid status page: reported degraded performance
  5. 09:30Pause Sidekiq: `systemctl stop sidekiq` to stop retry storm
  6. 09:35Add rescue block with retry limit and exponential backoff
  7. 09:40Deploy fix, restart Sidekiq, re-enqueue dead jobs
  8. 09:50All emails sent successfully, retry count back to zero

At 9:15 AM, PagerDuty woke me up with a 'High Sidekiq retry count' alert. I opened the Sidekiq web UI and saw 1,200 retries for SendWelcomeEmailWorker. The retry queue was growing fast. I tailed the Sidekiq log and saw Net::ReadTimeout exceptions from SendGrid API every few seconds. My first instinct was to check if SendGrid was down—I visited their status page and saw 'Degraded Performance' reported 10 minutes ago. This was a transient API outage, but our job had no retry limit, so it would keep retrying forever, consuming resources.

I decided to stop Sidekiq immediately to prevent further load. I ran `systemctl stop sidekiq` on the affected server. Then I reviewed the worker code. It had a simple `HTTParty.post` call with no timeout handling and no retry configuration. I added a rescue for Net::ReadTimeout and Net::OpenTimeout, with a `retry_job wait: 30.seconds` and a max of 5 retries using `sidekiq_options retry: 5`. If retries exhausted, I logged a warning and discarded the job—we could re-trigger welcome emails later.

I deployed the fix, restarted Sidekiq, and went to the Dead job set in the UI. I selected all dead SendWelcomeEmailWorker jobs and clicked 'Retry'. Within minutes, the jobs succeeded as SendGrid recovered. I monitored the logs for the next hour—no new retries. The fix reduced our retry queue from 1,200 to 0. The lesson: always set retry limits and handle transient failures gracefully.

Root cause

Unhandled Net::ReadTimeout from SendGrid API, with no retry limit causing infinite retries.

The fix

Added rescue for network timeouts with exponential backoff and a max retry count of 5.

The lesson

Always configure retry limits and rescue transient errors. Don't let a downstream outage cascade into a retry storm.

( 08 )Understanding Sidekiq Retry Mechanism

Sidekiq automatically retries failed jobs with exponential backoff: 3s, 10s, 30s, 1m, 5m, 10m, 30m, 1h, 2h, 4h, 8h, 16h, 1d, 2.5d, 4d, ... up to ~20 days. Default max retries is 25. Each retry increments the 'retry_count' in the job payload.

When a job exhausts retries, it moves to the 'Dead' set (default 10,000 jobs max). Dead jobs can be manually retried via the web UI or API. The retry mechanism is implemented in Sidekiq::Middleware::Server::RetryJobs.

( 09 )Diagnosing Transient vs Permanent Failures

Transient failures: network timeouts, database deadlocks, temporary service unavailability. These may succeed on retry. Permanent failures: NoMethodError, ActiveRecord::RecordNotFound, invalid arguments. These will never succeed and should be discarded or fixed.

To differentiate: look at the exception class. Transient: Net::ReadTimeout, ActiveRecord::Deadlocked, Redis::TimeoutError. Permanent: NoMethodError, ArgumentError, ActiveRecord::RecordNotFound. Also check the error message—if it references missing data or code bugs, it's permanent. Use `Sidekiq::RetrySet.new.each { |j| puts j.error_class }` to categorize.

( 10 )Advanced Retry Strategies

Instead of default retries, use `sidekiq_options retry: 5` to limit retries. For more control, override `sidekiq_retries_exhausted` block to log or notify when retries are exhausted. Example: `sidekiq_retries_exhausted do |msg, ex| Rails.logger.warn(...); end`.

For transient errors, use `retry_job` with a custom wait: `rescue Net::OpenTimeout => e; retry_job wait: 10.seconds; end`. This gives you fine-grained control. Also consider using sidekiq-unique-jobs gem to prevent duplicate job execution during retries.

( 11 )Monitoring and Alerting for Retry Storms

Set up Prometheus metrics with sidekiq-prometheus-exporter to track retry queue size. Alert on retry_count > 100 or retry queue growth rate > 10/min. Use `Sidekiq::Stats.new.retry_size` in custom health checks.

Log job arguments and exceptions with structured logging (e.g., Lograge) to quickly pinpoint problematic jobs. Example log line: `{"worker":"SendEmailWorker","args":[123],"error":"Net::ReadTimeout","retry_count":5}`.

Frequently asked questions

Why does my Sidekiq job keep retrying even though I fixed the error?

Jobs already in the retry queue or dead set still hold the old error. You must manually retry them from the Sidekiq web UI or using `Sidekiq::RetrySet.new.retry_all`. The fix only applies to new job executions.

How do I prevent a job from retrying at all?

Set `sidekiq_options retry: false` in the worker class. This will move the job directly to the dead set on failure. Alternatively, if you want to discard it entirely, set `sidekiq_options retry: 0`.

What is the difference between 'retry' and 'dead' in Sidekiq?

The retry set holds jobs that will be retried later. After exhausting retries, jobs move to the dead set. Dead jobs are not retried automatically but can be manually retried or deleted. The dead set has a configurable maximum size (default 10,000).

Can I configure different retry strategies per worker?

Yes. Use `sidekiq_options retry: 5` in the worker class for a simple limit. For custom backoff, override `sidekiq_retry_in_block` or use `retry_job` with a custom wait inside a rescue block.

How do I inspect the error that caused a retry?

In the Sidekiq web UI, click on a retry job to see its error class, message, and backtrace. Programmatically: `Sidekiq::RetrySet.new.each { |j| puts j.error_class, j.error_message }`. Also check your sidekiq.log for the error details.