What this usually means
Visibility timeout is the period during which a message is invisible to other consumers after being polled. If your worker takes longer than the visibility timeout to process and delete the message, SQS makes it visible again. Another poll (by the same or different consumer) picks it up, leading to duplicate processing. The root cause is almost always a mismatch between the configured timeout and the actual processing time — either because the timeout was set to a default (30s) without measurement, or because processing time spiked due to a slow dependency.
The first ten minutes — establish facts before touching code.
- 1Check the SQS queue's VisibilityTimeout value via AWS Console or CLI: `aws sqs get-queue-attributes --queue-url <URL> --attribute-names VisibilityTimeout`.
- 2Measure the 99th percentile processing time for your worker. Add logging around the message handler: log start time, end time, and message ID.
- 3Correlate duplicate messages with their original processing time. Look for messages where processing time > visibility timeout.
- 4Increase visibility timeout to 2x the 99th percentile processing time and observe if duplicates drop to near zero.
- 5Enable SQS dead-letter queue (DLQ) with maxReceiveCount = 3 to capture messages that fail repeatedly.
The specific files, logs, configs, and dashboards that usually own this bug.
- searchAWS Console → SQS → Queue → Monitoring tab: Look at ApproximateNumberOfMessagesVisible and ApproximateAgeOfOldestMessage.
- searchCloudWatch Metrics: SQS queue metrics for NumberOfMessagesReceived, NumberOfMessagesDeleted, and ApproximateReceiveCount.
- searchWorker application logs: Look for message IDs that appear multiple times with different receipt handles.
- searchApplication performance monitoring (APM) for processing time percentiles (p99, p99.9).
- searchSQS queue attributes: VisibilityTimeout, ReceiveMessageWaitTimeSeconds, and RedrivePolicy.
- searchCode: The message handler's delete call — ensure it's called after successful processing and that exceptions are caught.
Practical causes, not theory. These are the things you will actually find.
- warningVisibilityTimeout set too low (e.g., 30 seconds default) for actual processing time (e.g., 45 seconds p99).
- warningWorker processing time spikes due to external API latency, database contention, or large payloads.
- warningMultiple consumers polling the same queue without proper coordination (e.g., no change visibility timeout per message).
- warningMessage handler does not delete the message on success, relying on automatic deletion after visibility timeout (wrong).
- warningReceipt handle expires before delete is attempted — happens if processing takes longer than visibility timeout.
- warningUsing long polling with short visibility timeout, causing messages to reappear before processing completes.
Concrete fix directions. Pick the one that matches your root cause.
- buildSet VisibilityTimeout to at least 6x your p99 processing time (safety margin) — e.g., if p99 is 30s, set to 180s.
- buildImplement ChangeMessageVisibility API to extend the timeout while the message is being processed (e.g., heartbeat every 60s).
- buildUse a dead-letter queue with maxReceiveCount = 5 to isolate poison pills that consistently exceed timeout.
- buildRefactor processing to be idempotent: use a database unique constraint or idempotency key to handle duplicates safely.
- buildReduce processing time by offloading heavy work to async jobs or scaling workers.
- buildUse FIFO queues with deduplication ID if order and exactly-once processing are required (but watch out for throughput limits).
A fix you cannot prove is a guess. Close the loop.
- verifiedDeploy the fix and monitor CloudWatch metrics: NumberOfMessagesDeleted should closely match NumberOfMessagesReceived minus DLQ moves.
- verifiedRun a canary with a known message that takes exactly the p99 processing time and verify it is not duplicated.
- verifiedCheck application logs for duplicate message IDs — they should drop to zero.
- verifiedVerify that the SQS queue's ApproximateReceiveCount for messages is 1 for the vast majority.
- verifiedSimulate a slow dependency (e.g., add a sleep in a test environment) and confirm that heartbeats extend visibility correctly.
Things that make this bug worse or harder to find.
- warningSetting visibility timeout to a very large value (e.g., 12 hours) — this delays retries if a worker crashes; use heartbeats instead.
- warningRelying solely on increasing timeout without monitoring processing time percentiles — you might miss long-tail spikes.
- warningForgetting to delete the message after processing — the message will be redelivered after timeout, causing duplicates.
- warningIgnoring the receipt handle expiration: once timeout expires, you cannot delete the message even if you try.
- warningNot testing heartbeats thoroughly — if heartbeat fails silently, the message reappears.
- warningUsing FIFO queues without understanding that they have throughput limits (300 TPS without batching) and can cause ordering issues.
The Double-Dipping Delivery Service
Timeline
- 09:15Alert: 'Duplicate order processed' from customer support. They see two charges for one order.
- 09:20Check SQS queue: VisibilityTimeout = 30s. Worker p99 processing time from logs: 45s.
- 09:25Find duplicate DynamoDB entries with same orderId but different processedAt timestamps (15s apart).
- 09:30Look at worker code: no ChangeMessageVisibility call. Deletion happens after 45s, but timeout expired at 30s.
- 09:40Set VisibilityTimeout to 180s and deploy. Also add heartbeat: extend 60s every 30s.
- 09:50Monitor for 10 minutes: duplicate count drops to zero. Logs show no more repeated message IDs.
- 10:00Add dead-letter queue with maxReceiveCount=3 to catch any future misbehaving messages.
- 10:30Postmortem: root cause was default timeout not matching actual processing time; no heartbeat mechanism.
I got paged at 9:15 AM. Customer support reported that a single order was charged twice. I immediately checked our order processing pipeline: an SQS queue feeds a Node.js worker on ECS. The worker reads a message, processes payment, writes to DynamoDB, and deletes the message. I pulled the order's message ID from logs and found it appeared twice in the worker logs, with different receipt handles, 15 seconds apart.
I checked the SQS queue configuration: VisibilityTimeout was set to the default 30 seconds. Then I looked at our worker's processing time metrics: p99 was 45 seconds. That was the smoking gun. The worker took 45 seconds to process, but after 30 seconds the message became visible again. Another worker picked it up, processed it again, and we got two DynamoDB records. The original worker eventually deleted the message, but the second worker had already committed the duplicate.
The fix was straightforward: set VisibilityTimeout to 180 seconds (4x p99) and added a heartbeat using ChangeMessageVisibility every 30 seconds to extend the timeout while processing. I also added a dead-letter queue to catch messages that exceed maxRetries. After deploying, duplicates stopped immediately. The lesson: never trust default timeouts; measure your actual processing time and add heartbeats for long-running jobs.
Root cause
SQS visibility timeout (30s) was less than the worker's p99 processing time (45s), causing message redelivery before processing completed.
The fix
Increased VisibilityTimeout to 180s and implemented a heartbeat that extends visibility by 60s every 30s. Added DLQ with maxReceiveCount=3.
The lesson
Always measure p99 processing time and set visibility timeout with a comfortable margin. For variable processing times, use heartbeats to extend the lock dynamically.
When a consumer calls ReceiveMessage, SQS marks that message as 'in flight' and hides it from other consumers for the duration of the visibility timeout. If the consumer deletes the message before the timeout expires, the message is gone. If the timeout expires without a delete or extend, SQS makes the message visible again for another consumer.
The key insight: the visibility timeout is not a processing deadline — it's a lease. You must either renew the lease (ChangeMessageVisibility) or release it (DeleteMessage) before it expires. Many engineers treat it as a simple timeout and set it to a large value, but that delays retries if the worker crashes. The correct pattern is a moderate timeout with periodic heartbeats.
Setting a very large visibility timeout (e.g., 1 hour) might seem safe, but if the worker crashes mid-processing, the message will not be redelivered for that entire hour. This increases latency for retries. Instead, set a moderate timeout (e.g., 60 seconds) and use ChangeMessageVisibility to extend it every 30 seconds while processing is ongoing.
Implementing heartbeats in your worker: after receiving a message, start a periodic timer (e.g., setInterval in Node.js) that calls ChangeMessageVisibility with a new timeout. If the worker crashes, the timer stops, and the message becomes visible after the original timeout. This gives you fast failure detection while preventing premature redelivery.
Even with correct visibility timeout and heartbeats, network issues or bugs can still cause duplicates. Always design your message processing to be idempotent. For example, in DynamoDB, use a conditional put with a unique order ID — if the record already exists, the put fails and you can safely ignore the duplicate.
Idempotency keys: store a unique identifier (e.g., message ID) in a database with a TTL. Before processing, check if the key exists. If it does, skip processing. This is a robust last line of defense against duplicates from any cause.
Set up CloudWatch alarms on ApproximateReceiveCount. For example, alarm if the average ReceiveCount exceeds 1.2 over 5 minutes. This will catch visibility timeout issues before they cause widespread duplicates.
Log message IDs and receipt handles at key points: receive, start processing, end processing, delete. Correlate them with processing time. A simple script can detect messages that appear multiple times: sort logs by message ID and check for duplicates.
The receipt handle changes after each visibility extension. You must use the latest receipt handle when calling ChangeMessageVisibility or DeleteMessage. If you stored the old receipt handle, the call will fail with 'ReceiptHandleIsInvalid'.
Do not call ChangeMessageVisibility in a tight loop without backoff. Each call counts against your SQS API quota. A 30-second interval is reasonable. Also, ensure your heartbeat logic handles errors gracefully — if the heartbeat fails, you should still try to process and delete the message.
Frequently asked questions
What is the default visibility timeout in SQS?
The default visibility timeout for a standard queue is 30 seconds. For FIFO queues, it's also 30 seconds. You can set it anywhere from 0 seconds to 12 hours. Always configure it based on your actual processing time.
Can I change visibility timeout after a message is received?
Yes, using the ChangeMessageVisibility API. You can extend or reduce the timeout for a specific message using its receipt handle. This is useful for heartbeats.
What happens if I delete a message after visibility timeout has expired?
If the visibility timeout has expired, the receipt handle becomes invalid. Attempting to delete with an expired receipt handle will result in a 'ReceiptHandleIsInvalid' error. The message may have already been received by another consumer.
Does SQS guarantee exactly-once delivery?
Standard SQS queues guarantee at-least-once delivery. FIFO queues guarantee exactly-once processing if you use message deduplication IDs. However, FIFO has throughput limits (300 TPS without batching). For high throughput, use standard queues with idempotent processing.
How do I monitor for duplicate processing?
Use CloudWatch metrics like ApproximateReceiveCount. A value consistently above 1 indicates redelivery. Also, log message IDs and set up an alert when the same ID appears more than once within a short time window.