Skip to content
SP StackPractices
intermediate By StackPractices

Dead Letter Queues

Handle failed messages gracefully with dead letter queues, retry policies, and poison pill detection in message-driven architectures.

Topics: messaging

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Dead letter queues (DLQs) capture messages that fail processing after repeated attempts in message-driven systems. Without them, failed messages would block the queue or be lost entirely. A well-designed DLQ system distinguishes between poison pills (permanently bad messages) and transient failures, enabling operators to replay, inspect, or discard problematic messages without impacting the main processing flow.

When to Use

Use this resource when:

  • Message consumers encounter unrecoverable errors (malformed payloads, missing references)
  • You need to prevent one bad message from blocking an entire queue partition
  • Operations teams require visibility into failed messages for manual intervention
  • Compliance requires audit trails of all processed and failed messages. Use a data retention policy.

Solution

SQS DLQ Configuration (AWS CLI)

# Create main queue and DLQ
aws sqs create-queue --queue-name orders-queue
aws sqs create-queue --queue-name orders-dlq

# Get queue URLs
QUEUE_URL=$(aws sqs get-queue-url --queue-name orders-queue --query 'QueueUrl' --output text)
DLQ_URL=$(aws sqs get-queue-url --queue-name orders-dlq --query 'QueueUrl' --output text)
DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)

# Set redrive policy: send to DLQ after 3 failed receives
aws sqs set-queue-attributes \
  --queue-url $QUEUE_URL \
  --attributes '{
    "RedrivePolicy": "{\\"deadLetterTargetArn\\":\\"'$DLQ_ARN'\\",\\"maxReceiveCount\\":3}"
  }'

RabbitMQ Dead Letter Exchange (Python + pika)

import pika

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# DLX and DLQ
channel.exchange_declare(exchange='orders.dlx', exchange_type='direct')
channel.queue_declare(queue='orders-dlq', durable=True)
channel.queue_bind(queue='orders-dlq', exchange='orders.dlx', routing_key='failed')

# Main queue with TTL and dead-letter routing
args = {
    'x-dead-letter-exchange': 'orders.dlx',
    'x-dead-letter-routing-key': 'failed',
    'x-message-ttl': 300000  # 5 minutes
}
channel.queue_declare(queue='orders', durable=True, arguments=args)

# Reject a message to send to DLQ
channel.basic_reject(delivery_tag=method.delivery_tag, requeue=False)

Kafka Dead Letter Topic (Node.js + KafkaJS)

const { Kafka } = require('kafkajs');
const kafka = new Kafka({ brokers: ['localhost:9092'] });

const consumer = kafka.consumer({ groupId: 'order-processors' });

await consumer.connect();
await consumer.subscribe({ topic: 'orders', fromBeginning: false });

const producer = kafka.producer();
await producer.connect();

await consumer.run({
  eachMessage: async ({ topic, partition, message }) => {
    try {
      await processOrder(JSON.parse(message.value));
    } catch (err) {
      // Send to DLQ with error metadata
      await producer.send({
        topic: 'orders-dlq',
        messages: [{
          key: message.key,
          value: message.value,
          headers: {
            'error.type': err.name,
            'error.message': err.message,
            'original.topic': topic,
            'original.partition': String(partition),
            'original.offset': String(message.offset),
            'retry.count': '3'
          }
        }]
      });
    }
  }
});

Explanation

DLQ trigger conditions:

ConditionWhen to DLQAction
Max retries exceededAfter N failed attemptsMove to DLQ
Unparseable messageInvalid JSON, schema mismatchMove immediately
Missing dependencyReferenced record doesn’t existRetry, then DLQ
Business rule violationOrder for non-existent productMove immediately

DLQ monitoring:

  • Depth alerting: DLQ > 10 messages triggers PagerDuty
  • Age alerting: Message in DLQ > 24 hours needs investigation
  • Replay tooling: Admin UI to reprocess or purge DLQ messages
  • Correlation: Link DLQ message to original trace ID. See distributed tracing.

Variants

BrokerDLQ MechanismConfiguration
AWS SQSRedrive policymaxReceiveCount + target ARN
RabbitMQDead letter exchangex-dead-letter-exchange
KafkaConsumer-managedSeparate topic + producer logic
Azure SBForwardingmaxDeliveryCount + forwardTo
Google Pub/SubDead letter topicdeadLetterPolicy.maxDeliveryAttempts

Best Practices

  • Set reasonable retry counts: 3-5 attempts balances recovery time against queue pressure
  • Include full context in DLQ: Original headers, retry count, error type, and stack trace
  • Separate DLQs by severity: Validation errors vs. infrastructure failures need different handling
  • Monitor DLQ depth as a metric: It’s a leading indicator of system health. See metrics collection.
  • Automate replay with caution: Replay after fixing the bug; replaying blindly amplifies failures

Common Mistakes

  1. No DLQ at all: Failed messages silently disappear or block consumers forever
  2. Infinite retry loops: Requeuing without a max count creates perpetual processing. Use retry with exponential backoff.
  3. Ignoring DLQ messages: The DLQ becomes a dumping ground that nobody monitors
  4. No dead-letter reason: Operators can’t distinguish “bad JSON” from “database down”
  5. Shared DLQ for all topics: One poison pill from topic A doesn’t belong with topic B’s failures

Frequently Asked Questions

Q: Should I automatically replay DLQ messages? A: Only after identifying and fixing the root cause. Blind replay wastes resources and may re-trigger the same error.

Q: How long should I keep DLQ messages? A: Longer than your incident response SLA. 7-14 days is typical; archive to cheap storage beyond that.

Q: What’s the difference between a DLQ and a retry queue? A: Retry queues hold messages for later reprocessing. DLQs hold messages that have exhausted all retries.