Skip to content
SP StackPractices
intermediate By StackPractices

Message Queues — RabbitMQ, Kafka, and SQS Deep Dive

A comprehensive guide to message queues: when to use RabbitMQ, Kafka, or SQS. Covers patterns, throughput, ordering, and operational considerations.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Message queues are the nervous system of distributed systems. They decouple producers from consumers, absorb traffic spikes, and enable async workflows that would be impossible with synchronous HTTP alone. But choosing the wrong queue or using it incorrectly turns a resilience tool into a source of data loss, ordering bugs, and operational nightmares. This guide compares RabbitMQ, Kafka, and AWS SQS — the three most common choices — and explains when to use each, how to configure them, and what pitfalls to avoid.

When to Use

Use this guide when:

  • You need to choose between RabbitMQ, Kafka, and SQS for a new architecture
  • You are experiencing message loss, duplicate processing, or ordering issues in your queue system
  • Your team is designing event-driven microservices and needs to understand queue patterns

Solution

Comparison Matrix

ConcernRabbitMQKafkaAWS SQS
Throughput10K–50K msg/s1M+ msg/sUnlimited (auto-scales)
OrderingPer-queue (FIFO via plugin)Per-partitionFIFO queues available
PersistenceDisk + memoryDurable by design14 days max retention
Delivery ModelPush to consumerPull by consumerPull by consumer
ReplayNo (unless DLQ)Yes (any offset)No
RoutingExchange + routing keysTopic partitionsNo native routing
Operational ModelSelf-hosted / managedSelf-hosted / managedFully managed
Best ForTask queues, RPCEvent streaming, logsSimple decoupling

RabbitMQ Task Queue Example

# Producer
import pika

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)

message = "Process order #12345"
channel.basic_publish(
    exchange='',
    routing_key='task_queue',
    body=message,
    properties=pika.BasicProperties(delivery_mode=2)  # persistent
)
connection.close()

# Consumer
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='task_queue', durable=True)
channel.basic_qos(prefetch_count=1)  # fair dispatch

def callback(ch, method, properties, body):
    print(f"Received {body.decode()}")
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue='task_queue', on_message_callback=callback)
channel.start_consuming()

Kafka Producer/Consumer

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',  # wait for all replicas
    retries=3
)
producer.send('orders', {'order_id': '12345', 'amount': 99.99})
producer.flush()

# Consumer
consumer = KafkaConsumer(
    'orders',
    bootstrap_servers=['kafka:9092'],
    group_id='order-processors',
    auto_offset_reset='earliest',
    enable_auto_commit=False  # manual commit for at-least-once
)

for message in consumer:
    process_order(message.value)
    consumer.commit()  # commit after successful processing

SQS Producer/Consumer (Boto3)

import boto3

sqs = boto3.client('sqs')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/my-queue'

# Send
sqs.send_message(
    QueueUrl=queue_url,
    MessageBody='Process order #12345',
    MessageAttributes={
        'OrderType': {'StringValue': 'premium', 'DataType': 'String'}
    }
)

# Receive
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,
    WaitTimeSeconds=20,  # long polling
    VisibilityTimeout=300
)

for msg in response.get('Messages', []):
    process_message(msg['Body'])
    sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=msg['ReceiptHandle'])

Dead Letter Queue Pattern

# Pseudo-code for DLQ routing
MAX_RETRIES = 3

def process_with_dlq(message):
    try:
        process(message)
        ack(message)
    except Exception as e:
        retry_count = get_retry_count(message)
        if retry_count >= MAX_RETRIES:
            send_to_dlq(message, reason=str(e))
        else:
            nack_with_delay(message, delay=2 ** retry_count)

Explanation

The fundamental difference between these systems is push vs pull and persistence model. RabbitMQ pushes messages to consumers and keeps them in memory (with disk backup); this gives low latency but limits throughput. Kafka consumers pull from durable logs; this enables replay and massive throughput but introduces pull latency. SQS is a managed pull system with no replay but infinite scalability.

Ordering is a common source of confusion. RabbitMQ guarantees order within a single queue only if you have one consumer. Kafka guarantees order within a partition, but if you scale to multiple partitions, order is only preserved per partition key. SQS standard queues offer no ordering; FIFO queues do but at lower throughput.

The dead letter queue is not a luxury — it is a requirement. Without it, poison messages (messages that always fail processing) will block your queue indefinitely. Every production queue system should have a DLQ with alerting.

Variants

PatternQueue TypeUse Case
Work QueueRabbitMQ, SQSDistribute tasks among workers
Pub/SubRabbitMQ (fanout), KafkaBroadcast events to multiple consumers
Event SourcingKafkaImmutable event log as system of record
Request/ReplyRabbitMQ (RPC over queues)Async request-response
Scheduled JobsRabbitMQ (delayed plugin), SQSDelayed or recurring task execution
Priority QueueRabbitMQProcess high-priority messages first

Best Practices

  1. Always use durable queues and persistent messages in production; memory-only queues lose data on restart
  2. Set visibility timeouts longer than your max processing time; SQS will re-deliver mid-processing messages
  3. Implement idempotent consumers; at-least-once delivery means you will process duplicates
  4. Monitor consumer lag (Kafka) or approximate age of oldest message (SQS); lag is your canary
  5. Use schema validation (Avro, JSON Schema) before publishing; invalid messages are expensive to debug in production

Common Mistakes

  1. Treating queues as databases; queues are for transient messaging, not long-term storage
  2. Not handling message ordering explicitly; assuming global order when only per-partition/per-queue order exists
  3. Using auto-commit in Kafka without understanding the at-most-once vs at-least-once trade-off
  4. Not configuring DLQs; one poison message can block an entire queue
  5. Ignoring backpressure; if consumers are slower than producers, your queue grows until it crashes

Frequently Asked Questions

When should I choose Kafka over RabbitMQ?

Choose Kafka when you need event streaming, replay, or very high throughput (100K+ messages/sec). Kafka treats the log as the primary abstraction, making it ideal for event sourcing, log aggregation, and real-time analytics. Choose RabbitMQ when you need complex routing (exchanges, routing keys), RPC patterns, or priority queues. RabbitMQ is more flexible for traditional messaging patterns; Kafka is better for data pipelines.

How do I prevent duplicate message processing?

Use idempotent consumers: design your processing logic so that processing the same message twice produces the same result. Store a processed-message ID in a database with a unique constraint. Alternatively, use exactly-once semantics where supported (Kafka transactions, SQS FIFO with deduplication). But remember: exactly-once has performance overhead and complexity. Idempotency is simpler and more reliable.

What is consumer lag and how do I fix it?

Consumer lag is the difference between the newest message in the queue and the oldest unprocessed message. High lag means your consumers are slower than your producers. Fix it by: (1) scaling out consumers (add more instances), (2) optimizing processing time (profile your consumer code), (3) reducing message size, or (4) splitting into more partitions (Kafka) or queues (RabbitMQ). If lag is chronic, your architecture may need redesign — queues absorb spikes, not sustained overload.