Skip to content
SP StackPractices
advanced

Event-Driven Architecture — Queues, Topics, and Streams

A practical guide to event-driven architecture: events vs commands, message brokers, patterns like CQRS and Saga, and when to choose async over sync.

Event-Driven Architecture

Introduction

Event-driven architecture (EDA) is a pattern where services communicate by producing and consuming events rather than direct calls. It decouples producers from consumers, enables scalability, and naturally supports temporal decoupling — consumers do not need to be online when events are produced.

Events vs Commands

Understanding the difference is fundamental to designing EDA correctly.

EventCommand
IntentSomething happenedDo something
DirectionBroadcast (many may listen)Directed (one handler)
ExampleOrderPlacedChargeCustomer
Failure handlingConsumers handle their own retriesSender must know if it failed
CouplingLooseTighter
# Event: the order service announces an order was placed
def publish_order_placed(order):
    bus.publish("orders.placed", {
        "order_id": order.id,
        "user_id": order.user_id,
        "total": order.total
    })

# Command: the order service tells the payment service to charge
# (Do this only when the payment service MUST process it)
def charge_customer(payment_request):
    payment_service.charge(payment_request)  # synchronous or command queue

Rule of thumb: Prefer events. Use commands only when the action must happen and the caller needs to know the result.

Message Broker Types

Queues (Point-to-Point)

One message → one consumer. Good for distributing work.

import pika

# Producer
channel.basic_publish(exchange='', routing_key='email_queue', body=message)

# Consumer (one of many workers)
channel.basic_consume(queue='email_queue', on_message_callback=process_email)

Use for: background jobs, task queues, load leveling.

Topics (Publish-Subscribe)

One message → many consumers. Good for broadcasting.

# Kafka: one event, multiple consumer groups
producer.send('orders', order_event)

# Consumer group A: sends confirmation email
consumer_a.subscribe(['orders'])

# Consumer group B: updates analytics warehouse
consumer_b.subscribe(['orders'])

Use for: fan-out, event sourcing, cross-service notifications.

Streams

Ordered, replayable, durable log of events.

FeatureQueueTopicStream
DurabilityUntil consumedUntil all groups consumeRetained by policy (days)
OrderingWithin queueWithin partitionWithin partition
ReplayNoNoYes
ParallelismMultiple consumersConsumer groupsConsumer groups

Use streams when: you need replay, ordering guarantees, or event sourcing.

Core Patterns

1. Event Notification

The simplest pattern: one service notifies others that something happened.

Order Service ──OrderPlaced──> Email Service (send confirmation)
               ──OrderPlaced──> Analytics Service (record metrics)
               ──OrderPlaced──> Inventory Service (reserve stock)

Trade-off: Consumers are responsible for fetching data they need. The event is a notification, not a payload.

2. Event-Carried State Transfer

The event carries the data consumers need, eliminating extra queries.

{
  "event_type": "OrderPlaced",
  "order_id": "ord-123",
  "user_id": "usr-456",
  "items": [
    {"sku": "A1", "qty": 2, "price": 10.00}
  ],
  "total": 20.00,
  "timestamp": "2024-06-12T10:00:00Z"
}

Trade-off: Events are larger and may carry data consumers do not need. Versioning becomes important as schemas evolve.

3. CQRS (Command Query Responsibility Segregation)

Separate read and write models. Writes go to the command model; reads come from optimized read models populated by events.

┌──────────────┐    OrderPlaced event     ┌──────────────┐
│  Command     │ ───────────────────────> │  Read Model  │
│  Model       │                          │  (Elastic)   │
│  (PostgreSQL)│                          │  for search  │
└──────────────┘                          └──────────────┘

When to use: Read and write patterns differ significantly (e.g., relational writes, search-optimized reads).

4. Saga Pattern

Manage distributed transactions using a sequence of local transactions, each publishing an event that triggers the next.

Order Service: create order → publish OrderCreated
Payment Service: charge card → publish PaymentProcessed
Inventory Service: reserve stock → publish InventoryReserved
Shipping Service: create shipment → publish ShipmentCreated

Compensating transactions undo previous steps if a later step fails:

def on_payment_failed(event):
    # Compensate: cancel the order
    order_service.cancel(event.order_id)
    inventory_service.release(event.order_id)

When to use: Long-running business processes that span multiple services.

When to Choose Async Over Sync

Sync (REST/gRPC)Async (Events)
Real-time response neededEventual consistency acceptable
Tight coupling acceptableLoose coupling required
Simple failure modes acceptableComplex failure handling acceptable
Low latency criticalThroughput and resilience critical

Best Practices

  • Design events as facts, not instructionsOrderPlaced, not ProcessOrder
  • Include correlation IDs — trace a request across services and time
  • Make consumers idempotent — at-least-once delivery means events may be processed twice
  • Version your eventsOrderPlacedV1, OrderPlacedV2 to support gradual migration
  • Monitor consumer lag — lagging consumers are a sign of scaling or performance issues
  • Use dead letter queues — failed messages should not block the queue; analyze them separately

Common Mistakes

  • Treating events as commands — events announce facts; they do not demand action
  • Not handling duplicate delivery — assume at-least-once and design for idempotency
  • Ignoring consumer lag until it is a crisis — monitor and alert on lag metrics
  • Building custom message brokers — use proven systems (Kafka, RabbitMQ, NATS, AWS SNS/SQS)
  • Using events for simple request/response — adds unnecessary complexity

Frequently Asked Questions

How do I debug an event-driven system?

Use distributed tracing (OpenTelemetry, Jaeger) and correlation IDs. Log every event produced and consumed with the same trace ID. Build a “trace viewer” that shows the path of a request across services.

What if a consumer is down when an event is published?

With durable message brokers (Kafka, persistent queues), events are retained. The consumer catches up when it comes back online. Set retention policies based on your recovery time objectives.

Should every microservice communication be async?

No. Use sync for real-time queries and when the caller needs an immediate answer. Use async for background work, notifications, and decoupling. A healthy system uses both.