Skip to content
SP StackPractices
intermediate

Bulkhead Pattern

Isolate different parts of an application into pools so that a failure in one does not cascade to others. A resilience pattern for fault containment.

Topics: design

Bulkhead Pattern

Overview

The Bulkhead Pattern is a resilience pattern that isolates different parts of an application into separate pools, ensuring that a failure in one part does not cascade to others. Named after the watertight compartments (bulkheads) in ships, this pattern limits the scope of failures by allocating dedicated resources to different components or clients.

When to Use

Use the Bulkhead Pattern when:

  • You have multiple independent components that share a thread pool or connection pool
  • A slow or failing component should not consume all available resources
  • You need to ensure critical operations always have dedicated capacity
  • You want to degrade gracefully by isolating failures to specific subsystems
  • Examples: microservices with different SLAs, API gateways, multi-tenant systems

Solution

Python

import threading
import time
from concurrent.futures import ThreadPoolExecutor, TimeoutError as FutureTimeout
from queue import Queue

class Bulkhead:
    def __init__(self, name: str, max_workers: int, queue_size: int = 10):
        self.name = name
        self.executor = ThreadPoolExecutor(max_workers=max_workers)
        self.queue = Queue(maxsize=queue_size)

    def execute(self, func, timeout: float = 5.0):
        if self.queue.full():
            raise RuntimeError(f"Bulkhead '{self.name}' queue full — rejecting request")
        self.queue.put(1)
        try:
            future = self.executor.submit(func)
            return future.result(timeout=timeout)
        finally:
            self.queue.get()

# Usage: separate bulkheads for critical and non-critical operations
critical = Bulkhead("critical", max_workers=4)
background = Bulkhead("background", max_workers=2)

def slow_operation():
    time.sleep(3)
    return "done"

# Critical ops always have capacity
try:
    result = critical.execute(slow_operation, timeout=5)
    print(result)
except FutureTimeout:
    print("Critical operation timed out")

# Background ops are limited — won't starve critical pool
try:
    result = background.execute(slow_operation, timeout=5)
    print(result)
except RuntimeError as e:
    print(e)

JavaScript

class Bulkhead {
  constructor(name, maxConcurrent) {
    this.name = name;
    this.maxConcurrent = maxConcurrent;
    this.running = 0;
    this.queue = [];
  }

  async execute(fn, timeoutMs = 5000) {
    if (this.running >= this.maxConcurrent) {
      throw new Error(`Bulkhead '${this.name}' at capacity — rejecting request`);
    }

    this.running++;
    try {
      const timeout = new Promise((_, reject) =>
        setTimeout(() => reject(new Error("Timeout")), timeoutMs)
      );
      return await Promise.race([fn(), timeout]);
    } finally {
      this.running--;
    }
  }
}

// Usage
const critical = new Bulkhead("critical", 4);
const background = new Bulkhead("background", 2);

async function slowOperation() {
  await new Promise(r => setTimeout(r, 3000));
  return "done";
}

// Critical ops
async function runCritical() {
  try {
    const result = await critical.execute(slowOperation, 5000);
    console.log(result);
  } catch (e) {
    console.log("Critical failed:", e.message);
  }
}

runCritical();

Java

import java.util.concurrent.*;

public class Bulkhead {
    private final String name;
    private final Semaphore semaphore;

    public Bulkhead(String name, int maxConcurrent) {
        this.name = name;
        this.semaphore = new Semaphore(maxConcurrent);
    }

    public <T> T execute(Callable<T> task, long timeoutMs) throws Exception {
        if (!semaphore.tryAcquire(timeoutMs, TimeUnit.MILLISECONDS)) {
            throw new RuntimeException("Bulkhead '" + name + "' at capacity — rejecting request");
        }
        try {
            return task.call();
        } finally {
            semaphore.release();
        }
    }
}

// Usage
Bulkhead critical = new Bulkhead("critical", 4);
Bulkhead background = new Bulkhead("background", 2);

String result = critical.execute(() -> {
    Thread.sleep(3000);
    return "done";
}, 5000);

Explanation

The Bulkhead Pattern separates resources into isolated pools:

  • Thread/Connection Pools: Each component gets its own pool instead of sharing one
  • Semaphores/Queues: Limit concurrent operations per component
  • Rejection: When a pool is exhausted, new requests are rejected rather than queued indefinitely

This ensures that a runaway consumer (e.g., a background job) cannot consume all threads, leaving nothing for critical operations (e.g., user-facing API calls).

Variants

VariantDescriptionUse Case
Thread Pool IsolationSeparate thread pools per componentHystrix-style command isolation
Connection Pool IsolationSeparate DB/cache connections per serviceMulti-tenant databases
Semaphore IsolationLightweight, same-thread isolationWhen thread overhead is too high
Process IsolationSeparate OS processes or containersMaximum fault containment

Best Practices

  • Always reserve capacity for critical paths — don’t let background jobs starve user requests
  • Monitor pool saturation — track how often each bulkhead rejects or times out
  • Combine with Circuit Breaker — if a bulkhead is constantly saturated, the breaker should trip
  • Use semaphores instead of thread pools when thread creation is expensive or limited
  • Document and enforce SLAs per bulkhead so teams know the capacity boundaries

Common Mistakes

  • Using a single shared pool for everything, allowing one slow component to freeze the app
  • Setting pool sizes too small, causing unnecessary rejections under normal load
  • Not monitoring or alerting on bulkhead saturation
  • Isolating too granularly, creating resource fragmentation
  • Forgetting that queueing also consumes memory — bounded queues are essential

Frequently Asked Questions

Q: What is the difference between Bulkhead and Circuit Breaker? A: Bulkhead isolates resources to prevent one failure from affecting others. Circuit Breaker stops sending requests to a failing service. They complement each other: Bulkhead contains the blast radius, Circuit Breaker stops the bleeding.

Q: Should I use thread pools or semaphores for bulkheads? A: Thread pools provide the strongest isolation but have higher overhead. Semaphores are lighter and run on the calling thread — use them when you need many concurrent bulkheads or want to avoid thread starvation.