Throttling Pattern

Overview

The Throttling Pattern controls the rate at which a system processes requests or consumes resources to prevent overload and ensure fair resource allocation. Instead of accepting all incoming requests immediately, the system limits the rate based on capacity, user tiers, or resource availability.

Throttling prevents cascading failures by ensuring downstream services and shared resources are not overwhelmed. It is commonly used in APIs, message consumers, database connections, and third-party integrations where unbounded throughput could cause service degradation or cost explosion.

When to Use

Protecting downstream services from traffic spikes
Enforcing API rate limits for consumers
Controlling database connection pool exhaustion
Managing costs with metered third-party APIs
Ensuring fair resource allocation in multi-tenant systems
Preventing DDoS or accidental abuse

When to Avoid

Internal services within the same trust boundary with predictable load
Systems where any request rejection violates business requirements
When the bottleneck is not request rate but data size or complexity
Latency-sensitive paths where throttling adds unacceptable delay

Solution

Python (Token Bucket)

import time
import threading
from dataclasses import dataclass

@dataclass
class TokenBucket:
    capacity: int
    refill_rate: float
    tokens: float = 0
    last_refill: float = 0
    lock: threading.Lock = None

    def __post_init__(self):
        self.tokens = self.capacity
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def acquire(self, tokens: int = 1) -> bool:
        with self.lock:
            now = time.time()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
            self.last_refill = now

            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

class ThrottledAPI:
    def __init__(self):
        self.bucket = TokenBucket(capacity=10, refill_rate=2)

    def call(self, endpoint: str, data: dict) -> dict:
        if not self.bucket.acquire():
            raise RateLimitExceeded("Rate limit exceeded. Try again later.")

        # Process request
        return {"status": "success", "endpoint": endpoint}

class RateLimitExceeded(Exception):
    pass

Java (Guava RateLimiter)

import com.google.common.util.concurrent.RateLimiter;
import org.springframework.stereotype.Service;

@Service
public class ThrottledService {
    private final RateLimiter limiter = RateLimiter.create(10.0); // 10 permits/second

    public String processRequest(String request) {
        limiter.acquire(); // Blocks until permit available
        return "Processed: " + request;
    }
}

JavaScript (Sliding Window Log)

class SlidingWindowThrottle {
    constructor(windowMs, maxRequests) {
        this.windowMs = windowMs;
        this.maxRequests = maxRequests;
        this.requests = new Map();
    }

    isAllowed(clientId) {
        const now = Date.now();
        const windowStart = now - this.windowMs;

        if (!this.requests.has(clientId)) {
            this.requests.set(clientId, []);
        }

        const clientRequests = this.requests.get(clientId);
        const recent = clientRequests.filter(t => t > windowStart);

        if (recent.length < this.maxRequests) {
            recent.push(now);
            this.requests.set(clientId, recent);
            return true;
        }

        this.requests.set(clientId, recent);
        return false;
    }
}

Explanation

Throttling algorithms balance fairness and efficiency:

Token bucket: Tokens are added at a fixed rate. Requests consume tokens. Allows short bursts while maintaining long-term average rate.
Leaky bucket: Requests enter a fixed-size queue and leak out at a constant rate. Smooths traffic but drops overflow.
Fixed window: Count requests in time windows. Simple but allows burst at window boundaries.
Sliding window: More accurate by tracking exact timestamps within a rolling window.

Variants

Variant	Behavior	Best For
Token bucket	Bursts allowed up to capacity	APIs needing burst tolerance
Leaky bucket	Constant outflow rate	Smoothing traffic to downstream
Fixed window	Reset counter per interval	Simple implementations
Sliding window	Rolling time window	Accurate per-client rate limits

Best Practices

Return 429 Too Many Requests with Retry-After header for HTTP APIs
Differentiate between user tiers with different limits
Monitor rejection rates as an early warning signal
Implement backoff for clients that are throttled
Consider distributed rate limiting for multi-instance deployments

Common Mistakes

Throttling without communicating limits to clients
Using same limits for all users regardless of tier
Not handling clock skew in distributed systems
Forgetting to clean up expired entries in window-based algorithms

Real-World Examples

GitHub API

GitHub enforces rate limits per authenticated user (5000 requests/hour) and per IP (60 requests/hour). Exceeding limits returns 403 with X-RateLimit-Reset header.

AWS API Gateway

API Gateway supports throttling at account, stage, and method levels using token bucket algorithms, with burst capacity for traffic spikes.

Frequently Asked Questions

Q: What is the difference between throttling and backpressure? A: Throttling rejects or delays requests at the entry point. Backpressure signals upstream to slow down production. They are often used together.

Q: How do I throttle across multiple servers? A: Use a shared store (Redis) to maintain token counts or request logs across instances.

Q: Should I queue or reject throttled requests? A: For user-facing APIs, reject with 429. For background processing, queue with visible delay.