Throttling Pattern
Limit the rate at which a system processes requests or consumes resources to prevent overload, ensure fair usage, and maintain predictable performance under varying load.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Throttling Pattern
Overview
The Throttling Pattern controls the rate at which a system processes requests or consumes resources to prevent overload and ensure fair resource allocation. Instead of accepting all incoming requests immediately, the system limits the rate based on capacity, user tiers, or resource availability.
Throttling prevents cascading failures by ensuring downstream services and shared resources are not overwhelmed. It is commonly used in APIs, message consumers, database connections, and third-party integrations where unbounded throughput could cause service degradation or cost explosion.
When to Use
- Protecting downstream services from traffic spikes
- Enforcing API rate limits for consumers
- Controlling database connection pool exhaustion
- Managing costs with metered third-party APIs
- Ensuring fair resource allocation in multi-tenant systems
- Preventing DDoS or accidental abuse
When to Avoid
- Internal services within the same trust boundary with predictable load
- Systems where any request rejection violates business requirements
- When the bottleneck is not request rate but data size or complexity
- Latency-sensitive paths where throttling adds unacceptable delay
Solution
Python (Token Bucket)
import time
import threading
from dataclasses import dataclass
@dataclass
class TokenBucket:
capacity: int
refill_rate: float
tokens: float = 0
last_refill: float = 0
lock: threading.Lock = None
def __post_init__(self):
self.tokens = self.capacity
self.last_refill = time.time()
self.lock = threading.Lock()
def acquire(self, tokens: int = 1) -> bool:
with self.lock:
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
class ThrottledAPI:
def __init__(self):
self.bucket = TokenBucket(capacity=10, refill_rate=2)
def call(self, endpoint: str, data: dict) -> dict:
if not self.bucket.acquire():
raise RateLimitExceeded("Rate limit exceeded. Try again later.")
# Process request
return {"status": "success", "endpoint": endpoint}
class RateLimitExceeded(Exception):
pass
Java (Guava RateLimiter)
import com.google.common.util.concurrent.RateLimiter;
import org.springframework.stereotype.Service;
@Service
public class ThrottledService {
private final RateLimiter limiter = RateLimiter.create(10.0); // 10 permits/second
public String processRequest(String request) {
limiter.acquire(); // Blocks until permit available
return "Processed: " + request;
}
}
JavaScript (Sliding Window Log)
class SlidingWindowThrottle {
constructor(windowMs, maxRequests) {
this.windowMs = windowMs;
this.maxRequests = maxRequests;
this.requests = new Map();
}
isAllowed(clientId) {
const now = Date.now();
const windowStart = now - this.windowMs;
if (!this.requests.has(clientId)) {
this.requests.set(clientId, []);
}
const clientRequests = this.requests.get(clientId);
const recent = clientRequests.filter(t => t > windowStart);
if (recent.length < this.maxRequests) {
recent.push(now);
this.requests.set(clientId, recent);
return true;
}
this.requests.set(clientId, recent);
return false;
}
}
Explanation
Throttling algorithms balance fairness and efficiency:
- Token bucket: Tokens are added at a fixed rate. Requests consume tokens. Allows short bursts while maintaining long-term average rate.
- Leaky bucket: Requests enter a fixed-size queue and leak out at a constant rate. Smooths traffic but drops overflow.
- Fixed window: Count requests in time windows. Simple but allows burst at window boundaries.
- Sliding window: More accurate by tracking exact timestamps within a rolling window.
Variants
| Variant | Behavior | Best For |
|---|---|---|
| Token bucket | Bursts allowed up to capacity | APIs needing burst tolerance |
| Leaky bucket | Constant outflow rate | Smoothing traffic to downstream |
| Fixed window | Reset counter per interval | Simple implementations |
| Sliding window | Rolling time window | Accurate per-client rate limits |
Best Practices
- Return
429 Too Many RequestswithRetry-Afterheader for HTTP APIs - Differentiate between user tiers with different limits
- Monitor rejection rates as an early warning signal
- Implement backoff for clients that are throttled
- Consider distributed rate limiting for multi-instance deployments
Common Mistakes
- Throttling without communicating limits to clients
- Using same limits for all users regardless of tier
- Not handling clock skew in distributed systems
- Forgetting to clean up expired entries in window-based algorithms
Real-World Examples
GitHub API
GitHub enforces rate limits per authenticated user (5000 requests/hour) and per IP (60 requests/hour). Exceeding limits returns 403 with X-RateLimit-Reset header.
AWS API Gateway
API Gateway supports throttling at account, stage, and method levels using token bucket algorithms, with burst capacity for traffic spikes.
Frequently Asked Questions
Q: What is the difference between throttling and backpressure? A: Throttling rejects or delays requests at the entry point. Backpressure signals upstream to slow down production. They are often used together.
Q: How do I throttle across multiple servers? A: Use a shared store (Redis) to maintain token counts or request logs across instances.
Q: Should I queue or reject throttled requests? A: For user-facing APIs, reject with 429. For background processing, queue with visible delay.