Batch Processing Patterns
Design robust batch processing pipelines for large datasets with retry logic, idempotency, and observability.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Batch processing is the backbone of data pipelines, ETL workflows, and report generation. Unlike stream processing, batch jobs process bounded datasets in chunks, making them simpler to reason about but requiring careful attention to idempotency, fault tolerance, and observability.
When to Use
Use this resource when:
- Processing large datasets that do not fit in memory
- Building ETL pipelines for data warehouses
- Generating nightly reports or aggregations
- Migrating data between systems with downtime windows
Solution
Resilient Batch Pipeline (Python)
import logging
from typing import Callable, List, Iterator
class BatchProcessor:
def __init__(self, batch_size: int = 1000, max_retries: int = 3):
self.batch_size = batch_size
self.max_retries = max_retries
self.processed = 0
self.failed = []
def process(
self,
items: Iterator[dict],
handler: Callable[[List[dict]], None]
) -> dict:
batch = []
for item in items:
batch.append(item)
if len(batch) >= self.batch_size:
self._execute(batch, handler)
batch = []
if batch:
self._execute(batch, handler)
return {"processed": self.processed, "failed": len(self.failed)}
def _execute(self, batch: List[dict], handler: Callable):
for attempt in range(self.max_retries):
try:
handler(batch)
self.processed += len(batch)
return
except Exception as e:
logging.warning(f"Batch failed (attempt {attempt + 1}): {e}")
if attempt == self.max_retries - 1:
self.failed.extend(batch)
Idempotent Job Tracking (SQL)
CREATE TABLE job_runs (
job_id VARCHAR(64) PRIMARY KEY,
started_at TIMESTAMP NOT NULL DEFAULT NOW(),
completed_at TIMESTAMP,
status VARCHAR(20) CHECK (status IN ('running', 'completed', 'failed')),
checksum VARCHAR(64)
);
-- Before starting, check if already completed
SELECT * FROM job_runs WHERE job_id = 'daily_report_2025_01_15' AND status = 'completed';
Explanation
A production batch pipeline needs three properties:
- Idempotency: Running the same job twice must produce the same result. Use job IDs and checksums to skip already-processed work.
- Fault tolerance: Individual batch failures should not crash the entire job. Implement retry with exponential backoff and a dead-letter queue.
- Observability: Track progress, throughput, and errors. Emit metrics for processed items, latency, and failure rates.
Chunking strategy: Size batches to balance memory usage and throughput. Too small = overhead; too large = OOM risk.
Variants
| Pattern | Use Case | Trade-off |
|---|---|---|
| Chunked processing | Large files, memory limits | Simpler, higher latency |
| Parallel workers | CPU-bound transformations | Complex, needs coordination |
| MapReduce | Distributed aggregation | Scales horizontally |
| Change Data Capture | Incremental sync | Requires source support |
Best Practices
- Design for idempotency: Every job must be safely retryable
- Log everything: Job start, end, and every batch outcome
- Use transactions: Wrap batch writes in database transactions
- Monitor queue depth: Alert when pending batches exceed thresholds
- Implement circuit breakers: Stop retrying if downstream is unhealthy
Common Mistakes
- Not handling partial failures: A batch of 1000 where 1 fails needs individual retry
- Ignoring memory limits: Loading entire datasets into RAM crashes the process
- Missing checkpointing: A 6-hour job that fails at 5:55 must restart from scratch
- Silent data loss: Errors logged but not surfaced to operators
- No rollback strategy: Failed jobs leave the database in an inconsistent state
Frequently Asked Questions
Q: How large should each batch be? A: Start with 100-1000 items. Benchmark with your data and memory constraints.
Q: Should I use a job queue like Celery or a cron job? A: Use Celery/Redis for distributed systems and cron for single-node, simple pipelines.
Q: How do I handle schema changes mid-pipeline? A: Version your job logic and data schemas. Run old and new versions in parallel during migration.
Related Resources
Caching & Memoization
How to cache expensive computations and API responses using in-memory, LRU, and distributed caches across Python, JavaScript, and Java.
RecipeValidate and Sanitize User Input Data
How to validate, sanitize, and constrain user input data at the application boundary using schemas, type checking, and validation libraries.
RecipeDate Formatting
How to parse, format, and manipulate dates across timezones using Python, JavaScript, and Java.
RecipeDeep Clone Objects in JavaScript
How to create deep copies of JavaScript objects and arrays correctly, handling circular references, Dates, Maps, Sets, and custom classes.
RecipeFlatten and Unflatten Nested Objects
How to convert nested objects to flat key-value pairs and back again, with dot-notation, bracket notation, and custom separator support.