Skip to content
SP StackPractices
intermediate By StackPractices

Log Aggregation — Centralize, Search, and Analyze Logs at Scale

A practical guide to log aggregation: structured logging, shipping strategies, retention policies, and building searchable log pipelines with ELK, Loki, and cloud-native solutions.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Log aggregation collects logs from all services, systems, and infrastructure into a centralized, searchable platform. It transforms scattered text files into a queryable observability signal, enabling fast debugging, security auditing, and operational visibility across distributed systems.

This guide covers structured logging, shipping strategies, storage optimization, and platform selection.

When to Use

  • You operate more than 5 services and need to correlate logs across them
  • Debugging requires grepping through multiple servers or containers
  • You need log-based alerting for errors and anomalies
  • Security or compliance requires centralized audit logs
  • Your current logging is ad-hoc and inconsistent across teams

Core Concepts

ConceptDescription
Structured LoggingOutputting logs as JSON or key-value pairs instead of free text
Log ShipperAgent that reads local logs and forwards them to a central store
IndexSearchable storage partition organized by time or source
Retention PolicyRules that determine how long logs are kept before deletion
Log ParsingExtracting fields from raw log lines at ingest or query time
Hot/Warm/Cold StorageTiered storage based on access frequency and age

Log Aggregation Architectures

┌──────────┐   ┌──────────┐   ┌──────────┐
│  App 1   │   │  App 2   │   │  App N   │
│ (stdout) │   │ (stdout) │   │ (stdout) │
└────┬─────┘   └────┬─────┘   └────┬─────┘
     │              │              │
     └──────────────┼──────────────┘

           ┌────────▼────────┐
           │  Log Shipper    │  (Filebeat, Fluent Bit, Vector)
           │  (Parse + Enrich)│
           └────────┬────────┘

         ┌──────────┴──────────┐
         │                     │
   ┌─────▼─────┐        ┌─────▼─────┐
   │  Indexer  │        │  Object   │
   │(Elasticsearch│      │  Storage  │
   │   Loki)    │        │  (S3/GCS) │
   └─────┬─────┘        └───────────┘

   ┌─────▼─────┐
   │ Dashboard │
   │(Kibana/Grafana)
   └───────────┘

Step-by-Step Log Aggregation Setup

1. Adopt Structured Logging

Make logs machine-parseable from the source:

# Example: Python structured logging with structlog
import structlog
import logging
import sys

structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    wrapper_class=structlog.stdlib.BoundLogger,
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

# Structured log output
logger.info(
    "payment_processed",
    payment_id="pay-123",
    amount=99.99,
    currency="USD",
    user_id="user-456",
    duration_ms=145
)
# Output: {"event": "payment_processed", "payment_id": "pay-123", "amount": 99.99, ...}
// Example: Node.js structured logging with pino
const pino = require('pino');
const logger = pino({ level: 'info' });

logger.info({
  msg: 'payment_processed',
  paymentId: 'pay-123',
  amount: 99.99,
  currency: 'USD',
  userId: 'user-456',
  durationMs: 145
});

Structured logging best practices:

  • Always log as JSON in production
  • Use consistent field names (snake_case recommended)
  • Include trace_id, span_id, and request_id in every log
  • Add contextual fields (user_id, tenant_id, request_path) at request start
  • Never log PII or secrets

2. Ship Logs to Central Store

Choose and configure a log shipper:

ShipperBest ForProsCons
FilebeatELK stackMature, rich modulesResource heavy
Fluent BitKubernetes, embeddedLightweight, fastLess mature than Fluentd
VectorHigh throughputRust-based, performantSmaller ecosystem
PromtailLokiNative Loki integrationLoki-only
# Example: Fluent Bit configuration for Kubernetes
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db

    [FILTER]
        Name              kubernetes
        Match             kube.*
        Merge_Log         On
        Keep_Log          Off

    [OUTPUT]
        Name              loki
        Match             kube.*
        Host              loki.monitoring.svc
        Labels            job=fluentbit
# Example: Filebeat configuration for Elasticsearch
filebeat.inputs:
  - type: log
    paths:
      - /var/log/myapp/*.log
    fields:
      service: myapp
      environment: production
    fields_under_root: true
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  index: "myapp-logs-%{+yyyy.MM.dd}"

Shipping best practices:

  • Use backpressure-aware shippers that won’t crash the host
  • Add metadata (host, service, environment) at the shipper level
  • Buffer locally to survive temporary network outages
  • Use TLS for all log transport

3. Design Retention and Storage

Balance cost with queryability:

Storage TierRetentionQuery SpeedCost
Hot1-7 daysInstantHigh
Warm7-30 daysSecondsMedium
Cold (S3/GCS)30-365 daysMinutesLow
Archive1-7 yearsBatch onlyVery low
# Example: Elasticsearch ILM (Index Lifecycle Management) policy
PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "freeze": {},
          "allocate": { "require": { "data": "cold" } }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Retention rules:

  • Error logs: Keep longer (90+ days) than access logs (30 days)
  • Security/audit logs: Keep 1-7 years based on compliance requirements
  • Debug logs: Keep only in hot storage (1-3 days)
  • Archive to object storage before deletion for compliance

4. Query and Analyze Logs

Make your aggregated logs actionable:

# Example: KQL (Kibana Query Language)

# Find errors in a specific service
service.name:orders-service and level:error

# Find slow requests (>1s)
duration_ms > 1000

# Find requests for a specific user
user_id:user-123

# Count errors by service
service.name:* and level:error | stats count() by service.name

# Find exceptions in time range
@timestamp:[now-1h TO now] and exception.class:*
# Example: LogQL (Grafana Loki)

# Search for errors in a service
{job="orders-service"} |= "ERROR"

# Count errors per minute
sum(rate({job="orders-service"} |= "ERROR" [1m]))

# Find slow database queries
{job="orders-service"} |= "duration_ms" | json | duration_ms > 500

# Extract and graph payment amounts
{job="payment-service"} |= "payment_processed" | json | line_format "{{.amount}}"

Query best practices:

  • Learn the query language of your chosen platform (KQL, LogQL, SPL)
  • Save common queries as dashboards or alerts
  • Use log-based metrics for dashboards (faster than raw log queries)
  • Correlate logs with traces using trace_id fields

5. Build Log-Based Alerts

Detect issues from log patterns:

# Example: Grafana alert rule for error rate
apiVersion: 1
groups:
  - name: log_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate({job=~".*"} |= "ERROR" [5m])) by (job)
          /
          sum(rate({job=~".*"} [5m])) by (job)
          > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: SlowPayments
        expr: |
          avg_over_time(
            {job="payment-service"} |= "payment_processed" | json | duration_ms [10m]
          ) > 500
        labels:
          severity: warning

Alert patterns:

  • Error rate spike: % of error logs / total logs > threshold
  • New error pattern: Count of unique exception types increased
  • Missing logs: Log volume dropped below expected baseline
  • Security event: Pattern matching known attack signatures

Best Practices

  • Standardize log levels. Use ERROR, WARN, INFO, DEBUG consistently across all services.
  • Include correlation IDs. Every log must have trace_id, span_id, or request_id for cross-service debugging.
  • Avoid logging in tight loops. Batch or skip loop logs to prevent log flooding.
  • Sample high-volume logs. Not every request needs full debug logging.
  • Monitor the pipeline. Alert if log shipping falls behind or storage fills up.
  • Document your schema. Teams need to know which fields are available for queries.

Common Mistakes

  • Unstructured logs everywhere. Parsing text logs at ingest time is fragile and slow.
  • No retention strategy. Storage costs grow exponentially without lifecycle policies.
  • Over-logging. Debug-level logs in production overwhelm the pipeline and hide real issues.
  • Missing context. Logs without service name, environment, or trace IDs are nearly useless.
  • Ignoring backpressure. Log shippers that crash under load create blind spots.

Variants

  • Cloud-native: AWS CloudWatch Logs, Google Cloud Logging, Azure Monitor Logs (managed, but vendor-specific)
  • Open-source stack: ELK (Elasticsearch, Logstash, Kibana) or PLG (Promtail, Loki, Grafana)
  • Enterprise: Splunk, Datadog, Sumo Logic (rich features, higher cost)
  • Edge aggregation: Local log aggregation before central shipping (reduces network cost)

FAQ

Q: Should I use ELK or Loki? ELK is more mature and feature-rich. Loki is simpler, cheaper at scale, and integrates natively with Grafana. Choose ELK for complex search; Loki for cost-efficient observability.

Q: How do I handle multi-line logs (stack traces)? Use log shippers with multiline parsing (Filebeat multiline.pattern, Fluent Bit multiline.parser) or log directly as JSON with the stack trace as a single field.

Q: How much does log aggregation cost? At high scale, log storage is often your largest observability cost. Use sampling, aggressive retention policies, and tiered storage to control costs.

Q: Can I use logs for metrics? Yes — most platforms support log-based metrics (counting log lines over time, extracting numeric fields). This avoids dual-instrumentation but is less efficient than dedicated metrics.

Conclusion

Log aggregation transforms scattered application output into a unified debugging and auditing platform. By adopting structured logging, choosing the right shipping strategy, and designing smart retention policies, you build an observability foundation that scales with your infrastructure.