Skip to content
SP StackPractices
intermediate By StackPractices

Parse Log Files

How to parse and analyze server log files using Python, Java, and JavaScript.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Server logs are a goldmine for debugging, security auditing, and performance analysis. Common formats include Apache Combined Log, Nginx access logs, JSON Lines, and syslog. Parsing these programmatically enables automated monitoring, anomaly detection, and custom analytics dashboards.

When to Use

Use this resource when:

  • Analyzing web server logs to identify 404 errors, slow requests, or attack patterns
  • Building log aggregation pipelines for centralized observability platforms
  • Extracting metrics from application logs for custom dashboards
  • Automating security audits by scanning for suspicious IP addresses or user agents

Solution

Python

import re
from collections import Counter

# Parse Apache/Nginx combined log format
log_pattern = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) (?P<proto>[^"]+)" '
    r'(?P<status>\d{3}) (?P<bytes>\S+)'
)

with open('access.log', 'r') as f:
    for line in f:
        match = log_pattern.match(line)
        if match:
            print(match.group('ip'), match.group('status'))
# Count HTTP status codes with Counter
status_counts = Counter()
with open('access.log', 'r') as f:
    for line in f:
        match = log_pattern.match(line)
        if match:
            status_counts[match.group('status')] += 1
print(status_counts)

JavaScript

const fs = require('fs');
const readline = require('readline');

const logPattern = /^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) ([^"]+)" (\d{3}) (\S+)/;

async function parseLogFile(path) {
    const stream = fs.createReadStream(path);
    const rl = readline.createInterface({ input: stream });
    const statusCounts = {};

    for await (const line of rl) {
        const match = logPattern.exec(line);
        if (match) {
            const [, ip, time, method, path, proto, status] = match;
            statusCounts[status] = (statusCounts[status] || 0) + 1;
        }
    }
    return statusCounts;
}

Java

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class LogParser {
    private static final Pattern LOG_PATTERN = Pattern.compile(
        "^(\\S+) \\S+ \\S+ \\[(\\d{2}/\\w{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2} [+-]\\d{4})\\] " +
        "\"(\\S+) (\\S+) ([^\"]+)\" (\\d{3}) (\\S+)"
    );

    public static void main(String[] args) throws IOException {
        try (BufferedReader br = new BufferedReader(new FileReader("access.log"))) {
            String line;
            while ((line = br.readLine()) != null) {
                Matcher m = LOG_PATTERN.matcher(line);
                if (m.find()) {
                    System.out.println(m.group(1) + " " + m.group(6));
                }
            }
        }
    }
}

Explanation

Log parsing follows a common pattern: read line by line, match against a regular expression or grammar, extract named groups, and aggregate results. Streaming is essential because server logs can reach gigabytes per day.

The Apache Combined Log Format is the de facto standard: host ident authuser [date] "request" status bytes [referer] [user-agent]. JSON Lines (ndjson) is increasingly common in modern applications because it is self-describing and trivial to parse with JSON.parse().

Variants

FormatPatternBest Tool
Apache/NginxRegex + streamingPython re, Node streams
JSON LinesJSON.parse()Any language, trivial parsing
SyslogRFC 3164/5424 grammarsyslog-parser libraries
CSV logscsv.reader / csv-parserStandard CSV tools
Custom applicationNamed regex groupsLanguage-specific regex

Best Practices

  • Stream large files line by line instead of loading the entire log into memory
  • Use named regex groups ((?P<name>...)) for self-documenting parsers
  • Normalize timestamps to UTC immediately to avoid timezone confusion
  • Handle malformed lines gracefully by logging errors and continuing, not crashing
  • Index parsed results into Elasticsearch, ClickHouse, or SQLite for fast querying

Common Mistakes

  • Parsing multi-line stack traces as separate log entries: Use readline carefully or switch to structured logging
  • Not escaping special regex characters: Log paths may contain ?, &, and % that break naive patterns
  • Hardcoding log paths: Accept paths via CLI args or environment variables
  • Ignoring log rotation: Implement file tailing or use existing tools like logrotate + rsyslog
  • Running regex on unbounded input: Pre-compile patterns and set reasonable line length limits

Frequently Asked Questions

What is the best format for application logs?

JSON Lines (ndjson) is the modern standard. Each log entry is a self-contained JSON object on its own line, making parsing trivial and eliminating the need for complex regex. Use structured logging libraries like pino (JS), structlog (Python), or Logback JSON (Java).

How do I parse logs in real time?

Use tail -f or language-specific file tailing libraries (Python pygtail, Node tail). Alternatively, ship logs to a message queue (Kafka, Redis Streams) and process them with consumers.

How do I detect anomalies in logs?

After parsing, aggregate by status code, response time percentiles, and error rate per endpoint. Set thresholds (e.g., >1% 5xx errors) and alert via PagerDuty or Slack. For advanced detection, feed parsed features into an ML model or use tools like the ELK stack with anomaly detection plugins.