Read Large Files

Overview

Reading multi-gigabyte files into memory at once can crash applications or cause severe performance degradation. This recipe shows memory-efficient techniques to process large files line by line or in chunks across Python, JavaScript, and Java.

When to Use

Use this resource when:

Processing log files, CSV dumps, or datasets larger than available RAM
Building ETL pipelines that ingest massive files
Streaming file contents to avoid blocking the event loop or heap

Solution

Python

# Line-by-line streaming (memory-efficient)
with open('large-file.log', 'r', encoding='utf-8') as f:
    for line in f:
        process(line)

# Chunked binary reading
chunk_size = 1024 * 1024  # 1 MB
with open('large-file.bin', 'rb') as f:
    while chunk := f.read(chunk_size):
        process(chunk)

JavaScript

const fs = require('fs');
const readline = require('readline');

// Line-by-line with readline (Node.js)
const stream = fs.createReadStream('large-file.log');
const rl = readline.createInterface({ input: stream });

for await (const line of rl) {
    console.log(line);
}

// Chunked reading
const readable = fs.createReadStream('large-file.bin', { highWaterMark: 1024 * 1024 });
readable.on('data', chunk => process(chunk));

Java

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;

public class LargeFileReader {
    // Line-by-line
    public void readLines(String path) throws IOException {
        try (BufferedReader reader = new BufferedReader(new FileReader(path))) {
            String line;
            while ((line = reader.readLine()) != null) {
                process(line);
            }
        }
    }

    // Memory-mapped chunk reading
    public void readChunks(String path) throws IOException {
        try (FileChannel channel = FileChannel.open(Paths.get(path), StandardOpenOption.READ)) {
            ByteBuffer buffer = ByteBuffer.allocateDirect(1024 * 1024);
            while (channel.read(buffer) > 0) {
                buffer.flip();
                process(buffer);
                buffer.clear();
            }
        }
    }

    private void process(Object data) {}
}

Explanation

Line-by-line streaming keeps only one line in memory at a time, making it ideal for text logs and CSV files. Chunked reading processes fixed-size byte blocks, suitable for binary data or when you need to control buffer size precisely. Memory-mapped files (Java) let the OS handle paging directly, often faster for random access but uses virtual address space.

Variants

Technology	Approach	Notes
Python	`mmap` module	Maps file to memory; OS handles paging
JavaScript	Web Streams API	`ReadableStream.getReader()` in browsers
Java	`Files.lines()`	Lazy Stream; auto-closes

Best Practices

Always use with (Python), try-with-resources (Java), or pipe error handling (JS) to prevent file descriptor leaks
Choose buffer sizes based on average line length; 1 MB is a sane default
For CSV/JSONL, parse incrementally rather than loading the entire structure
Monitor memory usage with OS tools to verify streaming behavior
Use mmap when you need random access without loading the whole file

Common Mistakes

Calling read() or readFileSync() on large files loads everything into RAM
Forgetting to handle encoding errors, which crash streams mid-file
Using too small a chunk size, causing excessive system call overhead
Not closing file handles, leading to “too many open files” errors
Ignoring backpressure when piping to slow consumers

Frequently Asked Questions

How large is “large”?

Any file approaching or exceeding your process heap/RAM (e.g., >500 MB on a 2 GB container). Streaming is cheap; always prefer it for files over a few megabytes.

Does line-by-line work for binary files?

No. Binary files should use chunked byte reading. Line-based approaches assume newline delimiters and text encoding.

Is memory-mapped faster than streaming?

For sequential access, usually not dramatically. Memory mapping shines for random access patterns or when multiple processes share the same file.

Overview

When to Use

Solution

Python

JavaScript

Java

Explanation

Variants

Best Practices

Common Mistakes

Frequently Asked Questions

How large is “large”?

Does line-by-line work for binary files?

Is memory-mapped faster than streaming?

File Upload Validation

Generate PDFs

Process Large Files with Streams

Abstract Factory Pattern

Adapter Pattern

Overview

When to Use

Solution

Python

JavaScript

Java

Explanation

Variants

Best Practices

Common Mistakes

Frequently Asked Questions

How large is “large”?

Does line-by-line work for binary files?

Is memory-mapped faster than streaming?

Related Resources

File Upload Validation

Generate PDFs

Process Large Files with Streams

Abstract Factory Pattern

Adapter Pattern