Parse Excel Files | StackPractices

Q: How do I read large Excel files without running out of memory?

Use openpyxl with read_only=True in Python, FastExcel streaming in Java, or process files in chunks with xlsx in Node.js. Another approach is converting to CSV first and then streaming the CSV, though you lose formatting and formulas.

Q: Can I preserve formatting when writing Excel files?

Yes. openpyxl supports fonts, fills, borders, and number formats via the openpyxl.styles module. Apache POI has CellStyle and Font classes. xlsx (SheetJS) supports styles in its Pro version; the community version is limited to raw data.

Overview

Excel files (.xlsx) remain the dominant format for business reporting, data exports, and financial modeling. Parsing Excel programmatically enables automated data ingestion, report generation, and validation pipelines. This recipe covers reading, writing, and formatting spreadsheets across Python, JavaScript, and Java.

When to Use

Use this resource when:

Importing data from business users who deliver Excel files instead of CSV
Generating formatted reports with multiple sheets, charts, and styling
Validating data exports against source-of-truth databases
Converting legacy Excel-based workflows into automated pipelines

Solution

Python

# openpyxl is the standard for modern .xlsx files
# pip install openpyxl
from openpyxl import load_workbook

wb = load_workbook('data.xlsx')
ws = wb.active

for row in ws.iter_rows(min_row=2, values_only=True):
    print(row)

# Writing Excel files
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["Product", "Quantity", "Price"])
ws.append(["Widget", 100, 19.99])
wb.save('output.xlsx')

JavaScript

// xlsx (SheetJS) is the most popular Excel library for Node.js
// npm install xlsx
import xlsx from 'xlsx';

const workbook = xlsx.readFile('data.xlsx');
const sheet = workbook.Sheets[workbook.SheetNames[0]];
const data = xlsx.utils.sheet_to_json(sheet);
console.log(data);

// Writing Excel files
import xlsx from 'xlsx';

const ws = xlsx.utils.aoa_to_sheet([['Name', 'Age'], ['Alice', 30]]);
const wb = xlsx.utils.book_new();
xlsx.utils.book_append_sheet(wb, ws, 'People');
xlsx.writeFile(wb, 'output.xlsx');

Java

// Apache POI is the standard for Excel in Java
// Maven: org.apache.poi:poi-ooxml
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileInputStream;

public class ExcelParser {
    public static void main(String[] args) throws Exception {
        try (Workbook wb = new XSSFWorkbook(new FileInputStream("data.xlsx"))) {
            Sheet sheet = wb.getSheetAt(0);
            for (Row row : sheet) {
                for (Cell cell : row) {
                    System.out.print(cell.toString() + "\t");
                }
                System.out.println();
            }
        }
    }
}

Explanation

Excel files are ZIP archives containing XML files that follow the Open XML specification. Libraries abstract this complexity into sheet, row, and cell APIs. openpyxl (Python) supports modern .xlsx features like charts, images, and conditional formatting. xlsx (JS) is lightweight and supports both reading and writing in browser and Node.js. Apache POI (Java) is the enterprise standard but has a heavier memory footprint.

When reading, decide between values_only=True (Python) or sheet_to_json (JS) to get plain values, versus accessing cell objects for formatting, formulas, and metadata. For large files (>10k rows), use streaming readers to avoid loading the entire workbook into memory.

Variants

Technology	Library	Approach	Notes
Python	`openpyxl`	`load_workbook()`	Full feature support, read/write .xlsx
Python	`pandas`	`read_excel()`	Fast for data analysis, wraps openpyxl
Python	`xlrd`	`open_workbook()`	Legacy .xls only, not recommended for .xlsx
JavaScript	`xlsx` (SheetJS)	`readFile()` / `writeFile()`	Browser + Node, supports CSV/JSON export
Java	`Apache POI`	`XSSFWorkbook`	Enterprise standard, supports formulas and charts
Java	`FastExcel`	Streaming API	Low-memory alternative to POI for large files

Best Practices

Use read_only=True in openpyxl for large read-only files to reduce memory usage
Validate sheet names before accessing them; user-provided files may have unexpected names
Handle merged cells explicitly: Libraries often return None for merged cell ranges except the top-left cell
Prefer .xlsx over .xls: The modern format has larger row limits and better compression
Close file handles with context managers (with in Python, try-with-resources in Java)

Common Mistakes

Using xlrd for .xlsx files: xlrd dropped .xlsx support in version 2.0; use openpyxl instead
Not handling formula cells: Formula cells return 0 or #VALUE! unless evaluated or cached
Ignoring data types: Excel stores dates as serial numbers; convert them explicitly to datetime
Loading entire workbooks into memory: For files > 50 MB, use streaming APIs to avoid OOM errors
Hard-coding column indexes: Use header row mapping (e.g., {'Name': 0, 'Age': 1}) to survive column reordering

Frequently Asked Questions

How do I read large Excel files without running out of memory?

Use openpyxl with read_only=True in Python, FastExcel streaming in Java, or process files in chunks with xlsx in Node.js. Another approach is converting to CSV first and then streaming the CSV, though you lose formatting and formulas.

Can I preserve formatting when writing Excel files?

Yes. openpyxl supports fonts, fills, borders, and number formats via the openpyxl.styles module. Apache POI has CellStyle and Font classes. xlsx (SheetJS) supports styles in its Pro version; the community version is limited to raw data.

How do I handle dates correctly when parsing Excel?

Excel stores dates as floating-point serial numbers (days since 1900 or 1904). openpyxl returns datetime objects when data_only=True is set and values are cached. pandas automatically converts date columns if parse_dates is specified. In Java, use DataFormatter to render cell values as strings, then parse with DateTimeFormatter.