Parse Excel Files
How to read and write Excel (.xlsx) files in Python, Java, and JavaScript.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Excel files (.xlsx) remain the dominant format for business reporting, data exports, and financial modeling. Parsing Excel programmatically enables automated data ingestion, report generation, and validation pipelines. This recipe covers reading, writing, and formatting spreadsheets across Python, JavaScript, and Java.
When to Use
Use this resource when:
- Importing data from business users who deliver Excel files instead of CSV
- Generating formatted reports with multiple sheets, charts, and styling
- Validating data exports against source-of-truth databases
- Converting legacy Excel-based workflows into automated pipelines
Solution
Python
# openpyxl is the standard for modern .xlsx files
# pip install openpyxl
from openpyxl import load_workbook
wb = load_workbook('data.xlsx')
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
print(row)
# Writing Excel files
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["Product", "Quantity", "Price"])
ws.append(["Widget", 100, 19.99])
wb.save('output.xlsx')
JavaScript
// xlsx (SheetJS) is the most popular Excel library for Node.js
// npm install xlsx
import xlsx from 'xlsx';
const workbook = xlsx.readFile('data.xlsx');
const sheet = workbook.Sheets[workbook.SheetNames[0]];
const data = xlsx.utils.sheet_to_json(sheet);
console.log(data);
// Writing Excel files
import xlsx from 'xlsx';
const ws = xlsx.utils.aoa_to_sheet([['Name', 'Age'], ['Alice', 30]]);
const wb = xlsx.utils.book_new();
xlsx.utils.book_append_sheet(wb, ws, 'People');
xlsx.writeFile(wb, 'output.xlsx');
Java
// Apache POI is the standard for Excel in Java
// Maven: org.apache.poi:poi-ooxml
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileInputStream;
public class ExcelParser {
public static void main(String[] args) throws Exception {
try (Workbook wb = new XSSFWorkbook(new FileInputStream("data.xlsx"))) {
Sheet sheet = wb.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
System.out.print(cell.toString() + "\t");
}
System.out.println();
}
}
}
}
Explanation
Excel files are ZIP archives containing XML files that follow the Open XML specification. Libraries abstract this complexity into sheet, row, and cell APIs. openpyxl (Python) supports modern .xlsx features like charts, images, and conditional formatting. xlsx (JS) is lightweight and supports both reading and writing in browser and Node.js. Apache POI (Java) is the enterprise standard but has a heavier memory footprint.
When reading, decide between values_only=True (Python) or sheet_to_json (JS) to get plain values, versus accessing cell objects for formatting, formulas, and metadata. For large files (>10k rows), use streaming readers to avoid loading the entire workbook into memory.
Variants
| Technology | Library | Approach | Notes |
|---|---|---|---|
| Python | openpyxl | load_workbook() | Full feature support, read/write .xlsx |
| Python | pandas | read_excel() | Fast for data analysis, wraps openpyxl |
| Python | xlrd | open_workbook() | Legacy .xls only, not recommended for .xlsx |
| JavaScript | xlsx (SheetJS) | readFile() / writeFile() | Browser + Node, supports CSV/JSON export |
| Java | Apache POI | XSSFWorkbook | Enterprise standard, supports formulas and charts |
| Java | FastExcel | Streaming API | Low-memory alternative to POI for large files |
Best Practices
- Use
read_only=Truein openpyxl for large read-only files to reduce memory usage - Validate sheet names before accessing them; user-provided files may have unexpected names
- Handle merged cells explicitly: Libraries often return
Nonefor merged cell ranges except the top-left cell - Prefer
.xlsxover.xls: The modern format has larger row limits and better compression - Close file handles with context managers (
within Python, try-with-resources in Java)
Common Mistakes
- Using
xlrdfor .xlsx files:xlrddropped .xlsx support in version 2.0; useopenpyxlinstead - Not handling formula cells: Formula cells return
0or#VALUE!unless evaluated or cached - Ignoring data types: Excel stores dates as serial numbers; convert them explicitly to
datetime - Loading entire workbooks into memory: For files > 50 MB, use streaming APIs to avoid OOM errors
- Hard-coding column indexes: Use header row mapping (e.g.,
{'Name': 0, 'Age': 1}) to survive column reordering
Frequently Asked Questions
How do I read large Excel files without running out of memory?
Use openpyxl with read_only=True in Python, FastExcel streaming in Java, or process files in chunks with xlsx in Node.js. Another approach is converting to CSV first and then streaming the CSV, though you lose formatting and formulas.
Can I preserve formatting when writing Excel files?
Yes. openpyxl supports fonts, fills, borders, and number formats via the openpyxl.styles module. Apache POI has CellStyle and Font classes. xlsx (SheetJS) supports styles in its Pro version; the community version is limited to raw data.
How do I handle dates correctly when parsing Excel?
Excel stores dates as floating-point serial numbers (days since 1900 or 1904). openpyxl returns datetime objects when data_only=True is set and values are cached. pandas automatically converts date columns if parse_dates is specified. In Java, use DataFormatter to render cell values as strings, then parse with DateTimeFormatter.