Skip to content
SP StackPractices
beginner By StackPractices

Parse Excel Files

How to read and write Excel (.xlsx) files in Python, Java, and JavaScript.

Topics: data

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Excel files (.xlsx) remain the dominant format for business reporting, data exports, and financial modeling. Parsing Excel programmatically enables automated data ingestion, report generation, and validation pipelines. This recipe covers reading, writing, and formatting spreadsheets across Python, JavaScript, and Java.

When to Use

Use this resource when:

  • Importing data from business users who deliver Excel files instead of CSV
  • Generating formatted reports with multiple sheets, charts, and styling
  • Validating data exports against source-of-truth databases
  • Converting legacy Excel-based workflows into automated pipelines

Solution

Python

# openpyxl is the standard for modern .xlsx files
# pip install openpyxl
from openpyxl import load_workbook

wb = load_workbook('data.xlsx')
ws = wb.active

for row in ws.iter_rows(min_row=2, values_only=True):
    print(row)
# Writing Excel files
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["Product", "Quantity", "Price"])
ws.append(["Widget", 100, 19.99])
wb.save('output.xlsx')

JavaScript

// xlsx (SheetJS) is the most popular Excel library for Node.js
// npm install xlsx
import xlsx from 'xlsx';

const workbook = xlsx.readFile('data.xlsx');
const sheet = workbook.Sheets[workbook.SheetNames[0]];
const data = xlsx.utils.sheet_to_json(sheet);
console.log(data);
// Writing Excel files
import xlsx from 'xlsx';

const ws = xlsx.utils.aoa_to_sheet([['Name', 'Age'], ['Alice', 30]]);
const wb = xlsx.utils.book_new();
xlsx.utils.book_append_sheet(wb, ws, 'People');
xlsx.writeFile(wb, 'output.xlsx');

Java

// Apache POI is the standard for Excel in Java
// Maven: org.apache.poi:poi-ooxml
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileInputStream;

public class ExcelParser {
    public static void main(String[] args) throws Exception {
        try (Workbook wb = new XSSFWorkbook(new FileInputStream("data.xlsx"))) {
            Sheet sheet = wb.getSheetAt(0);
            for (Row row : sheet) {
                for (Cell cell : row) {
                    System.out.print(cell.toString() + "\t");
                }
                System.out.println();
            }
        }
    }
}

Explanation

Excel files are ZIP archives containing XML files that follow the Open XML specification. Libraries abstract this complexity into sheet, row, and cell APIs. openpyxl (Python) supports modern .xlsx features like charts, images, and conditional formatting. xlsx (JS) is lightweight and supports both reading and writing in browser and Node.js. Apache POI (Java) is the enterprise standard but has a heavier memory footprint.

When reading, decide between values_only=True (Python) or sheet_to_json (JS) to get plain values, versus accessing cell objects for formatting, formulas, and metadata. For large files (>10k rows), use streaming readers to avoid loading the entire workbook into memory.

Variants

TechnologyLibraryApproachNotes
Pythonopenpyxlload_workbook()Full feature support, read/write .xlsx
Pythonpandasread_excel()Fast for data analysis, wraps openpyxl
Pythonxlrdopen_workbook()Legacy .xls only, not recommended for .xlsx
JavaScriptxlsx (SheetJS)readFile() / writeFile()Browser + Node, supports CSV/JSON export
JavaApache POIXSSFWorkbookEnterprise standard, supports formulas and charts
JavaFastExcelStreaming APILow-memory alternative to POI for large files

Best Practices

  • Use read_only=True in openpyxl for large read-only files to reduce memory usage
  • Validate sheet names before accessing them; user-provided files may have unexpected names
  • Handle merged cells explicitly: Libraries often return None for merged cell ranges except the top-left cell
  • Prefer .xlsx over .xls: The modern format has larger row limits and better compression
  • Close file handles with context managers (with in Python, try-with-resources in Java)

Common Mistakes

  • Using xlrd for .xlsx files: xlrd dropped .xlsx support in version 2.0; use openpyxl instead
  • Not handling formula cells: Formula cells return 0 or #VALUE! unless evaluated or cached
  • Ignoring data types: Excel stores dates as serial numbers; convert them explicitly to datetime
  • Loading entire workbooks into memory: For files > 50 MB, use streaming APIs to avoid OOM errors
  • Hard-coding column indexes: Use header row mapping (e.g., {'Name': 0, 'Age': 1}) to survive column reordering

Frequently Asked Questions

How do I read large Excel files without running out of memory?

Use openpyxl with read_only=True in Python, FastExcel streaming in Java, or process files in chunks with xlsx in Node.js. Another approach is converting to CSV first and then streaming the CSV, though you lose formatting and formulas.

Can I preserve formatting when writing Excel files?

Yes. openpyxl supports fonts, fills, borders, and number formats via the openpyxl.styles module. Apache POI has CellStyle and Font classes. xlsx (SheetJS) supports styles in its Pro version; the community version is limited to raw data.

How do I handle dates correctly when parsing Excel?

Excel stores dates as floating-point serial numbers (days since 1900 or 1904). openpyxl returns datetime objects when data_only=True is set and values are cached. pandas automatically converts date columns if parse_dates is specified. In Java, use DataFormatter to render cell values as strings, then parse with DateTimeFormatter.