Safely Extract Zip Files with Python
How to extract and validate zip archives securely using Python zipfile and shutil.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Extracting zip files is a routine task, but doing it safely requires validation. Malicious archives can contain path traversal entries (../../etc/passwd) or zip bombs that exhaust disk space. Python’s zipfile module gives you the tools to extract safely if you check entries before writing.
When to Use
- You need to extract user-uploaded zip files
- You are processing archives from untrusted sources
- You want to validate zip contents before extraction (file count, total size)
- You need to extract specific files from an archive without unpacking everything
Solution
Basic extraction
import zipfile
with zipfile.ZipFile("archive.zip", "r") as zf:
zf.extractall("output_dir")
Safe extraction with path traversal protection
import zipfile
import os
def safe_extract(zip_path, extract_to):
with zipfile.ZipFile(zip_path, "r") as zf:
for member in zf.namelist():
# Resolve the target path
target = os.path.realpath(os.path.join(extract_to, member))
# Ensure the target is inside the extraction directory
if not target.startswith(os.path.realpath(extract_to) + os.sep):
raise ValueError(f"Path traversal detected: {member}")
# Only extract after validation passes
zf.extractall(extract_to)
safe_extract("archive.zip", "output_dir")
Validate before extracting
import zipfile
def validate_zip(zip_path, max_files=1000, max_total_size_mb=500):
with zipfile.ZipFile(zip_path, "r") as zf:
files = zf.namelist()
if len(files) > max_files:
raise ValueError(f"Too many files: {len(files)} (max {max_files})")
total_size = sum(info.file_size for info in zf.infolist())
if total_size > max_total_size_mb * 1024 * 1024:
raise ValueError(f"Archive too large: {total_size / 1024 / 1024:.1f}MB")
# Check for suspicious entries
for member in files:
if member.startswith("/") or ".." in member:
raise ValueError(f"Unsafe path in archive: {member}")
return True
if validate_zip("archive.zip"):
with zipfile.ZipFile("archive.zip", "r") as zf:
zf.extractall("output_dir")
Extract specific files only
import zipfile
with zipfile.ZipFile("archive.zip", "r") as zf:
# List all files
for name in zf.namelist():
print(name)
# Extract only .csv files
csv_files = [f for f in zf.namelist() if f.endswith(".csv")]
for f in csv_files:
zf.extract(f, "csv_output/")
Extract to memory without writing to disk
import zipfile
with zipfile.ZipFile("archive.zip", "r") as zf:
with zf.open("data.json") as f:
content = f.read()
# Process content directly without writing to disk
print(content[:200])
Explanation
The zipfile module reads archive metadata (file names, sizes, compression) without extracting. Use this to validate before writing anything to disk.
Path traversal attacks work by including entries like ../../etc/passwd in the archive. If you call extractall() without validation, Python writes files to those paths. The safe extraction function checks that every resolved path stays inside the target directory.
Zip bombs are archives that decompress to enormous sizes (e.g., 42KB file that expands to 4.5PB). Check file_size for each entry and sum them before extracting.
Variants
| Approach | Safety | Use When |
|---|---|---|
| extractall() | None | Trusted archives only |
| Safe extract with path check | High | User uploads |
| Validate + extract | Highest | Untrusted sources |
| Extract to memory | High | Processing without disk I/O |
Guidelines
- Never call
extractall()on untrusted archives without validation. - Check total uncompressed size before extracting to avoid zip bombs.
- Resolve paths with
os.path.realpath()to catch symlink-based traversal. - Use
zf.open()to read files into memory when you do not need them on disk. - Set a file count limit. Legitimate archives rarely contain 10,000 files.
Common Mistakes
- Calling
extractall()directly on user uploads. This is the most common zip extraction vulnerability. - Not checking
file_size(uncompressed). A 1MB zip can contain entries that expand to GBs. - Trusting
member.startswith("..")checks alone. Symlinks and absolute paths can bypass simple string checks. - Forgetting to handle password-protected archives.
zf.extractall(pwd=b"secret")raisesRuntimeErroron wrong passwords. - Not closing the ZipFile context. Use
withto ensure the file handle is released.
Frequently Asked Questions
How do I extract a password-protected zip?
Pass the password as bytes: zf.extractall("output", pwd=b"mypassword"). For AES-encrypted zips, install pyzipper instead of using the stdlib zipfile.
How do I detect a zip bomb?
Check the compression ratio. If uncompressed size is more than 100x the compressed size, treat it as suspicious. Also set a hard limit on total uncompressed size (e.g., 500MB).
Can I extract .tar.gz files with zipfile?
No. Use the tarfile module for tar archives. It has a similar API: tarfile.open("file.tar.gz", "r:gz").
How do I create a zip file in Python?
import zipfile
with zipfile.ZipFile("output.zip", "w", zipfile.ZIP_DEFLATED) as zf:
zf.write("file1.txt")
zf.write("file2.txt") Related Resources
Compress and Decompress Files
How to handle ZIP, GZIP, and TAR archives programmatically.
RecipeConfigure Firewall Rules with iptables
Set up basic firewall rules using iptables in Bash to filter traffic, block ports, and protect Linux servers.
RecipeSSH Key Management
Generate, rotate, and distribute SSH keys securely with Bash scripts for team and server access.
RecipeCopy and Move Files
How to copy and move files across platforms safely and efficiently.
RecipeGenerate Temporary Files
How to create temporary files and directories safely with automatic cleanup across Python, Node.js, Java, and Bash.