Blob Storage — S3, GCS, and Azure Blob Patterns for Engineers
A practical guide to cloud blob storage: bucket design, access control, lifecycle policies, multipart uploads, presigned URLs, and cost optimization patterns for S3, Google Cloud Storage, and Azure Blob.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Blob (object) storage is the dominant way to store unstructured data in the cloud: images, videos, documents, backups, and logs. Unlike filesystems or block storage, object storage treats each file as an independent object with metadata, accessed via HTTP APIs. It is infinitely scalable, durable, and cost-effective — but requires different design patterns than traditional storage.
This guide covers bucket design, access patterns, security, lifecycle management, and multi-cloud considerations.
When to Use
- You store files >1MB that do not need random access (images, videos, PDFs)
- You need durable, redundant storage without managing disks or RAID
- Your storage volume exceeds what a single server can handle
- You want to decouple storage from compute (stateless services)
- You need to share files between services, regions, or organizations
- Cost per gigabyte is a primary concern
When NOT to Use
- You need frequent small random reads/writes (databases, transactional data)
- You need POSIX filesystem semantics (directories, symlinks, file locking)
- Latency requirements are <10ms consistently (use SSD/block storage)
- You need to modify objects in-place (objects are immutable; rewrite required)
Core Concepts
| Concept | Description |
|---|---|
| Bucket | A container for objects with its own policies and configuration |
| Object | A file stored with metadata, a unique key, and a version ID |
| Key | The unique identifier (path-like string) for an object within a bucket |
| Presigned URL | A time-limited URL that grants temporary access without credentials |
| Multipart Upload | Splitting large files into parts for parallel upload and resumability |
| Lifecycle Policy | Rules that transition or delete objects based on age |
Provider Comparison
| Feature | AWS S3 | Google Cloud Storage | Azure Blob |
|---|---|---|---|
| Durability | 99.999999999% (11 nines) | 99.999999999% | 99.999999999% |
| Availability | 99.99% | 99.95% (multi-regional) | 99.99% (Hot) |
| Storage Classes | Standard, IA, Glacier, Deep | Standard, Nearline, Coldline, Archive | Hot, Cool, Cold, Archive |
| Min Billable Size | 128KB (IA) | N/A | N/A |
| Multipart Min | 5MB (except last) | N/A (composite objects) | 4MB (block) |
| Presigned URLs | Yes | Yes | Yes (SAS tokens) |
| Event Notifications | S3 Events, SNS, SQS | Cloud Pub/Sub | Event Grid |
| Static Website | Native support | Native support | Native support |
Step-by-Step Blob Storage Implementation
1. Design Your Bucket Structure
Organize objects to support access patterns and lifecycle management:
s3://myapp-production/
├── uploads/
│ ├── raw/ # Unprocessed user uploads
│ │ └── 2024/06/25/uuid-original.jpg
│ ├── processed/ # Resized, compressed versions
│ │ └── 2024/06/25/uuid-800x600.jpg
│ └── temp/ # Processing in progress
├── documents/
│ ├── invoices/ # Financial documents
│ └── contracts/ # Legal documents
├── backups/
│ └── database/ # Daily database dumps
├── logs/
│ └── application/ # Application log files
└── public/ # Static website assets
├── images/
├── css/
└── js/
Naming best practices:
| Pattern | Example | Purpose |
|---|---|---|
| Date prefix | logs/2024/06/25/app.log | Lifecycle rules by date |
| UUID filename | uploads/raw/a1b2c3d4.jpg | Avoid conflicts, enable distribution |
| Derived variants | uuid-thumb.jpg, uuid-full.jpg | Multiple sizes/formats |
| Version prefix | backups/v2.3.1/dump.sql | Software version correlation |
# Example: Generate structured object keys
import uuid
from datetime import datetime
def generate_upload_key(user_id, filename):
"""Generate S3 key with date prefix and UUID."""
now = datetime.utcnow()
file_uuid = uuid.uuid4().hex[:12]
extension = filename.split('.')[-1].lower()
return f"uploads/raw/{now:%Y/%m/%d}/{user_id}/{file_uuid}.{extension}"
# Result: uploads/raw/2024/06/25/12345/a1b2c3d4e5f6.jpg
2. Implement Secure Access
Never distribute long-term credentials. Use IAM roles, presigned URLs, and bucket policies:
# Example: Generate presigned URL for temporary access (Python/Boto3)
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client('s3')
def generate_upload_url(bucket, key, expiration=300):
"""Generate a presigned URL for direct browser upload."""
try:
url = s3.generate_presigned_url(
'put_object',
Params={
'Bucket': bucket,
'Key': key,
'ContentType': 'image/jpeg'
},
ExpiresIn=expiration
)
return url
except ClientError as e:
raise
def generate_download_url(bucket, key, expiration=3600):
"""Generate a presigned URL for temporary download access."""
try:
url = s3.generate_presigned_url(
'get_object',
Params={'Bucket': bucket, 'Key': key},
ExpiresIn=expiration
)
return url
except ClientError as e:
raise
# Usage in API
@app.route('/upload-url', methods=['POST'])
def get_upload_url():
user_id = get_current_user_id()
filename = request.json['filename']
key = generate_upload_key(user_id, filename)
url = generate_upload_url('myapp-uploads', key)
return jsonify({'uploadUrl': url, 'key': key})
// Example: S3 bucket policy for CloudFront origin access
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CloudFrontAccess",
"Effect": "Allow",
"Principal": {
"CanonicalUser": "CLOUDFRONT_OAI_CANONICAL_ID"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::myapp-public/*"
}
]
}
# Example: Terraform for private bucket with versioning
resource "aws_s3_bucket" "uploads" {
bucket = "myapp-production-uploads"
}
resource "aws_s3_bucket_versioning" "uploads" {
bucket = aws_s3_bucket.uploads.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_public_access_block" "uploads" {
bucket = aws_s3_bucket.uploads.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "uploads" {
bucket = aws_s3_bucket.uploads.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
3. Upload Large Files with Multipart
For files >100MB, use multipart upload for reliability and performance:
# Example: Multipart upload with Python/Boto3
import boto3
from boto3.s3.transfer import TransferConfig
s3 = boto3.client('s3')
# Simple multipart (Boto3 handles splitting)
config = TransferConfig(
multipart_threshold=1024 * 25, # 25MB
max_concurrency=10,
multipart_chunksize=1024 * 25, # 25MB parts
use_threads=True
)
s3.upload_file(
'large-video.mp4',
'myapp-uploads',
'videos/large-video.mp4',
Config=config
)
# Manual multipart for resumable uploads
def multipart_upload(bucket, key, file_path, part_size=50*1024*1024):
"""Upload with resume capability."""
s3 = boto3.client('s3')
# Initiate multipart upload
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = mpu['UploadId']
try:
parts = []
with open(file_path, 'rb') as f:
part_num = 1
while True:
data = f.read(part_size)
if not data:
break
response = s3.upload_part(
Bucket=bucket, Key=key,
UploadId=upload_id, PartNumber=part_num,
Body=data
)
parts.append({
'PartNumber': part_num,
'ETag': response['ETag']
})
part_num += 1
# Complete multipart upload
s3.complete_multipart_upload(
Bucket=bucket, Key=key, UploadId=upload_id,
MultipartUpload={'Parts': parts}
)
except Exception as e:
s3.abort_multipart_upload(Bucket=bucket, Key=key, UploadId=upload_id)
raise
// Example: Multipart upload with AWS SDK v3 (Node.js)
import { Upload } from "@aws-sdk/lib-storage";
import { S3Client } from "@aws-sdk/client-s3";
import { createReadStream } from "fs";
const client = new S3Client({ region: "us-east-1" });
const upload = new Upload({
client,
params: {
Bucket: "myapp-uploads",
Key: "videos/large-file.mp4",
Body: createReadStream("./large-file.mp4"),
},
queueSize: 4, // Concurrent parts
partSize: 25 * 1024 * 1024, // 25MB parts
});
upload.on("httpUploadProgress", (progress) => {
console.log(`Uploaded ${progress.loaded}/${progress.total}`);
});
await upload.done();
4. Implement Lifecycle Policies
Automate cost optimization by transitioning or deleting old objects:
// S3 Lifecycle Policy: Transition to cheaper storage, then delete
{
"Rules": [
{
"ID": "RawUploadsLifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "uploads/raw/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
},
{
"ID": "TempCleanup",
"Status": "Enabled",
"Filter": {
"Prefix": "uploads/temp/"
},
"Expiration": {
"Days": 7
}
},
{
"ID": "LogArchive",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
}
],
"NoncurrentVersionTransitions": [
{
"NoncurrentDays": 30,
"StorageClass": "GLACIER"
}
]
}
]
}
Lifecycle strategy by data type:
| Data Type | Hot (Standard) | Cool (IA/Nearline) | Cold (Glacier/Archive) | Delete |
|---|---|---|---|---|
| User uploads | 30 days | 30-90 days | 90-365 days | 1-2 years |
| Processed images | 90 days | 90-180 days | 1 year | 2 years |
| Database backups | 7 days | 7-30 days | 30-90 days | 90 days |
| Application logs | 7 days | 7-30 days | 30-90 days | 1 year |
| Temp/processing | Never | Never | Never | 7 days |
5. Optimize for Cost and Performance
| Optimization | Implementation | Savings |
|---|---|---|
| Storage class selection | Use IA/Coldline for infrequent access | 40-80% |
| Lifecycle transitions | Auto-move old data to cheaper tiers | 50-90% |
| Delete incomplete multipart | Abort incomplete uploads after 7 days | Prevents waste |
| Compress before upload | Gzip text files, WebP images | 30-70% |
| Use CloudFront/CDN | Cache frequently accessed objects | Reduces S3 egress by 80%+ |
| S3 Transfer Acceleration | For global uploads from distant clients | Faster uploads, minimal cost |
| Requester Pays | For public datasets | Offload bandwidth costs |
# Example: Compress before upload
import gzip
import boto3
s3 = boto3.client('s3')
def upload_compressed(bucket, key, data):
"""Upload gzip-compressed data with Content-Encoding header."""
compressed = gzip.compress(data.encode('utf-8'))
s3.put_object(
Bucket=bucket,
Key=key,
Body=compressed,
ContentEncoding='gzip',
ContentType='application/json'
)
Best Practices
- Never make buckets public. Use presigned URLs or CloudFront OAI for controlled access.
- Enable versioning on production buckets. Protects against accidental deletion and overwrites.
- Use server-side encryption by default. SSE-S3 or SSE-KMS depending on compliance needs.
- Implement object locking for compliance. WORM (Write Once Read Many) for regulatory data.
- Monitor with CloudTrail/CloudWatch. Track access patterns, costs, and unauthorized attempts.
- Use checksums for integrity. ETag, MD5, or SHA-256 verify data was not corrupted in transit.
Common Mistakes
- Storing small files individually. S3 has a minimum billable size. Batch small objects or use a database.
- Using blob storage as a filesystem. Listing prefixes is expensive. Store metadata in a database.
- No lifecycle policy. Production buckets accumulate years of unused data without automatic cleanup.
- Storing secrets in buckets. Use parameter stores or secret managers, not S3 objects.
- Ignoring egress costs. Serving large files directly from S3 to users is expensive. Use a CDN.
- No multipart for large files. Uploading a 10GB file as a single PUT is unreliable and slow.
Variants
- MinIO: Self-hosted S3-compatible object storage for on-premises or edge
- Ceph: Open-source distributed object store for private cloud
- Backblaze B2: Cost-effective S3-compatible alternative (1/4 the price)
- Cloudflare R2: Zero egress fee object storage, S3-compatible API
- NAS/SAN: Traditional block/file storage for applications needing POSIX semantics
FAQ
Q: Should I use one bucket or many? Use separate buckets for different environments (prod, staging, dev) and different security domains (public assets vs private uploads). Within an environment, use prefixes (folders) rather than many buckets.
Q: How do I handle millions of small files? Batch them into larger archive objects (tar, zip), use a database to track individual file metadata, or use an object store designed for small files (DynamoDB for metadata + S3 for blobs).
Q: What is the maximum file size? S3: 5TB (with multipart). GCS: 5TB. Azure: 4.75TB (Block Blob). For larger, split into chunks.
Q: How do I migrate from one provider to another?
Use tools like rclone, aws s3 sync, or cloud-native transfer services (AWS DataSync, Azure Data Box). For large migrations, consider physical data transfer appliances.
Conclusion
Blob storage is the backbone of modern cloud data architectures. By designing buckets for your access patterns, securing access with presigned URLs and IAM policies, automating lifecycle transitions, and optimizing large file uploads, you build a storage layer that scales infinitely while controlling costs.