Blob Storage — S3, GCS, and Azure Blob Patterns for Engineers

Overview

Blob (object) storage is the dominant way to store unstructured data in the cloud: images, videos, documents, backups, and logs. Unlike filesystems or block storage, object storage treats each file as an independent object with metadata, accessed via HTTP APIs. It is infinitely scalable, durable, and cost-effective — but requires different design patterns than traditional storage.

This guide covers bucket design, access patterns, security, lifecycle management, and multi-cloud considerations.

When to Use

You store files >1MB that do not need random access (images, videos, PDFs)
You need durable, redundant storage without managing disks or RAID
Your storage volume exceeds what a single server can handle
You want to decouple storage from compute (stateless services)
You need to share files between services, regions, or organizations
Cost per gigabyte is a primary concern

When NOT to Use

You need frequent small random reads/writes (databases, transactional data)
You need POSIX filesystem semantics (directories, symlinks, file locking)
Latency requirements are <10ms consistently (use SSD/block storage)
You need to modify objects in-place (objects are immutable; rewrite required)

Core Concepts

Concept	Description
Bucket	A container for objects with its own policies and configuration
Object	A file stored with metadata, a unique key, and a version ID
Key	The unique identifier (path-like string) for an object within a bucket
Presigned URL	A time-limited URL that grants temporary access without credentials
Multipart Upload	Splitting large files into parts for parallel upload and resumability
Lifecycle Policy	Rules that transition or delete objects based on age

Provider Comparison

Feature	AWS S3	Google Cloud Storage	Azure Blob
Durability	99.999999999% (11 nines)	99.999999999%	99.999999999%
Availability	99.99%	99.95% (multi-regional)	99.99% (Hot)
Storage Classes	Standard, IA, Glacier, Deep	Standard, Nearline, Coldline, Archive	Hot, Cool, Cold, Archive
Min Billable Size	128KB (IA)	N/A	N/A
Multipart Min	5MB (except last)	N/A (composite objects)	4MB (block)
Presigned URLs	Yes	Yes	Yes (SAS tokens)
Event Notifications	S3 Events, SNS, SQS	Cloud Pub/Sub	Event Grid
Static Website	Native support	Native support	Native support

Step-by-Step Blob Storage Implementation

1. Design Your Bucket Structure

Organize objects to support access patterns and lifecycle management:

s3://myapp-production/
├── uploads/
│   ├── raw/           # Unprocessed user uploads
│   │   └── 2024/06/25/uuid-original.jpg
│   ├── processed/     # Resized, compressed versions
│   │   └── 2024/06/25/uuid-800x600.jpg
│   └── temp/          # Processing in progress
├── documents/
│   ├── invoices/      # Financial documents
│   └── contracts/     # Legal documents
├── backups/
│   └── database/        # Daily database dumps
├── logs/
│   └── application/     # Application log files
└── public/              # Static website assets
    ├── images/
    ├── css/
    └── js/

Naming best practices:

Pattern	Example	Purpose
Date prefix	`logs/2024/06/25/app.log`	Lifecycle rules by date
UUID filename	`uploads/raw/a1b2c3d4.jpg`	Avoid conflicts, enable distribution
Derived variants	`uuid-thumb.jpg`, `uuid-full.jpg`	Multiple sizes/formats
Version prefix	`backups/v2.3.1/dump.sql`	Software version correlation

# Example: Generate structured object keys
import uuid
from datetime import datetime

def generate_upload_key(user_id, filename):
    """Generate S3 key with date prefix and UUID."""
    now = datetime.utcnow()
    file_uuid = uuid.uuid4().hex[:12]
    extension = filename.split('.')[-1].lower()
    return f"uploads/raw/{now:%Y/%m/%d}/{user_id}/{file_uuid}.{extension}"

# Result: uploads/raw/2024/06/25/12345/a1b2c3d4e5f6.jpg

2. Implement Secure Access

Never distribute long-term credentials. Use IAM roles, presigned URLs, and bucket policies:

# Example: Generate presigned URL for temporary access (Python/Boto3)
import boto3
from botocore.exceptions import ClientError

s3 = boto3.client('s3')

def generate_upload_url(bucket, key, expiration=300):
    """Generate a presigned URL for direct browser upload."""
    try:
        url = s3.generate_presigned_url(
            'put_object',
            Params={
                'Bucket': bucket,
                'Key': key,
                'ContentType': 'image/jpeg'
            },
            ExpiresIn=expiration
        )
        return url
    except ClientError as e:
        raise

def generate_download_url(bucket, key, expiration=3600):
    """Generate a presigned URL for temporary download access."""
    try:
        url = s3.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
        return url
    except ClientError as e:
        raise

# Usage in API
@app.route('/upload-url', methods=['POST'])
def get_upload_url():
    user_id = get_current_user_id()
    filename = request.json['filename']
    key = generate_upload_key(user_id, filename)
    url = generate_upload_url('myapp-uploads', key)
    return jsonify({'uploadUrl': url, 'key': key})

// Example: S3 bucket policy for CloudFront origin access
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudFrontAccess",
            "Effect": "Allow",
            "Principal": {
                "CanonicalUser": "CLOUDFRONT_OAI_CANONICAL_ID"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::myapp-public/*"
        }
    ]
}

# Example: Terraform for private bucket with versioning
resource "aws_s3_bucket" "uploads" {
  bucket = "myapp-production-uploads"
}

resource "aws_s3_bucket_versioning" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_public_access_block" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

3. Upload Large Files with Multipart

For files >100MB, use multipart upload for reliability and performance:

# Example: Multipart upload with Python/Boto3
import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')

# Simple multipart (Boto3 handles splitting)
config = TransferConfig(
    multipart_threshold=1024 * 25,    # 25MB
    max_concurrency=10,
    multipart_chunksize=1024 * 25,    # 25MB parts
    use_threads=True
)

s3.upload_file(
    'large-video.mp4',
    'myapp-uploads',
    'videos/large-video.mp4',
    Config=config
)

# Manual multipart for resumable uploads
def multipart_upload(bucket, key, file_path, part_size=50*1024*1024):
    """Upload with resume capability."""
    s3 = boto3.client('s3')
    
    # Initiate multipart upload
    mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
    upload_id = mpu['UploadId']
    
    try:
        parts = []
        with open(file_path, 'rb') as f:
            part_num = 1
            while True:
                data = f.read(part_size)
                if not data:
                    break
                
                response = s3.upload_part(
                    Bucket=bucket, Key=key,
                    UploadId=upload_id, PartNumber=part_num,
                    Body=data
                )
                parts.append({
                    'PartNumber': part_num,
                    'ETag': response['ETag']
                })
                part_num += 1
        
        # Complete multipart upload
        s3.complete_multipart_upload(
            Bucket=bucket, Key=key, UploadId=upload_id,
            MultipartUpload={'Parts': parts}
        )
    except Exception as e:
        s3.abort_multipart_upload(Bucket=bucket, Key=key, UploadId=upload_id)
        raise

// Example: Multipart upload with AWS SDK v3 (Node.js)
import { Upload } from "@aws-sdk/lib-storage";
import { S3Client } from "@aws-sdk/client-s3";
import { createReadStream } from "fs";

const client = new S3Client({ region: "us-east-1" });

const upload = new Upload({
  client,
  params: {
    Bucket: "myapp-uploads",
    Key: "videos/large-file.mp4",
    Body: createReadStream("./large-file.mp4"),
  },
  queueSize: 4,        // Concurrent parts
  partSize: 25 * 1024 * 1024,  // 25MB parts
});

upload.on("httpUploadProgress", (progress) => {
  console.log(`Uploaded ${progress.loaded}/${progress.total}`);
});

await upload.done();

4. Implement Lifecycle Policies

Automate cost optimization by transitioning or deleting old objects:

// S3 Lifecycle Policy: Transition to cheaper storage, then delete
{
    "Rules": [
        {
            "ID": "RawUploadsLifecycle",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "uploads/raw/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"
                }
            ],
            "Expiration": {
                "Days": 365
            }
        },
        {
            "ID": "TempCleanup",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "uploads/temp/"
            },
            "Expiration": {
                "Days": 7
            }
        },
        {
            "ID": "LogArchive",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "logs/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                }
            ],
            "NoncurrentVersionTransitions": [
                {
                    "NoncurrentDays": 30,
                    "StorageClass": "GLACIER"
                }
            ]
        }
    ]
}

Lifecycle strategy by data type:

Data Type	Hot (Standard)	Cool (IA/Nearline)	Cold (Glacier/Archive)	Delete
User uploads	30 days	30-90 days	90-365 days	1-2 years
Processed images	90 days	90-180 days	1 year	2 years
Database backups	7 days	7-30 days	30-90 days	90 days
Application logs	7 days	7-30 days	30-90 days	1 year
Temp/processing	Never	Never	Never	7 days

5. Optimize for Cost and Performance

Optimization	Implementation	Savings
Storage class selection	Use IA/Coldline for infrequent access	40-80%
Lifecycle transitions	Auto-move old data to cheaper tiers	50-90%
Delete incomplete multipart	Abort incomplete uploads after 7 days	Prevents waste
Compress before upload	Gzip text files, WebP images	30-70%
Use CloudFront/CDN	Cache frequently accessed objects	Reduces S3 egress by 80%+
S3 Transfer Acceleration	For global uploads from distant clients	Faster uploads, minimal cost
Requester Pays	For public datasets	Offload bandwidth costs

# Example: Compress before upload
import gzip
import boto3

s3 = boto3.client('s3')

def upload_compressed(bucket, key, data):
    """Upload gzip-compressed data with Content-Encoding header."""
    compressed = gzip.compress(data.encode('utf-8'))
    s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=compressed,
        ContentEncoding='gzip',
        ContentType='application/json'
    )

Best Practices

Never make buckets public. Use presigned URLs or CloudFront OAI for controlled access.
Enable versioning on production buckets. Protects against accidental deletion and overwrites.
Use server-side encryption by default. SSE-S3 or SSE-KMS depending on compliance needs.
Implement object locking for compliance. WORM (Write Once Read Many) for regulatory data.
Monitor with CloudTrail/CloudWatch. Track access patterns, costs, and unauthorized attempts.
Use checksums for integrity. ETag, MD5, or SHA-256 verify data was not corrupted in transit.

Common Mistakes

Storing small files individually. S3 has a minimum billable size. Batch small objects or use a database.
Using blob storage as a filesystem. Listing prefixes is expensive. Store metadata in a database.
No lifecycle policy. Production buckets accumulate years of unused data without automatic cleanup.
Storing secrets in buckets. Use parameter stores or secret managers, not S3 objects.
Ignoring egress costs. Serving large files directly from S3 to users is expensive. Use a CDN.
No multipart for large files. Uploading a 10GB file as a single PUT is unreliable and slow.

Variants

MinIO: Self-hosted S3-compatible object storage for on-premises or edge
Ceph: Open-source distributed object store for private cloud
Backblaze B2: Cost-effective S3-compatible alternative (1/4 the price)
Cloudflare R2: Zero egress fee object storage, S3-compatible API
NAS/SAN: Traditional block/file storage for applications needing POSIX semantics

FAQ

Q: Should I use one bucket or many? Use separate buckets for different environments (prod, staging, dev) and different security domains (public assets vs private uploads). Within an environment, use prefixes (folders) rather than many buckets.

Q: How do I handle millions of small files? Batch them into larger archive objects (tar, zip), use a database to track individual file metadata, or use an object store designed for small files (DynamoDB for metadata + S3 for blobs).

Q: What is the maximum file size? S3: 5TB (with multipart). GCS: 5TB. Azure: 4.75TB (Block Blob). For larger, split into chunks.

Q: How do I migrate from one provider to another? Use tools like rclone, aws s3 sync, or cloud-native transfer services (AWS DataSync, Azure Data Box). For large migrations, consider physical data transfer appliances.

Conclusion

Blob storage is the backbone of modern cloud data architectures. By designing buckets for your access patterns, securing access with presigned URLs and IAM policies, automating lifecycle transitions, and optimizing large file uploads, you build a storage layer that scales infinitely while controlling costs.