Skip to content
SP StackPractices
intermediate By StackPractices

Blob Storage — S3, GCS, and Azure Blob Patterns for Engineers

A practical guide to cloud blob storage: bucket design, access control, lifecycle policies, multipart uploads, presigned URLs, and cost optimization patterns for S3, Google Cloud Storage, and Azure Blob.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Blob (object) storage is the dominant way to store unstructured data in the cloud: images, videos, documents, backups, and logs. Unlike filesystems or block storage, object storage treats each file as an independent object with metadata, accessed via HTTP APIs. It is infinitely scalable, durable, and cost-effective — but requires different design patterns than traditional storage.

This guide covers bucket design, access patterns, security, lifecycle management, and multi-cloud considerations.

When to Use

  • You store files >1MB that do not need random access (images, videos, PDFs)
  • You need durable, redundant storage without managing disks or RAID
  • Your storage volume exceeds what a single server can handle
  • You want to decouple storage from compute (stateless services)
  • You need to share files between services, regions, or organizations
  • Cost per gigabyte is a primary concern

When NOT to Use

  • You need frequent small random reads/writes (databases, transactional data)
  • You need POSIX filesystem semantics (directories, symlinks, file locking)
  • Latency requirements are <10ms consistently (use SSD/block storage)
  • You need to modify objects in-place (objects are immutable; rewrite required)

Core Concepts

ConceptDescription
BucketA container for objects with its own policies and configuration
ObjectA file stored with metadata, a unique key, and a version ID
KeyThe unique identifier (path-like string) for an object within a bucket
Presigned URLA time-limited URL that grants temporary access without credentials
Multipart UploadSplitting large files into parts for parallel upload and resumability
Lifecycle PolicyRules that transition or delete objects based on age

Provider Comparison

FeatureAWS S3Google Cloud StorageAzure Blob
Durability99.999999999% (11 nines)99.999999999%99.999999999%
Availability99.99%99.95% (multi-regional)99.99% (Hot)
Storage ClassesStandard, IA, Glacier, DeepStandard, Nearline, Coldline, ArchiveHot, Cool, Cold, Archive
Min Billable Size128KB (IA)N/AN/A
Multipart Min5MB (except last)N/A (composite objects)4MB (block)
Presigned URLsYesYesYes (SAS tokens)
Event NotificationsS3 Events, SNS, SQSCloud Pub/SubEvent Grid
Static WebsiteNative supportNative supportNative support

Step-by-Step Blob Storage Implementation

1. Design Your Bucket Structure

Organize objects to support access patterns and lifecycle management:

s3://myapp-production/
├── uploads/
│   ├── raw/           # Unprocessed user uploads
│   │   └── 2024/06/25/uuid-original.jpg
│   ├── processed/     # Resized, compressed versions
│   │   └── 2024/06/25/uuid-800x600.jpg
│   └── temp/          # Processing in progress
├── documents/
│   ├── invoices/      # Financial documents
│   └── contracts/     # Legal documents
├── backups/
│   └── database/        # Daily database dumps
├── logs/
│   └── application/     # Application log files
└── public/              # Static website assets
    ├── images/
    ├── css/
    └── js/

Naming best practices:

PatternExamplePurpose
Date prefixlogs/2024/06/25/app.logLifecycle rules by date
UUID filenameuploads/raw/a1b2c3d4.jpgAvoid conflicts, enable distribution
Derived variantsuuid-thumb.jpg, uuid-full.jpgMultiple sizes/formats
Version prefixbackups/v2.3.1/dump.sqlSoftware version correlation
# Example: Generate structured object keys
import uuid
from datetime import datetime

def generate_upload_key(user_id, filename):
    """Generate S3 key with date prefix and UUID."""
    now = datetime.utcnow()
    file_uuid = uuid.uuid4().hex[:12]
    extension = filename.split('.')[-1].lower()
    return f"uploads/raw/{now:%Y/%m/%d}/{user_id}/{file_uuid}.{extension}"

# Result: uploads/raw/2024/06/25/12345/a1b2c3d4e5f6.jpg

2. Implement Secure Access

Never distribute long-term credentials. Use IAM roles, presigned URLs, and bucket policies:

# Example: Generate presigned URL for temporary access (Python/Boto3)
import boto3
from botocore.exceptions import ClientError

s3 = boto3.client('s3')

def generate_upload_url(bucket, key, expiration=300):
    """Generate a presigned URL for direct browser upload."""
    try:
        url = s3.generate_presigned_url(
            'put_object',
            Params={
                'Bucket': bucket,
                'Key': key,
                'ContentType': 'image/jpeg'
            },
            ExpiresIn=expiration
        )
        return url
    except ClientError as e:
        raise

def generate_download_url(bucket, key, expiration=3600):
    """Generate a presigned URL for temporary download access."""
    try:
        url = s3.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': key},
            ExpiresIn=expiration
        )
        return url
    except ClientError as e:
        raise

# Usage in API
@app.route('/upload-url', methods=['POST'])
def get_upload_url():
    user_id = get_current_user_id()
    filename = request.json['filename']
    key = generate_upload_key(user_id, filename)
    url = generate_upload_url('myapp-uploads', key)
    return jsonify({'uploadUrl': url, 'key': key})
// Example: S3 bucket policy for CloudFront origin access
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudFrontAccess",
            "Effect": "Allow",
            "Principal": {
                "CanonicalUser": "CLOUDFRONT_OAI_CANONICAL_ID"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::myapp-public/*"
        }
    ]
}
# Example: Terraform for private bucket with versioning
resource "aws_s3_bucket" "uploads" {
  bucket = "myapp-production-uploads"
}

resource "aws_s3_bucket_versioning" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_public_access_block" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

3. Upload Large Files with Multipart

For files >100MB, use multipart upload for reliability and performance:

# Example: Multipart upload with Python/Boto3
import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')

# Simple multipart (Boto3 handles splitting)
config = TransferConfig(
    multipart_threshold=1024 * 25,    # 25MB
    max_concurrency=10,
    multipart_chunksize=1024 * 25,    # 25MB parts
    use_threads=True
)

s3.upload_file(
    'large-video.mp4',
    'myapp-uploads',
    'videos/large-video.mp4',
    Config=config
)

# Manual multipart for resumable uploads
def multipart_upload(bucket, key, file_path, part_size=50*1024*1024):
    """Upload with resume capability."""
    s3 = boto3.client('s3')
    
    # Initiate multipart upload
    mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
    upload_id = mpu['UploadId']
    
    try:
        parts = []
        with open(file_path, 'rb') as f:
            part_num = 1
            while True:
                data = f.read(part_size)
                if not data:
                    break
                
                response = s3.upload_part(
                    Bucket=bucket, Key=key,
                    UploadId=upload_id, PartNumber=part_num,
                    Body=data
                )
                parts.append({
                    'PartNumber': part_num,
                    'ETag': response['ETag']
                })
                part_num += 1
        
        # Complete multipart upload
        s3.complete_multipart_upload(
            Bucket=bucket, Key=key, UploadId=upload_id,
            MultipartUpload={'Parts': parts}
        )
    except Exception as e:
        s3.abort_multipart_upload(Bucket=bucket, Key=key, UploadId=upload_id)
        raise
// Example: Multipart upload with AWS SDK v3 (Node.js)
import { Upload } from "@aws-sdk/lib-storage";
import { S3Client } from "@aws-sdk/client-s3";
import { createReadStream } from "fs";

const client = new S3Client({ region: "us-east-1" });

const upload = new Upload({
  client,
  params: {
    Bucket: "myapp-uploads",
    Key: "videos/large-file.mp4",
    Body: createReadStream("./large-file.mp4"),
  },
  queueSize: 4,        // Concurrent parts
  partSize: 25 * 1024 * 1024,  // 25MB parts
});

upload.on("httpUploadProgress", (progress) => {
  console.log(`Uploaded ${progress.loaded}/${progress.total}`);
});

await upload.done();

4. Implement Lifecycle Policies

Automate cost optimization by transitioning or deleting old objects:

// S3 Lifecycle Policy: Transition to cheaper storage, then delete
{
    "Rules": [
        {
            "ID": "RawUploadsLifecycle",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "uploads/raw/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"
                }
            ],
            "Expiration": {
                "Days": 365
            }
        },
        {
            "ID": "TempCleanup",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "uploads/temp/"
            },
            "Expiration": {
                "Days": 7
            }
        },
        {
            "ID": "LogArchive",
            "Status": "Enabled",
            "Filter": {
                "Prefix": "logs/"
            },
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                }
            ],
            "NoncurrentVersionTransitions": [
                {
                    "NoncurrentDays": 30,
                    "StorageClass": "GLACIER"
                }
            ]
        }
    ]
}

Lifecycle strategy by data type:

Data TypeHot (Standard)Cool (IA/Nearline)Cold (Glacier/Archive)Delete
User uploads30 days30-90 days90-365 days1-2 years
Processed images90 days90-180 days1 year2 years
Database backups7 days7-30 days30-90 days90 days
Application logs7 days7-30 days30-90 days1 year
Temp/processingNeverNeverNever7 days

5. Optimize for Cost and Performance

OptimizationImplementationSavings
Storage class selectionUse IA/Coldline for infrequent access40-80%
Lifecycle transitionsAuto-move old data to cheaper tiers50-90%
Delete incomplete multipartAbort incomplete uploads after 7 daysPrevents waste
Compress before uploadGzip text files, WebP images30-70%
Use CloudFront/CDNCache frequently accessed objectsReduces S3 egress by 80%+
S3 Transfer AccelerationFor global uploads from distant clientsFaster uploads, minimal cost
Requester PaysFor public datasetsOffload bandwidth costs
# Example: Compress before upload
import gzip
import boto3

s3 = boto3.client('s3')

def upload_compressed(bucket, key, data):
    """Upload gzip-compressed data with Content-Encoding header."""
    compressed = gzip.compress(data.encode('utf-8'))
    s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=compressed,
        ContentEncoding='gzip',
        ContentType='application/json'
    )

Best Practices

  • Never make buckets public. Use presigned URLs or CloudFront OAI for controlled access.
  • Enable versioning on production buckets. Protects against accidental deletion and overwrites.
  • Use server-side encryption by default. SSE-S3 or SSE-KMS depending on compliance needs.
  • Implement object locking for compliance. WORM (Write Once Read Many) for regulatory data.
  • Monitor with CloudTrail/CloudWatch. Track access patterns, costs, and unauthorized attempts.
  • Use checksums for integrity. ETag, MD5, or SHA-256 verify data was not corrupted in transit.

Common Mistakes

  • Storing small files individually. S3 has a minimum billable size. Batch small objects or use a database.
  • Using blob storage as a filesystem. Listing prefixes is expensive. Store metadata in a database.
  • No lifecycle policy. Production buckets accumulate years of unused data without automatic cleanup.
  • Storing secrets in buckets. Use parameter stores or secret managers, not S3 objects.
  • Ignoring egress costs. Serving large files directly from S3 to users is expensive. Use a CDN.
  • No multipart for large files. Uploading a 10GB file as a single PUT is unreliable and slow.

Variants

  • MinIO: Self-hosted S3-compatible object storage for on-premises or edge
  • Ceph: Open-source distributed object store for private cloud
  • Backblaze B2: Cost-effective S3-compatible alternative (1/4 the price)
  • Cloudflare R2: Zero egress fee object storage, S3-compatible API
  • NAS/SAN: Traditional block/file storage for applications needing POSIX semantics

FAQ

Q: Should I use one bucket or many? Use separate buckets for different environments (prod, staging, dev) and different security domains (public assets vs private uploads). Within an environment, use prefixes (folders) rather than many buckets.

Q: How do I handle millions of small files? Batch them into larger archive objects (tar, zip), use a database to track individual file metadata, or use an object store designed for small files (DynamoDB for metadata + S3 for blobs).

Q: What is the maximum file size? S3: 5TB (with multipart). GCS: 5TB. Azure: 4.75TB (Block Blob). For larger, split into chunks.

Q: How do I migrate from one provider to another? Use tools like rclone, aws s3 sync, or cloud-native transfer services (AWS DataSync, Azure Data Box). For large migrations, consider physical data transfer appliances.

Conclusion

Blob storage is the backbone of modern cloud data architectures. By designing buckets for your access patterns, securing access with presigned URLs and IAM policies, automating lifecycle transitions, and optimizing large file uploads, you build a storage layer that scales infinitely while controlling costs.