Database Sharding — Horizontal Partitioning in Practice

Overview

Database sharding splits a single database into multiple smaller databases (shards) to distribute load and storage. When vertical scaling (bigger machines) becomes too expensive or hits physical limits, horizontal partitioning allows your database layer to grow by adding nodes rather than upgrading existing ones.

This guide covers when to shard, how to choose shard keys, query routing, rebalancing, and operational considerations.

When to Use

Your database exceeds 1TB of data and backup/restore times are unacceptable
Write throughput exceeds what a single node can handle (>5k writes/sec)
You have run out of CPU, memory, or I/O on your largest available instance
Read replicas cannot keep up with replication lag
Maintenance operations (index rebuilds, schema changes) take hours
You need geographic data distribution for compliance or latency

When NOT to Use

Your database is under 500GB — vertical scaling and read replicas are simpler
Your workload is read-heavy — read replicas and caching solve this without sharding
You have complex cross-shard joins — sharding makes them prohibitively expensive
Your team lacks operational experience with distributed databases
You have not exhausted query optimization and indexing improvements

Core Concepts

Concept	Description
Shard	A horizontal partition of data stored on a separate database node
Shard Key	The column(s) used to determine which shard stores a row
Routing	The logic that directs a query to the correct shard(s)
Hot Spot	A shard that receives disproportionately more load than others
Rebalancing	Moving data between shards to equalize load or storage
Global Table	A small table replicated to all shards for local joins

Sharding Architectures

┌──────────────┐
│  Application │
└──────┬───────┘
       │
  ┌────┴────┐
  │ Router  │  (Shard key → shard mapping)
  │ (Vitess)│
  └────┬────┘
       │
   ┌───┼───┐
   │   │   │
┌──▼─┐│┌─▼─┐│┌─▼──┐
│Shard││Shard││Shard│
│  0  ││  1  ││  2  │
└─────┘└─────┘└─────┘

Step-by-Step Sharding Implementation

1. Choose Your Shard Key

The shard key is the most important decision. A poor choice creates hot spots and defeats the purpose.

Good shard key characteristics:

High cardinality (many unique values)
Even distribution (no single value dominates)
Frequently used in WHERE clauses
Immutable or rarely changed

Use Case	Shard Key	Why
Multi-tenant SaaS	`tenant_id`	Natural isolation per customer
Social media	`user_id`	User data accessed together
E-commerce	`customer_id` or `order_id`	Orders and customer data co-located
Time-series	`timestamp` + `device_id`	Time-range queries hit few shards
Gaming	`player_id`	Player sessions and inventory together

-- Example: Hash-based sharding on user_id
-- Shard = hash(user_id) % number_of_shards

CREATE TABLE orders (
    order_id BIGINT,
    user_id BIGINT NOT NULL,
    amount DECIMAL(10,2),
    created_at TIMESTAMP,
    -- user_id is the shard key
    PRIMARY KEY (order_id, user_id)
);

# Example: Application-level shard routing
def get_shard_for_user(user_id):
    """Consistent hash routing."""
    return hash(user_id) % NUM_SHARDS

def get_shard_connection(user_id):
    shard = get_shard_for_user(user_id)
    return shard_connections[shard]

# Query execution
def get_user_orders(user_id):
    conn = get_shard_connection(user_id)
    return conn.query("SELECT * FROM orders WHERE user_id = %s", user_id)

Shard key anti-patterns:

Auto-increment IDs: Sequential inserts hit the same shard (monotonic write problem)
Low-cardinality keys: Gender, status, boolean — creates massive hot spots
Time-only keys: Recent data hits one shard (time-series need composite keys)
Frequently updated keys: Changing shard key requires moving data between shards

2. Implement Query Routing

Every query must know which shard(s) to hit:

# Example: Router middleware for sharded queries
class ShardRouter:
    def __init__(self, shards):
        self.shards = shards
    
    def route(self, query, params):
        """Route query to the appropriate shard(s)."""
        shard_key = self.extract_shard_key(query, params)
        
        if shard_key:
            # Single shard query
            shard = hash(shard_key) % len(self.shards)
            return [self.shards[shard]]
        else:
            # Scatter-gather: query all shards
            return self.shards
    
    def extract_shard_key(self, query, params):
        # Parse query to find shard key in WHERE clause
        if 'user_id' in params:
            return params['user_id']
        return None

# Single shard (fast)
orders = router.route("SELECT * FROM orders WHERE user_id = ?", {"user_id": 123})

# Multi-shard (slow, avoid in production)
all_orders = router.route("SELECT * FROM orders WHERE amount > ?", {"amount": 100})

Routing strategies:

Strategy	How It Works	Best For
Hash-based	`shard = hash(key) % N`	Even distribution, no metadata
Range-based	Shard 0: 1-1M, Shard 1: 1M-2M	Time-series, sequential access
Directory-based	Lookup table maps key → shard	Flexible, allows rebalancing
Consistent Hashing	Minimal redistribution on add/remove	Dynamic cluster sizing

3. Handle Cross-Shard Operations

Cross-shard queries are the biggest sharding pain point:

-- AVOID: Cross-shard JOIN (expensive)
SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.amount > 1000;
-- If users and orders are sharded differently, this requires
-- fetching data from multiple shards and joining in application

-- PREFER: Denormalize or application-level join
-- Application code:
high_value_orders = shard.query("SELECT user_id, amount FROM orders WHERE amount > 1000")
user_ids = [o.user_id for o in high_value_orders]
users = user_shard.query("SELECT id, name FROM users WHERE id IN %s", user_ids)
# Join in application memory

Cross-shard strategies:

Problem	Solution	Trade-off
Cross-shard JOINs	Denormalize, application join, or global tables	More storage, complexity
Aggregations (SUM, COUNT)	Pre-aggregate or use a data warehouse	Stale data, extra system
Unique constraints	Application-level check or UUID	Eventual consistency
Transactions	Saga pattern or avoid multi-shard TX	Complexity, no ACID
Auto-increment IDs	Snowflake IDs, UUID, or sequence tables	Coordination overhead

4. Plan for Rebalancing

Shards inevitably become uneven. Plan rebalancing from day one:

# Example: Rebalancing script (simplified)
def rebalance_shards():
    """Move data from overloaded shards to underloaded ones."""
    shard_sizes = [get_shard_size(i) for i in range(NUM_SHARDS)]
    avg_size = sum(shard_sizes) / NUM_SHARDS
    
    for shard_id in range(NUM_SHARDS):
        if shard_sizes[shard_id] > avg_size * 1.2:
            # This shard is overloaded
            excess = shard_sizes[shard_id] - avg_size
            target_shard = find_underloaded_shard()
            
            # Move a range of data
            move_data_range(shard_id, target_shard, excess)
    
def move_data_range(source, target, bytes_to_move):
    """Move data in batches to minimize downtime."""
    batch_size = 1000
    cursor = get_cursor(source)
    
    while bytes_moved < bytes_to_move:
        rows = cursor.fetchmany(batch_size)
        insert_into_shard(target, rows)
        delete_from_shard(source, rows)
        bytes_moved += estimate_size(rows)

Rebalancing approaches:

Approach	Downtime	Complexity	Use Case
Online rebalancing	None	High	Production systems (Vitess, Citus)
Dual-write migration	None	Medium	Gradual cutover with validation
Snapshot + replay	Brief read-only	Low	Small databases, maintenance window
Consistent hashing	None	Medium	Adding/removing shards dynamically

5. Use Sharding Middleware

Do not build your own shard router unless you have to:

Solution	Database	Type	Best For
Vitess	MySQL	Proxy/router	Large-scale MySQL (YouTube, Slack)
Citus	PostgreSQL	Extension	PostgreSQL sharding with minimal changes
MongoDB	MongoDB	Native	Document-based, flexible schema
CockroachDB	PostgreSQL-compatible	Native	Global distribution, strong consistency
TiDB	MySQL-compatible	Native	HTAP (hybrid transactional/analytical)
YugabyteDB	PostgreSQL/CQL-compatible	Native	Cloud-native, planet-scale

-- Example: Citus (PostgreSQL extension)
-- Convert a table to a distributed table

-- Add Citus extension
CREATE EXTENSION IF NOT EXISTS citus;

-- Create distributed table
SELECT create_distributed_table('orders', 'user_id');

-- Citus handles routing, rebalancing, and distributed queries
-- Most queries work unchanged
SELECT * FROM orders WHERE user_id = 123;  -- Routed to single shard

# Example: Vitess configuration snippet
# vschema.json defines sharding logic
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "orders": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "hash"
        }
      ]
    }
  }
}

Best Practices

Start with directory-based routing. It is easier to rebalance than hash-based routing.
Keep shards as large as possible. Fewer, larger shards are easier to manage than many small ones.
Design for the rebalancing event. It will happen. Have runbooks ready.
Avoid cross-shard transactions. Use sagas, outbox pattern, or design around the need.
Monitor shard balance. Alert when any shard exceeds 120% of average size or QPS.
Test with production-like data volumes. Small test datasets hide hot spot problems.
Plan your global tables. Small lookup tables (countries, currencies) should be replicated to all shards.

Common Mistakes

Sharding too early. Sharding adds massive complexity. Exhaust vertical scaling and read replicas first.
Poor shard key choice. A bad shard key is worse than no sharding. Test distribution with production data.
Ignoring cross-shard queries. Queries that worked on a single node fail or become slow after sharding.
No rebalancing plan. Uneven shards create hot spots that negate the benefits of sharding.
Losing ACID semantics. Multi-shard transactions require application-level coordination.
Underestimating operational overhead. Sharded databases are harder to backup, monitor, and troubleshoot.

Variants

Functional sharding: Split by domain (users db, orders db) rather than by row — simpler, no router needed
Zonal sharding: Shard by geography (EU data in EU shards) for compliance
Hybrid sharding: Shard large tables, replicate small tables — the most common pattern
Auto-sharding: Managed services (Amazon Aurora, Google Spanner, Azure Cosmos DB) handle sharding transparently

FAQ

Q: How many shards should I start with? Start with 4-8 shards. Fewer shards are easier to manage. You can split shards later ( Vitess, Citus support this).

Q: What is the difference between sharding and partitioning? Partitioning splits data within a single database instance. Sharding splits data across multiple independent instances. Partitioning is simpler but does not scale beyond one machine.

Q: Can I change my shard key later? Changing a shard key requires migrating all data. It is possible but painful. Invest in choosing the right key upfront.

Q: Do I need a shard router? Yes, unless you use a natively sharded database (MongoDB, CockroachDB, YugabyteDB). For PostgreSQL and MySQL, use Citus or Vitess.

Conclusion

Database sharding is a powerful but complex scaling strategy. By choosing the right shard key, implementing robust routing, and planning for rebalancing, you can scale your database layer horizontally. But shard only when necessary — the operational overhead is significant, and many workloads can be solved with simpler approaches.