Skip to content
SP StackPractices
advanced By StackPractices

Database Sharding — Horizontal Partitioning in Practice

A practical guide to database sharding: choosing shard keys, routing queries, rebalancing data, and avoiding common pitfalls when scaling beyond a single database node.

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Database sharding splits a single database into multiple smaller databases (shards) to distribute load and storage. When vertical scaling (bigger machines) becomes too expensive or hits physical limits, horizontal partitioning allows your database layer to grow by adding nodes rather than upgrading existing ones.

This guide covers when to shard, how to choose shard keys, query routing, rebalancing, and operational considerations.

When to Use

  • Your database exceeds 1TB of data and backup/restore times are unacceptable
  • Write throughput exceeds what a single node can handle (>5k writes/sec)
  • You have run out of CPU, memory, or I/O on your largest available instance
  • Read replicas cannot keep up with replication lag
  • Maintenance operations (index rebuilds, schema changes) take hours
  • You need geographic data distribution for compliance or latency

When NOT to Use

  • Your database is under 500GB — vertical scaling and read replicas are simpler
  • Your workload is read-heavy — read replicas and caching solve this without sharding
  • You have complex cross-shard joins — sharding makes them prohibitively expensive
  • Your team lacks operational experience with distributed databases
  • You have not exhausted query optimization and indexing improvements

Core Concepts

ConceptDescription
ShardA horizontal partition of data stored on a separate database node
Shard KeyThe column(s) used to determine which shard stores a row
RoutingThe logic that directs a query to the correct shard(s)
Hot SpotA shard that receives disproportionately more load than others
RebalancingMoving data between shards to equalize load or storage
Global TableA small table replicated to all shards for local joins

Sharding Architectures

┌──────────────┐
│  Application │
└──────┬───────┘

  ┌────┴────┐
  │ Router  │  (Shard key → shard mapping)
  │ (Vitess)│
  └────┬────┘

   ┌───┼───┐
   │   │   │
┌──▼─┐│┌─▼─┐│┌─▼──┐
│Shard││Shard││Shard│
│  0  ││  1  ││  2  │
└─────┘└─────┘└─────┘

Step-by-Step Sharding Implementation

1. Choose Your Shard Key

The shard key is the most important decision. A poor choice creates hot spots and defeats the purpose.

Good shard key characteristics:

  • High cardinality (many unique values)
  • Even distribution (no single value dominates)
  • Frequently used in WHERE clauses
  • Immutable or rarely changed
Use CaseShard KeyWhy
Multi-tenant SaaStenant_idNatural isolation per customer
Social mediauser_idUser data accessed together
E-commercecustomer_id or order_idOrders and customer data co-located
Time-seriestimestamp + device_idTime-range queries hit few shards
Gamingplayer_idPlayer sessions and inventory together
-- Example: Hash-based sharding on user_id
-- Shard = hash(user_id) % number_of_shards

CREATE TABLE orders (
    order_id BIGINT,
    user_id BIGINT NOT NULL,
    amount DECIMAL(10,2),
    created_at TIMESTAMP,
    -- user_id is the shard key
    PRIMARY KEY (order_id, user_id)
);
# Example: Application-level shard routing
def get_shard_for_user(user_id):
    """Consistent hash routing."""
    return hash(user_id) % NUM_SHARDS

def get_shard_connection(user_id):
    shard = get_shard_for_user(user_id)
    return shard_connections[shard]

# Query execution
def get_user_orders(user_id):
    conn = get_shard_connection(user_id)
    return conn.query("SELECT * FROM orders WHERE user_id = %s", user_id)

Shard key anti-patterns:

  • Auto-increment IDs: Sequential inserts hit the same shard (monotonic write problem)
  • Low-cardinality keys: Gender, status, boolean — creates massive hot spots
  • Time-only keys: Recent data hits one shard (time-series need composite keys)
  • Frequently updated keys: Changing shard key requires moving data between shards

2. Implement Query Routing

Every query must know which shard(s) to hit:

# Example: Router middleware for sharded queries
class ShardRouter:
    def __init__(self, shards):
        self.shards = shards
    
    def route(self, query, params):
        """Route query to the appropriate shard(s)."""
        shard_key = self.extract_shard_key(query, params)
        
        if shard_key:
            # Single shard query
            shard = hash(shard_key) % len(self.shards)
            return [self.shards[shard]]
        else:
            # Scatter-gather: query all shards
            return self.shards
    
    def extract_shard_key(self, query, params):
        # Parse query to find shard key in WHERE clause
        if 'user_id' in params:
            return params['user_id']
        return None

# Single shard (fast)
orders = router.route("SELECT * FROM orders WHERE user_id = ?", {"user_id": 123})

# Multi-shard (slow, avoid in production)
all_orders = router.route("SELECT * FROM orders WHERE amount > ?", {"amount": 100})

Routing strategies:

StrategyHow It WorksBest For
Hash-basedshard = hash(key) % NEven distribution, no metadata
Range-basedShard 0: 1-1M, Shard 1: 1M-2MTime-series, sequential access
Directory-basedLookup table maps key → shardFlexible, allows rebalancing
Consistent HashingMinimal redistribution on add/removeDynamic cluster sizing

3. Handle Cross-Shard Operations

Cross-shard queries are the biggest sharding pain point:

-- AVOID: Cross-shard JOIN (expensive)
SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.amount > 1000;
-- If users and orders are sharded differently, this requires
-- fetching data from multiple shards and joining in application

-- PREFER: Denormalize or application-level join
-- Application code:
high_value_orders = shard.query("SELECT user_id, amount FROM orders WHERE amount > 1000")
user_ids = [o.user_id for o in high_value_orders]
users = user_shard.query("SELECT id, name FROM users WHERE id IN %s", user_ids)
# Join in application memory

Cross-shard strategies:

ProblemSolutionTrade-off
Cross-shard JOINsDenormalize, application join, or global tablesMore storage, complexity
Aggregations (SUM, COUNT)Pre-aggregate or use a data warehouseStale data, extra system
Unique constraintsApplication-level check or UUIDEventual consistency
TransactionsSaga pattern or avoid multi-shard TXComplexity, no ACID
Auto-increment IDsSnowflake IDs, UUID, or sequence tablesCoordination overhead

4. Plan for Rebalancing

Shards inevitably become uneven. Plan rebalancing from day one:

# Example: Rebalancing script (simplified)
def rebalance_shards():
    """Move data from overloaded shards to underloaded ones."""
    shard_sizes = [get_shard_size(i) for i in range(NUM_SHARDS)]
    avg_size = sum(shard_sizes) / NUM_SHARDS
    
    for shard_id in range(NUM_SHARDS):
        if shard_sizes[shard_id] > avg_size * 1.2:
            # This shard is overloaded
            excess = shard_sizes[shard_id] - avg_size
            target_shard = find_underloaded_shard()
            
            # Move a range of data
            move_data_range(shard_id, target_shard, excess)
    
def move_data_range(source, target, bytes_to_move):
    """Move data in batches to minimize downtime."""
    batch_size = 1000
    cursor = get_cursor(source)
    
    while bytes_moved < bytes_to_move:
        rows = cursor.fetchmany(batch_size)
        insert_into_shard(target, rows)
        delete_from_shard(source, rows)
        bytes_moved += estimate_size(rows)

Rebalancing approaches:

ApproachDowntimeComplexityUse Case
Online rebalancingNoneHighProduction systems (Vitess, Citus)
Dual-write migrationNoneMediumGradual cutover with validation
Snapshot + replayBrief read-onlyLowSmall databases, maintenance window
Consistent hashingNoneMediumAdding/removing shards dynamically

5. Use Sharding Middleware

Do not build your own shard router unless you have to:

SolutionDatabaseTypeBest For
VitessMySQLProxy/routerLarge-scale MySQL (YouTube, Slack)
CitusPostgreSQLExtensionPostgreSQL sharding with minimal changes
MongoDBMongoDBNativeDocument-based, flexible schema
CockroachDBPostgreSQL-compatibleNativeGlobal distribution, strong consistency
TiDBMySQL-compatibleNativeHTAP (hybrid transactional/analytical)
YugabyteDBPostgreSQL/CQL-compatibleNativeCloud-native, planet-scale
-- Example: Citus (PostgreSQL extension)
-- Convert a table to a distributed table

-- Add Citus extension
CREATE EXTENSION IF NOT EXISTS citus;

-- Create distributed table
SELECT create_distributed_table('orders', 'user_id');

-- Citus handles routing, rebalancing, and distributed queries
-- Most queries work unchanged
SELECT * FROM orders WHERE user_id = 123;  -- Routed to single shard
# Example: Vitess configuration snippet
# vschema.json defines sharding logic
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash"
    }
  },
  "tables": {
    "orders": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "hash"
        }
      ]
    }
  }
}

Best Practices

  • Start with directory-based routing. It is easier to rebalance than hash-based routing.
  • Keep shards as large as possible. Fewer, larger shards are easier to manage than many small ones.
  • Design for the rebalancing event. It will happen. Have runbooks ready.
  • Avoid cross-shard transactions. Use sagas, outbox pattern, or design around the need.
  • Monitor shard balance. Alert when any shard exceeds 120% of average size or QPS.
  • Test with production-like data volumes. Small test datasets hide hot spot problems.
  • Plan your global tables. Small lookup tables (countries, currencies) should be replicated to all shards.

Common Mistakes

  • Sharding too early. Sharding adds massive complexity. Exhaust vertical scaling and read replicas first.
  • Poor shard key choice. A bad shard key is worse than no sharding. Test distribution with production data.
  • Ignoring cross-shard queries. Queries that worked on a single node fail or become slow after sharding.
  • No rebalancing plan. Uneven shards create hot spots that negate the benefits of sharding.
  • Losing ACID semantics. Multi-shard transactions require application-level coordination.
  • Underestimating operational overhead. Sharded databases are harder to backup, monitor, and troubleshoot.

Variants

  • Functional sharding: Split by domain (users db, orders db) rather than by row — simpler, no router needed
  • Zonal sharding: Shard by geography (EU data in EU shards) for compliance
  • Hybrid sharding: Shard large tables, replicate small tables — the most common pattern
  • Auto-sharding: Managed services (Amazon Aurora, Google Spanner, Azure Cosmos DB) handle sharding transparently

FAQ

Q: How many shards should I start with? Start with 4-8 shards. Fewer shards are easier to manage. You can split shards later ( Vitess, Citus support this).

Q: What is the difference between sharding and partitioning? Partitioning splits data within a single database instance. Sharding splits data across multiple independent instances. Partitioning is simpler but does not scale beyond one machine.

Q: Can I change my shard key later? Changing a shard key requires migrating all data. It is possible but painful. Invest in choosing the right key upfront.

Q: Do I need a shard router? Yes, unless you use a natively sharded database (MongoDB, CockroachDB, YugabyteDB). For PostgreSQL and MySQL, use Citus or Vitess.

Conclusion

Database sharding is a powerful but complex scaling strategy. By choosing the right shard key, implementing robust routing, and planning for rebalancing, you can scale your database layer horizontally. But shard only when necessary — the operational overhead is significant, and many workloads can be solved with simpler approaches.