Skip to content
SP StackPractices
beginner By StackPractices

Generate Test Data

How to generate realistic, deterministic test data with Faker, factory-boy, and type-aware generators for robust test suites in Python, JavaScript, and Java.

Topics: testing

Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.

Overview

Hardcoded test data (name = "John", email = "test@test.com") quickly becomes stale, fails to expose edge cases, and does not represent production data distributions. Generators produce realistic, varied, and deterministic data that makes tests more robust while reducing manual fixture maintenance.

When to Use

  • You maintain dozens of hardcoded test objects that drift from the production schema
  • Edge cases (empty strings, Unicode, very long values) are never tested because they are tedious to write
  • Integration tests need a database seeded with hundreds of realistic rows
  • You want tests to exercise validation rules with varied input distributions
  • Load tests need large volumes of plausible data

When NOT to Use

  • The test requires a very specific, known scenario — hardcode it explicitly
  • Determinism across runs is more important than data variety — seed the generator but keep values minimal
  • The data schema is extremely simple (2-3 fields) — a literal object is clearer
  • You are testing a Faker-like library itself — use controlled, predictable inputs

Step-by-Step Implementation

Python

from faker import Faker
from dataclasses import dataclass
from typing import List
import factory
from factory import Faker as FactoryFaker

fake = Faker()
Faker.seed(12345)  # Deterministic across runs

# Basic Faker usage
fake.name()        # 'John Smith'
fake.email()       # 'john.smith@example.com'
fake.ipv4()        # '192.168.1.45'
fake.uuid4()       # '550e8400-e29b-41d4-a716-446655440000'

# factory-boy for ORM objects
@dataclass
class User:
    id: int
    name: str
    email: str
    age: int
    is_active: bool

class UserFactory(factory.Factory):
    class Meta:
        model = User

    id = factory.Sequence(lambda n: n)
    name = FactoryFaker('name')
    email = FactoryFaker('email')
    age = factory.Faker('random_int', min=18, max=90)
    is_active = True

# Usage
user = UserFactory()           # Single instance
users = UserFactory.build_batch(100)  # 100 instances
admin = UserFactory(name="Admin User", age=30)

# Custom provider for domain-specific data
from faker.providers import BaseProvider

class ProductProvider(BaseProvider):
    def sku(self):
        categories = ['ELEC', 'BOOK', 'HOME', 'TOY']
        return f"{self.random_element(categories)}-{self.random_int(1000, 9999)}"

fake.add_provider(ProductProvider)
fake.sku()  # 'ELEC-4521'

# Deterministic dataset for property-based tests
import hypothesis.strategies as st

user_strategy = st.builds(
    User,
    id=st.integers(min_value=1),
    name=st.text(min_size=1, max_size=100),
    email=st.emails(),
    age=st.integers(min_value=0, max_value=120),
    is_active=st.booleans()
)

JavaScript

import { faker } from '@faker-js/faker';

// Seed for determinism
faker.seed(12345);

// Basic generators
faker.person.fullName();    // 'John Smith'
faker.internet.email();     // 'john.smith@example.com'
faker.number.int({ min: 18, max: 65 });  // 34

// Factory function
function createUser(overrides = {}) {
    return {
        id: faker.string.uuid(),
        name: faker.person.fullName(),
        email: faker.internet.email(),
        age: faker.number.int({ min: 18, max: 90 }),
        avatar: faker.image.avatar(),
        isActive: true,
        ...overrides
    };
}

// Generate batch
const users = Array.from({ length: 100 }, () => createUser());

// Domain-specific faker helpers
const createOrder = (overrides = {}) => ({
    id: faker.string.uuid(),
    customerId: faker.string.uuid(),
    items: Array.from({ length: faker.number.int({ min: 1, max: 5 }) }, () => ({
        sku: `SKU-${faker.string.alphanumeric(6).toUpperCase()}`,
        quantity: faker.number.int({ min: 1, max: 10 }),
        price: faker.commerce.price({ min: 5, max: 500 })
    })),
    status: faker.helpers.arrayElement(['pending', 'paid', 'shipped', 'delivered']),
    createdAt: faker.date.past(),
    ...overrides
});

// Deterministic data for snapshots
faker.seed(42);
const snapshotUser = createUser({ name: 'Snapshot User' });

Java

import net.datafaker.Faker;
import java.util.List;
import java.util.stream.IntStream;

public class TestDataGenerator {
    private static final Faker faker = new Faker();

    public static User createUser() {
        return User.builder()
            .id(faker.number().randomNumber())
            .name(faker.name().fullName())
            .email(faker.internet().emailAddress())
            .age(faker.number().numberBetween(18, 90))
            .isActive(true)
            .build();
    }

    public static List<User> createUsers(int count) {
        return IntStream.range(0, count)
            .mapToObj(i -> createUser())
            .toList();
    }

    // JUnit 5 parameterized with generated data
    public static Stream<Arguments> emailProvider() {
        return Stream.generate(() -> Arguments.of(faker.internet().emailAddress()))
            .limit(50);
    }
}

// Instancio for type-aware generation
import org.instancio.Instancio;
import org.instancio.Select;

User user = Instancio.of(User.class)
    .set(Select.field("role"), "admin")
    .generate(Select.field("age"), gen -> gen.ints().range(18, 90))
    .create();

List<User> users = Instancio.ofList(User.class).size(100).create();

Best Practices

  • Always seed your random generator. Without a seed, a test that fails on CI may pass locally because the data was different. Set Faker.seed() or faker.seed() in a global setup file.
  • Override specific fields for scenario tests. createUser({ role: 'admin' }) is clearer than hoping the random generator happens to produce an admin.
  • Use realistic distributions. A random age between 0 and 120 will mostly produce invalid data. Constrain ranges to match your domain (18-90 for adult users).
  • Generate data close to the test. A global users.json fixture file drifts from the schema. Generate programmatically so adding a new field updates all tests automatically.
  • Include edge cases intentionally. Explicitly test empty strings, maximum lengths, Unicode, and null values alongside happy-path generated data.

Common Mistakes

  • Unseeded random data. Tests fail intermittently because a random email happened to match a uniqueness constraint, or a random string happened to contain a SQL injection pattern.
  • Overly permissive ranges. faker.number.int() defaults to large ranges that may violate business rules (negative prices, 200-character names).
  • Mixing generated and hardcoded data inconsistently. Some tests use Faker, others use literals — the test suite has inconsistent coverage and developers do not know which to reach for.
  • Not regenerating static fixture files. Exporting a JSON fixture once and checking it into git means the data never exercises new validation rules added after the export.
  • Generators that depend on each other. createOrder() calling createUser() internally hides the user from the test, making assertions on the relationship impossible.