Find and Remove Duplicate Rows in SQL
Detect duplicate records in SQL tables using GROUP BY and HAVING, then remove them safely while keeping the canonical row.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Duplicate rows creep into tables through application bugs, import scripts, or race conditions. They waste space, distort analytics, and can break unique constraints you intended to enforce. Finding them requires grouping by the columns that define uniqueness, and removing them safely means keeping one canonical row while deleting the rest without losing related data.
When to Use
Use this resource when:
- You need to identify duplicate records in a table.
- A unique constraint violation prevents adding a required index.
- You are cleaning data after an import or migration.
- You want to deduplicate before enforcing a new primary key or unique index.
Solution
Find duplicates in PostgreSQL
-- Find duplicate emails in the users table
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- Keep the oldest row and delete the rest
WITH duplicates AS (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at) AS rn
FROM users
)
DELETE FROM users
WHERE id IN (
SELECT id FROM duplicates WHERE rn > 1
);
Explanation
The first query groups rows by the column that should be unique and uses HAVING COUNT(*) > 1 to return only duplicates. The second query uses a common table expression (CTE) with ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at). Each group of duplicates gets numbered starting from 1, and we delete every row except the first one. The ORDER BY clause determines which row is kept; here we keep the oldest record. Always run the SELECT version of the CTE before DELETE to confirm what will be removed.
Variants
| Database | Technique | Notes |
|---|---|---|
| PostgreSQL | ROW_NUMBER() OVER | Flexible and safe |
| MySQL 8+ | ROW_NUMBER() OVER | Same syntax as PostgreSQL |
| MySQL 5.7 | Self-join | Use MIN(id) to keep one row |
| SQLite | DELETE with IN subquery | Works with window functions in 3.25+ |
Best Practices
- Always preview before deleting. Run the CTE as a
SELECTfirst to see which rows will be kept. - Back up the table or use a transaction. A single bad
DELETEcan remove thousands of rows. - Choose the canonical row with business logic. Oldest, newest, or most complete record depends on the use case.
- Add a unique constraint after cleanup. This prevents duplicates from returning.
- Consider foreign keys. Deleting a parent row may orphan child rows unless you use
ON DELETE CASCADEor update references first.
Common Mistakes
- Deleting without a WHERE clause. A missing
WHEREturns the query into a table wipe. - Keeping the wrong row. If you order randomly, you may discard the most valuable duplicate.
- Ignoring NULL values.
NULLdoes not equalNULL, so duplicates with NULL keys may not be detected byGROUP BY. - Running on production during peak traffic. Lock contention can block writes; use a batch approach or low-traffic window.
- Forgetting to update related sequences. If you delete the highest
id, you may need to reset a sequence, though it is rarely required.
Frequently Asked Questions
Q: What if duplicates have different values in other columns? A: Choose the canonical row by business rules, then either merge the data or keep the row with the most complete or most recent data.
Q: Can I delete duplicates in batches?
A: Yes. Add AND id IN (SELECT id FROM duplicates WHERE rn > 1 LIMIT 1000) and run the delete repeatedly until no duplicates remain.
Q: How do I prevent duplicates from reappearing? A: Add a unique constraint or unique index on the columns that define uniqueness, and handle duplicate key exceptions in your application.
Related Resources
Read Replicas — Scale Reads Without Changing Application Logic
A practical guide to read replicas: setting up replication, routing read queries, handling replication lag, and scaling read-heavy workloads with PostgreSQL, MySQL, and cloud-managed replicas.
GuideSQL CTEs — Common Table Expressions Explained
A practical guide to SQL Common Table Expressions (CTEs): non-recursive and recursive CTEs, readability, performance, and when to use them over subqueries.
DocDatabase Failover Runbook
A step-by-step runbook for executing database failover procedures safely with minimal downtime and data loss.
DocDatabase Schema Documentation Template
A template for documenting database schemas with entity relationships, field definitions, and migration history.
GuideFull-Text Search — Implement Search That Actually Works
A practical guide to full-text search: PostgreSQL tsvector, Elasticsearch indexing, query design, relevance tuning, and building search that users trust with autocomplete, faceting, and typo tolerance.