Delete Duplicate Records in SQL

Below are a few scenarios I can think in the past where I’ve had to delete duplicate records.

It is strongly recommended you take a backup of your table prior to performing a DELETE. Test thoroughly and use these examples at your own risk.

Delete Using a CTE with row_number

This example demonstrates using a common table expression that utilizes a row_number by partitioning on the distinct values. This windowed function allows us to essentially group the unique values and provide an ordered sort for each of them. We then keep the record with rownum of 1 and delete the others.

[cc lang=”sql”]
IF OBJECT_ID(N’tempdb..#URL’) IS NOT NULL DROP TABLE #URL

— setup the base data
CREATE TABLE #URL(
url nvarchar(1024) NULL
);
INSERT #URL (url) VALUES (N’www.litwareinc.com/’);
GO
INSERT #URL (url) VALUES (N’www.treyresearch.net/’);
GO
INSERT #URL (url) VALUES (N’www.adatum.com/’);
GO
INSERT #URL (url) VALUES (N’www.northwindtraders.com/’);
GO
INSERT #URL (url) VALUES (N’www.wideworldimporters.com/’);
GO
INSERT #URL (url) VALUES (N’www.proseware.com/’);

— create duplicates
INSERT INTO #URL(url)
SELECT url FROM #URL

GO 3 — repeat this insert statement 3x

SELECT * FROM #URL;

BEGIN TRANSACTION;

— perform actual delete
WITH URL_cte(row_num, url)
AS
(
SELECT row_num = ROW_NUMBER() OVER (PARTITION BY url ORDER BY url),
url
FROM #URL
)
DELETE FROM URL_cte
WHERE row_num > 1; — anything that is not record 1

— check results
SELECT * FROM #URL;

— If results look good, execute the commit below
— COMMIT TRANSACTION
[/cc]

Delete for an Unconstrained Key

This scenario is when there is a key in the table that should be the primary key, however it was not constrained properly. Say we have a Customers table that has a CustomerID that was not enforced as being unique, and now, somehow duplicate records with the same CustomerID are in the table. What we first need in this table, is a column that will be unique. To do this, let’s create an alternate key and set it as an identity.

[cc lang=”sql”]
ALTER TABLE #Customers
ADD CustomerAltID bigint identity(1,1);
[/cc]

After executing, we now have a column named CustomerAltID that is truly unique in the table. We will use this column to help identify the records we want to retain.

First, let’s create an index to speed things up.

[cc lang=”sql”]

CREATE NONCLUSTERED INDEX [IX_NC_Customers_Alt] ON #Customers
(
[CustomerID] ASC
)
INCLUDE
(
CustomerAltID
);
[/cc]

Now let’s look at our delete statement.

[cc lang=”sql”]
DELETE
FROM #Customers
WHERE NOT EXISTS(
— These are the records
— we want to retain
SELECT CustomerID
,CustomerAltID
FROM #Customers c
WHERE c.CustomerAltID =
(
SELECT MAX(c2.CustomerAltID)
FROM #Customers c2
WHERE
c2.CustomerID = c.CustomerID
— Additional critera can be specified below
–AND c2.DateCreated > c.DateCreated
)
AND #Customers.CustomerID = c.CustomerID
AND #Customers.CustomerAltID = c.CustomerAltID
);

[/cc]

We are deleting all the records from Customer that are not in the following query (NOT EXISTS). We then do a correlated sub query to filter the CustomerAltID’s that we want to keep. We can also specify additional criteria in the correlated sub query to further filter the records we care about. In the example above we have a commented out line that would only return us the more recent record with the same CustomerID.

Delete for entirely duplicate row

The second situation I have run into is when a table contains records where the entire row is an exact duplicate of another row. This is also where a primary key comes in handy. Of course if you do not have one, you have to improvise. This example uses rownumber in order to create a pseudo primary key.

If you know the number of records that are supposed to be deleted, it’s a good idea to constrain the count using DELETE TOP (recordcount)

The Checksum function creates a single number from all the fields, then groups on that number. This is advantageous because instead of joining on many columns, you only have to join on one column. Please note, though highly unlikely (about 1 in 2 billion chance) the checksum function can return the same value for different row values.

[cc lang=”sql”]
SELECT
RowNum = ROW_NUMBER() OVER(order by entryid ASC)
,ChkSum = CHECKSUM(*)
INTO #HoldDuplicates
FROM dbo.Duplicate

DELETE
FROM #HoldDuplicates
WHERE RowNum NOT IN(
SELECT MAX(RowNum)
FROM #HoldDuplicates
GROUP BY ChkSum
)

SELECT *
FROM #HoldDuplicates
[/cc]

Now that we have only the good records we need in our #HoldDuplicates table, we can truncate the old table and insert the records we have retained. I recommend doing this in a transaction in case things go awry.

Featured Articles

 Site Author

  • Thanks for visiting!