There are two scenarios I can think in the past where I’ve had to delete duplicate records. With both these example below, I strongly recommend taking backups of your tables, and verifying what is actually going to be deleted prior to deleting. As a disclaimer, you are using these examples at your own risk.
Delete for an Unconstrained Key
The first scenario is when there is a key in the table that should be the primary key, however it was not constrained properly. Say we have a Customers table that has a CustomerID that was not enforced as being unique, and now, somehow duplicate records with the same CustomerID are in the table. What we first need in this table, is a column that will be unique. To do this, let’s create an alternate key and set it as an identity.
ALTER TABLE dbo.Customers ADD CustomerAltID bigint identity(1,1);
After executing, we now have a column named CustomerAltID that is truly unique in the table. We will use this column to help identify the records we want to retain.
First, let’s create an index to speed things up.
CREATE NONCLUSTERED INDEX [IX_NC_Customers_Alt] ON dbo.Customers ( [CustomerID] ASC ) INCLUDE ( CustomerAltID );
Now let’s look at our delete statement.
DELETE FROM dbo.Customers WHERE NOT EXISTS( -- These are the records -- we want to retain SELECT CustomerID ,CustomerAltID FROM dbo.Customers c WHERE c.CustomerAltID = ( SELECT MAX(c2.CustomerAltID) FROM dbo.Customers c2 WHERE c2.CustomerID = c.CustomerID -- Additional critera can be specified below --AND c2.DateCreated > c.DateCreated ) AND dbo.Customers.CustomerID = c.CustomerID AND dbo.Customers.CustomerAltID = c.CustomerAltID );
We are deleting all the records from Customer that are not in the following query (NOT EXISTS). We then do a correlated sub query to filter the CustomerAltID’s that we want to keep. We can also specify additional criteria in the correlated sub query to further filter the records we care about. In the example above we have a commented out line that would only return us the more recent record with the same CustomerID.
Delete for entirely duplicate row
The second situation I have run into is when a table contains records where the entire row is an exact duplicate of another row. This is also where a primary key comes in handy. Of course if you do not have one, you have to improvise. This example uses rownumber in order to create a pseudo primary key.
The Checksum function creates a single number from all the fields, then groups on that number. This is advantageous because instead of joining on many columns, you only have to join on one column. Please note, though highly unlikely (about 1 in 2 billion chance) the checksum function can return the same value for different row values.
SELECT RowNum = ROW_NUMBER() OVER(order by entryid ASC) ,ChkSum = CHECKSUM(*) INTO #HoldDuplicates FROM dbo.Duplicate DELETE FROM #HoldDuplicates WHERE RowNum NOT IN( SELECT MAX(RowNum) FROM #HoldDuplicates GROUP BY ChkSum ) SELECT * FROM #HoldDuplicates
Now that we have only the good records we need in our #HoldDuplicates table, we can truncate the old table and insert the records we have retained. I recommend doing this in a transaction in case things go awry.
Popular search terms:
