Microsoft SQL Server logo

How to clean up duplicates from MS SQL Database?

Reading Time: 4 minutes.

This article explains how to query and remove duplicate entries from a Microsoft SQL database using T-SQL. The same model works both for Azure Databases (in Azure SQL Query Editor) and databases on MS SQL Server (using SSMS – SQL Server Management Studio).

Background

Why bother?

That’s a (somewhat) valid question. A table in a SQL database can have quite a few rows as long as you have plenty of storage space available. The number of rows might (technically speaking) be limited by the maximum value of BigInt, which would be 9,223,372,036,854,775,807.

On Stack Overflow there’s an example of a table with over 24.5 billion rows in a live production system.

Bottom line: You won’t run out of row space anytime soon. Even if 24.5 billion rows would be the maximum (and it isn’t), if you were adding something to the database 10 times per second, you could keep that up for over 77 years. That gives you plenty of time to rethink the architecture (and do some clean up later, if need be 🙂)

However, managing data in a table this large becomes complicated. Running any queries against a table as big as this – especially, when you’re using an ORM like EF Core – requires you to put a lot of thought into it to avoid unnecessarily long execution times and performance bottlenecks.

The 24.5-billion row production database I mentioned earlier ended up getting restructured and older data simply truncated. Not because of any actual limitations or performance reasons – but because of the size of the backups!

In short – it’s not a good idea to have massive databases. Rather, you should either partition the tables or clean up the data. The maintainer from the thread on Stack Overflow actually mentions this:

If you plan on storing this much data in a single table: don’t. I highly recommend you consider table partitioning (either manually or with the built-in features if you’re running Enterprise edition). – – Deleting directly from a table with tens of billions of rows of data will a) take a HUGE amount of time and b) fill up the transaction log hundreds or thousands of times over.

Dan Bechard (on Stack Overflow – emphasis mine)

Well, and there’s obviously the glaring point of data sanitization. In my case, the rows were produced by a long undiscovered bug, that would in some cases store 2 identical rows apart from a random identifier.

That data wasn’t strictly speaking duplicates – but it was completely redundant and unnecessary. And after fixing the bug, I wanted to get rid of the unneeded data cluttering my databases :)

Okay – with all of the background stuff cleared, let’s move on

Solution

This’ll work in a fairly similar manner both in Azure Portal and SQL Management Studio, or probably whatever you use to query your databases.

First of all, open the query window. In SQL Management Studio, right-click your database and click “New Query”.

For a SQL database on Azure, it’s a similar idea – navigate to your database and select “Query editor” – it’s in preview on my tenant, as shown in the screenshot below:

Then, let’s query to find all duplicates rows! Essentially, we’ll need to define what’s a duplicate, query for them, and store the results as a temporary result set – a CTE, or Common Table Expression.

For me, the T-SQL looks something like this:

WITH CTE AS 
(
    SELECT 
        Property1, 
	Property2,
        ROW_NUMBER() OVER (
            PARTITION BY 
                Property1
            ORDER BY 
                Property1
        ) AS row_num
     FROM 
        [dbo].[TABLE]
)
SELECT DISTINCT Property1
FROM   CTE
WHERE row_num > 1

Running this query will simply gather all of the rows matching your query as a CTE, and then select each row with row_num – essentially, duplicate count – larger than 1.

But wait – we’ll actually want to get rid of the duplicates, right?

Well, that’s easily achieved! Since you already have the result set stored as a CTE, aptly named “CTE”, you can just query that to DELETE.. Well, the same ideas above. Anything with a row_num > 1.

So, an example of this below:

WITH CTE AS 
(
    SELECT 
        Property1, 
	Property2,
        ROW_NUMBER() OVER (
            PARTITION BY 
                Property1
            ORDER BY 
                Property1
        ) AS row_num
     FROM 
        [dbo].[TABLE]
)
DELETE
FROM   CTE
WHERE row_num > 1

That’s it. Hope this is a good example to adapt from for whatever your use case is. And as usual, if you encounter any issues, let me know in the comments below! :)

References

mm

Leave a Reply

avatar
5000
  Subscribe  
Notify of