Microsoft SQL Server logo

How to clean up duplicates from MS SQL Database?

This post was most recently updated on December 25th, 2020.

Reading Time: 4 minutes.

This article explains how to query and remove duplicate entries from a Microsoft SQL database using T-SQL. The same model works both for Azure Databases (in Azure SQL Query Editor) and databases on MS SQL Server (using SSMS – SQL Server Management Studio).

I had to do this a while back due to a bug in an API creating multiple entities with the same content – save for the Identity field. Fun!

Background

Table of Contents

Why bother?

That’s a (somewhat) valid question. A table in a SQL database can have quite a few rows as long as you have plenty of storage space available. The number of rows might (technically speaking) be limited by the maximum value of BigInt, which would be 9,223,372,036,854,775,807.

On Stack Overflow there’s an example of a table with over 24.5 billion rows in a live production system.

Bottom line: You won’t run out of row space anytime soon. Even if 24.5 billion rows would be the maximum (and it isn’t), if you were adding something to the database 10 times per second, you could keep that up for over 77 years. That gives you plenty of time to rethink the architecture (and do some clean up later, if need be 🙂)

However, managing data in a table this large becomes complicated. Running any queries against a table as big as this – especially, when you’re using an ORM like EF Core – requires you to put a lot of thought into it to avoid unnecessarily long execution times and performance bottlenecks.

The 24.5-billion row production database I mentioned earlier ended up getting restructured and older data simply truncated. Not because of any actual limitations or performance reasons – but because of the size of the backups!

In short – it’s not a good idea to have massive databases. Rather, you should either partition the tables or clean up the data. The maintainer from the thread on Stack Overflow actually mentions this:

If you plan on storing this much data in a single table: don’t. I highly recommend you consider table partitioning (either manually or with the built-in features if you’re running Enterprise edition). – – Deleting directly from a table with tens of billions of rows of data will a) take a HUGE amount of time and b) fill up the transaction log hundreds or thousands of times over.

Dan Bechard (on Stack Overflow – emphasis mine)

Well, and there’s obviously the glaring point of data sanitization. In my case, the rows were produced by a long undiscovered bug, that would in some cases store 2 (or more) identical rows apart from a random identifier.

That data wasn’t strictly speaking duplicates – but it was completely redundant and unnecessary. And after fixing the bug, I wanted to get rid of the unneeded data cluttering my databases :)

Okay – with all of the background stuff cleared, let’s move on!

Solution

This’ll work in a fairly similar manner both in Azure Portal and SQL Management Studio, or probably whatever you use to query your databases.

Time needed: 20 minutes.

How to remove duplicates in an MS SQL database using a Common Table Expression (CTE)?

  1. Create a new Query

    First of all, open the query window. In SQL Management Studio, right-click your database and click “New Query”.

    "New Query" in SQL Server Management Studio.“New Query” in SQL Server Management Studio.

    For a SQL database on Azure, it’s a similar idea – navigate to your database and select “Query editor” – it’s in preview on my tenant, as shown in the screenshot below:
    In Azure SQL, the "Query editor" is the equivalent of the "New Query" on SSMS.In Azure SQL, the “Query editor” is the equivalent of the “New Query” on SSMS.

  2. Form a Common Table Expression

    Then, let’s query to find all duplicates rows! Essentially, we’ll need to define what’s a duplicate, query for them, and store the results as a temporary result set – a CTE, or Common Table Expression.

    For me, the T-SQL looks something like this:

    WITH CTE
    AS (SELECT Property1, Property2, ROW_NUMBER()
    OVER (
    PARTITION BY Property1
    ORDER BY Property1 ) AS row_num
    FROM [dbo].[Table] )
    SELECT DISTINCT Property1
    FROM CTE WHERE row_num > 1


    Running this query will simply gather all of the rows matching your query as a CTE, and then select each row with row_num – essentially, duplicate count – larger than 1.

    But wait – we’ll actually want to get rid of the duplicates, right?

  3. (OPTIONAL) Query your CTE to remove the duplicate rows

    Well, that’s easily achieved! Since you already have the result set stored as a CTE, aptly named “CTE”, you can just query that to DELETE.. Well, the same ideas above. Anything with a row_num > 1.

    So, an example of this below:

    WITH cte
    AS (SELECT property1, property2, Row_number()
    OVER (
    partition BY property1
    ORDER BY property1 ) AS row_num
    FROM [dbo].[table])
    DELETE FROM cte
    WHERE row_num > 1


That’s it. Hope this is a good example to adapt from for whatever your use case is. And as usual, if you encounter any issues, let me know in the comments below! :)

References

mm
4 1 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments