This post was most recently updated on July 31st, 2022.

4 min read.

This article explains how to query and remove duplicate entries from a Microsoft SQL database using T-SQL. The same model works both for Azure Databases (in Azure SQL Query Editor) and databases on MS SQL Server (using SSMS – SQL Server Management Studio).

I had to do this a while back due to a bug in an API creating multiple entities with the same content – save for the Identity field. Fun!

Background

Table of Contents

Why bother?

That’s a (somewhat) valid question. A table in a SQL database can have quite a few rows as long as you have plenty of storage space available. The number of rows might (technically speaking) be limited by the maximum value of BigInt, which would be 9,223,372,036,854,775,807.

On Stack Overflow, there’s an example of a table with over 24.5 billion rows in a live production system.

Bottom line: You won’t run out of row space anytime soon. Even if 24.5 billion rows would be the maximum (and it isn’t), if you were adding something to the database 10 times per second, you could keep that up for over 77 years. That gives you plenty of time to rethink the architecture (and do some clean-up later, if need be 🙂)

However, managing data in a table this large becomes complicated. Running any queries against a table as big as this – especially, when you’re using an ORM like EF Core – requires you to put a lot of thought into it to avoid unnecessarily long execution times and performance bottlenecks.

The 24.5-billion row production database I mentioned earlier ended up getting restructured and older data simply truncated. Not because of any actual limitations or performance reasons – but because of the size of the backups!

In short – it’s not a good idea to have massive databases. Rather, you should either partition the tables or clean up the data. The maintainer from the thread on Stack Overflow actually mentions this:

If you plan on storing this much data in a single table: don’t. I highly recommend you consider table partitioning (either manually or with the built-in features if you’re running Enterprise edition). – – Deleting directly from a table with tens of billions of rows of data will a) take a HUGE amount of time and b) fill up the transaction log hundreds or thousands of times over.
Dan Bechard (on Stack Overflow – emphasis mine)

Well, and there’s obviously the glaring point of data sanitization. In my case, the rows were produced by a long undiscovered bug, that would in some cases store 2 (or more) identical rows apart from a random identifier.

That data wasn’t strictly speaking duplicates – but it was completely redundant and unnecessary. And after fixing the bug, I wanted to get rid of the unneeded data cluttering my databases :)

Okay – with all of the background stuff cleared, let’s move on!

Solution

This will work in a fairly similar manner both in Azure Portal and SQL Management Studio, or probably whatever you use to query your databases.

Time needed: 20 minutes

How to remove duplicates in an MS SQL database using a Common Table Expression (CTE)?

Create a new Query
First of all, open the query window. In SQL Management Studio, right-click your database and click “New Query”.

^{“New Query” in SQL Server Management Studio.}

For a SQL database on Azure, it’s a similar idea – navigate to your database and select “Query editor” – it’s in preview on my tenant, as shown in the screenshot below:
^{In Azure SQL, the “Query editor” is the equivalent of the “New Query” on SSMS.}
Form a Common Table Expression
Then, let’s query to find all duplicated rows! Essentially, we’ll need to define what’s a duplicate, query for them, and store the results as a temporary result set – a CTE, or Common Table Expression.

For me, the T-SQL looks something like this:

WITH CTE AS (SELECT Property1, Property2, ROW_NUMBER() OVER ( PARTITION BY Property1 ORDER BY Property1 ) AS row_num FROM [dbo].[Table] ) SELECT DISTINCT Property1 FROM CTE WHERE row_num > 1

Running this query will simply gather all of the rows matching your query as a CTE, and then select each row with row_num – essentially, duplicate count – larger than 1.

But wait – we’ll actually want to get rid of the duplicates, right?
(OPTIONAL) Query your CTE to remove the duplicate rows
Well, that’s easily achieved! Since you already have the result set stored as a CTE, aptly named “CTE”, you can just query that to DELETE.. Well, the same ideas above. Anything with a row_num > 1.

So, an example of this below:
WITH cte AS (SELECT property1, property2, Row_number() OVER ( partition BY property1 ORDER BY property1 ) AS row_num FROM [dbo].[table]) DELETE FROM cte WHERE row_num > 1

That’s it. Hope this is a good example to adapt from for whatever your use case is. And as usual, if you encounter any issues, let me know in the comments below! :)

References

A very useful tutorial on CTEs

Author
Recent Posts

Antti K. Koskela

Developer / Speaker / Consultant at Koskila / Precio Fishbone / Norppandalotti Software Co / Alter - Experience Ideas Ltd

Antti Koskela is a proud digital native nomadic millennial full stack developer (is that enough funny buzzwords? That's definitely enough funny buzzwords!), who works as Solutions Architect for Precio Fishbone, building delightful Digital Workplaces.

He's been a developer from 2004 (starting with PHP and Java), and he's been working on .NET projects, Azure, Office 365, SharePoint and a lot of other stuff. He's also Microsoft MVP for Azure.

This is his personal professional (e.g. professional, but definitely personal) blog.

Latest posts by Antti K. Koskela (see all)

How to output all Environment Variables in Azure DevOps with PowerShell? - April 9, 2024
CSOM suddenly throwing exceptions when trying to access list contents in SharePoint? A weird fix. - April 2, 2024
“Predefined type ‘System.Object’ is not defined or imported” and other System namespace stuff missing in your solution? - March 26, 2024

4 1 vote

Article Rating

#SharePointProblems

Solutions are worthless unless shared! Antti K. Koskela's Personal Professional Blog

How to clean up duplicates from MS SQL Database?

Background

Solution

References

Posts Related to "How to clean up duplicates from MS SQL Database?":