MySQL MariaDB

SQL Server Delete Duplicate Rows

Duplicate values in a database can be an issue when performing highly accurate operations. They can lead to a single value being processed multiple times, tainting the result. Duplicate records also take up more space than necessary, leading to slow performance.

In this guide, you will understand how you can find and remove duplicate rows in an SQL Server database.

The Basics

Before we proceed further, what is a duplicate row? We can classify a row as a duplicate if it contains a similar name and value to another row on the table.

To illustrate how to find and remove duplicate rows in a database, let us start by creating sample data as shown in the queries below:

USE duplicatedb;
CREATE TABLE users(
                id INT IDENTITY(1,1) NOT NULL,
                username VARCHAR(20),
                email VARCHAR(55),
                phone BIGINT,
                states VARCHAR(20)
);
INSERT INTO users(username, email, phone, states)
VALUES ('zero', 'zero@mail.to', 6819693895, 'New York'),
           ('Gr33n', 'green@zap.tv', 9247563872, 'Colorado'),
           ('Shell', 'user@shell.io', 702465588, 'Texas'),
           ('dwell', 'dewll@zero.mail', 1452745985, 'New Mexico'),
           ('Gr33n', 'green@zap.tv', 9247563872, 'Colorado'),
           ('zero', 'zero@mail.to', 6819693895, 'New York');

In the example query above, we create a table containing user information. In the next clause block, we use the insert into the statement to add duplicate values to the users’ table.

Find Duplicate Rows

Once we have the sample data we need, let us check for duplicate values in the users’ table.  We can do this using the count function as:

SELECT username, email, phone, states, COUNT(*) AS count_value FROM users GROUP BY username, email, phone, states HAVING COUNT(*) > 1;

The above code snippet should return the duplicate rows in the database and how many times they appear on the table.

An example output is as shown:

Next, we remove the duplicate rows.

Delete Duplicate Rows

The next step is to remove duplicate rows. We can do this by using the delete query as shown in the example snippet below:

delete from users where id not in (select max(id) from users group by username, email, phone, states);

The query should affect the duplicate rows and keep the unique rows in the table.

We can view the table as:

SELECT * FROM users;

The resulting value is as shown:

Delete Duplicate Rows (JOIN)

You can also use a JOIN statement to remove duplicate rows from a table. An example sample query code is as shown below:

DELETE a FROM users an INNER JOIN
        (SELECT id, rank() OVER(partition BY username ORDER BY id) AS rank_ FROM users)
         b ON a.id=b.id WHERE b.rank_>1;

Keep in mind that using inner join to remove duplicates may take longer than others on an extensive database.

Delete Duplicate Row (row_number())

The row_number() function assigns a sequential number to the rows in a table. We can use this functionality to remove duplicates from a table.

Consider the example query below:

USE duplicatedb
DELETE T
FROM
(
SELECT *
, duplicate_rank = ROW_NUMBER() OVER (
             PARTITION BY id
             ORDER BY (SELECT NULL)
           )
FROM users
) AS T
WHERE duplicate_rank > 1

The query above should use the values returned from the row_number() function to remove the duplicates. A duplicate row will produce a value higher than 1 from the row_number() function.

Conclusion

Keeping your databases clean by removing duplicate rows from the tables is good. This helps to improve performance and storage space. Using the methods in this tutorial, you will clean your databases safely.

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list