Duplicate values in a database can be an issue when performing highly accurate operations. They can lead to a single value being processed multiple times, tainting the result. Duplicate records also take up more space than necessary, leading to slow performance.
In this guide, you will understand how you can find and remove duplicate rows in an SQL Server database.
The Basics
Before we proceed further, what is a duplicate row? We can classify a row as a duplicate if it contains a similar name and value to another row on the table.
To illustrate how to find and remove duplicate rows in a database, let us start by creating sample data as shown in the queries below:
CREATE TABLE users(
id INT IDENTITY(1,1) NOT NULL,
username VARCHAR(20),
email VARCHAR(55),
phone BIGINT,
states VARCHAR(20)
);
INSERT INTO users(username, email, phone, states)
VALUES ('zero', 'zero@mail.to', 6819693895, 'New York'),
('Gr33n', 'green@zap.tv', 9247563872, 'Colorado'),
('Shell', 'user@shell.io', 702465588, 'Texas'),
('dwell', 'dewll@zero.mail', 1452745985, 'New Mexico'),
('Gr33n', 'green@zap.tv', 9247563872, 'Colorado'),
('zero', 'zero@mail.to', 6819693895, 'New York');
In the example query above, we create a table containing user information. In the next clause block, we use the insert into the statement to add duplicate values to the users’ table.
Find Duplicate Rows
Once we have the sample data we need, let us check for duplicate values in the users’ table. We can do this using the count function as:
The above code snippet should return the duplicate rows in the database and how many times they appear on the table.
An example output is as shown:
Next, we remove the duplicate rows.
Delete Duplicate Rows
The next step is to remove duplicate rows. We can do this by using the delete query as shown in the example snippet below:
delete from users where id not in (select max(id) from users group by username, email, phone, states);
The query should affect the duplicate rows and keep the unique rows in the table.
We can view the table as:
The resulting value is as shown:
Delete Duplicate Rows (JOIN)
You can also use a JOIN statement to remove duplicate rows from a table. An example sample query code is as shown below:
(SELECT id, rank() OVER(partition BY username ORDER BY id) AS rank_ FROM users)
b ON a.id=b.id WHERE b.rank_>1;
Keep in mind that using inner join to remove duplicates may take longer than others on an extensive database.
Delete Duplicate Row (row_number())
The row_number() function assigns a sequential number to the rows in a table. We can use this functionality to remove duplicates from a table.
Consider the example query below:
DELETE T
FROM
(
SELECT *
, duplicate_rank = ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY (SELECT NULL)
)
FROM users
) AS T
WHERE duplicate_rank > 1
The query above should use the values returned from the row_number() function to remove the duplicates. A duplicate row will produce a value higher than 1 from the row_number() function.
Conclusion
Keeping your databases clean by removing duplicate rows from the tables is good. This helps to improve performance and storage space. Using the methods in this tutorial, you will clean your databases safely.