MySQL MariaDB

How MySQL Delete Duplicate Rows

MySQL is a relational dataset that stores data in tables that have rows and columns. However, the data stored in the database may contain duplicate values caused by errors in the applications or users.

In this tutorial, we shall learn how to remove duplicate rows in a MySQL database to reduce database size and help to increase the server’s performance.

Before we continue, we are assuming:

  1. You have MySQL installed and running on your system
  2. You have root access to the database.
  3. You have access to a database for experimenting or testing

NOTE: If you need a sample database to try out the concepts provided in this guide, please consider the Sakila database or download a copy of the database used in this guide.

Resources are provided below:

Basic Usage

Before we begin, we will intentionally create a table containing duplicate values for testing purposes. The SQL queries to perform this action are below:

USE world;
DROP TABLE IF EXISTS users;
CREATE TABLE users (id INT PRIMARY KEY NOT NULL AUTO_INCREMENT, username VARCHAR(10) NOT NULL, full_name VARCHAR(20), email VARCHAR(255) NOT NULL);
INSERT INTO users (username, full_name, email) VALUES
    ("virgo", "Claude M. Mori", "[email protected]"),
    ("pulsa", "Tiffany G. Bailey", "[email protected]"),
    ("rocket", "Christopher S. Payton", "[email protected]"),
    ("darkmatter", "Patricia J. Fox", "[email protected]"),
    ("pwnc", "Faye H. Hartley", "[email protected]"),
    ("darkmatter", "Patricia J. Fox", "[email protected]"),
    ("rocket", "Christopher S. Payton", "[email protected]"),
    ("artemis", "Wesley C. Dillard", "[email protected]");

Feel free to modify the query above to fit your needs. You should also ensure that you have the database (world) created to avoid errors.

Now, if we get all the data inside the table and in order by username, we will see the duplicates we have as shown:

mysql> use world;
Database changed
mysql> SELECT * FROM users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username   | full_name             | email                 |
+----+------------+-----------------------+-----------------------+
|  8 | artemis    | Wesley C. Dillard     | [email protected]       |
|  4 | darkmatter | Patricia J. Fox       | [email protected]        |
|  6 | darkmatter | Patricia J. Fox       | [email protected]        |
|  2 | pulsa      | Tiffany G. Bailey     | [email protected] |
|  5 | pwnc       | Faye H. Hartley       | [email protected]   |
|  3 | rocket     | Christopher S. Payton | [email protected]        |
|  7 | rocket     | Christopher S. Payton | [email protected]        |
|  1 | virgo      | Claude M. Mori        | [email protected]      |
+----+------------+-----------------------+-----------------------+

As you can see from the table above, we have two duplicate values that make the database bigger for no reason and cause slow speeds.

Let us now learn how we can remove these values.

#1 – DELETE JOIN

One way to remove duplicate rows in a database is to use the MySQL DELETE JOIN statement. The query, however, uses ids to remove duplicate values.

For example, to remove the duplicate values in the users table above, we can input:

DELETE  table1 FROM users table1 INNER JOIN users table2 WHERE table1.id < table2.id AND table1.email = table2.email;

Once you execute the query above, you will remove the duplicate values as shown in the output below:

mysql> DELETE table1 FROM users table1 INNER JOIN users table2 WHERE table1.id < table2.id AND table1.email = table2.email;
Query OK, 2 rows affected (0.01 sec)
 
mysql> SELECT * FROM users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username   | full_name             | email                 |
+----+------------+-----------------------+-----------------------+
|  8 | artemis    | Wesley C. Dillard     | [email protected]       |
|  6 | darkmatter | Patricia J. Fox       | [email protected]        |
|  2 | pulsa      | Tiffany G. Bailey     | [email protected] |
|  5 | pwnc       | Faye H. Hartley       | [email protected]   |
|  7 | rocket     | Christopher S. Payton | [email protected]        |
|  1 | virgo      | Claude M. Mori        | [email protected]      |
+----+------------+-----------------------+-----------------------+

#2 – Row_Number() Function

The second method we can implement is to use MySQL row_number() function. This function is supported in MySQL version 8 and higher.

It works by assigning a sequential int value to each row, with rows containing duplicate values getting a value higher than 1.

To learn more about this function, use the resource provided below:

https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_row-number

Consider the query below that returns the id of the rows with duplicate values:

SELECT id FROM (SELECT id, ROW_NUMBER() OVER (PARTITION BY username ORDER BY username) AS row_var FROM users) t1 WHERE row_var > 1;

Once you execute the above query, you should get the list of ids as shown in the output below:

 +----+
| id |
+----+
|  6 |
|  7 |
+----+
2 rows in set (0.01 sec)

If you want to remove the values, simply replace the SELECT statement with the DELETE statement as shown below:

DELETE FROM users WHERE id IN (SELECT id FROM ( SELECT id, ROW_NUMBER() OVER (PARTITION BY username ORDER BY username) AS row_var FROM users) t1 WHERE row_var > 1);

Finally, you can verify that the duplicate values are removed using the SELECT statement.

mysql> SELECT * from users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username   | full_name             | email                 |
+----+------------+-----------------------+-----------------------+
|  8 | artemis    | Wesley C. Dillard     | [email protected]       |
|  4 | darkmatter | Patricia J. Fox       | [email protected]        |
|  2 | pulsa      | Tiffany G. Bailey     | [email protected] |
|  5 | pwnc       | Faye H. Hartley       | [email protected]   |
|  3 | rocket     | Christopher S. Payton | [email protected]        |
|  1 | virgo      | Claude M. Mori        | [email protected]      |
+----+------------+-----------------------+-----------------------+

Conclusion

In this tutorial, we discussed the two methods of removing duplicate values from a database. Large databases, especially those in common use, may contain many duplicate values from external imports and other errors. Therefore, there is a need to keep purging duplicate values to ensure that applications perform optimally.

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list