In this tutorial, we shall learn how to remove duplicate rows in a MySQL database to reduce database size and help to increase the server’s performance.
Before we continue, we are assuming:
- You have MySQL installed and running on your system
- You have root access to the database.
- You have access to a database for experimenting or testing
NOTE: If you need a sample database to try out the concepts provided in this guide, please consider the Sakila database or download a copy of the database used in this guide.
Resources are provided below:
Basic Usage
Before we begin, we will intentionally create a table containing duplicate values for testing purposes. The SQL queries to perform this action are below:
DROP TABLE IF EXISTS users;
CREATE TABLE users (id INT PRIMARY KEY NOT NULL AUTO_INCREMENT, username VARCHAR(10) NOT NULL, full_name VARCHAR(20), email VARCHAR(255) NOT NULL);
INSERT INTO users (username, full_name, email) VALUES
("virgo", "Claude M. Mori", "[email protected]"),
("pulsa", "Tiffany G. Bailey", "[email protected]"),
("rocket", "Christopher S. Payton", "[email protected]"),
("darkmatter", "Patricia J. Fox", "[email protected]"),
("pwnc", "Faye H. Hartley", "[email protected]"),
("darkmatter", "Patricia J. Fox", "[email protected]"),
("rocket", "Christopher S. Payton", "[email protected]"),
("artemis", "Wesley C. Dillard", "[email protected]");
Feel free to modify the query above to fit your needs. You should also ensure that you have the database (world) created to avoid errors.
Now, if we get all the data inside the table and in order by username, we will see the duplicates we have as shown:
Database changed
mysql> SELECT * FROM users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username | full_name | email |
+----+------------+-----------------------+-----------------------+
| 8 | artemis | Wesley C. Dillard | [email protected] |
| 4 | darkmatter | Patricia J. Fox | [email protected] |
| 6 | darkmatter | Patricia J. Fox | [email protected] |
| 2 | pulsa | Tiffany G. Bailey | [email protected] |
| 5 | pwnc | Faye H. Hartley | [email protected] |
| 3 | rocket | Christopher S. Payton | [email protected] |
| 7 | rocket | Christopher S. Payton | [email protected] |
| 1 | virgo | Claude M. Mori | [email protected] |
+----+------------+-----------------------+-----------------------+
As you can see from the table above, we have two duplicate values that make the database bigger for no reason and cause slow speeds.
Let us now learn how we can remove these values.
#1 – DELETE JOIN
One way to remove duplicate rows in a database is to use the MySQL DELETE JOIN statement. The query, however, uses ids to remove duplicate values.
For example, to remove the duplicate values in the users table above, we can input:
Once you execute the query above, you will remove the duplicate values as shown in the output below:
Query OK, 2 rows affected (0.01 sec)
mysql> SELECT * FROM users ORDER BY username;
+----+------------+-----------------------+-----------------------+
| id | username | full_name | email |
+----+------------+-----------------------+-----------------------+
| 8 | artemis | Wesley C. Dillard | [email protected] |
| 6 | darkmatter | Patricia J. Fox | [email protected] |
| 2 | pulsa | Tiffany G. Bailey | [email protected] |
| 5 | pwnc | Faye H. Hartley | [email protected] |
| 7 | rocket | Christopher S. Payton | [email protected] |
| 1 | virgo | Claude M. Mori | [email protected] |
+----+------------+-----------------------+-----------------------+
#2 – Row_Number() Function
The second method we can implement is to use MySQL row_number() function. This function is supported in MySQL version 8 and higher.
It works by assigning a sequential int value to each row, with rows containing duplicate values getting a value higher than 1.
To learn more about this function, use the resource provided below:
https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_row-number
Consider the query below that returns the id of the rows with duplicate values:
Once you execute the above query, you should get the list of ids as shown in the output below:
| id |
+----+
| 6 |
| 7 |
+----+
2 rows in set (0.01 sec)
If you want to remove the values, simply replace the SELECT statement with the DELETE statement as shown below:
Finally, you can verify that the duplicate values are removed using the SELECT statement.
+----+------------+-----------------------+-----------------------+
| id | username | full_name | email |
+----+------------+-----------------------+-----------------------+
| 8 | artemis | Wesley C. Dillard | [email protected] |
| 4 | darkmatter | Patricia J. Fox | [email protected] |
| 2 | pulsa | Tiffany G. Bailey | [email protected] |
| 5 | pwnc | Faye H. Hartley | [email protected] |
| 3 | rocket | Christopher S. Payton | [email protected] |
| 1 | virgo | Claude M. Mori | [email protected] |
+----+------------+-----------------------+-----------------------+
Conclusion
In this tutorial, we discussed the two methods of removing duplicate values from a database. Large databases, especially those in common use, may contain many duplicate values from external imports and other errors. Therefore, there is a need to keep purging duplicate values to ensure that applications perform optimally.