How to Remove Duplicates from a C++ Vector

Duplicate means one of two or more things that are the same. Consider the following vector:

vector<char> vtr = {'E', 'G', 'I', 'E', 'A', 'E', 'C', 'A', 'C'};

‘E’ occurs three times in different positions. ‘A’ occurs two times in different positions. ‘C’ occurs two times in different positions. So, ‘E’, ‘A’ and ‘C’ have duplicates. Each of the rest of the other characters occur once.

To remove duplicates in this vector, means to remove the duplicates of ‘E’, ‘A’ and ‘C’, and allow the first occurrence of each character, in its position. The result should be:

vector<char> vtr = {'E', 'G', 'I', 'A', 'C'};

There are two main ways of removing duplicates from a vector. One way is the direct or brute force way. In this way, the first element is checked against the rest of the elements, and any duplicate is removed. The second element is checked against the rest of the other elements on the right, removing duplicates. The same procedure is done for the third element, and the rest of the elements. This way usually takes too long. The other way is to maintain the original vector, and have a sorted copy of it. Remove duplicates from the sorted vector while making the copy of any duplicated element, as key in a map. Finally, scan through the original vector from the beginning to the end using the map to erase duplicates.

These two ways can be referred to as the brute-force method and the sort-and-compare method, respectively. This article illustrates both ways. Do not forget to include the vector library at the beginning of the program.

Removing a Vector Element

A vector element is removed with the vector erase member function. The syntax is:

constexpr iterator erase(const_iterator position);

The argument is an iterator that points to the element, to be removed.

Removing Duplicates by Brute Force

With this approach, the first element is compared with the rest of the elements on the right, one-by-one, and any duplicate is erased. The second element, if it was not erased, is compared with the rest of the other elements on the right, one-by-one, erasing duplicates. The same procedure is done for the third element, and the rest of the elements. This approach usually takes too long. The following code illustrates it with iterators:

vectorvtr = {'E', 'G', 'I', 'E', 'A', 'E', 'C', 'A', 'C'};
for (vector::iterator itei = vtr.begin(); itei<vtr.end(); itei++) {
char ch = *itei;
for (vector::iterator itej = itei+1; itej<vtr.end(); itej++) {
if (ch == *itej) {
vtr.erase(itej);
}
}
}

for (int i=0; i<vtr.size(); i++) {
cout<<vtr[i] << ' ';
}
cout<<endl;

These are iterator for-loops with one loop nested. The second separate for-loop is not part of the process. It is for printing out the result. There are two for-loops in the process. The inner for-loop would scan though the rest of the vector, comparing each element with the one pointed to by the outer for-loop. Note the statement,

vector<char>::iterator itej = itei+1;

in the parentheses of the inner for-loop.

Removing Duplicates by Sort-and-Compare

Notice from the above method that there is a lot of re-scanning (re-reading and comparing) from large sequence, to small sequence of elements of the one vector. If the whole vector is scanned once or twice or three times, this would probably mean less accesses of the elements compared to the above approach. Well, the whole vector can even be scanned four or more times, but not many times. This must not necessarily be done with the same vector. It can be done with copies of the vector.

With the second approach, the original vector is maintained while a sorted copy is made from it. The sorted vector is read through (scanned), erasing the duplicate of consecutive elements that occurred more than once. One iterator for-loop can achieve this, from the beginning to the end of the sorted vector, once through. While this reading and some erasing is taking place, for any element that occurs more than once, a copy of the element is made key in a map, and the corresponding value for this key, is given the value -1. This value of -1 will be changed to 1 to indicate a duplicate. Each value in the map is an indicator for duplicate of its key which might occur two or more times in the original vector.

If the sorted vector with duplicates removed was required, then the sorted vector is returned, and the work is done. If the order of the first occurrence of the vector elements has to be maintained, then the following sub-procedure has to take place (continue):

Re-read the original vector from the beginning. While reading, if a key does not occur in the map (map returns 0), allow that key in the original vector. This means that the key has no duplicate. If a key of the original vector occurs in the map, it means that is the first occurrence of duplicates for that element in the vector. Make the indicator value for the key in the map, 1. That indicator value now has the value, 1. Continue reading the rest of the elements in the original vector, and be checking for duplicate element with the map. If a key is found and the map key value is 1, then the current element is a duplicate. Remove the current element. (Remember that the first occurrence of a duplicate key turned the corresponding indicator value in the map from -1 to 1.) Continue giving a value of 1 for the map key indicators, removing the original current vector element that already has a corresponding 1 in the map off the original vector; until the end of the original vector is reached. The resulting original vector is the vector without any duplicate element, and in the order with first occurrences.

In order to code map in C++, the map (unordered_map) library has to be included. Since the sort() function in the algorithm library will be used, the algorithm library too, has to be included into the program. The heading for the program of this approach, should be:

#include <iostream>

#include <vector>

#include <unordered_map>

#include <algorithm>

using namespace std;

The first code segment in the C++ main function can be:

vector<char> vtrO = {'E', 'G', 'I', 'E', 'A', 'E', 'C', 'A', 'C'};

vector<char> vtr = vtrO;

sort(vtr.begin(), vtr.end());

unordered_map<char, int> mp;

The first statement defines the original vector. The second statement makes a copy of the original vector. The third statement sorts the copied vector. The fourth statement declares the map, without initialization. The next code segment in the C++ main function, can be:

for (vector::iterator iter = vtr.begin(); iter<vtr.end()-1; iter++) {
vector::iterator iter0 = iter; vector::iterator iter1 = iter + 1;
if (*iter0 == *iter1) {
mp[*iter1] = -1;
iter--;
vector::iterator iter2 = vtr.erase(iter1);
}
}

This code segment erases the duplicates from the sorted copied vector. While doing that, it creates the map entries. Note that in the parentheses of the for-loop, the iteration reaches the last-but-one element (and not the last element). This is because the current and next elements are involved in the code. Also note that when an element is to be erased, the iterator is retarded (decremented) by one step.

If the sorted vector without duplicates is what was required, then the following code would display the result:

for (int i=0; i<vtr.size(); i++) {
cout<<vtr[i] << ' ';
}
cout<<endl;

The next code segment uses the original vector and the map to erase the duplicates in the original vector:

for (vector::iterator iter = vtrO.begin(); iter<vtrO.end(); iter++) {
if (mp[*iter] == 1) {
vtrO.erase(iter);
iter--;
}
if (mp[*iter] == -1)
mp[*iter] = 1;
}

The reason for choosing, -1 and 1, instead of 0 and 1, is because the default (absent) value of this map is 0. This avoids the confusion with the elements that do not have duplicate at all. An ordinary for-loop as follows can print out the final (reduced) original vector:

for (int i=0; i<vtrO.size(); i++) {
cout<<vtrO[i] << ' ';
}
cout<<endl;

The input to the program is:

'E', 'G', 'I', 'E', 'A', 'E', 'C', 'A', 'C'

The output of the program is:

A C E G I

E G I A C

The first line of the output is the sorted input without duplicates. The second line is the input in the given order, with the duplicates removed.

Conclusion

In order to remove duplicates from a C++ vector, the brute-force method can be used. This method is usually slow. The reader is advised to use the sort-and-compare method, which is usually fast, for his/her commercial work. Both methods have been explained above.

How to Remove Duplicates from a C++ Vector

Removing a Vector Element

Removing Duplicates by Brute Force

Removing Duplicates by Sort-and-Compare

Conclusion

About the author

Chrysanthus Forcha