R

Remove the Duplicate Rows from the DataFrame in R

DataFrames can often contain duplicate rows which can lead to errors and inaccuracies in analysis. Removing the duplicate rows from a DataFrame is common for the data cleaning task in R. Duplicate data should be removed to decrease the dataset size, increase the analysis efficiency, and improve the analysis accuracy. However, R provides different functions to discard the duplicate rows from the given DataFrame. In this article, we will discuss the several ways to remove the duplicate rows from a DataFrame.

Example 1: Using the Unique() Function
The unique() function is the most common approach to eliminate the duplicates row from the DataFrame. The usage of the unique() function to remove the identical rows is provided in the following:

data1 = data.frame(Col1= c(11, 25, 14, 11, 45),
                 Col2= c(20, 15, 95, 20, 24),
                 Col3= c(87, 32, 42, 87, 61))
data1
data2 = unique(data1)
data2

In the given program of R, we create a “data1” DataFrame with three columns – “col1”, “col2”, and “col3” – and populate them with the specified values in the respective vectors. The “data1” DataFrame is called “alone” which simply prints its data to the console. Then, we create a new DataFrame, “data2”, that contains only the unique rows of “data1”. In other words, if there are any rows in the “data1” DataFrame that are absolute duplicates of other rows, they are eliminated from the “data2” DataFrame. The order of the rows in “data2” are the same in “data1”.

The output represents the DataFrame with the duplicate rows and the DataFrame with the removed duplicated row which is row “4”.

Example 2: Using the Data.Table() Function
The DataFrame can be converted into the data.table and then remove the duplicate rows from that data.table since the data.table is a more efficient way than the data frames.

library(data.table)
df= data.frame(col=c('X', 'X', 'X', 'Y', 'Y', 'Y'),
                val=c('Integer', 'Integer', 'String', 'Double', 'Boolean', 'Boolean'))
data = data.table(df)
dt = unique(data, by = "val")
dt

In the provided program of R, we load the data.table package in the beginning. Then, we compose a “df” DataFrame with only two columns: “col” and “val”. The values in each column are specified using vectors. Note that both columns contain similar values. Next, we convert the “df” DataFrame to a data table using the data.table() function from the data.table package. Then, a transformed data table is assigned to the “data”. After that, we employ the unique() function which inputs the data table “data” and the “val” column is set to the “by” argument. The “by” argument specifies that we want to find the duplicates based on the values in the “val” column. As a result, just the first instance of any rows with the same “val” value are retained.

Hence, we have an output where the returned data table “dt” has the same columns as the original data table “data1” but with fewer rows since the copied values are removed.

Example 3: Using the Distinct() Function
Another popular approach is to use the distinct() function from the dplyr package which returns a DataFrame with unique rows.

library(dplyr)
df1= data.frame(A = c("c1", "c1", "c1", "c1", "c2", "c2", "c3"),
                   B = c(1, 1, 1, 2, 2, 3, 5))
df2 = df1 %>% distinct()
df2

In the provided program of R, we first define the “dplyr” package to use the distinct() function. Next, we build the DataFrame using the data.frame() function where two columns are set with the values using vectors. Then, we establish a new DataFrame in “df2” which contains only the unique rows of “df1”. The pipe operator, represented by the symbol %>%, is utilized to link a number of functions together.

Here, it is used to deliver the “df” DataFrame to the distinct() function which produces a DataFrame that contains only the unique rows. Since no columns are taken by the distinct() function, it considers all columns in the DataFrame when identifying the duplicates.

The following output shows only the distinct rows in the resulting DataFrame:

Example 4: Using the Distinct() Function Keep_All Argument
The distinct() function has the keep_all argument which is used to specify whether to keep all the columns or only those that are used for the distinct operation. The keep_all argument is set to FALSE by default which indicates that only the columns that are used for the distinct operation are retained in the outcome.

library(dplyr)
order_df=data.frame(id=c(1,1,2,3,3),
              order=c("laptop","laptop","mobile","LED","wire"),
              cost=c(500,500,300,400,400))
res_df <- order_df %>% distinct(id,cost, .keep_all = TRUE)
res_df

In the provided program in R, we call the dplyr package to access its functionality. After that, we declare the “order_df” DataFrame where the data.frame() is used to define the column’s names “id”, “order”, and “cost” along with their values. Then, the distinct() function is applied to “order_df” using the pipe operator %>% to create a new DataFrame, “res_df”, that retains only the unique rows of “order_df” based on the values in the “id” and “cost” columns. Note that we pass the “.keep_all = TRUE” which specifies that all columns of the original DataFrame “order_df” should be retained in the output, rather than just the columns that are used to identify the unique rows.

The duplicated rows are discarded from the DataFrame that is shown in the following output screen:

Example 5: Using the Duplicated() Function
Additionally, we can use the duplicated() function of R to remove the copied rows from the DataFrames. The function takes a DataFrame as its argument and is used to identify the duplicate rows based on a subset of columns or all columns of the DataFrame.

MyData=data.frame(emp=c("Marrie","John","David",
                "Sam","Marrie","John") ,
                id=c(1,2,3,4,1,2),
                position=c("Manager","Clerk","Employee",
                        "CEO","Manager","Clerk"))
result= MyData[!duplicated(MyData$position), ]
result

In the provided program of R, we have a “MyData” DataFrame with three columns: “emp”, “id”, and “position”. Then, we assign this “MyData” DataFrame to the duplicated() function that is called within the “result” variable. The “!duplicated()” function eliminates the DataFrame’s duplicate rows by filtering the “position” column. The exclamation point (!) here overrides the duplicated() function’s output, returning FALSE for additional occurrences of each unique value in the “position” column and TRUE for subsequent occurrences.

The DataFrame is now filtered in the following output as it removed the duplicate row:

Example 6: Using the Group_By() and Slice() Functions
An alternative way to eliminate the duplicate rows from the DataFrames is using the group_by() function in conjunction with the slice() function.

library(dplyr)

dataframe= data.frame(col = c(1, 2, 2, 3, 3, 4, 5, 5),
                 var = c("var1", "var1", "var2", "var1", "var3", "var3", "var1", "var2"),
                 val = c("a", "b", "c", "d", "e", "f", "g", "h"))
newDataframe = dataframe %>% group_by(col) %>% slice(1)
newDataframe

In the provided program of R, we employ the data.frame() function to establish the DataFrame and specify it in the “dataframe” variable. After that, we use the group_by() function to group the rows of “dataframe” by the unique values in the “col” column. The obtained DataFrame is then piped to the slice() function to return the first row for each group. If a different row needs to be extracted, the number is changed accordingly in the slice() function.

The output represents the resultant DataFrame where each row represents the first occurrence of each unique value of the column:

Conclusion

We have seen and used the various functions of R to remove the duplicate rows of the DataFrames. These functions includes unique(), duplicated(), distinct() and group_by() with slice(). Choosing the best function depends on the complexity as well as the need for the analysis.

About the author

Saeed Raza

Hello geeks! I am here to guide you about your tech-related issues. My expertise revolves around Linux, Databases & Programming. Additionally, I am practicing law in Pakistan. Cheers to all of you.