R

How to Handle and Impute Missing Data in R

Most datasets may have missing values, perhaps because they weren’t provided or due to a mistake. Data imputation is the process of substituting a different value for this missing information. Missing values need to be changed or removed to interpret the data and correctly draw the appropriate conclusion. The column’s missing value can be replaced in R in several ways including the zero, the average, the median, and so on. This guide discusses some code examples in R to handle and impute the missing data in the R language via RStudio.

Create a Dataset

For instance, we use the MongoDB collection data after fetching it, i.e. the collection data is displayed in the grid view with a total of 10 records. The SALARY column of this table holds some NA values, i.e. empty or missing.

Example 1: Calculating the Mean for the Missing Data

Let’s begin by looking at the first example of managing the missing values in the R data. For this, we use the RStudio tool. We use the source code area to perform our code. Make sure to install the Mongolite and “dplyr” package in the R tool before moving forward to fetch the data from MongoDB and for data manipulation purposes.

The library() function is utilized to use the Mongolite and “dplyr” libraries in this code. The Mongo() function is utilized to fetch the “dummy” collection from the MongoDB “test” database as per the provided connection string. The original collection result is stored in the “t” variable, and the find() method displays the original collection via the “data” variable.

The “Run” button can execute this whole code after selecting it and the output displays the collection data in the RStudio console.

install.packages("mongolite")

install.packages("dplyr") # For data manipulation

library(mongolite)

library(dplyr)

t = mongo("dummy", url = "mongodb://127.0.0.1:27017/test")

data <- t$find()

data

The mean() function is mainly used to compute the mean of a specific column. For this, it takes the first argument as the “data” dataset followed by the “SALARY” column name combined with the “$” sign. The second argument removes the “NA” value from the specific column by setting “na.rm” to TRUE.

The very next line is cast off the if-else statement of R to check if there is any “NA” in the “SALARY” column. If so, the mean of the column is calculated and NA is removed and replaced by the new mean.

The new “data” dataset is updated and displayed as per the following code. The output of this code shows that the “NA” value in the “SALARY” column is replaced by the 61228.57 mean in all the missing places.

mean(data$SALARY, na.rm = TRUE)

data$SALARY <- ifelse(is.na(data$SALARY), mean(data$SALARY, na.rm = TRUE), data$SALARY)

data

Example 2: Remove the Missing Data in R

Another way to handle the missing data in the R dataset is to remove the particular row from the record where a single column has a missing value.

For instance, we utilize the same “dummy” collection from our MongoDB database in RStudio after connecting it via the connection string that is provided in the following code. Make sure to ingress the needed libraries before jumping towards the main code.

After fetching the collection in the “data” variable via the “find” function of R, we display the original collection on the RStudio console. After that, the complete.cases() function is utilized to confirm that the “data” dataset has no missing value, i.e. NA. If any missing value is found, it returns “FALSE”; on the complete row, it returns TRUE.

The rows returning FALSE are discarded while the rows with the return “TRUE” value remain in the dataset. The comma at the end indicates that all the “data” dataset rows are used here. The output displays the removal of three rows with an “NA” value in the “SALARY” column.

library(mongolite)

library(dplyr)

t = mongo("dummy", url = "mongodb://127.0.0.1:27017/test")

data <- t$find()

data

data <- data[complete.cases(data), ]

data

Example 3: Remove the Missing Data Column in R

Users can easily remove the whole column from the dataset rather than deleting the rows with the missing values in a column. For this, we utilize the same code structure that we previously used.

library(mongolite)

library(dplyr)

t = mongo("dummy", url = "mongodb://127.0.0.1:27017/test")

data <- t$find()

data

The only difference is the usage of the colSums() function. Before the use of the colSums() function, we explain each part of it. The is.na() searches for “NA” values in every column index, i.e. return TRUE if found. The colSums() function calculates the sum of the “TRUE” value which is returned by each column via the is.na() function, i.e. the “SALARY” column has three TRUE values. The columns where the sum is not equal to “0” are removed from the dataset. In the end, the “SALARY” column is detached from the collection as shown in the following:

is.na(data))

colSums(is.na(data))

colSums(is.na(data)) == 0

data <- data[, colSums(is.na(data)) == 0]

data

Example 4: Imputation Using Mice

In R, the mice function is utilized to perform multiple imputations, i.e. guess the value for a missing field by observing the values of the whole column. Within the following code, the mice library is cast-off to substitute the missing values, while the Mongolite library is utilized to connect to a MongoDB database. A MongoDB database connection is established using the mongodb://127.0.0.1:27017/test connection string. The next consecutive line of code pulls every piece of information from the “test” collection.

library(mongolite)

library(mice)

t = mongo("dummy", url = "mongodb://127.0.0.1:27017/test")

data <- t$find()

data

The term “pmm” in this context refers to “Predictive Mean Matching” which is an imputation technique to anticipate the missing values based on the seen values of other variables. The “pmm” method is utilized to assign the missing values from the guessed values of data. The complete function that is used here is to add the rows to the indexes where the value is NA and adjust it with any suitable value. The imputed data is displayed in the following:

imp <- mice(data, method = "pmm", m = 1)

final <- complete(imp)

final

Conclusion

Starting from the guide’s introduction, we discussed how the missing data can be handled in a variety of ways. To support our discussion, we elaborated the different code examples to impute the missing data in R like calculating the mean, removing the rows and deleting the columns with the missing values, and using the mice function to impute the data.

About the author

Saeed Raza

Hello geeks! I am here to guide you about your tech-related issues. My expertise revolves around Linux, Databases & Programming. Additionally, I am practicing law in Pakistan. Cheers to all of you.