R

How to Implement Random Forest in R

A random forest is an organized set of decision trees in R programming. To obtain more precise predictions, it constructs and merges several decision trees. The random forest is used to form the tree, and the error is estimated. This OOB (Out-of-Bag) inaccuracy estimate is given as a percentage. The random forest of R handles a lot of features and assists in determining the most significant characteristics.

Installation of the “RandomForest” Package in R

Before implementing the random forest on any of the specified data, we need to ensure that the random forest package is installed in the R environment. To easily install the package, we use the following command in R:

install.packages(“randomForest”)

This installs the “randomForest” package in R. Then, we can use the RandomForest() function to build and examine the random forests. We install the package in the R directory so a random forest tree can easily be deployed here.

Example 1: Implementation of the RandomForest() Method in R

The implementation of the randomForest() function can only be possible when the “randomForest” package is installed in the system.

library(randomForest)

print(str(cars))

output.forest <- randomForest(dist ~ speed ,data = cars)

print(output.forest)

Here, we first call the “randomForest” package in R to establish the decision tree for the dataset. In this case, we use the “Cars” dataset which is defined with the columns using the str() function and print it out via the print() function.

Next, we create the forest for the “Cars” dataset by employing the randomForest() function within the “output.forest” variable. Inside the randomForest() function, we pass the formula to be applied where the “dist” column is a response variable and the “speed” column indicates the predictor variable. Then, we use the data parameter and specify it with the dataset titanic to get the regression tree of it. Finally, we print the stored results of the “output.forest” using the print() function.

The output displays the type of random forest for the “Cars” dataset, the number of trees, and the mean of the square kind in detail in the following image:

Example 2: Implementation of the RandomForest() Method in R after Removing the NA Values

Sometimes, the datasets contain missing values, leading to an error in performing the random forest implementation. So, we first remove the NA values and then fill them with the median values. After that, randomForest() will be applicable to that dataset.

df = data.frame(

X1 = c(12,NA,13,NA),

X2 = c(5,4,2,1),

X3 = c(99,NA,NA,31))

print(df)

library(randomForest)

str(df)

sum(!complete.cases(df))

for(i in 1:ncol(df)) {

df[ , i][is.na(df[ , i])] <- median(df[ , i], na.rm=TRUE)

}

df

set.seed(1)

model <- randomForest(

formula = X1 ~ .,

data = df

)

model

Here, we manually set the DataFrame within the “df” variable where the data.frame() function is used and defined with three columns – “X1”, “X2”, and “X3” – which contain the values along with the NA values inside it. After that, we print the DataFrame by passing the “df” variable to the print() function.

Next, we load the “randomForest” package in the R environment using the library() function to set the random forest for the previous DataFrame. Before that, we call the str() function to get the structure of the DataFrame object.

After that, we search for the total rows with the missing values via the sum() method of R. As we can see in the output, the sum() function returns the value of 3, indicating that three rows are there with the missing values. We will, therefore, use the column medians to satisfy the values that are not present in each column before fitting a random forest model.

We use the for-loop functionality which is set with the is.na() method to search for the NA values and then replace them with the column median. Then, we display the “df” DataFrames which contains no NA values in any of the rows.

Finally, we fit the random forest model by deploying the randomForest() method here where the formula is set and to which data the “randomForest” is applied is also given. Then, we display the fitted random forest model in the output.

The following output displays the mean squared error for the forest model in the image:

Example 3: Implementation of the RandomForest() Method in R for Classification

Moreover, the dataset is classified using the random forest program as well.

library(randomForest

mtcars.rf <- randomForest(disp ~ .,

data = mtcars,

importance = TRUE,

proximity = TRUE)

print(mtcars.rf)

plot(mtcars.rf)

We define the following “randomForest” package and the random forest for the classification of the “mtcars” dataset here. We deploy the randomForest() function and set this with the “formula” parameter, the “data” parameter, and the “importance” parameter which are defined with the TRUE value which specifies that the variables’ significance measures are computed for the dataset’s features.

Then, the “proximity” parameter is also defined with the TRUE value to measure the considered rows. Next, we display the results of the random forest classification model in the following image:

Next, we represent the visualization for the number of trees and the error that are obtained from the randomForest() function that is rendered in the following image:

Example 4: Implementation of the RandomForest() Method in R to Create a Variable Importance Visualization

The variable importance graph is based on the random forest package to plot the visual based on the accuracy and Gini coefficient.

library(randomForest)

data(USArrests)

head(USArrests)

firstRand <- randomForest(USArrests, data=USArrests, ntree=100, keep.forest=FALSE,

importance=TRUE)

randomForest::varImpPlot(firstRand, sort=FALSE,

main="Importance Plot of Variables")

We set the data of the USArrests in this example and then fetch the top rows of the USArrests data using the head() function. After that, we create the “firstRand” variable where the randomForest() function is performed over the USArrests data and set with the “ntree” parameter with the TRUE value for the number of trees to be specified and the “keep.forest” parameter with the FALSE value for the output object that needs to keep the forest.

Then, we pass that forest results to the “varImpPlot” to render the plot for the “importance” variables. For this, we define the varImpPlot() method with the “firstRand” input, the sort input with the FALSE value for not sorting the variable, and the main title for the plot that is specified with the “main” parameter.

The variable importance plot for the “USArrests” dataset is displayed in the following image:

Conclusion

Using the example implementation, we explored the random forest package and its functionality in R. We set different examples which include the regression random forest, removing the missing values to perform the random forest, classification of random forest, and then the visualization of the important variables using the plot.

About the author

Saeed Raza

Hello geeks! I am here to guide you about your tech-related issues. My expertise revolves around Linux, Databases & Programming. Additionally, I am practicing law in Pakistan. Cheers to all of you.