R

How to Implement Clustering Analysis in R

R is a well-known computer language that is used for graphical display and statistical computation. It is most frequently used to examine and display the data. In R language, we can also implement the clustering analysis. The clustering analysis is used in R language for the separation of huge sets of items into smaller sets of objects by grouping the objects that have the same characteristics and properties together. A cluster of objects all have similar traits in common. Clustering is utilized in the analysis and also in the mining of data to identify the comparable datasets. We use the clustering analysis for various purposes such as medical science, marketing, data analysis, etc. In this tutorial, we will learn the different types and methods that we can utilize for the clustering of our data in the R language.

Kinds of Clustering Analysis

  • K-mean clustering
  • Hierarchical clustering
  • Spectral clustering
  • Density-based clustering
  • Ensemble clustering

Techniques for Clustering Analysis in R

We have two techniques in R which we use for the clustering of the data in the R language. These are:

  • Soft clustering
  • Hard clustering

Example 1: Clustering Analysis Using the K-mean() Method in R

We utilize a command prompt here for the clustering of data. For this, we must have two packages installed which are “cluster” and “factoextra”. Here, we install the “factoextra” package. Then, we run the library(factorextra). After successfully installing these packages, we cluster the data easily.

install.packages(“factoextra”)

Here, we load our dataset. There are a lot of built-in datasets in the R language so we utilize the “airquality” dataset here. This dataset’s data is now stored in the “df” that we have written here.

Now, we have to remove the unnecessary cases that are present in this dataset. So, we use the “na.omit()” function. After this, we scale this dataset with the help “scale()” function. We utilize the “kmean()” method below this in which we put “df” and then set the “centers” as “4” and “nstart” as “25”. We also visualize the cluster with the help of “fviz_cluster()”. This function shows the cluster with the help of two components that define the “X-Y” coordinates.

We place the “km, data = df” inside the “fviz_cluster()”. Now, we utilize the “kmean()”again but this time, we set the “centers” as “5” and place the “fviz_cluster()”again. When we run this code, it shows the graph showing the dataset’s clustering on the output terminal.

install.packages("factoextra")

library(factoextra)

df <- airquality

# Omitting any NA values

df <- na.omit(df)

# Scaling dataset

df <- scale(df)

km <- kmeans(df, centers = 4, nstart = 25)

# Visualize the clusters

fviz_cluster(km, data = df)

km <- kmeans(df, centers = 5, nstart = 25)

# Visualize the clusters

fviz_cluster(km, data = df)

We can easily see the clusters of the dataset in the following image As shown here, we can implement the clustering analysis in the R language like this. It shows the clusters of data in five different colors.

Example 2: Clustering Analysis from the CSV File in R

This is our second code in this tutorial in which we will learn how we read the data from the CSV file and then apply the hierarchical clustering on the data of this CSV file. The hierarchy clustering is like a tree-shape structure.

To read the “CSV” file, we utilize the “read.csv()” method. In this method, we write “file.choose()” so when this code runs, it allows us to select or choose any “CSV” file whose data we want to read and store it as the dataset in the “df”.

After this, the “dist()” method is utilized which helps in computing a distance matrix in the R language. We do the hierarchical clustering in this code. It generates the dendrograms for us. In this “hclust()” function, we write “d” in which the data of the file is stored as a matrix. The “hclust()” hierarchical diagram is stored in “hc”. Now, we plot this “hc” dendrograms on the screen as an output. So, we use the “plot()” method and place “hc” in it.

df <- read.csv(file.choose())

d <- dist(as.matrix(df))

hc <- hclust(d)

plot(hc)

q()

The output is given after this code:

This is the cluster dendrogram in which the data is shown as “hierarchy clustering”. This makes the clustering hierarchy of the data that is present in the “CSV” file that we read in the code and extract the data from the CSV file to make clusters.

Example 3: Clustering Analysis Using the Fviz_cluster() Method in R

We use the “library()” directory in this last example. In this directory, the packages are saved. Here, we use three packages, so we put them separately in the “library()” directory. First, we put “cluster” inside the “library”. Then, we place “factoextra” in it. In the third “library()” directory, we write “gridExtra”. We need these packages in this code so we get them from this “library()” directory.

After this, we utilize the “data()” function. This method facilitates us in utilizing the built-in datasets which are present in the R packages. We access the “mtcars” dataset here, and this is the built-in dataset in the R language. The “d_frame” that we constructed here is where the data from this “mtcars” dataset is kept.

We now need to delete any extraneous cases from this dataset. Thus, we utilize the “na.omit()” function. The “scale()” function is then used to scale this dataset. In the following illustration, we use the “kmean()” approach where we first set the “d_frame” and then “centers” to “2” and “nstart” to “25.” The “fviz_cluster()” must also be used to visualize the cluster. Two components that determine the “X-Y” coordinates are used by this function to display the cluster. The “kmeans2” and “date = d_frame” are inserted inside the “fviz_cluster()” function.

library(cluster)

library(factoextra)

library(gridExtra)

data('mtcars')

d_frame <- mtcars

d_frame <- na.omit(d_frame) #Removing the missing values

d_frame <- scale(d_frame)

kmeans2 <- kmeans(d_frame, centers = 2, nstart = 25)

fviz_cluster(kmeans2, data = d_frame)

The following graphic clearly shows the dataset’s clusters. As seen in the following, we can perform the clustering analysis in R. The data clusters are displayed using two different colors:

Conclusion

This tutorial is all about implementing the “clustering analysis” in the R language, the different kinds of clustering the data, and the techniques for clustering the data. We learned how to utilize the “kmean()” function and the hierarchy clustering here. We explained three unique examples in which we cluster the data. We apply the clustering analysis on the built-in dataset as well as the data of the CSV file. We also demonstrated the examples along with the outputs in this tutorial.

About the author

Saeed Raza

Hello geeks! I am here to guide you about your tech-related issues. My expertise revolves around Linux, Databases & Programming. Additionally, I am practicing law in Pakistan. Cheers to all of you.