R

How to Implement Principal Component Analysis in R

In R, the principal component analysis is an abbreviated PCA which is used to reduce the number of variables to demonstrate the majority of the variation in a dataset compared to the original dataset. The primary goal of PCA is to reduce the size of a large feature space to save on computational costs. Note that the first main component represents the most significant variation in the dataset in PCA. It goes where there are higher unpredictability. The remaining variance in the data is identified by the second principal component which is indifferent to PC1.

Example 1: Using the Princomp() Function

The predefined princomp() method is an approach in R to easily conduct the principal component analysis by employing it on the data. Basically, the princomp() function internally conducts PCA using eigenvectors.

pca1 = princomp(cars, cor = TRUE)

pca1$sdev
unclass(pca1$loadings)
head(pca1$scores)

We create the “pca1” variable where the princomp() function is employed to perform the principal component analysis on the given DataFrame. In our case, we use the “cars” DataFrame that are predefined in the R environment. The “cars” is passed as an input in the princomp() function along with the “cor” parameter which is assigned with the Boolean value of “TRUE” to the center and scale the DataFrame before analysis.

After that, we retrieve the standard deviation on “pca1” using the “pca1$sdev” component. Then, we use the unclass() function of R where the results of “pca1” data are extracted and transformed the loading from the result of PCA to the standard object of R. In the end, we get the principal component scores from “pac1”.

After performing the principal component analysis and the scores, the standard deviation of the “car” dataset is retrieved in the following output:

Example 2: Using the Prcomp() Function

Moreover, we can also use the prcomp() function in R which is a built-in function to perform the principal component analysis on the data and return the outcomes in the form of a “prcomp” object.

data("mtcars")

outcomes <- prcomp(mtcars, library(tidyverse)
scale = TRUE)
outcomes$rotation <- -1*outcomes$rotation

outcomes$rotation
outcomes$x <- -1*outcomes$x

head(outcomes$x)
biplot(outcomes, scale = 0)

We load the “tidyverse” package to use the biplot method for the specified data on which we perform the principal component analysis. The built-in data that is used here is “mtcars” for PCA which is called in the data() method.

After that, we create the “outcomes” variable where the prcomp() method is defined with the parameter. We specify the “mtcars” dataset as a parameter in the prcomp() method. Then, we set the “scale” parameter with the “TRUE” value which standardizes the input data before using PCA. So, it has a mean of 0 and a variance of 1.

Next, we use the rotation feature to change the sign of the PCA outcome by multiplying it with -1 as eigenvectors in R by default, pointing in the other direction. Then, we display the principal component using the “outcome$rotation” on the R console.

The output displays the principal component analysis for the provided dataset with high column variations. Next, we label the set of “results$x” including the principal component scores for every state. Also, we apply the sign-reverse to the scores by dividing them by -1.

The following output displays the first few scores of the “mtcars” data:

In the end, we creat a biplot graph of the previous principal component analysis that is performed on the “mtcars” dataset. We set the biplot and specify the “outcomes” as a parameter. Then, the scale is defined with the value of 0 which guarantees that the arrows in the visualization are scaled to indicate the loadings.

We can see the biplot representing the principal component analysis results in the following:

Example 3: Using the PCA() Function

However, we can use the R package “FactomineR” to conduct the principal component analysis which provides a PCA() function for multivariate data analysis and reduces its dimensionality.

library(FactoMineR)

RandomData<-replicate(20,rnorm(2000))

outcome.pca = PCA(RandomData[,1:20], scale.unit=TRUE, graph=T)

print(outcome.pca)

We load the “FactoMineR” package within the library() function of R to get the results of the principal component analysis in more detail. After that, we generate the random number using the replicate() function where 20 sets of 2000 rows are specified in the rnorm() function from an average distribution with a mean of 0 and a standard deviation of 1.

After that, we define the “outcome.pca” variable where we perform the principal component analysis on the randomly generated data using the PCA() function. Within the PCA() function, we pass the randomly generated data and the scale parameter which specify that the RandomData needs to be scaled to have a unit variance before the PCA performance.

We also set the “graph” parameter inside the PCA() function which is given with a value of “T” (TRUE) for the graphical representation.

The following graph represents the details of the principal component analysis using the PCA() method for the randomly generated data:

Lastly, we print out the principal component analysis data result which includes the eigenvalues, individual scores, and variance information, respectively.

Example 4: Using the Screeplot to Visualize the Principal Component Analysis in R

Suppose we want to visualize the performance of the principal component analysis in R. In that case, we can use the “screeplot” function to plot the deviations against the number of components as the follows:

data(USArrests)

pc <- princomp(USArrests, cor = TRUE)
screeplot(pc, type = "line", main = "Scree Plot")

We provide the “USArrests” dataset using the data() function. After that, we set the “pc” variable where the princomp() function is applied to the “USArrests” DataFrame to implement the principal component analysis and center the scale of the data using the “cor” parameter with the “TRUE” value.

Then, we use the screeplot() function to render the visual of the principal component analysis. For this, we input the “pc” variable. The type input is set to line for the line plot in the graph and the main input is used for the main title of the scree plot.

We can visualize the scree plot of the principal component analysis in the following image:

Conclusion

The principal component analysis is explored in the article with the implementation. We covered the prcomp() method, princomp() method, and the PCA() method which help us to perform the principal component analysis on the specified data. We also visualized the principal component analysis performance using the scree plot in R.

About the author

Saeed Raza

Hello geeks! I am here to guide you about your tech-related issues. My expertise revolves around Linux, Databases & Programming. Additionally, I am practicing law in Pakistan. Cheers to all of you.