Python

PCA in Sklearn

PCA (Principal Component Analysis) is a mathematical algorithm that transforms the observations of correlated variables into a set of values of linearly uncorrelated variables, known as principal components. PCA is one of the most popular algorithms for dimensionality reduction. Karl Pearson (LI, 1901) invented PCA in 1901, defining it as “identifying lines and planes of closest fit to systems of points in space”.

We will discuss the details of PCA and its implementation using sklearn.

What is PCA?

Principal Component Analysis (PCA) is a data reduction method. When you have a lot of measures for each case, but some of them are correlated with each other, this method is used. Principal Components use the correlation to decrease the number of variables required to characterize each situation adequately depending on the results. But a Principal Components analysis will likely reveal that, despite the ten measures, only three variables were being measured. They are more likely to have sprint, jumping, and throwing ability—three rather than ten characteristics. Principal Components would provide coefficients for each of the ten scores, indicating how much each score goes into the new run, jump, and throw scores. The three composite scores would also tell you how much of the total variation is accounted for. Working with three variables is easier than working with 10, and if they account for the majority of the fluctuation, you’ve captured all of the information from the ten scores in three.

The technique becomes much more useful when you have hundreds of measurements. However, we may encounter one issue: some of our input properties are connected. Depending on the strength of the association, this could indicate that we include extra dimensions in our input data when we could acquire the same amount of information with less. PCA provides a systematic technique for determining which feature combinations appear to be more responsible for data variance than others and provides some recommendations on reducing the number of dimensions in our input. This isn’t to say that PCA tells us which characteristics aren’t necessary; rather, it shows us how to integrate features into a smaller subspace without losing (a lot) of information. Traditionally, reducing the number of dimensions before feeding data into ML algorithms has been useful because it reduces complexity and processing time. However, I should point out that PCA is not a panacea but a fantastic tool when it does work.

Example 1

Consider the case of dataset D, which contains two-dimensional data along the y = x line. This data is represented in two dimensions, with an x and y point for each data point. The vector <1,1> would be identified as the direction of maximum variance via PCA and this vector would be used as the new x-axis. We can now represent the dataset D in just one dimension. As a result, PCA is a dimensionality reduction technique focused on locating the largest variance vectors.

Example 2

Suppose your data is along a line in two dimensions. In that case, PCA quickly recognizes that your x and y are associated and develops a new orthogonal coordinate system to maximize the variance in the first coordinate. As a result, the second (primary) component has no predictive ability and you can probably remove it from your models without causing too much damage. You’ve managed to project two dimensions into one without losing much information in this method. While you can do this visually in two dimensions, it may be a little more difficult in the n dimension.

Features of PCA

Transforming the data to a comparable scale. Some features in a dataset may be extremely high (1 to 100). In contrast, others are extremely low (0 to 1), resulting in high features having a greater impact on output predictions than low features data.

To identify the correlations between all features, calculate the data covariance.

Then, find the covariance matrix’s eigenvalues and eigenvectors. After that, sort the eigenvectors by decreasing eigenvalues and select the k eigenvectors with the highest eigenvalues.

To transform the samples onto the new subspace, use this eigenvector matrix.

It can be used to find out if there are any correlations between your variables. If you have 3-dimensional variables and their best-fit 2-dimensional plane accurately captures them, then the values in the third dimension are likely to be a linear function of the first two, plus or minus some Gaussian noise.

Instead of transmitting n-dimensional data points, you can use PCA to communicate m-dimensional coordinates in a best-fit subspace if you need to convey data (plus the subspace equation). It can be used to compress data as well. If the fit is perfect, you’ll lose no information; if it’s close, you’ll lose a bit.

Because many machine learning algorithms work best when each variable adds new information, it’s frequently used in machine learning to find and delete redundant variables in your dataset.

It doesn’t always find actual, non-linear duplication. If you interpret PCA statistically, you have to make some assumptions about the underlying relationships between the variables / their noise. Still, it’s a highly valuable tool, even if some assumptions aren’t ideal.

Implementing PCA in sklearn

import numpy as np
from sklearn.decomposition import PCA

X = np.array([[1, 2], [2, 1], [3, 2], [2, 3], [4, 5], [5, 4]])
pca = PCA(n_components=2)

pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

Output
[0.86153846 0.13846154]
[4.3204938  1.73205081]

Conclusion

This article discussed the Principal Component Analysis and its implementation using sklearn. Sklearn is a popular Python library used to develop Machine Learning models. PCA reduces the curse of dimensionality by directing high-dimensional data into lower dimensions without losing much information. PCA can be implemented using sklearn with the ‘Sklearn.decomposition’ class.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.