Python

Spectral Clustering in Python

Clustering is a widely used Machine Learning problem where similar data points are clustered together to form a set of clusters. It is widely used in applications like recommender systems, anomaly detection, and customer segmentation. We will be going through a modern clustering technique known as Spectral Clustering and its implementation in Python using the sklearn library.

What is Clustering?

Clustering is an unsupervised machine learning problem in which one must divide “m” observations into “k” clusters, with points in the same cluster being extremely similar and points in different clusters being very dissimilar. Problems like customer segmentation, recommender systems, anomaly detection, etc., are solved from clustering. You may be familiar with the k-means clustering algorithm, in which we don’t have labels and must place each data point into its cluster. The spectral clustering method is used to achieve the same goal as the k-means clustering method but with a graph-based approach. The below image shows the three clusters separated from each other and have similar points together.

What is K-means Clustering?

K-means clustering involves identifying the dataset’s K clusters that are distinct from one another. Only independent variables are used to create clusters. K means clustering is an unsupervised learning algorithm. The data points in the same cluster are quite similar, whereas the data points in different clusters are very distinct. You begin with K random centers and assign items to the ones that are closest to them. The center of each collection is then recalculated, resulting in new K centers. You keep doing this until the number of iterations reaches a predetermined threshold or the center of clusters is barely moving. The Elbow Method is commonly used to determine the value of K.

Classification vs. Clustering

Classification is the result of supervised learning, which means that you want the system to generate a known label. For example, if you constructed a image classifier, it would say, “this is a dog, this is a cat,” based on samples of dogs and cats you showed it.

Clustering is the consequence of unsupervised learning, which implies you’ve seen a lot of samples but haven’t been given labels for them. For instance, we can use clustering to segment the customers of the same kind from the customers of different kinds. This is a widely used problem statement that is solved using clustering.

What is Spectral Clustering Algorithm?

Spectral Clustering is a modern clustering algorithm based on graph theory. It has outperformed several classic clustering approaches and is still evolving. This algorithm takes each data point as a graph node and uses graph partitioning to solve the clustering problem.

Working of Spectral Clustering

Creating a Graph Data Structure

You can visualize any dataset as a point cloud, with m points in n dimensions. You can make a graph out of those points, with the nodes being the points and the edges (represented by w) being weighted by how similar the points are. Once we have our data in the form of a graph, we can generate an adjacency matrix by simply entering the weight of the edge between nodes “i” and “j” in each column of the matrix. This is a m x m symmetric matrix. W is the name for the adjacency matrix.

Projecting the Data

In this step, the data is projected into a lower-dimensional space to make the points closer to each other in the lower dimensional space. The formula gives the degree of each node:

The degree matrix is then calculated using the formula:

The graph’s Laplacian can be calculated using the formula L = D-W. We may compute the spectrum of this matrix, or its eigenvectors arranged from most significant to least important, now that we have the Laplacian of the graph. Taking the “k” least significant eigenvectors gives you a representation of each node in the graph in “k” dimensions, which represents each point in the dataset. The smallest eigenvalues are related to the least significant eigenvectors. This is a type of dimensionality reduction that isn’t linear.

Clustering the Data

This step mostly entails clustering the reduced dimensional data using K-Means Clustering or any other classic clustering technique. The normalized Graph Laplacian Matrix is first assigned to each node. The data is then clustered using any standard method.

In an ideal scenario, you would anticipate your data to be not fully connected, with distinct connected components for each cluster. However, in practice, this is rarely the case: it depends on various things, including the data itself and how you design your adjacency graph. In terms of efficiency, the better clusters are separated, the more spectral clustering behaves predictably: the graph will have more than one connected component (ideally K, the number of clusters in the dataset), the first K eigenvalues will be zero, and running K-Means in the space created by taking the first K eigenvectors of the graph Laplacian will yield fairly satisfying results. The closer the clusters are, the farther the eigenvalues are from 0, and the closer the points in the eigenspace are to distinct clusters.

K-means vs. Spectral Clustering

Consider the data given below.

Even when the true number of clusters K is known to the algorithm, K-means will fail to cluster the above data successfully. This is because K-means is a good data clustering algorithm for finding globular groups like the ones below:

where all cluster members are close to each other (in the Euclidean sense). Graph-clustering approaches such as spectral clustering, on the other hand, do not cluster data points directly in their native data space but instead build a similarity matrix with the (i,j)th row representing some similarity distance between the ith and jth data points in your dataset.

In some ways, spectral clustering is more general (and powerful) than K-means since spectral clustering is applicable whenever K-means is not (just use a simple Euclidean distance as the similarity measure). However, the opposite is not true. When choosing one of these strategies over the other, there are some practical concerns to keep in mind. The input data matrix is factorized with K-means, whereas the Laplacian matrix is factorized with spectral clustering (a matrix derived from the similarity matrix).

Implementing Spectral Clustering Using Python

Importing the Libraries

from sklearn.cluster import SpectralClustering

import numpy as np

Reading the data

X = np.array([[1, 1], [2, 1], [1, 0],

[4, 7], [3, 5], [3, 6]])

Note that in this example, we have taken the data with less dimensions. If you have larger dimensional data, you can apply Principal Component Analysis (PCA) to reduce the data dimensions.

Initializing our Model

model = SpectralClustering(n_clusters=2,

assign_labels='discretize',

random_state=0).fit(X)

Get labels of each data point

print(model.labels_)

Output

array([1, 1, 1, 0, 0, 0])

Advantages of Spectral Clustering

  • Spectral Clustering does not assume the shape of data. It performs well on all kinds of distributions of data. Other classical algorithms like K-means assume the shape of data as spherical.
  • It works pretty well when relations are roughly transitive (like similarity).
  • We don’t need the entire data set to cluster; just a similarity/distance matrix, or perhaps just the Laplacian, will suffice.

Disadvantages of Spectral Clustering

  • Computing eigenvectors is the bottleneck; therefore, it’s expensive for really large datasets.
  • Does not work well with noisy datasets.
  • The number of clusters (K) needs to be decided beforehand.

Use Cases of Spectral Clustering

  • Image Segmentation
  • Customer Segmentation
  • Entity Resolution
  • Protein Sequences Spectral Clustering

Conclusion

We saw how we can use spectral clustering to cluster our data points. We first project the data points into a graph data structure, reduce the dimensions of the data and then apply the traditional clustering technique on the reduced data. We later saw how easily this complex algorithm could be implemented in Python using a few lines of code.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.