Data Science

# K-Means Clustering

K-Means clustering is an unsupervised machine learning algorithm. If we compare the K-Means unsupervised clustering algorithm with the supervised algorithm, it is not required to train the model with the labeled data. K-Means algorithm is used to classify or group different objects based on their attributes or features into a K number of groups. Here, K is an integer number. The K-Means calculates the distance (using the distance formula) and then finds the minimum distance between the data points and the centroid cluster to classify the data.

Let’s understand the K-Means using the small example using the 4 objects, and each object has 2 attributes.

ObjectsName Attribute_X Attribute_Y
M1 1 1
M2 2 1
M3 4 3
M4 5 4

## K-Means to solve Numerical Example:

To solve the above numerical problem through K-Means, we have to follow the following steps:

The K-Means algorithm is very simple. First, we have to choose any random number of K and then choose the centroids or center of the clusters. To choose the centroids, we can choose any random number of objects for the initialization (depends upon the value of K).

The K-Means algorithm basic steps are as follows:

1. Continues to run till no objects move from their centroids (stable).
2. We first choose some centroids randomly.
3. Then, we determine the distance between each object and centroids.
4. Grouping the objects based on the minimum distance.

So, every object has two points as X and Y, and they represent on the graph space as following: So we initially choose the value of K=2 as random to solve our above problem.

Step 1: Initially, we choose the first two objects (1, 1) and (2, 1) as our centroids. The below graph is showing the same. We call these centroids C1 (1, 1) and C2 (2,1). Here, we can say C1 is group_1 and C2 is group_2. Step 2: Now, we will calculate each object data point to centroids using the Euclidean distance formula.

To calculate the distance, we use the following formula. We calculate the distance from objects to centroids, as shown in the below image. So, we calculated every object data point distance through the above distance method, finally got the distance matrix as given below:

## DM_0 =

 0 1 3.61 5 C1 = (1,1) cluster1 group_1 1 0 2.83 4.24 C2 = (2,1) cluster2 group_2
A B C D
1 2 4 5   X
1 1 3 4   Y

Now, we calculated each object’s distance value for each centroid. For example, the object points (1,1) have a distance value to c1 is 0 and c2 is 1.

As, from the above distance matrix, we find out that the object (1, 1) has a distance to cluster1 (c1) is 0 and to cluster2 (c2) is 1. So the object one is close to cluster1 itself.

Similarly, if we check the object (4, 3), the distance to cluster1 is 3.61 and to cluster2 is 2.83. So, the object (4, 3) will shift to cluster2.

Similarly, if you check for the object (2, 1), the distance to cluster1 is 1 and to cluster2 is 0. So, this object will shift to cluster2.

Now, according to their distance value, we group the points (object clustering).

## G_0 =

A B C D
1 0 0 0   group_1
0 1 1 1   group_2

Now, according to their distance value, we group the points (object clustering).

And finally, the graph will look like below after doing the clustering (G_0). Iteration_1: Now, we will calculate new centroids as initial groups changed because of the distance formula as shown in the G_0. So, the group_1 has only one object, so its value is still c1 (1,1), but the group_2 has 3 objects, so its new centroid value is So, new c1 (1,1) and c2 (3.66, 2.66)

Now, we again have to calculate all the distance to new centroids as we calculated before.

## DM_1 =

 0 1 3.61 5 C1 = (1,1) cluster1 group_1 3.14 2.36 0.47 1.89 C2 = (3.66,2.66) cluster2 group_2
A B C D
1 2 4 5   X
1 1 3 4   Y

Iteration_1 (Object clustering): Now, on behalf of the new distance matrix (DM_1) calculation, we cluster it according to that. So, we shift the M2 object from group_2 to group_1 as the rule of minimum distance to centroids, and the rest of the object will be the same. So new clustering will be as below.

## G_1 =

A B C D
1 1 0 0   group_1
0 0 1 1   group_2

Now, we have to calculate the new centroids again, as both the objects have two values.

So, new centroids will be So, after we get the new centroids, the clustering will look like below:

c1 = (1.5, 1)

c2 = (4.5, 3.5) Iteration_2: We repeat the step where we calculate the new distance of each object to new calculated centroids. So, after the calculation, we will get the following distance matrix for iteration_2.

## DM_2 =

 0.5 0.5 3.2 4.61 C1 = (1.5, 1) cluster1 group_1 4.3 3.54 0.71 0.71 C2 = (4.5, 3.5) cluster2 group_2

A          B               C            D

A B C D
1 2 4 5   X
1 1 3 4   Y

Again, we do the clustering assignments based on the minimum distance as we did before. So after doing that, we got the clustering matrix which is the same as G_1.

## G_2 =

 A B C D 1 1 0 0 group_1 0 0 1 1 group_2

As here, G_2 == G_1, so no further iteration is required, and we can stop here.

## K-Means Implementation using Python:

Now, we are going to implement the K-means algorithm in python. To implement the K-means, we are going to use the famous Iris dataset, which is open-source. This dataset has three different classes. This dataset has basically four features: Sepal length, sepal width, petal length, and petal width. The last column will tell the name of the class of that row like setosa.

The dataset looks like the below: For the python k-means implementation, we need to import the required libraries. So we import Pandas, Numpy, Matplotlib, and also KMeans from sklearn.clutser as given below: We are reading the Iris.csv dataset using the read_csv panda’s method and will display the top 10 results using the head method. Now, we are reading only those features of the dataset which we required to train the model. So we are reading all the four features of the datasets (sepal length, sepal width, petal length, petal width). For that, we passed the four index values [0, 1, 2, 3] into the iloc function of the panda’s data frame (df) as shown below: Now, we choose the number of clusters randomly (K=5). We create the object of the K-means class and then fit our x dataset into that for training and prediction as shown below: Now, we are going to visualize our model with the random K=5 value. We can clearly see five clusters, but it looks like it is not accurate, as shown below. So, our next step is to find out either the number of the clusters was accurate or not. And for that, we use the Elbow method. The Elbow method is used to find out the optimal number of the cluster for a particular dataset. This method will be used to find out whether the value of k=5 was correct or not as we are not getting clear clustering. So after that, we go to the following graph, which shows the value of K=5 is not correct because the optimal value falls between 3 or 4. Now, we are going to run the above code again with the number of clusters K=4 as shown below: Now, we are going to visualize the above K=4 new build clustering. The below screen shows that now the clustering is done through the k-means. ## Conclusion

So, we studied the K-means algorithm in both numerical and python code. We have also seen how we can find out the number of clusters for a particular dataset. Sometimes, the Elbow method cannot give the correct number of clusters, so in that case, there are several methods that we can choose.

The code for this blog, along with the dataset, is available at the following link https://github.com/shekharpandey89/k-means 