Python

KNN in Sklearn

β€œAn assumption is made by the KNN algorithm that new data and existing data are comparable. It makes use of nearest neighbors metrics for this. This article will discuss the KNN algorithm and sklearn’s implementation.”

What is KNN?

This algorithm uses labeled data because it is a supervised machine learning model. The location of the new data point’s nearest “k” number of neighbors determines how the KNN algorithm categorizes it. Euclidean distance is used to achieve this. The formula for the Euclidean distance between two points (a,b) and (x,y) is √(a-x)2 + (b-y)2.

Where is KNN Used?

KNN can be used to tackle a variety of issues. For instance, in classification, a new point can be categorized simply by looking at the class of its closest neighbors. The most comparable documents to a given document can be found using KNN to check for plagiarism, discover mirrors, etc. KNN can be used in recommender systems to identify products that are most like a product a user hasn’t evaluated and then determine whether or not the user will enjoy it. There are numerous other applications, including clustering methods, where you can apply them.

Pros and Cons of KNN?

 

Pros

    • A simple algorithm that just employs the distance function and the value of K (the odd integer) (Euclidean, as mentioned today).
    • It’s an approach that is effective for tiny datasets.
    • Make use of “Lazy Learning.” This makes it faster than Support Vector Machines (SVMs) and Linear Regression because the training dataset is kept and used when making predictions.

Cons

    • Processing large datasets take longer.
    • Calls for feature scaling and failure to do so will lead to inaccurate projections.
    • Noisy data can cause data to be over- or under-fitted.

Implementing KNN in Sklearn

Importing the required methods and classes

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier

 
Creating the dataset

X, y = make_classification(n_samples = 500, n_features = 4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
print('Features are')
print(X)
print('Labels are')
print(y)

 
Output

Features are

array([[ 0.44229321,  0.08089276,  0.54077359, -1.81807763],
       [ 1.34699113,  1.48361993, -0.04932407,  0.2390336 ],
       [-0.54639809, -1.1629494 , -1.00033035,  1.67398571],
       ...,
       [ 0.8903941 ,  1.08980087, -1.53292105, -1.71197016],
       [ 0.73135482,  1.25041511,  0.04613506, -0.95837448],
       [ 0.26852399,  1.70213738, -0.08081161, -0.70385904]])

 
Labels are

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

 
Creating the model and making predictions

model = KNeighborsClassifier(n_neighbors = 5)
model.fit(X, y)
print(model.predict([[0.5, 0.3, 0.2, 0.1]]))
print(model.predict_proba([[0.5, 0.3, 0.2, 0.1]]))

 
Output

[0]
[[0.8 0.2]]

 

Conclusion

We discussed the KNN algorithm, which is a supervised Machine Learning algorithm. We saw where it might be helpful or may fail. Later, we also discussed its implementation in sklearn Python.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.