“An assumption is made by the KNN algorithm that new data and existing data are comparable. It makes use of nearest neighbors metrics for this. This article will discuss the KNN algorithm and sklearn’s implementation.”
What is KNN?
This algorithm uses labeled data because it is a supervised machine learning model. The location of the new data point’s nearest “k” number of neighbors determines how the KNN algorithm categorizes it. Euclidean distance is used to achieve this. The formula for the Euclidean distance between two points (a,b) and (x,y) is √(a-x)2 + (b-y)2.
Where is KNN Used?
KNN can be used to tackle a variety of issues. For instance, in classification, a new point can be categorized simply by looking at the class of its closest neighbors. The most comparable documents to a given document can be found using KNN to check for plagiarism, discover mirrors, etc. KNN can be used in recommender systems to identify products that are most like a product a user hasn’t evaluated and then determine whether or not the user will enjoy it. There are numerous other applications, including clustering methods, where you can apply them.
Pros and Cons of KNN?
Pros
-
- A simple algorithm that just employs the distance function and the value of K (the odd integer) (Euclidean, as mentioned today).
- It’s an approach that is effective for tiny datasets.
- Make use of “Lazy Learning.” This makes it faster than Support Vector Machines (SVMs) and Linear Regression because the training dataset is kept and used when making predictions.
Cons
-
- Processing large datasets take longer.
- Calls for feature scaling and failure to do so will lead to inaccurate projections.
- Noisy data can cause data to be over- or under-fitted.
Implementing KNN in Sklearn
Importing the required methods and classes
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
Creating the dataset
print('Features are')
print(X)
print('Labels are')
print(y)
Output
Features are
[ 1.34699113, 1.48361993, -0.04932407, 0.2390336 ],
[-0.54639809, -1.1629494 , -1.00033035, 1.67398571],
...,
[ 0.8903941 , 1.08980087, -1.53292105, -1.71197016],
[ 0.73135482, 1.25041511, 0.04613506, -0.95837448],
[ 0.26852399, 1.70213738, -0.08081161, -0.70385904]])
Labels are
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Creating the model and making predictions
model.fit(X, y)
print(model.predict([[0.5, 0.3, 0.2, 0.1]]))
print(model.predict_proba([[0.5, 0.3, 0.2, 0.1]]))
Output
[[0.8 0.2]]
Conclusion
We discussed the KNN algorithm, which is a supervised Machine Learning algorithm. We saw where it might be helpful or may fail. Later, we also discussed its implementation in sklearn Python.