KNN (K-Nearest Neighbors) Classifier from Scratch

Amy Beisel
5 min readAug 24, 2020

--

Figure 1. A visual depiction of classification using k Nearest Neighbors Algorithm

How does KNN classifier work?

KNN classifier algorithm works on a very simple principle. Let’s explain briefly in using Figure 1. We have an entire dataset with 2 labels, Class A and Class B. Class A belongs to the yellow data and Class B belongs to the purple data. While predicting, it compares the input (red star) to the entire existing data and checks the similarity between them. What happens when we take k Nearest Neighbors k=3? What happens with k=6? With k=3 — two data points belong to a purple class and one belongs to the yellow class. The majority vote is purple, so the returned predicted output is Class B. But when we have our k nearest neighbors equal to six (k=6), four data points belong to the yellow class and two belong to the purple class. So majority votes are yellow so our predicted label will be Class A. Easy enough right?!!

Maybe these real-world examples will help:

  • Marketing: imagine you purchased a baby stroller on Amazon, the next day you receive an email from the company that you might be interested in a set of pacifiers. Amazon can target you with similar products, or products with other customers that have similar buying habits as you.
  • Content recommendation: maybe the most interesting example is Spotify and its acclaimed recommendation engine. KNN is used on many content recommendation engines, more powerful systems are now available (auto-encoders), but KNN is an example of how Spotify could use this implementation.

Diving into the Code

In order to calculate what we did above. We need a distance function. There are many distance functions to choose from. In this case, we will be using the famous Iris DataSet from SKlearn to test our algorithm built from scratch, so we will use the Euclidean distance. It is calculated as follows:

The square root over the sum from i = 1 to k, where k is the number of dimensions. Then the sum over each component, and for each component, we calculate the squared distance or difference. That’s it! This is all we need to know in order to implement the KNN.

The best way to understand a model is to build one from scratch. All the following methods are defined in a k_nearest_neighbors class. Let’s get started.

1. Instantiate the class

We will only use the NumPy library and basic python Counter for arithmetical operations.

import numpy as np
from collections import Counter
#Euclidian Distance
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
class k_nearest_neighbors:
def __init__(self, k):
self.k = k

The euclidian distance is outside our class. We could use this as a global function and put it in a separate file. In this case, we will just put it at the top of our class and will be using it later. The class has an__init__ method so init self and a k, the number of nearest neighbors we want to consider and store it.

2. Fit the model

Next, we will fit the training samples and the training labels. Before we go on, let us take a look at how our data looks. For this, we used the famous Iris dataset from the scikit_learn module. What are the X_train and y_train? We will generate some training samples and some test samples. As well as some training labels and test labels by using a train_test_split. Data is the X_train and the target is the y_train.

from sklearn import datasets
from sklearn.model_selection import train_test_split
iris_data = datasets.load_iris()
data = iris_data.data
target = iris_data.target
# Train/Test splits
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)

Take a look at the iris dataset.

print(X_train.shape)
print(X_train[0])
print(y_train.shape)
print(y_train)
(105, 4)
[6.5 3. 5.2 2. ]
(105,)
[2 1 1 0 0 1 2 1 1 1 2 0 0 2 1 2 0 0 0 1 2 0 1 0 1 0 0 2 0 2 0 2 2 0 2 1 1 1 1 2 0 2 0 1 2 0 1 0 1 0 2 0 0 2 2 0 0 2 0 2 2 2 0 0 1 1 0 1 0 0 2 0 2 0 2 1 2 0 0 1 2 0 1 1 0 1 2 0 0 0 2 1 1 2 2 1 1 2 1 1 1 1 0 1 2]

So in the knn_fit method, this is our training step so all we want to do here is store our training samples and use them later by using the self method.

def knn_fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train

3. Implement a predict method

def knn_predict(self, X):
predicted_lables = [self._predict(x) for x in X]
return np.array(predicted_lables)
#helper method
def _predict(self, x):

#compute distances
distances = [euclidean_distance(x, x_train) for x_train in
self.X_train]
#get k nearest samples, labels
k_indices = np.argsort(distances)[:self.k]
k_nearest_labels = [self.y_train[i] for i in k_indices]
#majority vote, most common class label
majority_vote = Counter(k_nearest_labels).most_common(1)
return majority_vote[0][0]

This is where we will use the euclidean distance and our stored samples from the knn_fit. The def knn_predict(self, X) can get multiple samples and get all of our test samples. Our helper method will only get one new sample. It will calculate the distances of one new sample to all the training samples. Next, we get the k-nearest samples and labels and sort our distances. Once we have this, we get the most common class label or majority vote.

There you have it, all of our implementations for the KNN algorithm. Let's try it out and see how our model performs compared to the KNN from scikit-learn using the Iris data.

Iris Flowers

Scikit-learn KNN Classifier

model_SKlean = KNeighborsClassifier(n_neighbors=3)
model_SKlean.fit(X_train, y_train)
predict = model_SKlean.predict(X_test)
print("Scikit-learn KNN classifier accuracy:{0:.3f}".format(accuracy_score(y_test, predict)))

Scikit-learn KNN classifier accuracy:0.978

k_nearest_neighbors

import numpy as np
from KNN import k_nearest_neighbors
my_model = k_nearest_neighbors(k = 3)
my_model.knn_fit(X_train, y_train)
predictions = my_model.knn_predict(X_test)
acc = np.sum(predictions == y_test) / len(y_test)
print("My model from scratch KNN classifier accuracy:{0:.3}".format(acc))

My model from scratch KNN classifier accuracy:0.978

The accuracy of the models is the same, meaning we implemented a successful k-nearest-neighbors classifier model from scratch.

Find here the entire code and notebook with the comparison of the algorithms.

References

sklearn.datasets.load_iris

scikit learn Nearest Neighbors

sklearn.neighbors.KNeighborsClassifer

Euclidean Distance

Machine Learning Mastery

Let’s make a KNN Classifier from Scratch

--

--

Amy Beisel
Amy Beisel

Written by Amy Beisel

Professional cyclist turning #DataScientist @LambdaSchool | 2 rescue cats as study buddies | bee keeper

Responses (1)