The K-nearest neighbors (KNNs) classifier or simply Nearest Neighbor Classifier is a kind of supervised machine learning algorithm that operates based on spatial distance measurements. In this post, we investigate the theory behind it.

## Introduction

The K-nearest neighbors (KNNs) classifier or simply Nearest Neighbor Classifier is a kind of supervised machine learning algorithms. K-Nearest Neighbor is remarkably simple to implement, and yet performs an excellent job for basic classification tasks such as economic forecasting. It doesn’t have a specific training phase. Instead, it observes all the data while classifying a query data point. Henceforth, K-Nearest Neighbor does not have any assumption about the underlying data. This characteristic aligns with the underlying pattern of the majority of real-world datasets.

## What is the philosophy behind it?

The presentiment behind the K Nearest Neighbor Classifier algorithm is very simple: The algorithm classifies the new data point based on its proximity to different classes. The algorithm calculates the distance between the ** query data point** (the unlabeled data point that supposed to be classified) and its

**K**nearest labeled data points. Ultimately, it assigns the query data point to the class label that the majority of the

**K**data points have.

**Fig.1** demonstrates the algorithm in practice. The test sample (red circle) supposed to be classified as one of the green, blue, or yellow classes. Considering K = 3, the three closest points determine the classification outcome. As the majority vote on the red category, then the algorithm assigns yellow~(yellow star) as the test sample class. Considering K = 1, the green circle is the nearest neighbor.

## Benefits and limitations

Let’s focus on the benefits first:

- It is
as we need only two things: parameter K and the distance metric. Ordinarily used distance metrics are Euclidean distance (continuous variables) and Hamming distance (discrete variables).**simple to implement** - As there is
, the evaluation is simple and fast. However, this is regardless of the computation that*no training phase*lead s to identifying the**K**nearest points. The algorithmregarding the data distribution.*does not need any prior knowledge*

There are drawbacks in the algorithm as follows:

- The basic “majority voting” approach becomes misleading when there is a
*skewed*. In other words, the proximity criterion cannot adequately identify the data distribution (**class distribution****Fig. 2**). To more elaborate on this if the number of one class samples is very frequent in the data, K-Nearest Neighbor fails to observe any meaningful pattern. - It is
**computationally expensive** - K-Nearest Neighbor fails often in case of
as using the simple distance metric, it is hard to distinguish high-dimensional data.**encountering high-dimensional data**

## The algorithm in a Nutshell

Just to briefly summarize the algorithm as follows:

- An unseen data (test data) arrives and the algorithm wants to classify that.
- The algorithm calculates the spatial distance between the unseen data and all the labeled data points.
- The algorithm identifies the K nearest data samples.
- Considering the K data points with their associated categories, the category with the highest frequency is assigned to the unseen sample as its class.

### Read More

To gain a better understanding of the Nearest Neighbor algorithm, please refer to the following resources:

*k*-nearest neighbors algorithm by Wikipedia- K Nearest Neighbors Classifier – A Medium.com post
- Machine Learning Basics with the K-Nearest Neighbors Algorithm

## Summary

In this post, we have investigated the theory behind the K Nearest Neighbor algorithm for classification. We observed its pros and cons and described how it works in practice. In future posts, we demonstrate how to implement it. This is important as it brings more insight regarding the algorithm details.