The K-nearest neighbors (KNNs) classifier or simply Nearest Neighbor Classifier is a kind of supervised machine learning algorithm that operates based on spatial distance measurements. In this post, we investigate the theory behind it.

Introduction

The K-nearest neighbors (KNNs) classifier or simply Nearest Neighbor Classifier is a kind of supervised machine learning algorithms. K-Nearest Neighbor is remarkably simple to implement, and yet performs an excellent job for basic classification tasks such as economic forecasting. It doesn’t have a specific training phase. Instead, it observes all the data while classifying a query data point. Henceforth, K-Nearest Neighbor does not have any assumption about the underlying data. This characteristic aligns with the underlying pattern of the majority of real-world datasets.

What is the philosophy behind it?

The presentiment behind the K Nearest Neighbor Classifier algorithm is very simple: The algorithm classifies the new data point based on its proximity to different classes. The algorithm calculates the distance between the query data point (the unlabeled data point that supposed to be classified) and its K nearest labeled data points. Ultimately, it assigns the query data point to the class label that the majority of the K data points have.

Fig.1 demonstrates the algorithm in practice. The test sample (red circle) supposed to be classified as one of the green, blue, or yellow classes. Considering K = 3, the three closest points determine the classification outcome. As the majority vote on the red category, then the algorithm assigns yellow~(yellow star) as the test sample class. Considering K = 1, the green circle is the nearest neighbor.

Figure 1: Visualization of an example of k-NN classification is shown in this figure.

Benefits and limitations

Let’s focus on the benefits first:

  1. It is simple to implement as we need only two things: parameter K and the distance metric. Ordinarily used distance metrics are Euclidean distance (continuous variables) and Hamming distance (discrete variables).
  2. As there is no training phase, the evaluation is simple and fast. However, this is regardless of the computation that leads to identifying the K nearest points.
  3. The algorithm does not need any prior knowledge regarding the data distribution.

There are drawbacks in the algorithm as follows:

  1. The basic “majority voting” approach becomes misleading when there is a skewed class distribution. In other words, the proximity criterion cannot adequately identify the data distribution (Fig. 2). To more elaborate on this if the number of one class samples is very frequent in the data, K-Nearest Neighbor fails to observe any meaningful pattern.
  2. It is computationally expensive to calculate the distance between the unseen sample and all the points to find the K nearest one. Basically, the minimum number of computations for each test sample equals the number of data samples.
  3. K-Nearest Neighbor fails often in case of encountering high-dimensional data as using the simple distance metric, it is hard to distinguish high-dimensional data.
k-Nearest Neighbor having trouble while encountering skewed data distribution.
Figure 2: Visualization of the data distribution. The number of samples belonging to the blue class is much more than the other.

The algorithm in a Nutshell

Just to briefly summarize the algorithm as follows:

  • An unseen data (test data) arrives and the algorithm wants to classify that.
  • The algorithm calculates the spatial distance between the unseen data and all the labeled data points.
  • The algorithm identifies the K nearest data samples.
  • Considering the K data points with their associated categories, the category with the highest frequency is assigned to the unseen sample as its class.

Read More

To gain a better understanding of the Nearest Neighbor algorithm, please refer to the following resources:

Summary

In this post, we have investigated the theory behind the K Nearest Neighbor algorithm for classification. We observed its pros and cons and described how it works in practice. In future posts, we demonstrate how to implement it. This is important as it brings more insight regarding the algorithm details.

Leave a Comment

Your email address will not be published. Required fields are marked *

Tweet
Share
Pin
Share