The K-nearest neighbors (KNNs) classifier or simply Nearest Neighbor Classifier is a kind of supervised machine learning algorithms. K-Nearest Neighbor is remarkably simple to implement, and yet performs an excellent job for basic classification tasks such as economic forecasting. It doesn’t have a specific training phase. Instead, it observes all the data while classifying a query data point. Henceforth, K-Nearest Neighbor does not have any assumption about the underlying data. This characteristic aligns with the underlying pattern of the majority of real-world datasets.
What is the philosophy behind it?
The presentiment behind the K Nearest Neighbor Classifier algorithm is very simple: The algorithm classifies the new data point based on its proximity to different classes. The algorithm calculates the distance between the query data point (the unlabeled data point that supposed to be classified) and its K nearest labeled data points. Ultimately, it assigns the query data point to the class label that the majority of the K data points have.
Fig.1 demonstrates the algorithm in practice. The test sample (green) supposed to be classified as one of the blue or red classes. Considering K = 3, the three closest points determine the classification outcome. As the majority vote on the red category, then the algorithm assigns red as the test sample class.
Benefits and limitations
Let’s focus on the benefits first:
- It is simple to implement as we need only two things: parameter K and the distance metric. Ordinarily used distance metrics are Euclidean distance (continuous variables) and Hamming distance (discrete variables).
- As there is no training phase, the evaluation is simple and fast. However, this is regardless of the computation that
leads to identifying the K nearest points . Thealgorithm does not need any prior knowledge regarding the data distribution.
There are drawbacks in the algorithm as follows:
- The basic “majority voting” approach becomes misleading when there is a skewed class distribution. In other words, the proximity criterion cannot adequately identify the data distribution (Fig. 2). To more elaborate on this if the number of one class samples is very frequent in the data, K-Nearest Neighbor fails to observe any meaningful pattern.
- It is computationally expensive to calculate the distance between the unseen sample and all the points to find the K nearest one. Basically, the minimum number of computations for each test sample equals the number of data samples.
- K-Nearest Neighbor fails often in case of encountering high-dimensional data as using the simple distance metric, it is hard to distinguish high-dimensional data.
The algorithm in a Nutshell
Just to briefly summarize the algorithm as follows:
- An unseen data (test data) arrives and the algorithm wants to classify that.
- The algorithm calculates the spatial distance between the unseen data and all the labeled data points.
- The algorithm identifies the K nearest data samples.
- Considering the K data points with their associated categories, the category with the highest frequency is assigned to the unseen sample as its class.
To gain a better understanding of the Nearest Neighbor algorithm, please refer to the following resources:
- k-nearest neighbors algorithm by Wikipedia
- K Nearest Neighbors Classifier – A Medium.com post
- Machine Learning Basics with the K-Nearest Neighbors Algorithm
In this post, we have investigated the theory behind the K Nearest Neighbor algorithm for classification. We observed its pros and cons and described how it works in practice. In future posts, we demonstrate how to implement it. This is important as it brings more insight regarding the algorithm details.