K- Nearest Neighbors or also known as K-NN belong to the family of supervised machine learning algorithms which means we use labeled (Target Variable) data set to predict the class of new data point. The K-NN algorithm is a robust classifier which is often used as a benchmark for more complex classifiers such as Artificial Neural Network (ANN) or Support vector machine (SVM). Despite being considered simple K-NN can outperform more powerful classifiers and can be used in varieties of applications such as economic forecasting or classifying the new data entries.
The K-NN algorithms are non-parametric (no assumptions) and instance based algorithms. In other words, we can say K-NN algorithms have no modeling equation rather for every new data entry it reads entire dataset to predict the class.
K-NN Algorithm: To classify the new data point K-NN algorithm reads through whole dataset to find out K nearest neighbors. Once K nearest neighbors is identified K-NN algorithm summarizes the output of those nearest neighbors. For regression this might be the mean of the outputs for K nearest neighbors and for classification it would be the mode of output.
To determine which of the nearest neighbor can influence the result distance measure is used. The most popular distance measure used is Euclidean distance. Euclidean distance is calculated as the square root of the sum of squared differences between new data point and existing data points.
Euclidean Distance = √((x1 – x2)² + (y1 – y2)²)
There are few more popular distance measures:
⦁ Hamming Distance: Calculate the distance between binary vectors. In case of categorical variables the Hamming distance must be used.
⦁ Manhattan Distance: Calculate the distance between real vectors using the sum of their absolute difference.
⦁ Minkowski Distance: Generalization of Euclidean and Manhattan distance.
K-NN can be used both for regression and classification problems.
K-NN Regression: When K-NN is used for regression the output is predicted based on mean or median of the K-nearest neighbors
K-NN Classification: When K-NN is used for classification the output can be classified into the category with highest numbers of votes or we can say Mode is used as measure to classify the output.
Curse of Dimensionality: K-NN works well with small number of input variables but as the numbers of variables grow K-NN algorithm struggles to predict the output of new data point. The reason for this problem is that when the dimensionality (Input Variable) increases, the volume space increases so fast that available data becomes sparse. This sparsely distributed data is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality.
Steps followed in K-NN algorithm:
Step1: Identify the number K which will be used to decide the nearest neighbors
Step2: Take the K nearest neighbors of the new data point as per the chosen distance measure
Step3: Among these K neighbors, count the number of data points in each category
Step4: Assign the new data point to the category where we counted the most neighbors
Best practices to prepare data for K-NN:
⦁ Standardized data: Incase input variables have different scales then standardize the data e.g. while predicting affordability of Loan, Age would vary from 20 to 60 but Income will vary in thousands. Since variance in Income is much more than Age, so income would have higher influence on the distance calculated using K-NN
⦁ Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed
⦁ Low Dimension: K-NN algorithm works very well with less number of input variables