SMOTE: In most of the real world classification problem, data tends to display some degree of imbalance i.e. each class does not make up to equal proportion within the data. When difference in proportion between classes is small most of the machine learning or statistical algorithms work fine but as this difference grows most of the algorithms tends to predicts majority class.
For Example, let us suppose we are trying to predict Loan Defaulters based on historical data. Most probable algorithm proposed would be “Logistic Regression”. Though there is no problem with algorithm but it is well established fact that “Maximum Likelihood” estimation of Logistic model suffers from small sample bias. The degree of bias grows as number of events we want to predict gets smaller i.e. while building logistic model based on historical data if number of “Defaulters” is very small then Logistic Regression would be biased towards “Non-Defaulters”.
Imbalanced Classification Problem and Commonly proposed solutions: In case of binary classification strongly imbalanced classes often lead to unsatisfactory results regarding the prediction of new observations, especially for the small class. In this context imbalanced class simply means that the number of observations of one class (usu. positive or majority class) by far exceeds the number of observations of the other class (usu. negative or minority class). The most commonly used techniques are:
- Under-sampling methods: Elimination of randomly chosen cases of the majority class to decrease their effect on the classifier. All cases of the minority class are kept.
- Over-sampling methods: Generation of additional cases (copies, artificial observations) of the minority class to increase their effect on the classifier. All cases of the majority class are kept.
- Hybrid methods: Mixture of under- and oversampling strategies.
Disadvantages of Under\Over Sampling: In Over-sampling the minority can lead to model overfitting, since it will introduce duplicate instances, drawing from a pool of instances that is already small. Similarly, under-sampling the majority can end up leaving out important instances that provide important differences between the two classes.
SMOTE: SMOTE (Synthetic Minority Oversampling Technique) is a powerful sampling method that goes beyond simple under or over sampling. This algorithm creates new instances of the minority class by creating convex combinations of neighboring instances.
How SMOTE resolve the rare events problem: SMOTE synthetically generates new minority instances between existing instances. The new instances created are not just the copy of existing minority cases instead; the algorithm takes sample of feature space for each target class and its neighbors and then generates new instances that combine the features of the target cases with features of its neighbors.
This approach increases the features available to each class and makes the samples more general. SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. For example, suppose you have an imbalanced dataset where just 1% of the cases have the target value A (the minority class), and 99% of the cases have the value B. To increase the percentage of minority cases to twice the previous percentage, you would enter 200 for SMOTE percentage in the module’s properties.
Let us try to understand SMOTE with another example. Imagine SMOTE draws lines between existing minority instances as shown in below example.
SMOTE will synthetically generate new instances along these lines which would result into increase in percentage of minority class in comparison to majority class.
Working of SMOTE using R: SMOTE function handles unbalanced classification problems, which generate a new “SMOTEd” data set that addresses the class unbalance problem.
The parameter K, in the SMOTE functions specifies the number of nearest neighbors to be considered while synthesizing new instances.
For K = 1, only the closest neighbor would be considered from the minority class.
For K=2, both the neighbor from minority class would be considered which would result into two new instances.
So, we can say as we increase the number of K, the number of synthetic minority instances for each of the actual minority instance would increase proportionately.
There is another parameter “dup_size”, which decides how many times SMOTE () should loop through the existing data.
Once we run SOMTE () function for appropriate values of “K” and “dup_size” the sufficient number of synthetic minority instances generated would solve the problem of data imbalance.
Shortcomings of SMOTE:
- Overgeneralization: SMOTE’s procedure is inherently dangerous since it blindly generalizes the minority area without regard to the majority class. This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture.
- Lack of Flexibility: The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.
Solution to the above problems:
Overgeneralization: In order to avoid overgeneralization, we propose to use three techniques:
- Testing for data sparsity
- Clustering the minority class
- 2-class (rather than 1-class) sample generalization
Lack of Flexibility : In order to avoid the lack of flexibility we can run SMOTE() for different combination of “K” and “dup_size”