Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems

Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems
Dilip Singh Sisodia, Upasna Verma

Dilip Singh Sisodia Department of Computer Science and Engineering, National Institute of Technology Raipur, Raipur, Raipur, Chhattisgarh 492010, India
Upasna Verma Department of Information Technology, National Institute of Technology Raipur, Raipur, Chhattisgarh 492010, India
Traditional classification algorithms often fail in learning from highly imbalanced datasets because the training involves most of the samples from majority class compared to the other existing minority class. In this paper, a Multiple Learners-based Ensemble SMOTEBagging (ML-ESB) technique is proposed. The ML-ESB algorithm is a modified SMOTEBagging technique in which the ensemble of multiple instances of the single learner is replaced by multiple distinct classifiers. The proposed ML-ESB is designed for handling only the binary class imbalance problem. In ML-ESB the ensembles of multiple distinct classifiers include Naïve Bays, Support Vector Machine, Logistic Regression and Decision Tree (C4.5) is used. The performance of ML-ESB is evaluated based on six binary imbalanced benchmark datasets using evaluation measures such as specificity, sensitivity, and area under receiver operating curve. The obtained results are compared with those of SMOTEBagging, SMOTEBoost, and cost-sensitive MCS algorithms with different imbalance ratios (IR). The ML-ESB algorithm outperformed other existing methods on four datasets with high dimensions and class IR, whereas it showed moderate performance on the remaining two low dimensions and small IR value datasets.

An ensemble of classifiers; Area under receiver operating curve; Classification; Class imbalance problem; Sensitivity; SMOTE; SMOTEBagging; SMOTEBoost; Specificity


The advancement in data generation and acquisition tools and techniques has accelerated the growth and accessibility of raw data. This has further resulted in new avenues of learning from historical data (Sisodia et al., 2018b). The existing machine learning algorithms show good performance for many real-world applications with proportionate class instances. However, in the case of disproportionate instances (or imbalanced problems), the same learning algorithms face performance-related challenges. Therefore, in recent years, learning from imbalanced datasets has garnered significant attention of the machine learning community (Sun et al. 2018). In the datasets with binary classes, the majority class having more samples than the minority class overshadow the entire dataset (Collell et al., 2018). This class imbalance is further aggravated in critical real-world problems that have high misclassification cost of the minority class instances (Galar et al., 2012). Examples of such problems are determination of an uncommon disease, (Nusantara et al., 2016), fraud detection (Oentaryo et al., 2014; Moepya et al., 2014), bankruptcy prediction (Fedorova et al., 2013), intrusion identification in remote sensor (Rodda & Erothi 2016), oil spilling (Kubat et al., 1998), etc. Therefore, in datasets with imbalanced instances, learning algorithms are unable to appropriately represent the class distribution characteristics of the dataset. As a result, they produce objectionable credibility across the class of the dataset. The different techniques to deal with imbalanced data can be categorized into three approaches: data-level approach, algorithmic approach, and cost-sensitive learning approach. The data-level approach involves pre-processing of data before taking it into further consideration. Some of these approaches include random oversampling, random undersampling, Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002), ADASYN (He et al., 2008), etc. The algorithmic approach, also known as an ensemble of classifiers, is used to improve the accuracy of the classifier (Chawla et al., 2003) by combining multiple classifiers. This approach shows significantly improved performance as compared to single classifiers. The cost-sensitive learning approach considers either the data-level or algorithmic approaches or both of them.

The remaining texts of this paper are organized under the following sections. In section two, research works related to class imbalance are discussed in brief. In section three, the working of the proposed approach is discussed in detail. In section four, evaluation parameters used for performance measure and comparison of the proposed approach are described. Section five describes the dataset used in this study with experimental results and discussions. Lastly, the conclusion and future work are summarized in section six.


In this paper, a modified SMOTEBagging technique called ML-ESB is discussed to address the learning performance issues of imbalanced datasets. The performance of ML-ESB was evaluated on binary six benchmark imbalanced datasets using specificity and sensitivity with default class IR of datasets and two fixed IR values of 1:10 and 1:25 for all datasets. The obtained experimental results showed that the ML-ESB algorithm performed significantly better on four datasets with comparatively large numbers of features and high-class IR and performed moderately on the remaining two datasets with small number of attributes and low IR. In future, ML-ESB may be modified for handling multi-class imbalanced data classification.


