• Vol 10, No 4 (2019)
  • Electrical, Electronics, and Computer Engineering

Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems

Dilip Singh Sisodia, Upasna Verma

Corresponding email: dssisodia.cs@nitrr.ac.in


Cite this article as:
Sisodia, D.S., Verma, U., 2019. Distinct Multiple Learner-Based Ensemble SMOTEBagging (ML-ESB) Method for Classification of Binary Class Imbalance Problems. International Journal of Technology. Volume 10(4), pp. 721-730
42
Downloads
Dilip Singh Sisodia Department of Computer Science and Engineering, National Institute of Technology Raipur, Raipur, Raipur, Chhattisgarh 492010, India
Upasna Verma Department of Information Technology, National Institute of Technology Raipur, Raipur, Chhattisgarh 492010, India
Email to Corresponding Author

Abstract
image

Traditional classification algorithms often fail in learning from highly imbalanced datasets because the training involves most of the samples from majority class compared to the other existing minority class. In this paper, a Multiple Learners-based Ensemble SMOTEBagging (ML-ESB) technique is proposed. The ML-ESB algorithm is a modified SMOTEBagging technique in which the ensemble of multiple instances of the single learner is replaced by multiple distinct classifiers. The proposed ML-ESB is designed for handling only the binary class imbalance problem. In ML-ESB the ensembles of multiple distinct classifiers include Naïve Bays, Support Vector Machine, Logistic Regression and Decision Tree (C4.5) is used. The performance of ML-ESB is evaluated based on six binary imbalanced benchmark datasets using evaluation measures such as specificity, sensitivity, and area under receiver operating curve. The obtained results are compared with those of SMOTEBagging, SMOTEBoost, and cost-sensitive MCS algorithms with different imbalance ratios (IR). The ML-ESB algorithm outperformed other existing methods on four datasets with high dimensions and class IR, whereas it showed moderate performance on the remaining two low dimensions and small IR value datasets.

An ensemble of classifiers; Area under receiver operating curve; Classification; Class imbalance problem; Sensitivity; SMOTE; SMOTEBagging; SMOTEBoost; Specificity

Introduction

The advancement in data generation and acquisition tools and techniques has accelerated the growth and accessibility of raw data. This has further resulted in new avenues of learning from historical data (Sisodia et al., 2018b). The existing machine learning algorithms show good performance for many real-world applications with proportionate class instances. However, in the case of disproportionate instances (or imbalanced problems), the same learning algorithms face performance-related challenges. Therefore, in recent years, learning from imbalanced datasets has garnered significant attention of the machine learning community (Sun et al. 2018). In the datasets with binary classes, the majority class having more samples than the minority class overshadow the entire dataset (Collell et al., 2018). This class imbalance is further aggravated in critical real-world problems that have high misclassification cost of the minority class instances (Galar et al., 2012). Examples of such problems are determination of an uncommon disease, (Nusantara et al., 2016), fraud detection (Oentaryo et al., 2014; Moepya et al., 2014), bankruptcy prediction (Fedorova et al., 2013), intrusion identification in remote sensor (Rodda & Erothi 2016), oil spilling (Kubat et al., 1998), etc. Therefore, in datasets with imbalanced instances, learning algorithms are unable to appropriately represent the class distribution characteristics of the dataset. As a result, they produce objectionable credibility across the class of the dataset. The different techniques to deal with imbalanced data can be categorized into three approaches: data-level approach, algorithmic approach, and cost-sensitive learning approach. The data-level approach involves pre-processing of data before taking it into further consideration. Some of these approaches include random oversampling, random undersampling, Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002), ADASYN (He et al., 2008), etc. The algorithmic approach, also known as an ensemble of classifiers, is used to improve the accuracy of the classifier (Chawla et al., 2003) by combining multiple classifiers. This approach shows significantly improved performance as compared to single classifiers. The cost-sensitive learning approach considers either the data-level or algorithmic approaches or both of them.

The remaining texts of this paper are organized under the following sections. In section two, research works related to class imbalance are discussed in brief. In section three, the working of the proposed approach is discussed in detail. In section four, evaluation parameters used for performance measure and comparison of the proposed approach are described. Section five describes the dataset used in this study with experimental results and discussions. Lastly, the conclusion and future work are summarized in section six.

Conclusion

In this paper, a modified SMOTEBagging technique called ML-ESB is discussed to address the learning performance issues of imbalanced datasets. The performance of ML-ESB was evaluated on binary six benchmark imbalanced datasets using specificity and sensitivity with default class IR of datasets and two fixed IR values of 1:10 and 1:25 for all datasets. The obtained experimental results showed that the ML-ESB algorithm performed significantly better on four datasets with comparatively large numbers of features and high-class IR and performed moderately on the remaining two datasets with small number of attributes and low IR. In future, ML-ESB may be modified for handling multi-class imbalanced data classification.

References

Abolkarlou, N.A., Niknafs, A.A., Ebrahimpour, M.K., 2014. Ensemble Imbalance Classification: Using Data Preprocessing, Clustering Algorithm and Genetic Algorithm. In: Computer and Knowledge Engineering (ICCKE), 2014 4th International eConference on. pp. 171–176

Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., Garcia, S., Sánchez, L., Herrera, F., 2011. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing, Volume 17(2–3), pp. 255–287

Bauer, E., Kohavi, R., 1999. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, Volume 36(1-2), pp. 105–139

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Volume 16, pp. 321–357

Chawla, N.V., Lazarevic, A., Hall, L,O., Bowyer, K.W., 2003. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery. pp. 107–119

Collell, G., Prelec, D., Patil, K.R., 2018. A Simple Plug-in Bagging Ensemble based on Threshold-moving for Classifying Binary and Multiclass Imbalanced Data. Neurocomputing, Volume 275, pp. 330–340

Fedorova, E., Gilenko, E., Dovzhenko, S., 2013. Bankruptcy Prediction for Russian Companies: Application of Combined Classifiers. Expert Systems with Applications, Volume 40(18), pp. 7285–7293

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F., 2012. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-based Approaches. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, Volume 42(4), pp. 463–484

Galar, M., Fernandez, A., Barrenechea, E., Herrera, F., 2013. EUSBoost: Enhancing Ensembles for Highly Imbalanced Data-sets by Evolutionary Undersampling. Pattern Recognition, Volume 46(12), pp. 3460–3471

Gong, J., Kim, H., 2017. RHSBoost: Improving Classification Performance in Imbalance Data. Computational Statistics and Data Analysis, Volume 111, pp. 1–13

Han, H., Wang, W.-Y., Mao, B.-H., 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in intelligent computing, Volume 17(12), pp. 878–887

Hanifah, F.S., Wijayanto, H., Kurnia, A., 2015. SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis (Case?: Credit of Bank X). Applied Mathematical Sciences, Volume 9(138), pp. 6857–6865

He, H., Bai, Y., Garcia, E.A., Li, S., 2008. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of the International Joint Conference on Neural Networks, (3), pp.1322–1328

Huang, J., Ling, C.X., 2005. Using AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on Knowledge and Data Engineering, Volume 17(3), pp. 299–310

Krawczyk, B., Woznaik, M., Schaefer, G., 2014. Cost-sensitive Decision Tree Ensembles for Effective Imbalanced Classification. Applied Soft Computing Journal, Volume 14(PART C), pp. 554–562

Kubat, M., Holte, R.C., Matwin, S., 1998. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, Volume 30(2–3), pp. 195–215

Li, K., Zhang, W., Lu, Q., Fang, X., 2014. An Improved SMOTE Imbalanced Data Classification Method based on Support Degree. In: International Conference on Identification, Information and Knowledge in the Internet of Things, pp. 34–38

MATLAB(2012a), Software package

Moepya, S.O., Akhoury, S.S., Nelwamondo, F.V., 2014. Applying Cost-sensitive Classification for Financial Fraud Detection under High Class-imbalance. In: 2014 IEEE International Conference on Data Mining Workshop, pp. 183–192

Moniz, N., Ribeiro, R.P., Cerqueira, V., Chawla, N.V., 2018. SMOTEBoost for Regression: Improving the Prediction of Extreme Values. In IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, pp. 150–159

Nusantara, A.C., Purwanti, E., Soelistiono, S., 2016. Classification of Digital Mammogram based on Nearest-Neighbor Method for Breast Cancer Detection. International Journal of Technology, Volume 7(1), pp. 71–77

Oentaryo, R., Lim, E-P., Finegold, M., Lo, D., Zhu, F., Phua, C., Cheu, E-Y., Yap, G-E., Sim, K., Nguyen, M.N., Perera, K., Neupane, B., Faisal, M., Aung, Z., Woon, W.L., Chen, W., Patel, D., Berrar, D., 2014. Detecting Click Fraud in Online Advertising?: A Data Mining Approach. Journal of Machine Learning Research, Volume 15, pp. 99–140

Rodda, S., Erothi, U.S.R., 2016. Class Imbalance Problem in the Network Intrusion Detection Systems. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2685–2688

Sisodia, D.S., Reddy, N.K., Bhandari, S., 2018a. Performance Evaluation of Class Balancing Techniques for Credit Card Fraud Detection. In: IEEE International Conference on Power, Control, Signals and Instrumentation Engineering. IEEE, pp. 2747–2752

Sisodia, D.S., Singhal, R., Kandal, V., 2018b. Comparative Performance of Interestingness Measures to Identify Redundant and Non-informative Rules from Web Usage Data. International Journal of Technology, Volume 9(1), pp. 201–211

Sisodia, D.S., Verma, U., 2018. The Impact of Data Re-sampling on Learning Performance of Class Imbalanced Bankruptcy Prediction Models. International Journal on Electrical Engineering and Informatics, Volume 10(3), pp. 433–446

Sun, J., Lang, J., Fujita, H., Li, H., 2018. Imbalanced Enterprise Credit Evaluation with DTE-SBD: Decision Tree Ensemble based on SMOTE and Bagging with Differentiated Sampling Rates. Information Sciences, Volume 425, pp. 76–91

Wang, S., Yao, X., 2009. Diversity Analysis on Imbalanced Data Sets by using Ensemble Models. In: IEEE Symposium on Computational Intelligence and Data Mining. IEEE, pp. 324–331

Zhang, Y., Min, Z., Danling, Z., Gang, M., Daichuan, M., 2014. Improved SMOTEBagging and Its Application in Imbalanced Data Classification. In: 2013 IEEE Conference Anthology, ANTHOLOGY 2013