Published at : 29 Jul 2019
Volume : IJtech
Vol 10, No 4 (2019)
DOI : https://doi.org/10.14716/ijtech.v10i4.1743
Dilip Singh Sisodia | Department of Computer Science and Engineering, National Institute of Technology Raipur, Raipur, Raipur, Chhattisgarh 492010, India |
Upasna Verma | Department of Information Technology, National Institute of Technology Raipur, Raipur, Chhattisgarh 492010, India |
Traditional classification algorithms often
fail in learning from highly imbalanced datasets because the training involves
most of the samples from majority class compared to the other existing minority
class. In this paper, a Multiple Learners-based Ensemble SMOTEBagging (ML-ESB)
technique is proposed. The ML-ESB algorithm is a modified SMOTEBagging technique
in which the ensemble of multiple instances of the single learner is replaced
by multiple distinct classifiers. The proposed ML-ESB is designed for handling
only the binary class imbalance problem. In ML-ESB the ensembles of multiple
distinct classifiers include Naïve Bays, Support Vector Machine, Logistic Regression
and Decision Tree (C4.5) is used. The performance of ML-ESB is evaluated based
on six binary imbalanced benchmark datasets using evaluation measures such as
specificity, sensitivity, and area under receiver operating curve. The obtained
results are compared with those of SMOTEBagging, SMOTEBoost, and cost-sensitive
MCS algorithms with different imbalance ratios (IR). The ML-ESB algorithm
outperformed other existing methods on four datasets with high dimensions and
class IR, whereas it showed moderate performance on the remaining two low
dimensions and small IR value datasets.
An ensemble of classifiers; Area under receiver operating curve; Classification; Class imbalance problem; Sensitivity; SMOTE; SMOTEBagging; SMOTEBoost; Specificity
The advancement in data generation and acquisition tools and techniques has accelerated the growth and accessibility of raw data. This has further resulted in new avenues of learning from historical data (Sisodia et al., 2018b). The existing machine learning algorithms show good performance for many real-world applications with proportionate class instances. However, in the case of disproportionate instances (or imbalanced problems), the same learning algorithms face performance-related challenges. Therefore, in recent years, learning from imbalanced datasets has garnered significant attention of the machine learning community (Sun et al. 2018). In the datasets with binary classes, the majority class having more samples than the minority class overshadow the entire dataset (Collell et al., 2018). This class imbalance is further aggravated in critical real-world problems that have high misclassification cost of the minority class instances (Galar et al., 2012). Examples of such problems are determination of an uncommon disease, (Nusantara et al., 2016), fraud detection (Oentaryo et al., 2014; Moepya et al., 2014), bankruptcy prediction (Fedorova et al., 2013), intrusion identification in remote sensor (Rodda & Erothi 2016), oil spilling (Kubat et al., 1998), etc. Therefore, in datasets with imbalanced instances, learning algorithms are unable to appropriately represent the class distribution characteristics of the dataset. As a result, they produce objectionable credibility across the class of the dataset. The different techniques to deal with imbalanced data can be categorized into three approaches: data-level approach, algorithmic approach, and cost-sensitive learning approach. The data-level approach involves pre-processing of data before taking it into further consideration. Some of these approaches include random oversampling, random undersampling, Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002), ADASYN (He et al., 2008), etc. The algorithmic approach, also known as an ensemble of classifiers, is used to improve the accuracy of the classifier (Chawla et al., 2003) by combining multiple classifiers. This approach shows significantly improved performance as compared to single classifiers. The cost-sensitive learning approach considers either the data-level or algorithmic approaches or both of them.
The remaining texts of this paper are organized under the following sections. In section two, research works related to class imbalance are discussed in brief. In section three, the working of the proposed approach is discussed in detail. In section four, evaluation parameters used for performance measure and comparison of the proposed approach are described. Section five describes the dataset used in this study with experimental results and discussions. Lastly, the conclusion and future work are summarized in section six.
In this paper, a modified SMOTEBagging
technique called ML-ESB is discussed to address the learning performance issues
of imbalanced datasets. The performance of ML-ESB was evaluated on binary six
benchmark imbalanced datasets using specificity and sensitivity with default
class IR of datasets and two fixed IR values of 1:10 and 1:25 for all datasets.
The obtained experimental results showed that the ML-ESB algorithm performed significantly
better on four datasets with comparatively large numbers of features and
high-class IR and performed moderately on the remaining two datasets with small
number of attributes and low IR. In future, ML-ESB may be modified for handling
multi-class imbalanced data classification.
Abolkarlou, N.A.,
Niknafs, A.A., Ebrahimpour, M.K., 2014. Ensemble Imbalance Classification: Using
Data Preprocessing, Clustering Algorithm and Genetic Algorithm. In: Computer and Knowledge Engineering (ICCKE), 2014 4th
International eConference on. pp. 171–176
Alcalá-Fdez, J., Fernández, A.,
Luengo, J., Derrac, J., Garcia, S., Sánchez, L., Herrera, F., 2011. KEEL Data-Mining
Software Tool: Data Set Repository, Integration of Algorithms and Experimental
Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing,
Volume 17(2–3), pp. 255–287
Bauer, E., Kohavi, R., 1999. An
Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting,
and Variants. Machine Learning, Volume 36(1-2), pp. 105–139
Chawla, N.V., Bowyer, K.W., Hall,
L.O., Kegelmeyer, W.P., 2002. SMOTE: Synthetic Minority Over-sampling Technique.
Journal of Artificial Intelligence Research, Volume 16, pp. 321–357
Chawla, N.V., Lazarevic, A., Hall,
L,O., Bowyer, K.W., 2003. SMOTEBoost: Improving Prediction of the Minority
Class in Boosting. In: European
Conference on Principles of Data Mining and Knowledge Discovery. pp.
107–119
Collell, G., Prelec, D., Patil, K.R.,
2018. A Simple Plug-in Bagging Ensemble based on Threshold-moving for
Classifying Binary and Multiclass Imbalanced Data. Neurocomputing, Volume
275, pp. 330–340
Fedorova, E., Gilenko, E., Dovzhenko,
S., 2013. Bankruptcy Prediction for Russian Companies: Application of Combined
Classifiers. Expert Systems with Applications, Volume 40(18), pp.
7285–7293
Galar, M., Fernandez, A.,
Barrenechea, E., Bustince, H., Herrera, F., 2012. A Review on Ensembles for the
Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-based Approaches. IEEE
Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews,
Volume 42(4), pp. 463–484
Galar, M., Fernandez, A.,
Barrenechea, E., Herrera, F., 2013. EUSBoost: Enhancing Ensembles for Highly
Imbalanced Data-sets by Evolutionary Undersampling. Pattern Recognition,
Volume 46(12), pp. 3460–3471
Gong, J., Kim, H., 2017. RHSBoost:
Improving Classification Performance in Imbalance Data. Computational
Statistics and Data Analysis, Volume 111, pp. 1–13
Han, H., Wang, W.-Y., Mao, B.-H.,
2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets
Learning. Advances in intelligent computing, Volume 17(12), pp. 878–887
Hanifah, F.S., Wijayanto, H., Kurnia,
A., 2015. SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression
Analysis (Case?: Credit of Bank X). Applied Mathematical Sciences, Volume
9(138), pp. 6857–6865
He, H., Bai, Y., Garcia, E.A., Li, S.,
2008. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings
of the International Joint Conference on Neural Networks, (3), pp.1322–1328
Huang, J., Ling, C.X., 2005. Using
AUC and Accuracy in Evaluating Learning Algorithms. IEEE Transactions on
Knowledge and Data Engineering, Volume 17(3), pp. 299–310
Krawczyk, B., Woznaik, M., Schaefer,
G., 2014. Cost-sensitive Decision Tree Ensembles for Effective Imbalanced
Classification. Applied Soft Computing Journal, Volume 14(PART C), pp.
554–562
Kubat, M., Holte, R.C., Matwin, S.,
1998. Machine Learning for the Detection of Oil Spills in Satellite Radar
Images. Machine Learning, Volume 30(2–3), pp. 195–215
Li, K., Zhang, W., Lu, Q., Fang, X., 2014.
An Improved SMOTE Imbalanced Data Classification Method based on Support
Degree. In: International Conference on Identification, Information and Knowledge
in the Internet of Things, pp. 34–38
MATLAB(2012a), Software package
Moepya, S.O., Akhoury, S.S.,
Nelwamondo, F.V., 2014. Applying Cost-sensitive Classification for Financial
Fraud Detection under High Class-imbalance. In:
2014 IEEE International Conference on
Data Mining Workshop, pp. 183–192
Moniz, N., Ribeiro, R.P., Cerqueira,
V., Chawla, N.V., 2018. SMOTEBoost for Regression: Improving the Prediction of
Extreme Values. In IEEE 5th International Conference on Data
Science and Advanced Analytics (DSAA). IEEE, pp. 150–159
Nusantara, A.C., Purwanti, E.,
Soelistiono, S., 2016. Classification of Digital Mammogram based on
Nearest-Neighbor Method for Breast Cancer Detection. International Journal
of Technology, Volume 7(1), pp. 71–77
Oentaryo, R., Lim, E-P., Finegold, M.,
Lo, D., Zhu, F., Phua, C., Cheu, E-Y., Yap, G-E., Sim, K., Nguyen, M.N.,
Perera, K., Neupane, B., Faisal, M., Aung, Z., Woon, W.L., Chen, W., Patel, D.,
Berrar, D., 2014. Detecting Click
Fraud in Online Advertising?: A Data Mining Approach. Journal of Machine Learning Research, Volume 15, pp. 99–140
Rodda, S., Erothi, U.S.R., 2016.
Class Imbalance Problem in the Network Intrusion Detection Systems. In: 2016 International Conference on
Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 2685–2688
Sisodia, D.S., Reddy, N.K., Bhandari,
S., 2018a. Performance Evaluation of Class Balancing Techniques for Credit Card
Fraud Detection. In: IEEE International Conference on Power,
Control, Signals and Instrumentation Engineering. IEEE, pp. 2747–2752
Sisodia, D.S., Singhal, R., Kandal,
V., 2018b. Comparative Performance of Interestingness Measures to Identify
Redundant and Non-informative Rules from Web Usage Data. International
Journal of Technology, Volume 9(1), pp. 201–211
Sisodia, D.S., Verma, U., 2018. The Impact
of Data Re-sampling on Learning Performance of Class Imbalanced Bankruptcy
Prediction Models. International Journal on Electrical Engineering and
Informatics, Volume 10(3), pp. 433–446
Sun, J., Lang, J., Fujita, H., Li,
H., 2018. Imbalanced Enterprise Credit Evaluation with DTE-SBD: Decision Tree
Ensemble based on SMOTE and Bagging with Differentiated Sampling Rates. Information
Sciences, Volume 425, pp. 76–91
Wang, S., Yao, X., 2009. Diversity Analysis on Imbalanced Data Sets by using Ensemble Models. In: IEEE Symposium on Computational Intelligence and Data Mining. IEEE, pp. 324–331
Zhang, Y., Min, Z., Danling, Z., Gang, M., Daichuan, M., 2014. Improved SMOTEBagging and Its Application in Imbalanced Data Classification. In: 2013 IEEE Conference Anthology, ANTHOLOGY 2013