|Temitayo Matthew Fagbola||- KZN eSkills CoLab, Durban University of Technology, Durban 4000, South Africa - Department of Information Technology, Durban University of Technology, Durban, South Africa|
|Colin Surendra Thakur||- KZN eSkills CoLab, Durban University of Technology, Durban 4000, South Africa - Department of Information Technology, Durban University of Technology, Durban, South Africa|
|Oludayo Olugbara||Department of Information Technology, Durban University of Technology, Durban, South Africa|
News article classification is a recently growing area of interest in text classification because of its associated multiple matching categories. However, the weak reliability indices and ambiguities associated with state-of-the-art classifiers often employed make success in this domain very limited. Also, the high sensitivity and large disparity in performance results of classifiers to the varying nature of real-world datasets make the need for comparative evaluation inevitable. In this paper, the accuracy and computational time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and Artificial Neural Network (ANN) were experimentally evaluated for a prototype large dimensional news article classification problem. 2000 News articles from a dataset of 2225 British Broadcasting Corporation (BBC) news documents (including examples from sport, politics, entertainment, education and technology, and business) were used for categorical testing purposes. Porter’s algorithm was used for word stemming after tokenization and stop-words removal, and a Normalized Term Frequency–Inverse Document Frequency (NTF-IDF) technique was adopted for feature extraction. Experimental results revealed that ANN performs better in terms of accuracy while the KCDM produced better results than ANN in terms of computational time efficiency.
Artificial neural network; Kolmogorov complexity distance measure; News article dataset; Text classification
In the domain of text classification problems, news article classification has become an area of significant interest due to the overwhelmingly growing volume of news corpus on the World Wide Web (WWW). However, during classification, news articles often suffer from deep ambiguity because of their various matching categories and the weak reliability performance of most classification systems being used. These often resulted in low efficiency and poor performance evident in many current approaches (Kaur & Bajaj, 2016; Birabadar & Raikar, 2017). In recent times, learning systems based on ANN and KCDM for classification tasks in high-dimensional problem space, including intrinsic plagiarism detection, image and speech recognitions, identity and non-parametric testing, risk assessment, cellular automata classification, spam filtering, malicious URL detection, text and music classifications, DNA analysis, radar signal classification, EEG classification, e-commerce product classification, etc, are becoming more evident (Revolle et al., 2016; Oyewole & Olugbara, 2017; Abdalkafor, 2017; Haris et al., 2018). Actually, ANN has previously been identified as a good approach for dealing with large text classification problems (Lai et al., 2015). However, due to the complex structure of news article datasets, the identification of an efficient and accurate classifier that is a best fit for their classification remains an open problem. This makes it a highly challenging process to annotate topical news, based on different categories in an accurate and time efficient manner.
Recently, a growing trend of emerging, user-aware, big data analytic concepts with tags like “user-assisted classification”, “interactive classification”, “user-aware classification” and “user-centred classification” is becoming more evident. This reflects learning systems and/or big data analytic techniques that incorporate users’ feedbacks, reviews, ratings and personalized opinions into their classification process to augment the quality of classification decisions in an automated/semi-automated fashion (Donkers et al., 2018). For example, this type of approach includes personalized and sentiment-enhanced recommender systems (Yibo et al., 2018). However, this approach is best suited for unstructured data analysis (Donkers et al., 2018).
In the present work, the general aim was to conduct a performance comparison of the accuracies and time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and an Artificial Neural Network (ANN) for solving a prototype news article classification problem. In the experiments conducted, ANN and the KCDM were implemented using Microsoft Visual C# language. 2000 news articles were obtained from the publicly available BBC News article dataset. These were pre-processed using Porter’s algorithm after tokenization and stop-words removal. An NTF-IDF technique was used to extract and select relevant features before training and classification with the KCDM and ANN. The rest of this paper is summarized as follows: in Section 2, relevant literature on multi-labelled text classification, ANN and the KCDM are discussed. In Section 3, materials and method are presented. Section 4 discusses the results obtained, while Section 5 presents the conclusions with future directions. The major contributions of this work include:
In this paper, a performance comparison between two methods (ANN and the KCDM) for addressing the news article classification problem was conducted. The experimental results revealed that ANN was better in terms of accuracy, while KCDM was better for developing time-efficient applications. Summarily, this paper establishes the relative importance of conducting performance evaluation as a core part of choosing the best test routines during development to ensure an overall high reliability for deployed applications. Furthermore, this process can help identify certain trade-offs associated with each algorithm and synergize decision making on what algorithm to apply to a specific problem domain of interest, especially when developing fault-tolerant systems.
The major findings of our experiments are: (1) ANN can produce higher classification accuracy for large datasets than KCDM. In all the experiments conducted, ANN yielded the more true positives than the KCDM; (2) The time efficiency of ANN was very low when compared to the KCDM. In the experiments, as the size of the testing set grew, its time complexity also increased. With a testing set containing 1300 features, the classification time spent by ANN was approximately eight times more than that of the KCDM.
In future works, evaluation of some emerging and other baseline classifiers like Adaboost, SVM, naïve-Bayes and k-nearest neighbour could also be conducted for news article classification in large multi-dimensional features space. In addition, an ensemble of ANN and KCDM can also be developed to realize an algorithm with improved classification accuracy and time efficiency.
Abdalkafor, A.S., 2017. Designing Offline Arabic Handwritten Isolated Character Recognition System using Artificial Neural Network Approach. International Journal of Technology, Volume 8(3), pp. 528–538
Cilibrasi, R.L., Vitanyi, P.M.B., 2007. The Google Similarity Distance. IEEE Transaction on Knowledge and Data Engineering, Volume 19(3), pp. 370–383
Chan, C., Sun, A., Lim, E.P., 2001. Automated Online News Classification with Personalization. In: Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001), pp. 320–329
Donkers, T., Loepp, B., Ziegler, J., 2018. Explaining Recommendations by Means of User Reviews. In: Joint Proceedings of the ACM IUI 2018 Workshops, March 11, Tokyo, Japan, pp. 1–4
Fagbola, T.M., Olabiyisi, S.O., Egbetola, F.I., Oloyede, A., 2017. Review of Technical Approaches to Face Recognition in Unconstrained Scenes with Varying Pose and Illumination. Federal University, Oye-Ekiti (FUOYE). Journal of Engineering and Technology, Volume 2(1), pp. 1–8
Fagbola, T., Olabiyisi S., Adigun A., 2012. Hybrid GA-SVM for Efficient Feature Selection in E-Mail Classification. Journal of Computer Engineering and Intelligent Systems, Volume 3(3), pp. 17–28
Greene, D., Cunningham, P., 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 377–384
Gurmeet, K., Karan, B., 2016. News Classification and its Techniques: A Review. IOSR Journal of Computer Engineering (IOSR-JCE). Volume 18 (1), pp. 22-26.
Haris, A., Murdianto, B., Susattyo, R., Riyanto, A., 2018. Transforming Seismic Data into Lateral Sonic Properties using Artificial Neural Network: A Case Study of Real Data Set. International Journal of Technology. Volume 9(3), pp. 472–478
Huang, Y., 2009. Advances in Artificial Neural Networks-Methodological Development and Application. Algorithms, Volume 2(3), pp. 973–1007
Joho, H., Sanderson, M., 2007. Document Frequency and Term Specificity. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), Paris, France: Le Centre de Hautes Etudes Internationales D'informatique Documentaire. pp. 350–359
Kaur, G., Bajaj, K., 2016. News Classification and its Techniques: A Review. IOSR Journal of Computer Engineering (IOSR-JCE). Volume 18(1), pp. 22–26
Khan, A., Baharudin, B., Lee, L.H., Khan, K., 2010. A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology, Volume 1(1), pp. 4–20
Haynes, K.E., Kulkarni, R., Stough, R.R., Laurie, S., 2010. Exploring a Region Classifier based on Kolmogorov Complexity. School of Public Policy, George Mason University, Available Online at http://ssrn.com/abstract=1499218
Kolmogorov, A.N., 1965. Three Approaches to the Quantitative Definition of Information. Problems of Information Transmission, Volume 1(1), pp. 3–11
Li, L., Mostafa, Y.S.A., 2006. Data Complexity in Machine Learning. Computer Science Technical Reports, 2006.003, California Institute of Technology, Pasadena, USA California Institute of Technology, Pasadena, USA
May, R., Dandy, G., Maier, H., 2011. Review of Input Variable Selection Methods for Artificial Neural Networks. Suzuki, K. Edition, Artificial Neural Networks Methodological Advances and Biomedical Applications, University of Adelaide, Australia
Mandal, A.K., Sen, R., 2014. Supervised Learning Methods for Bangla Web Document Categorization. International Journal of Artificial Intelligence & Applications (IJAIA), Volume 5(5), pp. 93–105
Oloyede, A., Fagbola, T., Olabiyisi, S., Omidiora, E., Oladosu, J., 2016. Development of a Modified Local Binary Pattern-Gabor Wavelet Transform Aging Invariant Face Recognition System. In: Proceedings of ACM International Conference on Computing Research & Innovations, Nigeria, pp. 108–114
Oyewole, S.A., Olugbara, O.O., 2017. Product Image Classification using Eigen Colour Feature with Ensemble Machine Learning. Egyptian Informatics Journal, Volume 19(2), pp. 83–100
Revolle, M., le Bihan, N., Cayre, F., 2016. Algorithmic Information Theory for Automatic Classification. Grenoble, France: Gipsa-Lab, Universite Grenolbe Alpes, France
Birabadar, S., Raikar, M.M., 2017. Performance Analysis of Text Classifiers based on News Articles-A Survey. Indian Journal of Scientific Research, Volume 15(2), pp. 156–161
Lai, S., Xu, Liheng, Liu, K., Zhao, J., 2015. Recurrent Convolutional Neural Networks for Text Classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273
Skjennum, P.L., 2016. Multilingual News Article Classification. Master Thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, pp. 1–125
Van Meeuwen, F., 2013. Multi-label Text Classification of News Articles for ASDMedia, Master Thesis, Department of Information and Computing Sciences, Utrecht University
Wang, Y., Wang, X., 2005. A New Approach to Feature Selection in Text Classification. In: Proceedings of 4th International Conference on Machine Learning and Cybernetics, IEEE, Volume 6, pp. 3814–3819
Yibo, W., Mingming W., Wei X., 2018. A Sentiment-Enhanced Hybrid Recommender System for Movie Recommendation: A Big Data Analytics Framework. Wireless Communications and Mobile Computing, Volume 2018, pp. 1–9
Zhang, X., Zhao, J., LeCun, Y., 2016. Character-level Convolutional Networks for Text Classification. Ithaca, NY: Cornell University, arXiv:1509.01626v3