• Vol 10, No 4 (2019)
  • Electrical, Electronics, and Computer Engineering

News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network

Temitayo Matthew Fagbola, Colin Surendra Thakur, Oludayo Olugbara

Corresponding email: temitayo.fagbola@gmail.com


Cite this article as:
Fagbola, T.M., Thakur, C.S., Olugbara, O., 2019. News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network. International Journal of Technology. Volume 10(4), pp. 710-720
171
Downloads
Temitayo Matthew Fagbola - KZN eSkills CoLab, Durban University of Technology, Durban 4000, South Africa - Department of Information Technology, Durban University of Technology, Durban, South Africa
Colin Surendra Thakur - KZN eSkills CoLab, Durban University of Technology, Durban 4000, South Africa - Department of Information Technology, Durban University of Technology, Durban, South Africa
Oludayo Olugbara Department of Information Technology, Durban University of Technology, Durban, South Africa
Email to Corresponding Author

Abstract
image

News article classification is a recently growing area of interest in text classification because of its associated multiple matching categories. However, the weak reliability indices and ambiguities associated with state-of-the-art classifiers often employed make success in this domain very limited. Also, the high sensitivity and large disparity in performance results of classifiers to the varying nature of real-world datasets make the need for comparative evaluation inevitable. In this paper, the accuracy and computational time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and Artificial Neural Network (ANN) were experimentally evaluated for a prototype large dimensional news article classification problem. 2000 News articles from a dataset of 2225 British Broadcasting Corporation (BBC) news documents (including examples from sport, politics, entertainment, education and technology, and business) were used for categorical testing purposes. Porter’s algorithm was used for word stemming after tokenization and stop-words removal, and a Normalized Term Frequency–Inverse Document Frequency (NTF-IDF) technique was adopted for feature extraction. Experimental results revealed that ANN performs better in terms of accuracy while the KCDM produced better results than ANN in terms of computational time efficiency.

Artificial neural network; Kolmogorov complexity distance measure; News article dataset; Text classification

Introduction

In the domain of text classification problems, news article classification has become an area of significant interest due to the overwhelmingly growing volume of news corpus on the World Wide Web (WWW). However, during classification, news articles often suffer from deep ambiguity because of their various matching categories and the weak reliability performance of most classification systems being used. These often resulted in low efficiency and poor performance evident in many current approaches (Kaur & Bajaj, 2016; Birabadar & Raikar, 2017). In recent times, learning systems based on ANN and KCDM for classification tasks in high-dimensional problem space, including intrinsic plagiarism detection, image and speech recognitions, identity and non-parametric testing, risk assessment, cellular automata classification, spam filtering, malicious URL detection, text and music classifications, DNA analysis, radar signal classification, EEG classification, e-commerce product classification, etc, are  becoming more evident (Revolle et al., 2016; Oyewole & Olugbara, 2017; Abdalkafor, 2017; Haris et al., 2018).  Actually, ANN has previously been identified as a good approach for dealing with large text classification problems (Lai et al., 2015). However, due to the complex structure of news article datasets, the identification of an efficient and accurate classifier that is a best fit for their classification remains an open problem. This makes it a highly challenging process to annotate topical news, based on different categories in an accurate and time efficient manner.

Recently, a growing trend of emerging, user-aware, big data analytic concepts with tags like “user-assisted classification”, “interactive classification”, “user-aware classification” and “user-centred classification” is becoming more evident. This reflects learning systems and/or big data analytic techniques that incorporate users’ feedbacks, reviews, ratings and personalized opinions into their classification process to augment the quality of classification decisions in an automated/semi-automated fashion (Donkers et al., 2018). For example, this type of approach includes personalized and sentiment-enhanced recommender systems (Yibo et al., 2018). However, this approach is best suited for unstructured data analysis (Donkers et al., 2018).

In the present work, the general aim was to conduct a performance comparison of the accuracies and time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and an Artificial Neural Network (ANN) for solving a prototype news article classification problem. In the experiments conducted, ANN and the KCDM were implemented using Microsoft Visual C# language. 2000 news articles were obtained from the publicly available BBC News article dataset. These were pre-processed using Porter’s algorithm after tokenization and stop-words removal. An NTF-IDF technique was used to extract and select relevant features before training and classification with the KCDM and ANN. The rest of this paper is summarized as follows: in Section 2, relevant literature on multi-labelled text classification, ANN and the KCDM are discussed. In Section 3, materials and method are presented. Section 4 discusses the results obtained, while Section 5 presents the conclusions with future directions. The major contributions of this work include:

  1. Development of a classification method for a large corpus of news articles using ANN and the KCDM by combining Porter’s algorithm with an NTF-IDF technique.
  2. Experimental comparison of the performance of ANN and the KCDM on news article classification using accuracy and computational time efficiency as evaluation metrics.

Conclusion

In this paper, a performance comparison between two methods (ANN and the KCDM) for addressing the news article classification problem was conducted. The experimental results revealed that ANN was better in terms of accuracy, while KCDM was better for developing time-efficient applications. Summarily, this paper establishes the relative importance of conducting performance evaluation as a core part of choosing the best test routines during development to ensure an overall high reliability for deployed applications. Furthermore, this process can help identify certain trade-offs associated with each algorithm and synergize decision making on what algorithm to apply to a specific problem domain of interest, especially when developing fault-tolerant systems.

The major findings of our experiments are: (1) ANN can produce higher classification accuracy for large datasets than KCDM. In all the experiments conducted, ANN yielded the more true positives than the KCDM; (2) The time efficiency of ANN was very low when compared to the KCDM. In the experiments, as the size of the testing set grew, its time complexity also increased. With a testing set containing 1300 features, the classification time spent by ANN was approximately eight times more than that of the KCDM.

In future works, evaluation of some emerging and other baseline classifiers like Adaboost, SVM, naïve-Bayes and k-nearest neighbour could also be conducted for news article classification in large multi-dimensional features space. In addition, an ensemble of ANN and KCDM can also be developed to realize an algorithm with improved classification accuracy and time efficiency.

References

Abdalkafor, A.S., 2017. Designing Offline Arabic Handwritten Isolated Character Recognition System using Artificial Neural Network Approach. International Journal of Technology, Volume 8(3), pp. 528–538

Cilibrasi, R.L., Vitanyi, P.M.B., 2007. The Google Similarity Distance. IEEE Transaction on Knowledge and Data Engineering, Volume 19(3), pp. 370–383

Chan, C., Sun, A., Lim, E.P., 2001. Automated Online News Classification with Personalization. In: Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001), pp. 320–329

Donkers, T., Loepp, B., Ziegler, J., 2018. Explaining Recommendations by Means of User Reviews. In: Joint Proceedings of the ACM IUI 2018 Workshops, March 11, Tokyo, Japan, pp. 1–4

Fagbola, T.M., Olabiyisi, S.O., Egbetola, F.I., Oloyede, A., 2017. Review of Technical Approaches to Face Recognition in Unconstrained Scenes with Varying Pose and Illumination. Federal University, Oye-Ekiti (FUOYE). Journal of Engineering and Technology, Volume 2(1), pp. 1–8

Fagbola, T., Olabiyisi S., Adigun A., 2012. Hybrid GA-SVM for Efficient Feature Selection in E-Mail Classification. Journal of Computer Engineering and Intelligent Systems, Volume 3(3), pp. 17–28

Greene, D., Cunningham, P., 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 377–384

Gurmeet, K., Karan, B., 2016. News Classification and its Techniques: A Review. IOSR Journal of Computer Engineering (IOSR-JCE). Volume 18 (1), pp. 22-26.

Haris, A., Murdianto, B., Susattyo, R., Riyanto, A., 2018. Transforming Seismic Data into Lateral Sonic Properties using Artificial Neural Network: A Case Study of Real Data Set. International Journal of Technology. Volume 9(3), pp. 472–478

Huang, Y., 2009. Advances in Artificial Neural Networks-Methodological Development and Application. Algorithms, Volume 2(3), pp. 973–1007

Joho, H., Sanderson, M., 2007. Document Frequency and Term Specificity. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), Paris, France: Le Centre de Hautes Etudes Internationales D'informatique Documentaire. pp. 350359

Kaur, G., Bajaj, K., 2016. News Classification and its Techniques: A Review. IOSR Journal of Computer Engineering (IOSR-JCE). Volume 18(1), pp. 22–26

Khan, A., Baharudin, B., Lee, L.H., Khan, K., 2010. A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology, Volume 1(1), pp. 420

Haynes, K.E., Kulkarni, R., Stough, R.R., Laurie, S., 2010. Exploring a Region Classifier based on Kolmogorov Complexity. School of Public Policy, George Mason University, Available Online at http://ssrn.com/abstract=1499218

Kolmogorov, A.N., 1965. Three Approaches to the Quantitative Definition of Information. Problems of Information Transmission, Volume 1(1), pp. 3–11

Li, L., Mostafa, Y.S.A., 2006. Data Complexity in Machine Learning. Computer Science Technical Reports, 2006.003, California Institute of Technology, Pasadena, USA California Institute of Technology, Pasadena, USA

May, R., Dandy, G., Maier, H., 2011. Review of Input Variable Selection Methods for Artificial Neural Networks. Suzuki, K. Edition, Artificial Neural Networks Methodological Advances and Biomedical Applications, University of Adelaide, Australia

Mandal, A.K., Sen, R., 2014. Supervised Learning Methods for Bangla Web Document Categorization. International Journal of Artificial Intelligence & Applications (IJAIA), Volume 5(5), pp. 93–105

Oloyede, A., Fagbola, T., Olabiyisi, S., Omidiora, E., Oladosu, J., 2016. Development of a Modified Local Binary Pattern-Gabor Wavelet Transform Aging Invariant Face Recognition System. In: Proceedings of ACM International Conference on Computing Research & Innovations, Nigeria, pp. 108–114

Oyewole, S.A., Olugbara, O.O., 2017. Product Image Classification using Eigen Colour Feature with Ensemble Machine Learning. Egyptian Informatics Journal, Volume 19(2), pp. 83–100

Revolle, M., le Bihan, N., Cayre, F., 2016. Algorithmic Information Theory for Automatic Classification. Grenoble, France: Gipsa-Lab, Universite Grenolbe Alpes, France

Birabadar, S., Raikar, M.M., 2017. Performance Analysis of Text Classifiers based on News Articles-A Survey. Indian Journal of Scientific Research, Volume 15(2), pp. 156–161

Lai, S., Xu, Liheng, Liu, K., Zhao, J., 2015. Recurrent Convolutional Neural Networks for Text Classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267–2273

Skjennum, P.L., 2016. Multilingual News Article Classification. Master Thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, pp. 1–125

Van Meeuwen, F., 2013. Multi-label Text Classification of News Articles for ASDMedia, Master Thesis, Department of Information and Computing Sciences, Utrecht University

Wang, Y., Wang, X., 2005. A New Approach to Feature Selection in Text Classification. In: Proceedings of 4th International Conference on Machine Learning and Cybernetics, IEEE, Volume 6, pp. 3814–3819

Yibo, W., Mingming W., Wei X., 2018. A Sentiment-Enhanced Hybrid Recommender System for Movie Recommendation: A Big Data Analytics Framework. Wireless Communications and Mobile Computing, Volume 2018, pp. 19

Zhang, X., Zhao, J., LeCun, Y., 2016. Character-level Convolutional Networks for Text Classification. Ithaca, NY: Cornell University, arXiv:1509.01626v3