News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network

News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network

Title: News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network
Temitayo Matthew Fagbola, Colin Surendra Thakur, Oludayo Olugbara

Fagbola, T.M., Thakur, C.S., Olugbara, O., 2019. News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network. International Journal of Technology. Volume 10(4), pp. 710-720

Temitayo Matthew Fagbola - Department of Information Technology, Durban University of Technology, Durban, South Africa
Colin Surendra Thakur - Department of Information Technology, Durban University of Technology, Durban, South Africa
Oludayo Olugbara - Department of Information Technology, Durban University of Technology, Durban, South Africa
News Article Classification using Kolmogorov Complexity Distance Measure and Artificial Neural Network

News article classification is a recently growing area of interest in text classification because of its associated multiple matching categories. However, the weak reliability indices and ambiguities associated with state-of-the-art classifiers often employed make success in this domain very limited. Also, the high sensitivity and large disparity in performance results of classifiers to the varying nature of real-world datasets make the need for comparative evaluation inevitable. In this paper, the accuracy and computational time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and Artificial Neural Network (ANN) were experimentally evaluated for a prototype large dimensional news article classification problem. 2000 News articles from a dataset of 2225 British Broadcasting Corporation (BBC) news documents (including examples from sport, politics, entertainment, education and technology, and business) were used for categorical testing purposes. Porter’s algorithm was used for word stemming after tokenization and stop-words removal, and a Normalized Term Frequency–Inverse Document Frequency (NTF-IDF) technique was adopted for feature extraction. Experimental results revealed that ANN performs better in terms of accuracy while the KCDM produced better results than ANN in terms of computational time efficiency.

Artificial neural network; Kolmogorov complexity distance measure; News article dataset; Text classification


In the domain of text classification problems, news article classification has become an area of significant interest due to the overwhelmingly growing volume of news corpus on the World Wide Web (WWW). However, during classification, news articles often suffer from deep ambiguity because of their various matching categories and the weak reliability performance of most classification systems being used. These often resulted in low efficiency and poor performance evident in many current approaches (Kaur & Bajaj, 2016; Birabadar & Raikar, 2017). In recent times, learning systems based on ANN and KCDM for classification tasks in high-dimensional problem space, including intrinsic plagiarism detection, image and speech recognitions, identity and non-parametric testing, risk assessment, cellular automata classification, spam filtering, malicious URL detection, text and music classifications, DNA analysis, radar signal classification, EEG classification, e-commerce product classification, etc, are  becoming more evident (Revolle et al., 2016; Oyewole & Olugbara, 2017; Abdalkafor, 2017; Haris et al., 2018).  Actually, ANN has previously been identified as a good approach for dealing with large text classification problems (Lai et al., 2015). However, due to the complex structure of news article datasets, the identification of an efficient and accurate classifier that is a best fit for their classification remains an open problem. This makes it a highly challenging process to annotate topical news, based on different categories in an accurate and time efficient manner.

Recently, a growing trend of emerging, user-aware, big data analytic concepts with tags like “user-assisted classification”, “interactive classification”, “user-aware classification” and “user-centred classification” is becoming more evident. This reflects learning systems and/or big data analytic techniques that incorporate users’ feedbacks, reviews, ratings and personalized opinions into their classification process to augment the quality of classification decisions in an automated/semi-automated fashion (Donkers et al., 2018). For example, this type of approach includes personalized and sentiment-enhanced recommender systems (Yibo et al., 2018). However, this approach is best suited for unstructured data analysis (Donkers et al., 2018).

In the present work, the general aim was to conduct a performance comparison of the accuracies and time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and an Artificial Neural Network (ANN) for solving a prototype news article classification problem. In the experiments conducted, ANN and the KCDM were implemented using Microsoft Visual C# language. 2000 news articles were obtained from the publicly available BBC News article dataset. These were pre-processed using Porter’s algorithm after tokenization and stop-words removal. An NTF-IDF technique was used to extract and select relevant features before training and classification with the KCDM and ANN. The rest of this paper is summarized as follows: in Section 2, relevant literature on multi-labelled text classification, ANN and the KCDM are discussed. In Section 3, materials and method are presented. Section 4 discusses the results obtained, while Section 5 presents the conclusions with future directions. The major contributions of this work include:

  1. Development of a classification method for a large corpus of news articles using ANN and the KCDM by combining Porter’s algorithm with an NTF-IDF technique.
  2. Experimental comparison of the performance of ANN and the KCDM on news article classification using accuracy and computational time efficiency as evaluation metrics.


In this paper, a performance comparison between two methods (ANN and the KCDM) for addressing the news article classification problem was conducted. The experimental results revealed that ANN was better in terms of accuracy, while KCDM was better for developing time-efficient applications. Summarily, this paper establishes the relative importance of conducting performance evaluation as a core part of choosing the best test routines during development to ensure an overall high reliability for deployed applications. Furthermore, this process can help identify certain trade-offs associated with each algorithm and synergize decision making on what algorithm to apply to a specific problem domain of interest, especially when developing fault-tolerant systems.

The major findings of our experiments are: (1) ANN can produce higher classification accuracy for large datasets than KCDM. In all the experiments conducted, ANN yielded the more true positives than the KCDM; (2) The time efficiency of ANN was very low when compared to the KCDM. In the experiments, as the size of the testing set grew, its time complexity also increased. With a testing set containing 1300 features, the classification time spent by ANN was approximately eight times more than that of the KCDM.

In future works, evaluation of some emerging and other baseline classifiers like Adaboost, SVM, naïve-Bayes and k-nearest neighbour could also be conducted for news article classification in large multi-dimensional features space. In addition, an ensemble of ANN and KCDM can also be developed to realize an algorithm with improved classification accuracy and time efficiency.


