Published at : 29 Jul 2019
Volume : IJtech
Vol 10, No 4 (2019)
DOI : https://doi.org/10.14716/ijtech.v10i4.2339
Temitayo Matthew Fagbola | - KZN eSkills CoLab, Durban University of Technology, Durban 4000, South Africa - Department of Information Technology, Durban University of Technology, Durban, South Africa |
Colin Surendra Thakur | - KZN eSkills CoLab, Durban University of Technology, Durban 4000, South Africa - Department of Information Technology, Durban University of Technology, Durban, South Africa |
Oludayo Olugbara | Department of Information Technology, Durban University of Technology, Durban, South Africa |
News article classification is
a recently growing area of interest in text classification because of its
associated multiple matching categories. However, the weak reliability indices
and ambiguities associated with state-of-the-art classifiers often employed
make success in this domain very limited. Also, the high sensitivity and large
disparity in performance results of classifiers to the varying nature of
real-world datasets make the need for comparative evaluation inevitable. In
this paper, the accuracy and computational time efficiency of the Kolmogorov
Complexity Distance Measure (KCDM) and Artificial Neural Network (ANN) were
experimentally evaluated for a prototype large dimensional news article
classification problem. 2000 News articles from a dataset of 2225 British
Broadcasting Corporation (BBC) news documents (including examples from sport,
politics, entertainment, education and technology, and business) were used for
categorical testing purposes. Porter’s algorithm was used for word stemming
after tokenization and stop-words removal, and a Normalized Term Frequency–Inverse
Document Frequency (NTF-IDF) technique was adopted for feature extraction.
Experimental results revealed that ANN performs better in terms of accuracy
while the KCDM produced better results than ANN in terms of computational time
efficiency.
Artificial neural network; Kolmogorov complexity distance measure; News article dataset; Text classification
In the domain of text classification problems, news article classification has become an area of significant interest due to the overwhelmingly growing volume of news corpus on the World Wide Web (WWW). However, during classification, news articles often suffer from deep ambiguity because of their various matching categories and the weak reliability performance of most classification systems being used. These often resulted in low efficiency and poor performance evident in many current approaches (Kaur & Bajaj, 2016; Birabadar & Raikar, 2017). In recent times, learning systems based on ANN and KCDM for classification tasks in high-dimensional problem space, including intrinsic plagiarism detection, image and speech recognitions, identity and non-parametric testing, risk assessment, cellular automata classification, spam filtering, malicious URL detection, text and music classifications, DNA analysis, radar signal classification, EEG classification, e-commerce product classification, etc, are becoming more evident (Revolle et al., 2016; Oyewole & Olugbara, 2017; Abdalkafor, 2017; Haris et al., 2018). Actually, ANN has previously been identified as a good approach for dealing with large text classification problems (Lai et al., 2015). However, due to the complex structure of news article datasets, the identification of an efficient and accurate classifier that is a best fit for their classification remains an open problem. This makes it a highly challenging process to annotate topical news, based on different categories in an accurate and time efficient manner.
Recently, a
growing trend of emerging, user-aware, big data analytic concepts with tags
like “user-assisted classification”, “interactive classification”, “user-aware
classification” and “user-centred classification” is becoming more evident. This reflects learning systems
and/or big data analytic techniques that incorporate
users’ feedbacks, reviews, ratings and personalized opinions into their
classification process to augment the quality of classification decisions in an
automated/semi-automated fashion (Donkers et al., 2018). For example, this type of approach includes personalized
and sentiment-enhanced recommender systems (Yibo et al., 2018). However, this
approach is best suited for unstructured data analysis (Donkers et al., 2018).
In the present
work, the general aim was to conduct a performance comparison of the accuracies
and time efficiency of the Kolmogorov Complexity Distance Measure (KCDM) and an
Artificial Neural Network (ANN) for solving a prototype news article
classification problem. In the experiments
conducted, ANN and the KCDM were implemented using Microsoft Visual C#
language. 2000 news articles were obtained from the publicly available BBC News
article dataset. These were pre-processed using Porter’s algorithm after
tokenization and stop-words removal. An NTF-IDF
technique was used to extract and select relevant features before training and
classification with the KCDM and ANN. The rest of this paper is
summarized as follows: in Section 2, relevant literature on multi-labelled text
classification, ANN and the
KCDM are discussed. In Section 3, materials and method are presented. Section 4
discusses the results obtained, while Section 5 presents the conclusions with
future directions. The major contributions of this work include:
In this paper, a performance comparison between two methods (ANN and the
KCDM) for addressing the news article classification problem was conducted. The
experimental results revealed that ANN was better in terms of accuracy, while
KCDM was better for developing time-efficient applications. Summarily, this paper establishes the relative
importance of conducting performance evaluation as a core part of choosing the
best test routines during development to ensure an overall high reliability for
deployed applications. Furthermore, this process can help identify certain
trade-offs associated with each algorithm and synergize decision making on what
algorithm to apply to a specific problem domain of interest, especially when
developing fault-tolerant systems.
The major findings of our experiments are:
(1) ANN can
produce higher classification accuracy for large datasets than KCDM. In all the
experiments conducted, ANN yielded the more true positives than the KCDM; (2) The time efficiency of ANN was very low when compared
to the KCDM. In
the experiments, as the size of the testing set grew, its time complexity also
increased. With
a testing set containing 1300 features, the
classification time spent by ANN was
approximately eight
times more than that of the KCDM.
In future works, evaluation of some
emerging and other baseline classifiers like Adaboost, SVM, naïve-Bayes and k-nearest neighbour could also be
conducted for news article classification in large multi-dimensional features
space. In addition, an ensemble of ANN and KCDM can also be developed to
realize an algorithm with improved classification accuracy and time efficiency.
Abdalkafor, A.S., 2017.
Designing Offline Arabic Handwritten Isolated Character Recognition System
using Artificial Neural Network Approach. International Journal of
Technology, Volume 8(3), pp. 528–538
Cilibrasi, R.L., Vitanyi, P.M.B., 2007. The Google
Similarity Distance. IEEE Transaction on
Knowledge and Data Engineering, Volume 19(3), pp. 370–383
Chan, C., Sun, A., Lim, E.P., 2001. Automated Online
News Classification with Personalization. In:
Proceedings of the 4th International Conference of Asian Digital Library
(ICADL2001), pp. 320–329
Donkers, T., Loepp, B., Ziegler, J., 2018. Explaining Recommendations by Means
of User Reviews. In: Joint Proceedings of the ACM IUI 2018 Workshops, March
11, Tokyo, Japan, pp. 1–4
Fagbola, T.M., Olabiyisi, S.O., Egbetola, F.I., Oloyede, A., 2017. Review of Technical Approaches to Face
Recognition in Unconstrained Scenes with Varying Pose and Illumination. Federal
University, Oye-Ekiti (FUOYE). Journal of Engineering and Technology, Volume 2(1), pp. 1–8
Fagbola, T., Olabiyisi S.,
Adigun A., 2012. Hybrid GA-SVM for Efficient Feature
Selection in E-Mail Classification. Journal
of Computer Engineering and Intelligent Systems, Volume 3(3), pp. 17–28
Greene, D., Cunningham, P.,
2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel
Document Clustering. In: Proceedings of the 23rd
International Conference on Machine Learning, pp. 377–384
Gurmeet, K., Karan, B., 2016. News
Classification and its Techniques: A Review. IOSR Journal of Computer Engineering (IOSR-JCE). Volume 18
(1),
pp. 22-26.
Haris, A., Murdianto, B.,
Susattyo, R., Riyanto, A., 2018. Transforming Seismic Data into Lateral Sonic
Properties using Artificial Neural Network: A Case Study of Real Data
Set. International Journal of Technology. Volume 9(3), pp. 472–478
Huang, Y., 2009. Advances in Artificial Neural
Networks-Methodological Development and Application. Algorithms, Volume 2(3), pp. 973–1007
Joho, H., Sanderson, M., 2007. Document Frequency and
Term Specificity. In Large Scale Semantic Access to Content (Text, Image,
Video, and Sound), Paris, France: Le Centre de Hautes
Etudes Internationales D'informatique Documentaire. pp. 350–359
Kaur, G., Bajaj, K., 2016. News Classification and its Techniques: A
Review. IOSR Journal of Computer Engineering (IOSR-JCE). Volume 18(1), pp. 22–26
Khan, A., Baharudin, B., Lee, L.H., Khan, K., 2010. A
Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information
Technology, Volume 1(1), pp. 4–20
Haynes, K.E., Kulkarni, R.,
Stough, R.R., Laurie, S., 2010. Exploring
a Region Classifier based on Kolmogorov Complexity. School of Public Policy, George Mason University, Available
Online at http://ssrn.com/abstract=1499218
Kolmogorov, A.N., 1965. Three Approaches to the
Quantitative Definition of Information. Problems of Information Transmission,
Volume 1(1), pp. 3–11
Li, L., Mostafa, Y.S.A., 2006. Data Complexity in Machine Learning. Computer Science Technical
Reports, 2006.003, California
Institute of Technology, Pasadena, USA California Institute of
Technology, Pasadena, USA
May, R., Dandy, G., Maier, H., 2011. Review of Input Variable
Selection Methods for Artificial Neural Networks. Suzuki, K. Edition, Artificial Neural
Networks Methodological Advances and Biomedical Applications, University
of Adelaide, Australia
Mandal, A.K., Sen, R., 2014. Supervised Learning
Methods for Bangla Web Document Categorization. International Journal of Artificial Intelligence & Applications
(IJAIA), Volume 5(5), pp. 93–105
Oloyede, A., Fagbola, T., Olabiyisi, S., Omidiora, E., Oladosu, J., 2016. Development of a
Modified Local Binary Pattern-Gabor Wavelet Transform Aging Invariant Face
Recognition System. In: Proceedings of ACM International Conference on
Computing Research & Innovations, Nigeria, pp. 108–114
Oyewole, S.A., Olugbara, O.O., 2017. Product Image
Classification using Eigen Colour Feature with Ensemble Machine Learning. Egyptian Informatics Journal, Volume 19(2), pp. 83–100
Revolle, M., le Bihan, N., Cayre, F.,
2016. Algorithmic Information Theory for
Automatic Classification. Grenoble, France: Gipsa-Lab, Universite Grenolbe
Alpes, France
Birabadar, S., Raikar, M.M.,
2017. Performance Analysis of Text Classifiers based on News Articles-A Survey.
Indian Journal of Scientific Research,
Volume 15(2), pp. 156–161
Lai, S., Xu, Liheng, Liu, K., Zhao, J., 2015.
Recurrent Convolutional Neural Networks for Text Classification. In: Proceedings of the Twenty-Ninth AAAI
Conference on Artificial Intelligence, pp. 2267–2273
Skjennum, P.L., 2016.
Multilingual
News Article Classification. Master
Thesis, Department of Computer and Information Science, Norwegian
University of Science and Technology, pp. 1–125
Van Meeuwen, F., 2013. Multi-label Text Classification
of News Articles for ASDMedia, Master
Thesis, Department of Information and Computing Sciences, Utrecht
University
Wang, Y., Wang, X., 2005. A New Approach to Feature Selection
in Text Classification. In: Proceedings
of 4th International Conference on Machine Learning and Cybernetics,
IEEE, Volume 6, pp. 3814–3819
Yibo, W., Mingming W., Wei X., 2018. A Sentiment-Enhanced Hybrid Recommender System for Movie Recommendation: A Big Data Analytics Framework. Wireless Communications and Mobile Computing, Volume 2018, pp. 1–9
Zhang, X., Zhao, J., LeCun, Y., 2016. Character-level Convolutional Networks for Text Classification. Ithaca, NY: Cornell University, arXiv:1509.01626v3