Clustering Narrow-Domain Scientific Text Using Unsupervised and Similarity-Based Approaches

Title: Clustering Narrow-Domain Scientific Text Using Unsupervised and Similarity-Based Approaches

Authors
Authors and Affiliations

Saiful Akbar, Anindya Prameswari Ekaputri, William Fu, Rahmah Khoirussyifa’ Nurdini, Salman Ma’arif Achsien, Benhard Sitohang

Corresponding email: saiful@itb.ac.id

Published at : 22 Sep 2025
Volume : IJtech Vol 16, No 5 (2025)
DOI : https://doi.org/10.14716/ijtech.v16i5.7110

Cite this article as:

Akbar, S, Ekaputri, AP, Fu, W, Nurdini, RK, Achsien, SM & Sitohang, B 2025, ‘Clustering narrow-domain scientific text using unsupervised and similarity-based approaches’, International Journal of Technology, vol. 16, no. 5, pp. 1467-1483

679

Downloads

Saiful Akbar	School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia
Anindya Prameswari Ekaputri	School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia
William Fu	School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia
Rahmah Khoirussyifa’ Nurdini	School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia
Salman Ma’arif Achsien	School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia
Benhard Sitohang	School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia

Email to Corresponding Author

Abstract

Clustering Narrow-Domain Scientific Text Using Unsupervised and Similarity-Based Approaches

Clustering scientific papers published by authors is useful for discovering fellow authors with similar interests or research groups in the institution. In this study, we explore the use of scientific text clustering with an unsupervised approach to enhance the retrieval efficiency of similar works. Challenges in clustering scientific papers from a specific domain include an increase in the list of non-discriminating words (stop words) because more words are becoming common in most of the documents. For example, words such as engineering will no longer have discriminating power if most documents are from the engineering field. The use of similar terminologies to express different concepts, such as internet vs. internet of things, is also a challenge. To address this, we experimented with various text processing methods, including stemming, lemmatization, technical stop word removal, noun extraction, and n-gram phrase detection. The experiment was conducted on a corpus of faculty publications. Our methodology used text processing methods with latent Dirichlet allocation and non-negative matrix factorization topic models to cluster the documents and uncover latent topics within the corpus. The NMF model combined with lemmatization, technical stop word removal, noun extraction, and phrase detection was determined to be the optimal clustering pipeline. The pipeline yielded 11 clusters with the following evaluation scores: UMass of -2.493, CV of 0.681, NPMI of -0.136, and UCI of -4.491. It also improved the sample accuracy from 71.1% to 80.7% and generalized well to a different dataset. The resulting clusters from this pipeline fit our institution’s research groups, such as electrical power engineering, signal processing, and computer vision. Additionally, we provide a curated list of technical stop words that contributed to the effectiveness of our clustering results.

Keywords

Latent dirichlet allocation; Narrow-domain Non-negative factorization matrix; Text clustering; Text processing; Topic modelling

Supplementary Material

Filename	Description
R1-EECE-7110-20240901214205.docx	Supplementary File - DOCX, without revsiion (no revision is required)

References

Aftab, F, Bazai, SU, Marjan, S, Baloch, L, Aslam, S, Amphawan, A & Neo, TK 2023, 'A comprehensive survey on sentiment analysis techniques', International Journal of Technology, vol. 14, no. 6, pp. 1288-1298, https://doi.org/10.14716/ijtech.v14i6.6632

Bellaouar, S, Bellaouar, MM & Ghada, IE 2021, 'Topic modeling: Comparison of LSA and LDA on scientific publications', In: Proceedings of the 2021 4th International Conference on Data Storage and Data Engineering, pp. 59–64, https://doi.org/10.1145/3456146.3456156

Blei, DM, Ng, AY & Jordan, MI 2003, 'Latent Dirichlet allocation', The Journal of Machine Learning Research, vol. 3, pp. 993–1022

Chang, I-C, Yu, T-K, Chang, Y-J & Yu, T-Y 2021, 'Applying text mining, clustering analysis, and latent Dirichlet allocation techniques for topic classification of environmental education journals', Sustainability, vol. 13, no. 19, article 10856, https://doi.org/10.3390/su131910856

Devlin, J, Chang, M-W, Lee, K & Toutanova, K 2018, 'BERT: Pre-training of deep bidirectional transformers for language understanding', arXiv preprint, http://arxiv.org/abs/1810.04805

Egger, R & Yu, J 2022, 'A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts', Frontiers in Sociology, vol. 7, https://doi.org/10.3389/fsoc.2022.886498

Grootendorst, M 2022, 'BERTopic: Neural topic modeling with a class-based TF-IDF procedure', arXiv preprint, http://arxiv.org/abs/2203.05794

Hadiat, AR 2022, 'Topic modeling evaluations: The relationship between coherency and accuracy', Thesis, University of Groningen, viewed 01 August 2023, (https://fse.studenttheses.ub.rug.nl/28618/1/s2863685_alfiuddin_hadiat_CCS_thesis.pdf)

Hassani, A, Iranmanesh, A & Mansouri, N 2021, 'Text mining using nonnegative matrix factorization and latent semantic analysis', Neural Computing and Applications, vol. 33, no. 20, pp. 13745–13766, https://doi.org/10.1007/s00521-021-06014-6

Janmaijaya, M, Shukla, AK, Muhuri, PK & Abraham, A 2021, 'Industry 4.0: Latent Dirichlet allocation and clustering based theme identification of bibliography', Engineering Applications of Artificial Intelligence, vol. 103, article 104280, https://doi.org/10.1016/j.engappai.2021.104280

Jelodar, H, Wang, Y, Yuan, C, Feng, X, Jiang, X, Li, Y & Zhao, L 2019, 'Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey', Multimedia Tools and Applications, vol. 78, no. 11, pp. 15169-15211, https://doi.org/10.1007/s11042-018-6894-4

Kadhim, AI 2019, 'Survey on supervised machine learning techniques for automatic text classification', Artificial Intelligence Review, vol. 52, no. 1, pp. 273-292, https://doi.org/10.1007/s10462-018-09677-1

Kim, S-W & Gil, J-M 2019, 'Research paper classification systems based on TF-IDF and LDA schemes', Human-Centric Computing and Information Sciences, vol. 9, no. 1, article 30, https://doi.org/10.1186/s13673-019-0192-7

Larsen, PO & von Ins, M 2010, 'The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index', Scientometrics, vol. 84, no. 3, pp. 575-603, https://doi.org/10.1007/s11192-010-0202-z

Laxmi Lydia, E, Krishna Kumar, P, Shankar, K, Lakshmanaprabu, SK, Vidhyavathi, RM & Maseleno, A 2020, 'Charismatic document clustering through novel K-means non-negative matrix factorization (KNMF) algorithm using key phrase extraction', International Journal of Parallel Programming, vol. 48, no. 3, pp. 496-514, https://doi.org/10.1007/s10766-018-0591-9

Lee, DD & Seung, HS 1999, 'Learning the parts of objects by non-negative matrix factorization', Nature, vol. 401, pp. 788–791, https://doi.org/10.1038/44565

Leung, XY, Sun, J & Bai, B 2017, 'Bibliometrics of social media research: A co-citation and co-word analysis', International Journal of Hospitality Management, vol. 66, pp. 35-45, https://doi.org/10.1016/j.ijhm.2017.06.012

Li, Y, Wang, K, Xiao, Y & Froyd, JE 2020, 'Research and trends in STEM education: A systematic review of journal publications', International Journal of STEM Education, vol. 7, no. 1, article 11, https://doi.org/10.1186/s40594-020-00207-6

Lubis, FF, Mutaqin, Putri, A, Waskita, D, Sulistyaningtyas, T, Arman, AA & Rosmansyah, Y 2021, 'Automated short-answer grading using semantic similarity based on word embedding', International Journal of Technology, vol. 12, no. 3, pp. 571-581, https://doi.org/10.14716/ijtech.v12i3.4651

Mehta, V, Bawa, S & Singh, J 2021, 'WEClustering: Word embeddings based text clustering technique for large datasets', Complex & Intelligent Systems, vol. 7, no. 6, pp. 3211-3224, https://doi.org/10.1007/s40747-021-00512-9

Mifrah, S & Benlahmar, EH 2020, 'Topic modeling coherence: A comparative study between LDA and NMF models using COVID-19 corpus', International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 4, pp. 5756-5761, https://doi.org/10.30534/ijatcse/2020/231942020

Mohammed, SM, Jacksi, K & Zeebaree, RM 2021, 'A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms', Indonesian Journal of Electrical Engineering and Computer Science, vol. 22, no. 1, article 552, https://doi.org/10.11591/ijeecs.v22.i1.pp552-562

Mohemad, R, Muhait, NNM, Noor, NMM & Othman, ZA 2021, 'The impact of N-gram on the Malay text document clustering', Malaysian Journal of Information and Communication Technology, vol. 6, no. 2, pp. 22-29, https://doi.org/10.53840/myjict6-2-83

Muchene, L & Safari, W 2021, 'Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya', PLOS ONE, vol. 16, no. 1, article e0243208, https://doi.org/10.1371/journal.pone.0243208

Pavithra & Savitha 2024, 'Topic modeling for evolving textual data using LDA, HDP, NMF, BERTopic, and DTM with a focus on research papers', Journal of Technology and Informatics (JoTI), vol. 5, no. 2, pp. 53-63, https://doi.org/10.37802/joti.v5i2.618

Preetham, MCS, Reddy, BR, Tharun Reddy, DS & Gupta, D 2022, 'Comparative analysis of research papers categorization using LDA and NMF approaches', In: Proceedings of the 2022 IEEE North Karnataka Subsection Flagship International Conference (NKCon), pp. 1-7, https://doi.org/10.1109/NKCon56289.2022.10127059

Rajaraman, A & Ullman, J 2011, 'Data mining', in Mining of Massive Datasets, Cambridge University Press, pp. 1–17, https://doi.org/10.1017/CBO9781139058452.002

Sajid, NA, Ahmad, M, Afzal, MT & Atta-ur-Rahman 2021, 'Exploiting papers’ reference’s section for multi-label computer science research papers’ classification', Journal of Information & Knowledge Management, vol. 20, no. 1, article 2150004, https://doi.org/10.1142/S0219649221500040

Sarica, S & Luo, J 2021, 'Stopwords in technical language processing', PLOS ONE, vol. 16, no. 8, article e0315195, https://doi.org/10.1371/journal.pone.0254937

Shah, N & Mahajan, S 2012, 'Document clustering: A detailed review', International Journal of Applied Information Systems, vol. 4, no. 5, pp. 30-38, https://d1wqtxts1xzle7.cloudfront.net/81705889/ijais12-450691-libre.pdf

Shahnaz, F, Berry, MW, Pauca, VP & Plemmons, RJ 2006, 'Document clustering using nonnegative matrix factorization', Information Processing and Management, vol. 42, no. 2, pp. 373-386, https://doi.org/10.1016/j.ipm.2004.11.005

Smail, B, Aliane, H & Abdeldjalil, O 2023, 'Using an explicit query and a topic model for scientific article recommendation', Education and Information Technologies, vol. 28, no. 12, pp. 15657-15670, https://doi.org/10.1007/s10639-023-11817-2

Surjandari, I, Dhini, A, Wibisana, N & Lumbantobing, EWI 2015, 'University research theme mapping: A co-word analysis of scientific publications', International Journal of Technology, vol. 6, no. 3, pp. 410-421, https://doi.org/10.14716/ijtech.v6i3.1462

Syed, S & Spruit, M 2017, 'Full-text or abstract? Examining topic coherence scores using latent Dirichlet allocation', In: Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA 2017), pp. 165-174, https://doi.org/10.1109/DSAA.2017.61

Terko, A, Zunic, E & Donko, D 2019, 'NeurIPS conference papers classification based on topic modeling', In: Proceedings of the 2019 XXVII International Conference on Information, Communication and Automation Technologies (ICAT), pp. 1-5, https://doi.org/10.1109/ICAT47117.2019.8938961

Tey, WL, Goh, HN, Lim, AHL & Phang, CK 2023, 'Pre- and post-depressive detection using deep learning and textual-based features', International Journal of Technology, vol. 14, no. 6, pp. 1334-1343, https://doi.org/10.14716/ijtech.v14i6.6648

Tsuge, S, Shishibori, M, Kuroiwa, S & Kita, K 2001, 'Dimensionality reduction using non-negative matrix factorization for information retrieval', In: Proceedings of the 2001 IEEE International Conference on Systems, Man and Cybernetics, pp. 960-965, https://doi.org/10.1109/ICSMC.2001.973042

Vayansky, I & Kumar, SAP 2020, 'A review of topic modeling methods', Information Systems, vol. 94, article 101582, https://doi.org/10.1016/j.is.2020.101582

Wang, Y-X & Zhang, Y-J 2013, 'Nonnegative matrix factorization: A comprehensive review', IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, pp. 1336-1353, https://doi.org/10.1109/TKDE.2012.51

Yu, D & Xiang, B 2023, 'Discovering topics and trends in the field of artificial intelligence: Using LDA topic modeling', Expert Systems with Applications, vol. 225, article 120114, https://doi.org/10.1016/j.eswa.2023.120114

Zibani, P, Rajkoomar, M & Naicker, N 2022, 'A systematic review of faculty research repositories at higher education institutions', Digital Library Perspectives, vol. 38, no. 2, pp. 237-248, https://doi.org/10.1108/DLP-04-2021-0035

Zini, JE & Awad, M 2023, 'On the explainability of natural language processing deep models', ACM Computing Surveys, vol. 55, no. 5, pp. 1-31, https://doi.org/10.1145/3529755

Download PDF

Who cite this paper

Table of Contents

Article

Abstract

Supplementary Material

References