Published at : 22 Sep 2025
Volume : IJtech
Vol 16, No 5 (2025)
DOI : https://doi.org/10.14716/ijtech.v16i5.7110
Akbar, S, Ekaputri, AP, Fu, W, Nurdini, RK, Achsien, SM & Sitohang, B 2025, ‘Clustering narrow-domain scientific text using unsupervised and similarity-based approaches’, International Journal of Technology, vol. 16, no. 5, pp. 1467-1483
Saiful Akbar | School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia |
Anindya Prameswari Ekaputri | School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia |
William Fu | School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia |
Rahmah Khoirussyifa’ Nurdini | School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia |
Salman Ma’arif Achsien | School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia |
Benhard Sitohang | School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10, Bandung, 40132, Indonesia |
Clustering scientific papers published by authors is useful for discovering fellow authors with similar interests or research groups in the institution. In this study, we explore the use of scientific text clustering with an unsupervised approach to enhance the retrieval efficiency of similar works. Challenges in clustering scientific papers from a specific domain include an increase in the list of non-discriminating words (stop words) because more words are becoming common in most of the documents. For example, words such as engineering will no longer have discriminating power if most documents are from the engineering field. The use of similar terminologies to express different concepts, such as internet vs. internet of things, is also a challenge. To address this, we experimented with various text processing methods, including stemming, lemmatization, technical stop word removal, noun extraction, and n-gram phrase detection. The experiment was conducted on a corpus of faculty publications. Our methodology used text processing methods with latent Dirichlet allocation and non-negative matrix factorization topic models to cluster the documents and uncover latent topics within the corpus. The NMF model combined with lemmatization, technical stop word removal, noun extraction, and phrase detection was determined to be the optimal clustering pipeline. The pipeline yielded 11 clusters with the following evaluation scores: UMass of -2.493, CV of 0.681, NPMI of -0.136, and UCI of -4.491. It also improved the sample accuracy from 71.1% to 80.7% and generalized well to a different dataset. The resulting clusters from this pipeline fit our institution’s research groups, such as electrical power engineering, signal processing, and computer vision. Additionally, we provide a curated list of technical stop words that contributed to the effectiveness of our clustering results.
Latent dirichlet allocation; Narrow-domain Non-negative factorization matrix; Text clustering; Text processing; Topic modelling
Filename | Description |
---|---|
R1-EECE-7110-20240901214205.docx | Supplementary File - DOCX, without revsiion (no revision is required) |
Aftab, F, Bazai, SU, Marjan, S, Baloch, L, Aslam, S, Amphawan, A &
Neo, TK 2023, 'A comprehensive survey on sentiment analysis techniques', International
Journal of Technology, vol. 14, no. 6, pp. 1288-1298, https://doi.org/10.14716/ijtech.v14i6.6632
Bellaouar, S, Bellaouar, MM & Ghada, IE 2021, 'Topic modeling:
Comparison of LSA and LDA on scientific publications', In: Proceedings
of the 2021 4th International Conference on Data Storage and Data Engineering,
pp. 59–64, https://doi.org/10.1145/3456146.3456156
Blei, DM, Ng, AY & Jordan, MI 2003, 'Latent Dirichlet allocation', The
Journal of Machine Learning Research, vol. 3, pp. 993–1022
Chang, I-C, Yu, T-K, Chang, Y-J & Yu, T-Y 2021, 'Applying text
mining, clustering analysis, and latent Dirichlet allocation techniques for
topic classification of environmental education journals', Sustainability,
vol. 13, no. 19, article 10856, https://doi.org/10.3390/su131910856
Devlin, J, Chang, M-W, Lee, K & Toutanova, K 2018, 'BERT:
Pre-training of deep bidirectional transformers for language understanding',
arXiv preprint, http://arxiv.org/abs/1810.04805
Egger, R & Yu, J 2022, 'A topic modeling comparison between LDA,
NMF, Top2Vec, and BERTopic to demystify Twitter posts', Frontiers in
Sociology, vol. 7, https://doi.org/10.3389/fsoc.2022.886498
Grootendorst, M 2022, 'BERTopic: Neural topic modeling with a
class-based TF-IDF procedure', arXiv preprint, http://arxiv.org/abs/2203.05794
Hadiat, AR 2022, 'Topic modeling evaluations: The relationship between
coherency and accuracy', Thesis, University of Groningen, viewed 01 August
2023, (https://fse.studenttheses.ub.rug.nl/28618/1/s2863685_alfiuddin_hadiat_CCS_thesis.pdf)
Hassani, A, Iranmanesh, A & Mansouri, N 2021, 'Text mining using
nonnegative matrix factorization and latent semantic analysis', Neural
Computing and Applications, vol. 33, no. 20, pp. 13745–13766, https://doi.org/10.1007/s00521-021-06014-6
Janmaijaya, M, Shukla, AK, Muhuri, PK & Abraham, A 2021, 'Industry
4.0: Latent Dirichlet allocation and clustering based theme identification of
bibliography', Engineering Applications of Artificial Intelligence, vol.
103, article 104280, https://doi.org/10.1016/j.engappai.2021.104280
Jelodar, H, Wang, Y, Yuan, C, Feng, X, Jiang, X, Li, Y
& Zhao, L 2019, 'Latent Dirichlet allocation (LDA) and topic modeling:
Models, applications, a survey', Multimedia Tools and Applications,
vol. 78, no. 11, pp. 15169-15211, https://doi.org/10.1007/s11042-018-6894-4
Kadhim, AI 2019, 'Survey on supervised machine learning
techniques for automatic text classification', Artificial Intelligence Review,
vol. 52, no. 1, pp. 273-292, https://doi.org/10.1007/s10462-018-09677-1
Kim, S-W & Gil, J-M 2019, 'Research paper
classification systems based on TF-IDF and LDA schemes', Human-Centric
Computing and Information Sciences, vol. 9, no. 1, article 30, https://doi.org/10.1186/s13673-019-0192-7
Larsen, PO & von Ins, M 2010, 'The rate of growth in
scientific publication and the decline in coverage provided by Science Citation
Index', Scientometrics, vol. 84, no. 3, pp. 575-603, https://doi.org/10.1007/s11192-010-0202-z
Laxmi Lydia, E, Krishna Kumar, P, Shankar, K,
Lakshmanaprabu, SK, Vidhyavathi, RM & Maseleno, A 2020, 'Charismatic
document clustering through novel K-means non-negative matrix factorization
(KNMF) algorithm using key phrase extraction', International Journal of
Parallel Programming, vol. 48, no. 3, pp. 496-514, https://doi.org/10.1007/s10766-018-0591-9
Lee, DD & Seung, HS 1999, 'Learning the parts of
objects by non-negative matrix factorization', Nature, vol. 401,
pp. 788–791, https://doi.org/10.1038/44565
Leung, XY, Sun, J & Bai, B 2017, 'Bibliometrics of
social media research: A co-citation and co-word analysis', International
Journal of Hospitality Management, vol. 66, pp. 35-45, https://doi.org/10.1016/j.ijhm.2017.06.012
Li, Y, Wang, K, Xiao, Y & Froyd, JE 2020, 'Research
and trends in STEM education: A systematic review of journal publications', International
Journal of STEM Education, vol. 7, no. 1, article 11, https://doi.org/10.1186/s40594-020-00207-6
Lubis, FF, Mutaqin, Putri, A, Waskita, D,
Sulistyaningtyas, T, Arman, AA & Rosmansyah, Y 2021, 'Automated
short-answer grading using semantic similarity based on word embedding', International
Journal of Technology, vol. 12, no. 3, pp. 571-581, https://doi.org/10.14716/ijtech.v12i3.4651
Mehta, V, Bawa, S & Singh, J 2021, 'WEClustering:
Word embeddings based text clustering technique for large datasets', Complex
& Intelligent Systems, vol. 7, no. 6, pp. 3211-3224, https://doi.org/10.1007/s40747-021-00512-9
Mifrah, S & Benlahmar, EH 2020, 'Topic modeling
coherence: A comparative study between LDA and NMF models using COVID-19
corpus', International Journal of Advanced Trends in Computer Science and
Engineering, vol. 9, no. 4, pp. 5756-5761, https://doi.org/10.30534/ijatcse/2020/231942020
Mohammed, SM, Jacksi, K & Zeebaree, RM 2021, 'A
state-of-the-art survey on semantic similarity for document clustering using
GloVe and density-based algorithms', Indonesian Journal of Electrical Engineering and
Computer Science, vol. 22, no. 1, article 552, https://doi.org/10.11591/ijeecs.v22.i1.pp552-562
Mohemad, R, Muhait, NNM, Noor, NMM & Othman, ZA 2021,
'The impact of N-gram on the Malay text document clustering', Malaysian
Journal of Information and Communication Technology, vol. 6, no. 2,
pp. 22-29, https://doi.org/10.53840/myjict6-2-83
Muchene, L & Safari, W 2021, 'Two-stage topic
modelling of scientific publications: A case study of University of Nairobi,
Kenya', PLOS ONE, vol. 16, no. 1, article e0243208,
https://doi.org/10.1371/journal.pone.0243208
Pavithra & Savitha 2024, 'Topic modeling for evolving
textual data using LDA, HDP, NMF, BERTopic, and DTM with a focus on research
papers', Journal of Technology and Informatics (JoTI), vol. 5, no.
2, pp. 53-63, https://doi.org/10.37802/joti.v5i2.618
Preetham, MCS, Reddy, BR, Tharun Reddy, DS & Gupta, D
2022, 'Comparative analysis of research papers categorization using LDA and NMF
approaches', In: Proceedings of the 2022 IEEE North Karnataka
Subsection Flagship International Conference (NKCon), pp.
1-7, https://doi.org/10.1109/NKCon56289.2022.10127059
Rajaraman, A & Ullman, J 2011, 'Data mining', in Mining of
Massive Datasets, Cambridge University Press, pp. 1–17, https://doi.org/10.1017/CBO9781139058452.002
Sajid, NA, Ahmad, M, Afzal, MT & Atta-ur-Rahman 2021,
'Exploiting papers’ reference’s section for multi-label computer science
research papers’ classification', Journal of Information & Knowledge
Management, vol. 20, no. 1, article 2150004, https://doi.org/10.1142/S0219649221500040
Sarica, S & Luo, J 2021, 'Stopwords in technical
language processing', PLOS ONE, vol. 16, no. 8, article e0315195, https://doi.org/10.1371/journal.pone.0254937
Shah, N & Mahajan, S 2012, 'Document clustering: A
detailed review', International Journal of Applied Information Systems,
vol. 4, no. 5, pp. 30-38, https://d1wqtxts1xzle7.cloudfront.net/81705889/ijais12-450691-libre.pdf
Shahnaz, F, Berry, MW, Pauca, VP & Plemmons, RJ 2006,
'Document clustering using nonnegative matrix factorization', Information
Processing and Management, vol. 42, no. 2, pp. 373-386, https://doi.org/10.1016/j.ipm.2004.11.005
Smail, B, Aliane, H & Abdeldjalil, O 2023, 'Using an
explicit query and a topic model for scientific article recommendation', Education
and Information Technologies, vol. 28, no. 12, pp. 15657-15670, https://doi.org/10.1007/s10639-023-11817-2
Surjandari, I, Dhini, A, Wibisana, N & Lumbantobing,
EWI 2015, 'University research theme mapping: A co-word analysis of scientific
publications', International Journal of Technology, vol. 6, no. 3, pp.
410-421, https://doi.org/10.14716/ijtech.v6i3.1462
Syed, S & Spruit, M 2017, 'Full-text or abstract?
Examining topic coherence scores using latent Dirichlet allocation', In:
Proceedings
of the International Conference on Data Science and Advanced Analytics (DSAA
2017), pp. 165-174, https://doi.org/10.1109/DSAA.2017.61
Terko, A, Zunic, E & Donko, D 2019, 'NeurIPS
conference papers classification based on topic modeling', In: Proceedings
of the 2019 XXVII International Conference on Information, Communication and
Automation Technologies (ICAT), pp. 1-5, https://doi.org/10.1109/ICAT47117.2019.8938961
Tey, WL, Goh, HN, Lim, AHL & Phang, CK 2023, 'Pre-
and post-depressive detection using deep learning and textual-based features', International
Journal of Technology, vol. 14, no. 6, pp. 1334-1343, https://doi.org/10.14716/ijtech.v14i6.6648
Tsuge, S, Shishibori, M, Kuroiwa, S & Kita, K 2001,
'Dimensionality reduction using non-negative matrix factorization for
information retrieval', In: Proceedings of the 2001 IEEE International
Conference on Systems, Man and Cybernetics, pp. 960-965, https://doi.org/10.1109/ICSMC.2001.973042
Vayansky, I & Kumar, SAP 2020, 'A review of topic
modeling methods', Information Systems, vol. 94, article 101582, https://doi.org/10.1016/j.is.2020.101582
Wang, Y-X & Zhang, Y-J 2013, 'Nonnegative matrix
factorization: A comprehensive review', IEEE Transactions on Knowledge and Data
Engineering, vol. 25, no. 6, pp. 1336-1353, https://doi.org/10.1109/TKDE.2012.51
Yu, D & Xiang, B 2023, 'Discovering topics and trends
in the field of artificial intelligence: Using LDA topic modeling', Expert
Systems with Applications, vol. 225, article 120114, https://doi.org/10.1016/j.eswa.2023.120114
Zibani, P, Rajkoomar, M & Naicker, N 2022, 'A
systematic review of faculty research repositories at higher education
institutions', Digital Library Perspectives, vol. 38, no. 2, pp.
237-248, https://doi.org/10.1108/DLP-04-2021-0035
Zini, JE & Awad, M 2023, 'On the explainability of
natural language processing deep models', ACM Computing Surveys, vol. 55, no.
5, pp. 1-31, https://doi.org/10.1145/3529755