EVALUATING CLUSTERING METHODS FOR SEMANTIC REPRESENTATION OF DISASTER NEWS USING BERT EMBEDDINGS AND HBDSCAN

Authors

  • Ariska Fitriyana Ningrum Universitas Muhammadiyah Semarang image/svg+xml
  • Dannu Purwanto Universitas Muhammadiyah Semarang image/svg+xml
  • Abdel Nasser Sharkawy South Valley University

DOI:

https://doi.org/10.33480/jitk.v11i3.7204

Keywords:

Natural Disasters News, Sentence BERT, Text Mining, Text Clustering

Abstract

Natural disasters that frequently occur in Indonesia demand a fast and accurate information monitoring and analysis system through online news sources. This study aims to identify topic patterns related to natural disasters in Indonesia using news articles from Detik.com through a semantic clustering approach. A total of 1,000 articles were collected, preprocessed, and represented using the Sentence-BERT (SBERT) model to capture contextual relationships between sentences. The vector representations were then clustered using three methods: K-Means, Agglomerative Hierarchical Clustering, and HDBSCAN. The performance of each method was evaluated using the Silhouette Score, Davies–Bouldin (DB) Index, and Calinski–Harabasz (CH) Index. The results show that HDBSCAN achieved the best performance with a Silhouette Score of 0.215, a DB Index of 1.557, and a CH Index of 18.102, outperforming Agglomerative (0.028, 3.945, 29.669) and K-Means (0.055, 3.678, 36.778). Moreover, the HDBSCAN model achieved the highest coherence score of 0.8669, indicating strong semantic consistency within clusters. Five coherent clusters emerged, representing major disaster themes: landslides, earthquakes, tornadoes, flash floods, and volcanic activity. The visualization of word clouds for each cluster reinforced the interpretation of these disaster topics. Overall, the combination of SBERT and HDBSCAN effectively groups news articles based on semantic similarity. These findings highlight the potential of Natural Language Processing (NLP) to enhance data-driven media monitoring, support early warning systems, and strengthen disaster communication and mitigation strategies in Indonesia

Downloads

Download data is not yet available.

References

[1] X. Chen, “Monitoring of Public Opinion on Typhoon Disaster Using Improved Clustering Model Based on Single-Pass Approach,” Sage Open, vol. 13, no. 3, Jul. 2023, doi: 10.1177/21582440231200098.

[2] R. Mena, “Advancing ‘no natural disasters’ with care: risks and strategies to address disasters as political phenomena in conflict zones,” Disaster Prevention and Management: An International Journal, vol. 32, no. 6, pp. 14–28, 2023, doi: 10.1108/DPM-08-2023-0197.

[3] F. Sufi and M. Alsulami, “AI-Driven Global Disaster Intelligence from News Media,” Mathematics, vol. 13, no. 7, Apr. 2025, doi: 10.3390/math13071083.

[4] L. Wang et al., “Text Embeddings by Weakly-Supervised Contrastive Pre-training,” Feb. 2024, doi: https://doi.org/10.48550/arXiv.2212.03533.

[5] A. O. Alharm and S. Naim, “Enhancing Natural Disaster Response: A Deep Learning Approach to Disaster Sentiment Analysis using BERT and LSTM,” 4755638, 2024. doi: 10.2139/ssrn.4755638.

[6] M. S. Asyaky and R. Mandala, “Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP,” in Proceedings - 2021 8th International Conference on Advanced Informatics: Concepts, Theory, and Application, ICAICTA 2021, Institute of Electrical and Electronics Engineers Inc., 2021. doi: 10.1109/ICAICTA53211.2021.9640285.

[7] G. Stewart and M. Al-Khassaweneh, “An Implementation of the HDBSCAN* Clustering Algorithm,” Applied Sciences (Switzerland), vol. 12, no. 5, Mar. 2022, doi: 10.3390/app12052405.

[8] D. E. Cahyani and I. Patasik, “Performance comparison of tf-idf and word2vec models for emotion text classification,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2780–2788, Oct. 2021, doi: 10.11591/eei.v10i5.3157.

[9] A. K. Chanda, “Efficacy of BERT embeddings on predicting disaster from Twitter data,” Aug. 2021, [Online]. Available: http://arxiv.org/abs/2108.10698

[10] J. Li and B. Li, “Topic Mining of Civil Aviation Supervision Texts Based on BERTopic Model,” in Proceedings of the 2025 5th International Conference on Internet of Things and Machine Learning, IoTML 2025, Association for Computing Machinery, Inc, Aug. 2025, pp. 176–183. doi: 10.1145/3749566.3749605.

[11] M. Alsuhaibani, “Deep Learning-based Sentence Embeddings using BERT for Textual Entailment,” IJACSA) International Journal of Advanced Computer Science and Applications, vol. 14, no. 8, p. 2023, 2023, doi: https://doi.org/10.14569/IJACSA.2023.01408108.

[12] A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,” J. Big Data, vol. 9, no. 1, Dec. 2022, doi: 10.1186/s40537-022-00564-9.

[13] Y. Zhang, Z. Chen, X. Zheng, N. Chen, and Y. Wang, “Extracting the location of flooding events in urban systems and analyzing the semantic risk using social sensing data,” J. Hydrol. (Amst)., vol. 603, Dec. 2021, doi: 10.1016/j.jhydrol.2021.127053.

[14] I. Firman Ashari, E. Dwi Nugroho, R. Baraku, I. N. Yanda, and R. Liwardana, “Analysis of Elbow, Silhouette, Davies-Bouldin, Calinski-Harabasz, and Rand-Index Evaluation on K-Means Algorithm for Classifying Flood-Affected Areas in Jakarta,” 2023. [Online]. Available: http://jurnal.polibatam.ac.id/index.php/JAIC

[15] N. Kapellas and S. Kapidakis, “Event Detection in News Articles: A Hybrid Approach Combining Topic Modeling, Clustering, and Named Entity Recognition,” in International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K - Proceedings, Science and Technology Publications, Lda, 2023, pp. 272–279. doi: 10.5220/0012234300003598.

[16] T. Alasali and Y. Ortakaci, “Clustering Techniques in Data Mining: A Survey of Methods, Challenges, and Applications,” Computer Science, vol. 9, pp. 32–50, Mar. 2024, doi: 10.53070/bbd.1421527.

[17] A. Polimeno, M. Reuver, S. Vrijenhoek, and A. Fokkens, “Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains,” Sep. 2023, [Online]. Available: http://arxiv.org/abs/2309.06192

[18] M. W. U. Rahman, R. Nevarez, L. T. Mim, and S. Hariri, “SDEC: Semantic Deep Embedded Clustering,” Aug. 2025, [Online]. Available: http://arxiv.org/abs/2508.15823

[19] L. Muthoharoh, “Text Mining Customer Feedback: An Agglomerative Clustering Approach to Service Optimization,” International Journal of Electronics and Communications Systems, vol. 5, no. 1, pp. 31–51, Jun. 2025, doi: 10.24042/ijecs.v5i1.27188.

[20] H. W. A. Hanley and Z. Durumeric, “Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings,” May 2025, [Online]. Available: http://arxiv.org/abs/2506.00277

[21] M. Fuchs and W. Höpken, “Clustering: Hierarchical, k-Means, DBSCAN,” in Tourism on the Verge, vol. Part F1051, Springer Nature, 2022, pp. 129–149. doi: 10.1007/978-3-030-88389-8_8.

[22] R. Kusumaningrum, S. F. Khoerunnisa, K. Khadijah, and M. Syafrudin, “Exploring Community Awareness of Mangrove Ecosystem Preservation through Sentence-BERT and K-Means Clustering,” Information (Switzerland), vol. 15, no. 3, Mar. 2024, doi: 10.3390/info15030165.

[23] M. S. Asyaky and R. Mandala, “Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP,” in Proceedings - 2021 8th International Conference on Advanced Informatics: Concepts, Theory, and Application, ICAICTA 2021, Institute of Electrical and Electronics Engineers Inc., 2021. doi: 10.1109/ICAICTA53211.2021.9640285.

[24] M. S. Asyaky and R. Mandala, “Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP,” in Proceedings - 2021 8th International Conference on Advanced Informatics: Concepts, Theory, and Application, ICAICTA 2021, Institute of Electrical and Electronics Engineers Inc., 2021. doi: 10.1109/ICAICTA53211.2021.9640285.

[25] A. Tounsi and M. Temimi, “A systematic review of natural language processing applications for hydrometeorological hazards assessment,” Apr. 01, 2023, Springer Science and Business Media B.V. doi: 10.1007/s11069-023-05842-0.

[26] Z. Wang, X. Shi, H. Yang, B. Yu, and Y. Cai, “Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework,” ISPRS Int. J. Geoinf., vol. 13, no. 6, Jun. 2024, doi: 10.3390/ijgi13060201.

[27] A. D. P. Ariyanto, D. Purwitasari, C. Fatichah, S. D. Ravana, Andrian, and A. A. Y. Parwata, “Transformer-Based Semantic Role Labeling for Crisis Events Using Semi-Supervised Learning on Low-Resource Language Twitter Texts,” IEEE Access, vol. 13, pp. 158938–158966, 2025, doi: 10.1109/ACCESS.2025.3604068.

[28] Mustakim, Muhammad Zakiy Fauzi, Mustafa, Assyari Abdullah, and Rohayati, “Clustering of Public Opinion on Natural Disasters in Indonesia Using DBSCAN and K-Medoids Algorithms ,” J. Phys. Conf. Ser., vol. 1, 2020.

[29] D. F. Surianto and D. F. Surianto, “Enhancing K-Means Clustering for Journal Articles using TF-IDF and LDA Feature Extraction,” Brilliance: Research of Artificial Intelligence, vol. 4, no. 2, pp. 964–972, Mar. 2025, doi: 10.47709/brilliance.v4i2.5547.

[30] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Improving Text Embeddings with Large Language Models,” 2024.

Downloads

Published

2026-02-11

How to Cite

[1]
“EVALUATING CLUSTERING METHODS FOR SEMANTIC REPRESENTATION OF DISASTER NEWS USING BERT EMBEDDINGS AND HBDSCAN”, jitk, vol. 11, no. 3, pp. 784–794, Feb. 2026, doi: 10.33480/jitk.v11i3.7204.