
Diterbitkan Oleh:
Lembaga Penelitian Pengabdian Masyarakat Universitas Nusa Mandiri 
Creation is distributed below Lisensi Creative Commons Atribusi-NonKomersial 4.0 Internasional.
Information Retrieval (IR) systems are pivotal for efficient data management, particularly in tasks involving name searches and entity identification. This study evaluates text preprocessing techniques, including case folding, phonetic normalization, and gender tagging, that affect the performance of classical (TF-IDF, LSI) and CNN-based retrieval models for multilingual name matching. Using a dataset of 365,468 globally diverse names, this study implements a preprocessing pipeline featuring: Double Metaphone phonetic preprocessing (92% validation accuracy), gender disambiguation for unisex names (92% accuracy), and optimized n-gram tokenization for short names. Evaluation metrics include precision, recall, F1-score, and our novel Name Similarity Score (NSS), combining orthographic and phonetic preprocessing. Results show our full pipeline improves recall to 1.00 and F1-score by 37% while reducing false negatives by 63%. Key findings reveal: TF-IDF achieves superior recall (0.98 vs CNN’s 0.85), LSI handles cultural variants effectively, and CNNs deliver the highest precision (0.91 vs TF-IDF’s 0.70), particularly for unisex names. This work contributes both a scalable multilingual preprocessing framework and the NSS evaluation metric for robust name retrieval systems.
Abidin, Z., Junaidi, A., & Wamiliana, W. (2024). Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review. Journal of Information Systems Engineering and Business Intelligence, 10, 217–231. https://doi.org/10.20473/jisebi.10.2.217-231
Aboulola, O., & Umer, M. (2024). Novel approach for Arabic fake news classification using embedding from large language features with CNN-LSTM ensemble model and explainable AI. Scientific Reports, 14, 82111. https://doi.org/10.1038/s41598-024-82111-5
Adelia, D., Astuti, W., & Lhaksmana, K. (2024). Election Hoax Detection on X using CNN with TF-RF and TF-IDF Weighting Features. Journal of Computer System and Informatics (JoSYC), 5, 912–920. https://doi.org/10.47065/josyc.v5i4.5778
Al-Fuqaha’a, S., Al-Madi, N., & Hammo, B. (2024). A robust classification approach to enhance clinic identification from Arabic health text. Neural Computing and Applications, 36, 1–25. https://doi.org/10.1007/s00521-024-09453-z
Aso, M., Takamichi, S., Takamune, N., & Saruwatari, H. (2020). Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis. Speech Communication, 125, 53–60. https://doi.org/10.1016/j.specom.2020.09.003
Association, I. (2023). Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press. https://doi.org/10.1017/9780511807954
Boghara, A. (2025). Hybrid Information Retrieval -Navigating The State Of The Art Of Dense And Sparse Territories Through A Comprehensive Taxonomy. [Preprint], Researchgate. https://doi.org/10.13140/RG.2.2.21123.41769
Cosma, A., Ruseti, S., Radoi, E., & Dascalu, M. (2025). The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models. [Preprint], arXiv. https://doi.org/10.48550/arXiv.2505.14172
Elmurodov, U., & Meyliyeva, S. (2025). Features of phonetic skills formation at various stages of learning. [Preprint], Researchgate. https://www.researchgate.net/publication/391249382_Features_of_phonetic_skills_formation_at_various_stages_of_learning
Ghate, S., H, S., D, D., M, A., Alex, A., D’Souza, N., & Patil, P. (2025). Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction. [Preprint], Research Square. https://doi.org/10.21203/rs.3.rs-5897194/v1
Gupta, S., Vadde, V., Muralidharan, B., & Sharma, A. (2024). A Comprehensive Convolutional Neural Network Architecture Design using Magnetic Skyrmion and Domain Wall. Cornell University. https://doi.org/10.48550/arXiv.2407.08469
Jingye, C., Li, B., & Xue, X. (2021). Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) (pp. 1200-1206). https://doi.org/10.24963/ijcai.2021/85
Karakasidis, A., & Koloniari, G. (2023). Exploring Biases for Privacy-Preserving Phonetic Matching. New Trends in Database and Information Systems, 95–105. https://doi.org/10.1007/978-3-031-42941-5_9
Kulczynski, A., Brennan, S., & Ilicic, J. (2021). A spokesperson with any name won’t be as charming: the phonetic effect of spokesperson name and gender on personality evaluations. Journal of Brand Management, 28(1), 1–19. https://doi.org/10.1057/s41262-020-00218-2
Li, C., & Al-Tamimi, J. (2024). Tonal-segmental interaction in diphthong realization in Standard Mandarin. [Preprint], Researchgate. https://www.researchgate.net/publication/381127724_Tonal-segmental_interaction_in_diphthong_realization_in_Standard_Mandarin
Lo, S. W., & Chou, H.-M. (2022). Evaluating and Improving Optical Character Recognition (OCR) Efficiency in Recognizing Mandarin Phrases with Phonetic Symbols. 2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS), 390–394. https://doi.org/10.1109/iotais56727.2022.9975969
Maryanto, A., Munarko, Y., & Azhar, Y. (2024). Pengelompokan Kata Berdasarkan Kemiripan Ucapan Pada Kamus Menggunakan Algoritma Metaphone Pada Sistem Operasi Android [Word grouping based on pronunciation similarity in a dictionary using the metaphone algorithm on the Android operating system]. Jurnal Repositor, 1(1), 1-12. https://doi.org/10.22219/repositor.v1i1.30394
Merritt, B. (2025). Revising the Canon: The Need for Expansive Perspectives on Gender and Sexuality in Speech Science Research and Pedagogy. Perspectives of the ASHA Special Interest Groups, 10(4), 1077–1095. https://doi.org/10.1044/2025_PERSP-24-00253
Mryglod, O., Nazarovets, S., & Kozmenko, S. (2022). Peculiarities of gender disambiguation and ordering of non-English authors’ names for Economic papers beyond core databases. Journal of Data and Information Science, 8(1), 1-15. https://doi.org/10.2478/jdis-2023-0001
Munarko, Y., Rampadarath, A., & Nickerson, D. (2023). CASBERT: BERT-based retrieval for compositely annotated biosimulation model entities. Frontiers in Bioinformatics, 3, 1107467. https://doi.org/10.3389/fbinf.2023.1107467
Naz, H., Ahuja, S., Nijhawan, R., & Ahuja, N. J. (2023). Impact of Data Pre‐Processing in Information Retrieval for Data Analytics. Machine Intelligence, Big Data Analytics, and IoT in Image Processing, 197–224. Portico. https://doi.org/10.1002/9781119865513.ch9
Raykar, N., Kumbharkar, P., & Rangdale, S. (2024). Phonetic Redundancy Avoidance Technique. Smart Systems: Innovations in Computing, 109–118. https://doi.org/10.1007/978-981-97-3690-4_9
Suyahman, S., Sunardi, & Murinto. (2024). Comparative Analysis of CNN Architectures in Siamese Networks with Test-Time Augmentation for Trademark Image Similarity Detection. Scientific Journal of Informatics, 11(4), 949–958. https://doi.org/10.15294/sji.v11i4.13811
Tang, D. (2025). Cross-Lingual Semantic Alignment in Large Language Models via Context-Aware Training. [Preprint], Preprints.org. https://doi.org/10.20944/preprints202503.0935.v1
Verma, S., & Zafari, R. (2025). Self-Efficacy and Resilience: A Relative Study among College NSS and Non-NSS Students. Journal of Psychological Research, 7(1), 9–20. https://doi.org/10.30564/jpr.v7i1.8316
Vykhovanets, V., Du, J., & Sakulin, S. (2020). An Overview of Phonetic Encoding Algorithms. Automation and Remote Control, 81(10), 1896–1910. https://doi.org/10.1134/S0005117920100082
Zaburanna, O. (2023). Phonetically Modified Proper Names In Modern Japanese: Status And Ways Of Forming. Bulletin of Taras Shevchenko National University of Kyiv. Oriental Languages and Literatures, 1(29), 10–16. https://doi.org/10.17721/1728-242X.2023.29.02
Zeng, Y.-Z. (2025). Phonetic characteristics of Mandarin-English code-switching in second language (L2) English learners. The Journal of the Acoustical Society of America, 157(6), 4513–4525. https://doi.org/10.1121/10.0036905
Zhang, X., Thakur, N., Ogundepo, O., Kamalloo, E., Alfonso-Hermelo, D., Li, X., Liu, Q., Rezagholizadeh, M., & Lin, J. (2023). MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11, 1114–1131. https://doi.org/10.1162/tacl_a_00595
Copyright (c) 2025 Frizca Fellicita Marcelly, Irwansyah Saputra, Muhammad Bagus Andra

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
An author who publishes in the Pilar Nusa Mandiri: Journal of Computing and Information System agrees to the following terms:

Diterbitkan Oleh:
Lembaga Penelitian Pengabdian Masyarakat Universitas Nusa Mandiri 
Creation is distributed below Lisensi Creative Commons Atribusi-NonKomersial 4.0 Internasional.