EVALUATING PREPROCESSING EFFECTS IN NAME RETRIEVAL USING CLASSICAL IR AND CNN-BASED MODELS

Frizca Fellicita Marcelly; Irwansyah Saputra; Muhammad Bagus Andra

doi:10.33480/pilar.v21i2.6884

Authors

Frizca Fellicita Marcelly Universitas Nusa Mandiri
Irwansyah Saputra Universitas Nusa Mandiri
Muhammad Bagus Andra Universitas Nusa Mandiri

DOI:

https://doi.org/10.33480/pilar.v21i2.6884

Keywords:

CNN, information retrieval, multilingual names, name retrieval, phonetic normalization

Abstract

Information Retrieval (IR) systems are pivotal for efficient data management, particularly in tasks involving name searches and entity identification. This study evaluates text preprocessing techniques, including case folding, phonetic normalization, and gender tagging, that affect the performance of classical (TF-IDF, LSI) and CNN-based retrieval models for multilingual name matching. Using a dataset of 365,468 globally diverse names, this study implements a preprocessing pipeline featuring: Double Metaphone phonetic preprocessing (92% validation accuracy), gender disambiguation for unisex names (92% accuracy), and optimized n-gram tokenization for short names. Evaluation metrics include precision, recall, F1-score, and our novel Name Similarity Score (NSS), combining orthographic and phonetic preprocessing. Results show our full pipeline improves recall to 1.00 and F1-score by 37% while reducing false negatives by 63%. Key findings reveal: TF-IDF achieves superior recall (0.98 vs CNN’s 0.85), LSI handles cultural variants effectively, and CNNs deliver the highest precision (0.91 vs TF-IDF’s 0.70), particularly for unisex names. This work contributes both a scalable multilingual preprocessing framework and the NSS evaluation metric for robust name retrieval systems.

Downloads

Download data is not yet available.

References

Abidin, Z., Junaidi, A., & Wamiliana, W. (2024). Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review. Journal of Information Systems Engineering and Business Intelligence, 10, 217–231. https://doi.org/10.20473/jisebi.10.2.217-231

Aboulola, O., & Umer, M. (2024). Novel approach for Arabic fake news classification using embedding from large language features with CNN-LSTM ensemble model and explainable AI. Scientific Reports, 14, 82111. https://doi.org/10.1038/s41598-024-82111-5

Adelia, D., Astuti, W., & Lhaksmana, K. (2024). Election Hoax Detection on X using CNN with TF-RF and TF-IDF Weighting Features. Journal of Computer System and Informatics (JoSYC), 5, 912–920. https://doi.org/10.47065/josyc.v5i4.5778

Al-Fuqaha’a, S., Al-Madi, N., & Hammo, B. (2024). A robust classification approach to enhance clinic identification from Arabic health text. Neural Computing and Applications, 36, 1–25. https://doi.org/10.1007/s00521-024-09453-z

Aso, M., Takamichi, S., Takamune, N., & Saruwatari, H. (2020). Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis. Speech Communication, 125, 53–60. https://doi.org/10.1016/j.specom.2020.09.003

Association, I. (2023). Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet. Cambridge University Press. https://doi.org/10.1017/9780511807954

Boghara, A. (2025). Hybrid Information Retrieval -Navigating The State Of The Art Of Dense And Sparse Territories Through A Comprehensive Taxonomy. [Preprint], Researchgate. https://doi.org/10.13140/RG.2.2.21123.41769

Cosma, A., Ruseti, S., Radoi, E., & Dascalu, M. (2025). The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models. [Preprint], arXiv. https://doi.org/10.48550/arXiv.2505.14172

Elmurodov, U., & Meyliyeva, S. (2025). Features of phonetic skills formation at various stages of learning. [Preprint], Researchgate. https://www.researchgate.net/publication/391249382_Features_of_phonetic_skills_formation_at_various_stages_of_learning

Ghate, S., H, S., D, D., M, A., Alex, A., D’Souza, N., & Patil, P. (2025). Decoding Gender: A Machine Learning Approach for Classifying Indian Names with Advanced Feature Extraction. [Preprint], Research Square. https://doi.org/10.21203/rs.3.rs-5897194/v1

Gupta, S., Vadde, V., Muralidharan, B., & Sharma, A. (2024). A Comprehensive Convolutional Neural Network Architecture Design using Magnetic Skyrmion and Domain Wall. Cornell University. https://doi.org/10.48550/arXiv.2407.08469

Jingye, C., Li, B., & Xue, X. (2021). Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) (pp. 1200-1206). https://doi.org/10.24963/ijcai.2021/85

Karakasidis, A., & Koloniari, G. (2023). Exploring Biases for Privacy-Preserving Phonetic Matching. New Trends in Database and Information Systems, 95–105. https://doi.org/10.1007/978-3-031-42941-5_9

Kulczynski, A., Brennan, S., & Ilicic, J. (2021). A spokesperson with any name won’t be as charming: the phonetic effect of spokesperson name and gender on personality evaluations. Journal of Brand Management, 28(1), 1–19. https://doi.org/10.1057/s41262-020-00218-2

Li, C., & Al-Tamimi, J. (2024). Tonal-segmental interaction in diphthong realization in Standard Mandarin. [Preprint], Researchgate. https://www.researchgate.net/publication/381127724_Tonal-segmental_interaction_in_diphthong_realization_in_Standard_Mandarin

Lo, S. W., & Chou, H.-M. (2022). Evaluating and Improving Optical Character Recognition (OCR) Efficiency in Recognizing Mandarin Phrases with Phonetic Symbols. 2022 IEEE International Conference on Internet of Things and Intelligence Systems (IoTaIS), 390–394. https://doi.org/10.1109/iotais56727.2022.9975969

Maryanto, A., Munarko, Y., & Azhar, Y. (2024). Pengelompokan Kata Berdasarkan Kemiripan Ucapan Pada Kamus Menggunakan Algoritma Metaphone Pada Sistem Operasi Android [Word grouping based on pronunciation similarity in a dictionary using the metaphone algorithm on the Android operating system]. Jurnal Repositor, 1(1), 1-12. https://doi.org/10.22219/repositor.v1i1.30394

Merritt, B. (2025). Revising the Canon: The Need for Expansive Perspectives on Gender and Sexuality in Speech Science Research and Pedagogy. Perspectives of the ASHA Special Interest Groups, 10(4), 1077–1095. https://doi.org/10.1044/2025_PERSP-24-00253

Mryglod, O., Nazarovets, S., & Kozmenko, S. (2022). Peculiarities of gender disambiguation and ordering of non-English authors’ names for Economic papers beyond core databases. Journal of Data and Information Science, 8(1), 1-15. https://doi.org/10.2478/jdis-2023-0001

Munarko, Y., Rampadarath, A., & Nickerson, D. (2023). CASBERT: BERT-based retrieval for compositely annotated biosimulation model entities. Frontiers in Bioinformatics, 3, 1107467. https://doi.org/10.3389/fbinf.2023.1107467

Naz, H., Ahuja, S., Nijhawan, R., & Ahuja, N. J. (2023). Impact of Data Pre‐Processing in Information Retrieval for Data Analytics. Machine Intelligence, Big Data Analytics, and IoT in Image Processing, 197–224. Portico. https://doi.org/10.1002/9781119865513.ch9

Raykar, N., Kumbharkar, P., & Rangdale, S. (2024). Phonetic Redundancy Avoidance Technique. Smart Systems: Innovations in Computing, 109–118. https://doi.org/10.1007/978-981-97-3690-4_9

Suyahman, S., Sunardi, & Murinto. (2024). Comparative Analysis of CNN Architectures in Siamese Networks with Test-Time Augmentation for Trademark Image Similarity Detection. Scientific Journal of Informatics, 11(4), 949–958. https://doi.org/10.15294/sji.v11i4.13811

Tang, D. (2025). Cross-Lingual Semantic Alignment in Large Language Models via Context-Aware Training. [Preprint], Preprints.org. https://doi.org/10.20944/preprints202503.0935.v1

Verma, S., & Zafari, R. (2025). Self-Efficacy and Resilience: A Relative Study among College NSS and Non-NSS Students. Journal of Psychological Research, 7(1), 9–20. https://doi.org/10.30564/jpr.v7i1.8316

Vykhovanets, V., Du, J., & Sakulin, S. (2020). An Overview of Phonetic Encoding Algorithms. Automation and Remote Control, 81(10), 1896–1910. https://doi.org/10.1134/S0005117920100082

Zaburanna, O. (2023). Phonetically Modified Proper Names In Modern Japanese: Status And Ways Of Forming. Bulletin of Taras Shevchenko National University of Kyiv. Oriental Languages and Literatures, 1(29), 10–16. https://doi.org/10.17721/1728-242X.2023.29.02

Zeng, Y.-Z. (2025). Phonetic characteristics of Mandarin-English code-switching in second language (L2) English learners. The Journal of the Acoustical Society of America, 157(6), 4513–4525. https://doi.org/10.1121/10.0036905

Zhang, X., Thakur, N., Ogundepo, O., Kamalloo, E., Alfonso-Hermelo, D., Li, X., Liu, Q., Rezagholizadeh, M., & Lin, J. (2023). MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11, 1114–1131. https://doi.org/10.1162/tacl_a_00595