UNVEILING GENDER FROM INDONESIAN NAMES USING RANDOM FOREST AND LOGISTIC REGRESSION ALGORITHMS

  • Musthofa Galih Pradana Universitas Pembangunan Nasional Veteran Jakarta
  • Pujo Hari Saputro Universitas Sam Ratulangi
  • Dyah Listianing Tyas Universitas Prisma
Keywords: detection, gender, logistic regression, random forest, text_classification

Abstract

Gender detection can be done in many ways, some of these ways by using image identification such as the process of image identification based on faces or image shapes, on the other hand image identification and detection can also be done based on text or written data. The usefulness of gender identification can be used in various aspects of life, ranging from greetings such as ladies and gentlemen, which will certainly make the person concerned feel more appreciated by the accuracy of the pronunciation of the name. This gender identification and detection process can be done by making class predictions on predetermined gender label classes. Of course, each name in various languages has different characteristics in identifying and representing each gender, as well as Indonesian names that have diversity and unique levels of variation. The purpose of this study is to test the results of the algorithm in classification based on class labels. The application of this detection uses two algorithms, namely Random Forest and Logistic Regression. Both of these algorithms can predict classes with perfect accuracy in 6 experimental data, then the results of 526 experimental data resulted in a final accuracy of 0.94 for logistic regression and 0.93 for random forest. The advantage with a thin difference in this case is in the Logistic Regression algorithm.

References

Alanazi, S. A. (2019). Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis. IEEE Access, 7, 111931–111943. https://doi.org/10.1109/ACCESS.2019.2932026

Bartl, M., & Leavy, S. (2022). Inferring Gender: A Scalable Methodology for Gender Detection with Online Lexical Databases. Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 47–58. https://doi.org/10.18653/v1/2022.ltedi-1.7

Caliskan, A., Ajay, P. P., Charlesworth, T., Wolfe, R., & Banaji, M. R. (2022). Gender Bias in Word Embeddings. Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, 156–170. https://doi.org/10.1145/3514094.3534162

Charlesworth, T. E. S., Yang, V., Mann, T. C., Kurdi, B., & Banaji, M. R. (2021). Gender Stereotypes in Natural Language: Word Embeddings Show Robust Consistency Across Child and Adult Language Corpora of More Than 65 Million Words. Psychological Science, 32(2), 218–240. https://doi.org/10.1177/0956797620963619

Cryan, J., Tang, S., Zhang, X., Metzger, M., Zheng, H., & Zhao, B. Y. (2020). Detecting Gender Stereotypes: Lexicon vs. Supervised Learning Methods. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–11. https://doi.org/10.1145/3313831.3376488

Dinan, E., Fan, A., Wu, L., Weston, J., Kiela, D., & Williams, A. (2020). Multi-Dimensional Gender Bias Classification. http://arxiv.org/abs/2005.00614

Galih Pradana, M., Palilingan, K., Vanli Akay, Y., Puspasari Wijaya, D., & Hari Saputro, P. (2023). Comparison of Multi Layer Perceptron, Random Forest & Logistic Regression on Students Performance Test. 462–466. https://doi.org/10.1109/icimcis56303.2022.10017501

HaCohen-Kerner, Y. (2022). Survey on profiling age and gender of text authors. Expert Systems with Applications, 199, 117140. https://doi.org/10.1016/j.eswa.2022.117140

Karami, A., Lundy, M., Webb, F., & Dwivedi, Y. K. (2020). Twitter and Research: A Systematic Literature Review Through Text Mining. IEEE Access, 8, 67698–67717. https://doi.org/10.1109/ACCESS.2020.2983656

Kumar, J. A., Trueman, T. E., & Cambria, E. (2022). Gender-based multi-aspect sentiment detection using multilabel learning. Information Sciences, 606, 453–468. https://doi.org/10.1016/j.ins.2022.05.057

Musthofa Galih Pradana, H. K. (2023). Analisis Performa Algoritma Convolutional Neural Networks Menggunakan Arsitektur Lenet Dan Vgg16. Indonesian Journal of Business Intelligence (IJUBI), 6(2), 54–60.

Qiao, W., Khishe, M., & Ravakhah, S. (2021). Underwater targets classification using local wavelet acoustic pattern and Multi-Layer Perceptron neural network optimized by modified Whale Optimization Algorithm. Ocean Engineering, 219(June 2020), 108415. https://doi.org/10.1016/j.oceaneng.2020.108415

Sari, Y. (2022). Ekstraksi Fitur dan Aplikasinya pada Citra 2D. Perahu Litera.

Shah, K., Patel, H., Sanghvi, D., & Shah, M. (2020). A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augmented Human Research, 5(1). https://doi.org/10.1007/s41133-020-00032-0

Yang, Y.-C., Al-Garadi, M. A., Love, J. S., Perrone, J., & Sarker, A. (2021). Automatic gender detection in Twitter profiles for health-related cohort studies. JAMIA Open, 4(2). https://doi.org/10.1093/jamiaopen/ooab042

Published
2024-09-30
How to Cite
Pradana, M., Saputro, P., & Tyas, D. (2024). UNVEILING GENDER FROM INDONESIAN NAMES USING RANDOM FOREST AND LOGISTIC REGRESSION ALGORITHMS. Jurnal Techno Nusa Mandiri, 21(2), 144 - 150. https://doi.org/10.33480/techno.v21i2.5537