ENHANCING MACHINE LEARNING ALGORITHM PERFORMANCE FOR PCOS DIAGNOSIS USING SMOTENC ON IMBALANCED DATA
DOI:
https://doi.org/10.33480/jitk.v11i1.6676Keywords:
imbalanced data, machine learning algorithm, PCOS, SMOTENCAbstract
Polycystic Ovarian Syndrome (PCOS) is one of the most frequently occurring endocrine disorders in women of reproductive age, distinguished by disruptions in hormonal regulation that can impact menstrual cycles, fertility, and physical appearance. Despite its high prevalence, PCOS is often diagnosed late and inaccurately, leading to inappropriate treatment and long-term health issues for patients. Machine learning can serve as an effective solution to enhance the accuracy of PCOS diagnosis. However, one of the primary challenges encountered is the class imbalance in the dataset, where the number of positive case data (PCOS) is often significantly lower than the negative case data. This imbalance can result in a biased model that is less effective in predicting the actual condition of patients. In this study, the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTENC) method is recommended to address the issue of imbalanced data, thereby improving the performance and accuracy of the machine learning model employed. The evaluation matrix test results clearly demonstrate that the accuracy of each machine learning model improved after applying the SMOTENC method. Specifically, the accuracy of the K-Nearest Neighbors (KNN) algorithm increased from 81.6% to 89.8%, the Support Vector Machine (SVM) algorithm from 90.6% to 92.5%, the Naive Bayes algorithm from 70% to 82.3%, and the C4.5 algorithm from 99.6% to 99.7%. This research provides a substantial contribution to advancing the development of diagnostic methods thatare both more precise and efficient.
Downloads
References
A. Haleem, M. Javaid, R. Pratap Singh, and R. Suman, “Medical 4.0 technologies for healthcare: Features, capabilities, and applications,” Internet of Things and Cyber-Physical Systems, vol. 2, pp. 12–30, Jan. 2022, doi: 10.1016/J.IOTCPS.2022.04.001.
Ö. Çelik and M. F. Köse, “An overview of polycystic ovary syndrome in aging women,” J Turk Ger Gynecol Assoc, vol. 22, no. 4, p. 326, Dec. 2021, doi: 10.4274/JTGGA.GALENOS.2021.2021.0077.
P. Moghetti and F. Tosi, “Insulin resistance and PCOS: chicken or egg?,” J Endocrinol Invest, vol. 44, no. 2, pp. 233–244, Feb. 2021, doi: 10.1007/S40618-020-01351-0/METRICS.
B. Meczekalski et al., “Hyperthecosis: an underestimated nontumorous cause of hyperandrogenism,” Gynecological Endocrinology, vol. 37, no. 8, pp. 677–682, 2021, doi: 10.1080/09513590.2021.1903419.
M. Dapas and A. Dunaif, “Deconstructing a Syndrome: Genomic Insights Into PCOS Causal Mechanisms and Classification,” Endocr Rev, vol. 43, no. 6, pp. 927–965, Nov. 2022, doi: 10.1210/ENDREV/BNAC001.
S. Hatoum, M. Amiri, D. Hopkins, R. P. Buyalos, F. Bril, and R. Azziz, “Population-Based vs Health System and Insurer Records: Significant Underdiagnosis of PCOS,” J Clin Endocrinol Metab, Jan. 2025, doi: 10.1210/CLINEM/DGAF037.
C. Ley, R. K. Martin, A. Pareek, A. Groll, R. Seil, and T. Tischer, “Machine learning and conventional statistics: making sense of the differences,” Knee Surgery, Sports Traumatology, Arthroscopy, vol. 30, no. 3, pp. 753–757, Mar. 2022, doi: 10.1007/S00167-022-06896-6/FIGURES/1.
A. Yaqoob, R. Musheer Aziz, · Navneet, and K. Verma, “Applications and Techniques of Machine Learning in Cancer Classification: A Systematic Review,” Human-Centric Intelligent Systems 2023 3:4, vol. 3, no. 4, pp. 588–615, Sep. 2023, doi: 10.1007/S44230-023-00041-3.
X. Wang, Y. Bouzembrak, A. G. J. M. O. Lansink, and H. J. van der Fels-Klerx, “Application of machine learning to the monitoring and prediction of food safety: A review,” Compr Rev Food Sci Food Saf, vol. 21, no. 1, pp. 416–434, Jan. 2022, doi: 10.1111/1541-4337.12868.
M. F. Ahmad Fauzi, R. Nordin, N. F. Abdullah, and H. A. H. Alobaidy, “Mobile Network Coverage Prediction Based on Supervised Machine Learning Algorithms,” IEEE Access, vol. 10, pp. 55782–55793, 2022, doi: 10.1109/ACCESS.2022.3176619.
A. R. Munappy, J. Bosch, H. H. Olsson, A. Arpteg, and B. Brinne, “Data management for production quality deep learning models: Challenges and solutions,” Journal of Systems and Software, vol. 191, p. 111359, Sep. 2022, doi: 10.1016/J.JSS.2022.111359.
M. A. Talukder et al., “Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction,” J Big Data, vol. 11, no. 1, pp. 1–44, Dec. 2024, doi: 10.1186/S40537-024-00886-W/TABLES/16.
M. Bourel et al., “Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters,” Water Res, vol. 202, p. 117450, Sep. 2021, doi: 10.1016/J.WATRES.2021.117450.
V. Werner de Vargas, J. A. Schneider Aranda, R. dos Santos Costa, P. R. da Silva Pereira, and J. L. Victória Barbosa, “Imbalanced data preprocessing techniques for machine learning: a systematic mapping study,” Knowl Inf Syst, vol. 65, no. 1, pp. 31–57, Jan. 2023, doi: 10.1007/S10115-022-01772-8/METRICS.
L. Zhang et al., “Classification of Imbalanced Data:Review of Methods and Applications,” IOP Conf Ser Mater Sci Eng, vol. 1099, no. 1, p. 012077, Mar. 2021, doi: 10.1088/1757-899X/1099/1/012077.
Z. Zad et al., “Predicting polycystic ovary syndrome with machine learning algorithms from electronic health records,” Front Endocrinol (Lausanne), vol. 15, 2024, doi: 10.3389/fendo.2024.1298628.
W. Chen, J. Miao, J. Chen, and J. Chen, “Development of machine learning models for diagnostic biomarker identification and immune cell infiltration analysis in PCOS,” Journal of Ovarian Research , vol. 18, no. 1, pp. 1–16, Dec. 2025, doi: 10.1186/S13048-024-01583-1/FIGURES/9.
H. M. Emara, W. El-Shafai, N. F. Soliman, A. D. Algarni, R. Alkanhel, and F. E. Abd El-Samie, “A stacked learning framework for accurate classification of polycystic ovary syndrome with advanced data balancing and feature selection techniques,” Front Physiol, vol. 16, p. 1435036, May 2025, doi: 10.3389/FPHYS.2025.1435036/BIBTEX.
Y. A. Abu Adla, D. G. Raydan, M. Z. J. Charaf, R. A. Saad, J. Nasreddine, and M. O. Diab, “Automated Detection of Polycystic Ovary Syndrome Using Machine Learning Techniques,” International Conference on Advances in Biomedical Engineering, ICABME, vol. 2021-October, pp. 208–212, 2021, doi: 10.1109/ICABME53305.2021.9604905.
P. Bhardwaj and P. Tiwari, “Manoeuvre of Machine Learning Algorithms in Healthcare Sector with Application to Polycystic Ovarian Syndrome Diagnosis,” pp. 71–84, 2022, doi: 10.1007/978-981-16-6887-6_7.
I. S. Silva et al., “Polycystic ovary syndrome: clinical and laboratory variables related to new phenotypes using machine-learning models,” J Endocrinol Invest, vol. 45, no. 3, pp. 497–505, Mar. 2022, doi: 10.1007/S40618-021-01672-8/METRICS.
A. Zigarelli, Z. Jia, and H. Lee, “Machine-Aided Self-diagnostic Prediction Models for Polycystic Ovary Syndrome: Observational Study.,” JMIR Form Res, vol. 6, no. 3, p. e29967, Mar. 2022, doi: 10.2196/29967.
S. Tiwari et al., “SPOSDS: A smart Polycystic Ovary Syndrome diagnostic system using machine learning,” Expert Syst Appl, vol. 203, p. 117592, Oct. 2022, doi: 10.1016/J.ESWA.2022.117592.
V. V. Khanna, K. Chadaga, N. Sampathila, S. Prabhu, V. Bhandage, and G. K. Hegde, “A Distinctive Explainable Machine Learning Framework for Detection of Polycystic Ovary Syndrome,” Applied System Innovation 2023, Vol. 6, Page 32, vol. 6, no. 2, p. 32, Feb. 2023, doi: 10.3390/ASI6020032.
P. Dutta, S. Paul, and M. Majumder, “An Efficient SMOTE Based Machine Learning classification for Prediction & Detection of PCOS,” Nov. 2021, doi: 10.21203/RS.3.RS-1043852/V1.
F. Gurcan and A. Soylu, “Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis,” Cancers 2024, Vol. 16, Page 3417, vol. 16, no. 19, p. 3417, Oct. 2024, doi: 10.3390/CANCERS16193417.
G. Husain et al., “SMOTE vs. SMOTEENN: A Study on the Performance of Resampling Algorithms for Addressing Class Imbalance in Regression Models,” Algorithms 2025, Vol. 18, Page 37, vol. 18, no. 1, p. 37, Jan. 2025, doi: 10.3390/A18010037.
J. Fonseca and F. Bacao, “Geometric SMOTE for imbalanced datasets with nominal and continuous features,” Expert Syst Appl, vol. 234, Dec. 2023, doi: 10.1016/J.ESWA.2023.121053.
T. Wongvorachan, S. He, and O. Bulut, “A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining,” Information 2023, Vol. 14, Page 54, vol. 14, no. 1, p. 54, Jan. 2023, doi: 10.3390/INFO14010054.
B. Ravinder, S. K. Seeni, V. S. Prabhu, P. Asha, S. P. Maniraj, and C. Srinivasan, “Web Data Mining with Organized Contents Using Naive Bayes Algorithm,” 2024 2nd International Conference on Computer, Communication and Control, IC4 2024, 2024, doi: 10.1109/IC457434.2024.10486403.
S. Uddin, I. Haque, H. Lu, M. A. Moni, and E. Gide, “Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction,” Scientific Reports 2022 12:1, vol. 12, no. 1, pp. 1–11, Apr. 2022, doi: 10.1038/s41598-022-10358-x.
O. Saeful Bachri, R. Mohamad, and H. Bhakti, “Penentuan Status Stunting pada Anak dengan Menggunakan Algoritma KNN,” Jurnal Ilmiah Intech : Information Technology Journal of UMUS, vol. 3, no. 02, pp. 130–137, Nov. 2021, doi: 10.46772/INTECH.V3I02.533.
B. Gaye, D. Zhang, and A. Wulamu, “Improvement of Support Vector Machine Algorithm in Big Data Background,” Math Probl Eng, vol. 2021, no. 1, p. 5594899, Jan. 2021, doi: 10.1155/2021/5594899.
X. Zheng, W. Feng, M. Huang, and S. Feng, “Optimization of PBFT Algorithm Based on Improved C4.5,” Math Probl Eng, vol. 2021, no. 1, p. 5542078, Jan. 2021, doi: 10.1155/2021/5542078.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rofiqoh Dewi, Ratna Sri hayati, Alfa Saleh, Dahri Yani Hakim Tanjung, Abwabul Jinan

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.