COMPARATIVE ANALYSIS OF CLASSIFICATION ALGORITHMS IN HANDLING IMBALANCED DATA WITH SMOTE OVERSAMPLING APPROACH

Agung Nugroho; Wiyanto; Donny Maulana

doi:10.33480/jitk.v11i2.6956

Authors

Agung Nugroho Universitas Pelita Bangsa
Wiyanto Universitas Pelita Bangsa
Donny Maulana Universitas Pelita Bangsa

DOI:

https://doi.org/10.33480/jitk.v11i2.6956

Keywords:

classification, imbalanced data, Logistic Regression, random forest , SMOTE

Abstract

Most machine learning algorithms tend to yield optimal results when trained on datasets with balanced class proportions. However, their performance usually declines when applied to data with significant class imbalance. To address this issue, this study utilizes the Synthetic Minority Oversampling Technique (SMOTE) to improve class distribution before model training. Several classification algorithms were employed, including Decision Tree, K-Nearest Neighbors, Logistic Regression, Support Vector Machine, and Random Forest. Experimental results reveal that the Random Forest model produced the highest accuracy (95.70%) and the best F1-score, demonstrating a well-balanced trade-off between precision and recall. In contrast, the Logistic Regression algorithm achieved the highest recall (74.20%), indicating better sensitivity in identifying positive instances despite a lower F1-score. These outcomes highlight the importance of choosing appropriate classification methods based on the specific evaluation goals whether prioritizing accuracy, recall, or overall model balance.

Downloads

Download data is not yet available.

References

M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced Data Problem in Machine Learning: A Review,” IEEE Access, vol. 13, no. January, pp. 13686–13699, 2025, doi: 10.1109/ACCESS.2025.3531662.

Y. E. Kurniawati and Y. D. Prabowo, “Model optimisation of class imbalanced learning using ensemble classifier on over-sampling data,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 11, no. 1, p. 276, Mar. 2022, doi: 10.11591/ijai.v11.i1.pp276-283.

K. Ghosh, C. Bellinger, R. Corizzo, P. Branco, B. Krawczyk, and N. Japkowicz, The class imbalance problem in deep learning, vol. 113, no. 7. Springer US, 2024. doi: 10.1007/s10994-022-06268-8.

A. Kumar, S. Goel, N. Sinha, and A. Bhardwaj, “A Review on Unbalanced Data Classification,” 2022, pp. 197–208. doi: 10.1007/978-981-19-0332-8_14.

Y. Li, N. Adams, and T. Bellotti, “A Relabeling Approach to Handling the Class Imbalance Problem for Logistic Regression,” Journal of Computational and Graphical Statistics, vol. 31, no. 1, pp. 241–253, 2022, doi: 10.1080/10618600.2021.1978470.

F. Kamalov, F. Thabtah, and H. H. Leung, “Feature Selection in Imbalanced Data,” Annals of Data Science, vol. 10, no. 6, pp. 1527–1541, Dec. 2023, doi: 10.1007/s40745-021-00366-5.

Yoga Religia, Agung Nugroho, and Wahyu Hadikristanto, “Klasifikasi Analisis Perbandingan Algoritma Optimasi pada Random Forest untuk Klasifikasi Data Bank Marketing,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 1, pp. 187–192, Feb. 2021, doi: 10.29207/resti.v5i1.2813.

L. Zhang, T. Geisler, H. Ray, and Y. Xie, “Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function,” J Appl Stat, vol. 49, no. 13, pp. 3257–3277, 2022, doi: 10.1080/02664763.2021.1939662.

A. S. More and D. P. Rana, “Performance enrichment through parameter tuning of random forest classification for imbalanced data applications,” Mater Today Proc, vol. 56, pp. 3585–3593, 2022, doi: 10.1016/j.matpr.2021.12.020.

T. E. Tarigan, E. Susanti, M. I. Siami, I. Arfiani, A. A. Jiwa Permana, and I. M. Sunia Raharja, “Performance Metrics of AdaBoost and Random Forest in Multi-Class Eye Disease Identification: An Imbalanced Dataset Approach,” International Journal of Artificial Intelligence in Medical Issues, vol. 1, no. 2, pp. 84–94, 2023, doi: 10.56705/ijaimi.v1i2.98.

J. Dong and Q. Qian, “A Density-Based Random Forest for Imbalanced Data Classification,” Future Internet, vol. 14, no. 3, 2022, doi: 10.3390/fi14030090.

A. Newaz, M. S. Mohosheu, M. A. Al Noman, and T. Jabid, “iBRF: Improved Balanced Random Forest Classifier,” Conference of Open Innovation Association, FRUCT, pp. 501–508, 2024, doi: 10.23919/fruct61870.2024.10516372.

Eva Y Puspaningrum, Yisti Vita Via, Chilyatun Nisa, Hendra Maulana, and Wahyu S.J.Saputra, “Oversampled-Based Approach to Overcome Imbalance Data in the Classification of Apple Leaf Disease with SMOTE,” Technium: Romanian Journal of Applied Sciences and Technology, vol. 16, pp. 112–117, Oct. 2023, doi: 10.47577/technium.v16i.9968.

N. M. Djafar and A. Fauzan, “Implementation of K-Nearest Neighbor using the oversampling technique on mixed data for the classification of household welfare status,” Statistics in Transition new series, vol. 25, no. 1, pp. 109–124, Mar. 2024, doi: 10.59170/stattrans-2024-007.

I. Print, A. F. Pulungan, and D. Selvida, “Kombinasi Metode Sampling pada Pengklasifikasian Data Tidak Seimbang Menggunakan Algoritma Support Vector Machine ( SVM ),” InfoTekJar : Jurnal Nasional Informatika dan Teknologi Jaringan, vol. 6, no. 2, pp. 276–282, 2022.

L. Qadrini, H. Hikmah, and M. Megasari, “Oversampling, Undersampling, Smote SVM dan Random Forest pada Klasifikasi Penerima Bidikmisi Sejawa Timur Tahun 2017,” Journal of Computer System and Informatics (JoSYC), vol. 3, no. 4, pp. 386–391, Sep. 2022, doi: 10.47065/josyc.v3i4.2154.

N. S. Ramadhanti, W. A. Kusuma, and A. Annisa, “Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 6, p. 1221, Dec. 2020, doi: 10.25126/jtiik.2020762857.

L. Yuningsih, G. A. Pradipta, D. Hermawan, P. D. W. Ayu, D. P. Hostiadi, and R. R. Huizen, “IRS-BAG-Integrated Radius-SMOTE Algorithm with Bagging Ensemble Learning Model for Imbalanced Data Set Classification,” Emerging Science Journal, vol. 7, no. 5, pp. 1501–1516, Oct. 2023, doi: 10.28991/ESJ-2023-07-05-04.

D. E, M. Zhang, J. Liu, H. Jiang, and K. Mao, “RE-SMOTE: A Novel Imbalanced Sampling Method Based on SMOTE with Radius Estimation,” Computers, Materials & Continua, vol. 81, no. 3, pp. 3853–3880, 2024, doi: 10.32604/cmc.2024.057538.

R. Wardoyo, I. M. A. Wirawan, and I. G. A. Pradipta, “Oversampling Approach Using Radius-SMOTE for Imbalance Electroencephalography Datasets,” Emerging Science Journal, vol. 6, no. 2, pp. 382–398, Mar. 2022, doi: 10.28991/ESJ-2022-06-02-013.

A. U. Reddy, K. T. Devi, B. B. Vamsi, Anushka, and S. Shareefunnisa, “Enhancing Predictive Performance in Binary Classification on Imbalanced Data Using Automated Methodology,” in 2024 2nd World Conference on Communication & Computing (WCONF), IEEE, Jul. 2024, pp. 1–6. doi: 10.1109/WCONF61366.2024.10692288.

S. Das, “A new technique for classification method with imbalanced training data,” International Journal of Information Technology, vol. 16, no. 4, pp. 2177–2185, Apr. 2024, doi: 10.1007/s41870-024-01740-1.

A. Damari, Taghfirul Azhima Yoga Siswa, and Wawan Joko Pranoto, “Implementation of the PSO-SMOTE Method on the Naive Bayes Algorithm to Address Class Imbalance in Landslide Disaster Data,” INOVTEK Polbeng - Seri Informatika, vol. 10, no. 1, pp. 332–343, Jan. 2025, doi: 10.35314/7wcvrb72.

Asniar, N. U. Maulidevi, and K. Surendro, “SMOTE-LOF for noise identification in imbalanced data classification,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 6, pp. 3413–3423, 2022, doi: 10.1016/j.jksuci.2021.01.014.

L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, “Review of Classification Methods on Unbalanced Data Sets,” IEEE Access, vol. 9, pp. 64606–64628, 2021, doi: 10.1109/ACCESS.2021.3074243.

D. Ariyadi, T. A. Y. Siswa, and R. Rudiman, “Penerapan Metode PSO-SMOTE Pada Algoritma Random Forest Untuk Mengatasi Class Imbalance Data Bencana Tanah Longsor,” Kesatria : Jurnal Penerapan Sistem Informasi (Komputer dan Manajemen), vol. 6, no. 1, pp. 320–329, Jan. 2025, doi: 10.30645/kesatria.v6i1.574.

A. Nugroho and E. Rilvani, “Penerapan Metode Oversampling SMOTE Pada Algoritma Random Forest Untuk Prediksi Kebangkrutan Perusahaan,” Techno.Com, vol. 22, no. 1, pp. 207–214, Feb. 2023, doi: 10.33633/tc.v22i1.7527.

E. Ileberi, Y. Sun, and Z. Wang, “Performance Evaluation of Machine Learning Methods for Credit Card Fraud Detection Using SMOTE and AdaBoost,” IEEE Access, vol. 9, pp. 165286–165294, 2021, doi: 10.1109/ACCESS.2021.3134330.

F. Soriano, “Company Bankruptcy Prediction Dataset,” Kaggle. [Online]. Available: https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction

L. Yuningsih, G. A. Pradipta, D. Hermawan, P. D. W. Ayu, D. P. Hostiadi, and R. R. Huizen, “IRS-BAG-Integrated Radius-SMOTE Algorithm with Bagging Ensemble Learning Model for Imbalanced Data Set Classification,” Emerging Science Journal, vol. 7, no. 5, pp. 1501–1516, Oct. 2023, doi: 10.28991/ESJ-2023-07-05-04.

G. A. Pradipta, R. Wardoyo, A. Musdholifah, I. N. H. Sanjaya, and M. Ismail, “SMOTE for Handling Imbalanced Data Problem : A Review,” in 2021 Sixth International Conference on Informatics and Computing (ICIC), IEEE, Nov. 2021, pp. 1–8. doi: 10.1109/ICIC54025.2021.9632912.

A. Zhang, H. Yu, Z. Huan, X. Yang, S. Zheng, and S. Gao, “SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors,” Inf Sci (N Y), vol. 595, pp. 70–88, 2022, doi: 10.1016/j.ins.2022.02.038.

G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, “Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data,” IEEE Access, vol. 9, pp. 74763–74777, 2021, doi: 10.1109/ACCESS.2021.3080316.

A. Nugroho and Y. Religia, “Analisis Optimasi Algoritma Klasifikasi Naive Bayes menggunakan Genetic Algorithm dan Bagging,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 5, no. 3, pp. 504–510, Jun. 2021, doi: 10.29207/resti.v5i3.3067.

A. Wibowo, “10 Fold Cross Validation.” Accessed: Dec. 23, 2020. [Online]. Available: https://mti.binus.ac.id/2017/11/24/10-fold-cross-validation

Y. N. FUADAH, I. D. UBAIDULLAH, N. IBRAHIM, F. F. TALININGSING, N. K. SY, and M. A. PRAMUDITHO, “Optimasi Convolutional Neural Network dan K-Fold Cross Validation pada Sistem Klasifikasi Glaukoma,” ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika, vol. 10, no. 3, p. 728, Jul. 2022, doi: 10.26760/elkomika.v10i3.728.

D. Rajput, W. J. Wang, and C. C. Chen, “Evaluation of a decided sample size in machine learning applications,” BMC Bioinformatics, vol. 24, no. 1, pp. 1–17, 2023, doi: 10.1186/s12859-023-05156-9.

COMPARATIVE ANALYSIS OF CLASSIFICATION ALGORITHMS IN HANDLING IMBALANCED DATA WITH SMOTE OVERSAMPLING APPROACH

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Open Access

Indexing JITK

Information

Language