IMPLEMENTATION OF ENSEMBLE TECHNIQUES FOR DIARRHEA CASES CLASSIFICATION OF UNDER-FIVE CHILDREN IN INDONESIA
Diarrhea is an endemic disease in Indonesia with symptoms of three or more defecations with the consistency of liquid stool. According to WHO, diarrhea is the second largest contributor to the death of under-five children. Data and cases of children under five years who have diarrhea are very difficult to find, so the data analysis process becomes difficult due to the lack of information obtained. Difficulties in the data analysis process can be overcome by rebalancing, so the category ratios are balanced. The method that is popularly used is SMOTE. To solve imbalanced data and improve classification performance, this study implements the combination of SMOTE with several ensemble techniques in diarrhea cases of under-five children in Indonesia. Ensemble models that are used in this study are Random Forest, Adaptive Boosting, and XGBoost with Decision Tree as a baseline method. The results show that all SMOTE-based methods demonstrate a competitive performance whereas SMOTE-XGB gains a slightly higher accuracy (0.88), precision (0.96), and f1-score (0.86). The implementation of the SMOTE strategy improved the recall, precision, and f1-score metrics and give higher AUC of all methods (DT, RF, ADA, and XGB). This study is useful to solve the imbalanced problems in official statistics data provided by BPS Statistics Indonesia
World Health Organization, “Diarrhoeal disease,” 2017. https://www.who.int/en/news-room/fact-sheets/detail/diarrhoeal-disease (accessed Dec. 16, 2020).
Kementerian Kesehatan RI, “Profil Kesehatan Indonesia Tahun 2017,” 2017. [Online]. Available: https://www.kemkes.go.id/.
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Trans. Syst. Man Cybern. Part C Appl. Rev., vol. 42, no. 4, pp. 463–484, 2012, DOI: 10.1109/TSMCC.2011.2161285.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.
D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Inf. Sci. (Ny)., vol. 505, pp. 32–64, 2019, DOI: 10.1016/j.ins.2019.07.070.
U. R. Salunkhe and S. N. Mali, “Classifier Ensemble Design for Imbalanced Data Classification: A Hybrid Approach,” Procedia Comput. Sci., vol. 85, no. Cms, pp. 725–732, 2016, DOI: 10.1016/j.procs.2016.05.259.
H. Dong, D. He, and F. Wang, “SMOTE-XGBoost using Tree Parzen Estimator optimization for copper flotation method classification,” Powder Technol., vol. 375, pp. 174–181, 2020, DOI: 10.1016/j.powtec.2020.07.065.
K. Li, G. Zhou, J. Zhai, F. Li, and M. Shao, “Improved PSO_AdaBoost ensemble algorithm for imbalanced data,” Sensors (Switzerland), vol. 19, no. 6, 2019, DOI: 10.3390/s19061476.
L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, DOI: 10.1023/A:1010933404324.
R. Polikar, “Ensemble Learning,” in Ensemble Machine Learning: Methods and Applications, C. Zhang and Y. Ma, Eds. Boston, MA: Springer US, 2012, pp. 1–34.
Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997, DOI: https://doi.org/10.1006/jcss.1997.1504.
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794, DOI: 10.1145/2939672.2939785.
Z. Chen, J. Duan, L. Kang, and G. Qiu, “A Hybrid Data-Level Ensemble to Enable Learning from Highly Imbalanced Dataset,” Inf. Sci. (Ny)., 2020, DOI: 10.1016/j.ins.2020.12.023.
S. Wang et al., “A new method of diesel fuel brands identification: SMOTE oversampling combined with XGBoost ensemble learning,” Fuel, vol. 282, no. May, p. 118848, 2020, DOI: 10.1016/j.fuel.2020.118848.
M. Syukron, R. Santoso, and T. Widiharih, “Perbandingan Metode SMOTE Random Forest dan SMOTE XGBoost untuk Klasifikasi Tingkat Penyakit Hepatitis C pada Imbalance Class Data,” J. Gaussian, vol. 9, no. 3, pp. 227–236, 2020.
A. M. W. Saputra, I. P. Ananda, M. A. Rizki, Z. D. Hapsari, and R. Nooraeni, “Penerapan Metode Resampling dalam Mengatasi Imbalanced Data pada Determinan Kasus Diare pada Balita di Indonesia (Analisis Data SDKI 2017),” J. Mat. dan Stat. serta Apl., vol. 8, pp. 19–27, 2020, DOI: 10.24252/msa.v8i1.13452.
National Population and Family Planning Board - BKKBN, Statistical Indonesia - BPS, Ministry of Health - Kemenkes, and ICF, “Indonesia Demographic and Health Survey 2017 [Dataset].” 2018. Distributed by ICF. Available: http://dhsprogram.com/pubs/pdf/FR342/FR342.pdf.
F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
C. Gao and H. Elzarka, “The use of decision tree based predictive models for improving the culvert inspection process,” Adv. Eng. Informatics, vol. 47, no. October 2020, p. 101203, 2020, DOI: 10.1016/j.aei.2020.101203.
J. Han, M. Kamber, and J. Pei, “8 - Classification: Basic Concepts,” in Data Mining (Third Edition), Third Edit., J. Han, M. Kamber, and J. Pei, Eds. Boston: Morgan Kaufmann, 2012, pp. 327–391.
J. A. Hanley and B. J. McNeil, “A method of comparing the areas under receiver operating characteristic curves derived from the same cases.,” Radiology, vol. 148, no. 3, pp. 839–843, Sep. 1983, DOI: 10.1148/radiology.148.3.6878708.
K. Hajian-Tilaki, “Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation.,” Casp. J. Intern. Med., vol. 4, no. 2, pp. 627–635, 2013.
Abstract viewed = 15 times
PDF downloaded = 8 times
Copyright (c) 2021 Andriansyah Muqiit Wardoyo Saputra, Arie Wahyu Wijayanto
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.