• Sukmawati Anggraeni Putri (1*) Sistem Informasi STMIK Nusa Mandiri Jakarta

  • (*) Corresponding Author
Keywords: Prediction Software Defect, Information Gain, Naïve Bayes, SMOTE, Bagging


The prediction accuracy of defects in code can help direct the test effort, reduce costs and improve software quality. Until now, many researchers have applied various types of algorithm based on machine learning and statistical methods to build predictive performance software defects. One of them uses a machine learning approach to the classification, which is a popular approach to predict software defects. While Naive Bayes one simple classification to have good performance that produces an average probability of 71 percent. As well as the time required in the process of learning faster than on any other machine learning. Additionally, it has a good reputation on the accuracy of the prediction. While NASA MDP is a very popular data used by previous researchers in the development of predictive models of software defects. Because it is common and freely used by researchers. However, these data have deficiencies, including the occurrence of imbalance class and attribute noise. Therefore by using SMOTE (Synthetic Minority Over-Sampling Technique) for sampling techniques and Bagging on the ensemble method, is used to deal with the class imbalance. As for dealing with noise attribute, in this research using information gain in the process of selecting the relevant attributes. So after the trial that the application of the model SMOTE Bagging and Information Gain proven to obtain good results to handled imbalance class and attribute noise at prediction software defects, and can increase the accuracy of the prediction results software defects.


Download data is not yet available.


Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique, 16, 321–357.

Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. The Journal of Machine Learning Research, 7, 1–30.

Domingos, P. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2-3), 103–130.

Gao, K., & Khoshgoftaar, T. M. (2011). Software Defect Prediction for High-Dimensional and Class-Imbalanced Data. Conference: Proceedings of the 23rd International Conference on Software Engineering & Knowledge Engineering, (2).

Gao, K., Khoshgoftaar, T. M., Wang, H., & Seliya, N. (2011). Choosing software metrics for defect prediction : an investigation on feature selection techniques. Software: Practice and Experience, 41(5), 579–606. doi:10.1002/spe

Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2010). A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Knowledge and Data Engineering, 38(6), 1276 – 1304.

Jain, M., & Richariya, V. (2012). Improved Techniques Based on Naive Bayesian for Attack Detection. International Journal of Emerging Technology and Advanced Engineering, 2(1), 324–331.

Kabir, M., & Murase, K. (2012). Expert Systems with Applications A new hybrid ant colony optimization algorithm for feature selection. Expert Systems With Applications, 39(3), 3747–3763. doi:10.1016/j.eswa.2011.09.073

Khoshgoftaar, T. M., & Gao, K. (2009). Feature Selection with Imbalanced Data for Software Defect Prediction. 2009 International Conference on Machine Learning and Applications, 235–240. doi:10.1109/ICMLA.2009.18

Kohavi, R., & Edu, S. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), 1137–1143.

Lessmann, S., Member, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking Classification Models for Software Defect Prediction : A Proposed Framework and Novel Findings. IEEE Transactions on Software Engineering, 34(4), 485–496.

Ling, C. X. (2003). Using AUC and Accuracy in Evaluating Learning Algorithms, 1–31.

Ling, C. X., & Zhang, H. (2003). AUC: a statistically consistent and more discriminating measure than accuracy. Proceedings of the 18th International Joint Conference on Artificial Intelligence.

Mccabe, T. J. (1976). A Complexity Measure. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,, SE-2(4), 308–320.

Menzies, T., Greenwald, J., & Frank, A. (2007). Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering, 33(1), 2–13. doi:10.1109/TSE.2007.256941

Riquelme, J. C., Ruiz, R., & Moreno, J. (2008). Finding Defective Modules from Highly Unbalanced Datasets. Engineering, 2(1), 67–74.

Shepperd, M., Song, Q., Sun, Z., & Mair, C. (2013). Data Quality : Some Comments on the NASA Software Defect Data Sets. Software Engineering, IEEE Transactions, 39(9), 1–13.

Song, Q., Jia, Z., Shepperd, M., Ying, S., & Liu, J. (2011). A General Software Defect-Proneness Prediction Framework. IEEE Transactions on Software Engineering, 37(3), 356–370. doi:10.1109/TSE.2010.90

Turhan, B., & Bener, A. (2009). Analysis of Naive Bayes’ assumptions on software fault data: An empirical study. Data & Knowledge Engineering, 68(2), 278–290. doi:10.1016/j.datak.2008.10.005

Wahono, R. S., & Suryana, N. (2013). Combining Particle Swarm Optimization based Feature Selection and Bagging Technique for Software Defect Prediction. International Journal of Software Engineering and Its Applications, 7(5), 153–166.

Wang, H., Khoshgoftaar, T. M., Gao, K., & Seliya, N. (2009). High-Dimensional Software Engineering Data and Feature Selection. Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, Nov. 2-5, 83–90. doi:10.1109/ICTAI.2009.20

Wang, T., Li, W., Shi, H., & Liu, Z. (2011). Software Defect Prediction Based on Classifiers Ensemble. Journal of Information & Computational Science 8, 16(December), 4241–4254.

Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. International Biometric Society Stable, 1(6), 80–83.

Yap, B. W., Rani, K. A., Aryani, H., Rahman, A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), 285, 13–23. doi:10.1007/978-981-4585-18-7
How to Cite
Article Metrics

Abstract viewed = 312 times
PDF downloaded = 360 times