FEATURE SELECTION COMPARATIVE PERFORMANCE FOR UNSUPERVISED LEARNING ON CATEGORICAL DATASET
DOI:
https://doi.org/10.33480/techno.v22i1.6512Keywords:
Chi-Square Test, Dynamic Dependency Threshold, Feature Selection, Mutual Information, Unsupervised LearningAbstract
In the era of big data, Knowledge Discovery in Databases (KDD) is vital for extracting insights from extensive datasets. This study investigates feature selection for clustering categorical data in an unsupervised learning context. Given that an insufficient number of features can impede the extraction of meaningful patterns, we evaluate two techniques—Chi-Square and Mutual Information—to refine a dataset derived from questionnaires on college library visitor characteristics. The original dataset, containing 24 items, was preprocessed and partitioned into five subsets: one via Chi-Square and four via Mutual Information using different dependency thresholds (a low-mid-high scheme and dynamic quartile thresholds: Q1toMax, Q2toMax, and Q3toMax). K-Means clustering was applied across nine variations of K (ranging from 2 to 10), with clustering performance assessed using the silhouette score and Davies-Bouldin Index (DBI). Results reveal that while the Mutual Information approach with a Q3toMax threshold achieves an optimal silhouette score at K=7, it retains only 4 features—insufficient for comprehensive analysis based on domain requirements. Conversely, the Chi-Square method retains 18 features and yields the best DBI at K=9, better capturing the intrinsic characteristics of the data. These findings underscore the importance of aligning feature selection techniques with both clustering quality and domain knowledge, and highlight the need for further research on optimal dependency threshold determination in Mutual Information.
References
Bhadra, T., Mallik, S., Hasan, N., & Zhao, Z. (2022). Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer. BMC Bioinformatics, 23(S3), 153. https://doi.org/10.1186/s12859-022-04678-y
Büyükkeçeci̇, M., & Okur, M. C. (2023). A Comprehensive Review of Feature Selection and Feature Selection Stability in Machine Learning. Gazi University Journal of Science, 36(4), 1506–1520. https://doi.org/10.35378/gujs.993763
Covert, I., Qiu, W., Lu, M., Kim, N., White, N., & Lee, S.-I. (2023). Learning to Maximize Mutual Information for Dynamic Feature Selection (arXiv:2301.00557). arXiv. https://doi.org/10.48550/arXiv.2301.00557
Fitriyanto, R., & Syafiqoh, U. (2024). Multilevel Modal Value Analysis for Interpreting Categorical K-Medoids Clusters Data. Jurnal Techno Nusa Mandiri, 21(2), 134–143. https://doi.org/10.33480/techno.v21i2.5796
Hopf, K., & Reifenrath, S. (2021). Filter Methods for Feature Selection in Supervised Machine Learning Applications—Review and Benchmark (arXiv:2111.12140). arXiv. https://doi.org/10.48550/arXiv.2111.12140
Liu, S., & Motani, M. (2022). Improving Mutual Information based Feature Selection by Boosting Unique Relevance (arXiv:2212.06143). arXiv. https://doi.org/10.48550/arXiv.2212.06143
Párraga-Valle, J., García-Bermúdez, R., Rojas, F., Torres-Morán, C., & Simón-Cuevas, A. (2020). Evaluating Mutual Information and Chi-Square Metrics in Text Features Selection Process: A Study Case Applied to the Text Classification in PubMed. In I. Rojas, O. Valenzuela, F. Rojas, L. J. Herrera, & F. Ortuño (Eds.), Bioinformatics and Biomedical Engineering (Vol. 12108, pp. 636–646). Springer International Publishing. https://doi.org/10.1007/978-3-030-45385-5_57
Peng, D., Gui, Z., & Wu, H. (n.d.). Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect.
Prasetiyowati, M. I., Maulidevi, N. U., & Surendro, K. (2021). Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest. Journal of Big Data, 8(1), 84. https://doi.org/10.1186/s40537-021-00472-4
Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W., & O’Sullivan, J. M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Frontiers in Bioinformatics, 2, 927312. https://doi.org/10.3389/fbinf.2022.927312
Rohadi, P. B. (2023). Optimasi Metode Naïve Bayes Menggunakan Seleksi Fitur Mutual Information Untuk Klasifikasi Teks Ujaran Kebencian. Universitas Pembangunan Nasional “Veteran.”
Sosa-Cabrera, G., Gómez-Guerrero, S., García-Torres, M., & Schaerer, C. E. (2024). Feature Selection: A perspective on inter-attribute cooperation. International Journal of Data Science and Analytics, 17(2), 139–151. https://doi.org/10.1007/s41060-023-00439-z
Tadesse, G. A., Ogallo, W., Cintas, C., & Speakman, S. (2022). Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data (arXiv:2203.04386). arXiv. https://doi.org/10.48550/arXiv.2203.04386
Tang, C. (2024). Review on Application of Chi-square Statistic in Text Classification in Recent Five Years. Applied and Computational Engineering, 97(1), 115–118. https://doi.org/10.54254/2755-2721/97/20241397
Ting, K. M., Washio, T., Zhu, Y., & Xu, Y. (2021). Breaking the curse of dimensionality with Isolation Kernel (arXiv:2109.14198). arXiv. https://doi.org/10.48550/arXiv.2109.14198
Tsamardinos, I., Charonyktakis, P., Papoutsoglou, G., Borboudakis, G., Lakiotaki, K., Zenklusen, J. C., Juhl, H., Chatzaki, E., & Lagani, V. (2022). Just Add Data: Automated predictive modeling for knowledge discovery and feature selection. Npj Precision Oncology, 6(1), 38. https://doi.org/10.1038/s41698-022-00274-8
Yan, X., Sarkar, M., Gebru, B., Nazmi, S., & Homaifar, A. (2021). A Supervised Feature Selection Method For Mixed-Type Data using Density-based Feature Clustering (arXiv:2111.08169). arXiv. https://doi.org/10.48550/arXiv.2111.08169
Yang, Y., Wang, W., Fu, H., & Kuo, C.-C. J. (2022). On Supervised Feature Selection from High Dimensional Feature Spaces (arXiv:2203.11924). arXiv. https://doi.org/10.48550/arXiv.2203.11924
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rachmad Fitriyanto, Mohamad Ardi

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
The copyright of any article in the TECHNO Nusa Mandiri Journal is fully held by the author under the Creative Commons CC BY-NC license. The copyright in each article belongs to the author. Authors retain all their rights to published works, not limited to the rights set out on this page. The author acknowledges that Techno Nusa Mandiri: Journal of Computing and Information Technology (TECHNO Nusa Mandiri) is the first to publish with a Creative Commons Attribution 4.0 International license (CC BY-NC). Authors can enter articles separately, manage non-exclusive distribution, from manuscripts that have been published in this journal into another version (for example: sent to author affiliation respository, publication into books, etc.), by acknowledging that the manuscript was published for the first time in Techno Nusa Mandiri: Journal of Computing and Information Technology (TECHNO Nusa Mandiri); The author guarantees that the original article, written by the stated author, has never been published before, does not contain any statements that violate the law, does not violate the rights of others, is subject to the copyright which is exclusively held by the author. If an article was prepared jointly by more than one author, each author submitting the manuscript warrants that he has been authorized by all co-authors to agree to copyright and license notices (agreements) on their behalf, and agrees to notify the co-authors of the terms of this policy. Techno Nusa Mandiri: Journal of Computing and Information Technology (TECHNO Nusa Mandiri) will not be held responsible for anything that may have occurred due to the author's internal disputes.