Optimization of Classification Algorithms Performance with k-Fold Cross Validation

Authors

  • Moch. Anjas Aprihartha Program Studi PJJ Informatika, Universitas Dian Nuswantoro
  • Idham Idham Program Studi Sistem dan Teknologi Informasi, Universitas Muhammadiyah Mataram

DOI:

https://doi.org/10.29303/emj.v7i2.212

Keywords:

CART, k-fold cross validation, KNN, Naive Bayes, Logistic Regression

Abstract

Supervised learning is a predictive method used to make predictions or classifications. Supervised learning algorithms work by building a model using training data that includes both independent and dependent variables. Several methods for building classification include Logistic Regression, Naive Bayes, K-Nearest Neighbor (KNN), decision tree, etc. The lack of capacity of a classification algorithm to generalize certain data can be associated with the problem of overfitting or underfitting. K-fold cross-validation is a method that can help avoid overfitting or underfitting and produce a algorithm with good performance on new data. This study will test the Naive Bayes, K-Nearest Neighbor (KNN), Classification and Regression Tree (CART), and Logistic Regression methods with k-fold cross-validation on two different datasets. The values of k set for cross-validation are 2, 3, 5, 7, and 10. The analysis results concluded that each classification algorithm performed best at 10-fold cross-validation. In DATA 1, the Naive Bayes algorithm has the highest average accuracy of 0.67 (67%) and the error rate is 0.33 (33%), followed by the CART algorithm, KNN, and finally logistic regression. While DATA 2, the KNN algorithm has the highest average accuracy of 0.66 (66%) and an error rate of 0.34 (34%), followed by the CART algorithm, Naive Bayes, and finally logistic regressionbut can be a reference if you want to predict the growth direction of the accommodation and food service activities sector.

References

Anandan, B., & Manikandan, M. (2023). Machine learning approach with various regression models for predicting the ultimate tensile strength of the friction stir welded AA 2050-T8 joints by the K-Fold cross-validation method. Materials Today Communications, 34, 105286. doi : https://doi.org/10.1016/j.mtcomm.2022.105286

Aprihartha, M. A. (2024). Implementasi CART-Real Adaboost dalam Memprediksi Minat Pelanggan Membeli Sepatu. Jurnal EurekaMatika, 12(1), 35-46. doi: https://doi.org/10.17509/jem.v12i1.67808

Aprihartha, A. (2024). Penyelesaian Masalah Ketidakseimbangan Data Melalui Teknik Oversampling dan Undersampling pada Klasifikasi Siswa Tidak Naik Kelas.Jurnal Teknik Ibnu Sina (JT-IBSI), 9(01), 43-52. doi: https://doi.org/10.36352/jt-ibsi.v9i01.807

Aprihartha, M. A., Alam, T. N., & Husniyadi, M. (2024). Perbandingan Metrik Euclidean dan Metrik Manhattan untuk K-Nearest Neighbors dalam Klasifikasi Kismis. Jurnal Ilmu Komputer dan Informatika, 4(1), 21-30.

Aprihartha, M. A., Astutik, F., & Sulistianingsih, N. (2024). Comparison of Naïve Bayes, CART, dan CART Adaboost Methods in Predicting Tire Product Sales. Jurnal Matematika, Statistika dan Komputasi, 20(3), 596-605. doi: https://doi.org/10.20956/j.v20i3.33187

Aprihartha, M. A., Putrawan, Z., Zulhan, D., & Nurfaizal, F. A. (2024). Algoritma Synthetic Minority Oversampling Technique dan C5. 0 dalam Mengatasi Ketidakseimbangan Data pada Klasifikasi Kelulusan Siswa. UPGRADE: Jurnal Pendidikan Teknologi Informasi, 2(1), 1-10. doi: https://doi.org/10.30812/upgrade.v2i1.4148

Aprihartha, A., Putrawan, Z., Zulhan, D., & Nurfaizal, F. A. (2024). Klasifikasi Produktivitas Buah Nanas Menggunakan Algoritma Classification and Regression Tree (CART). Diophantine Journal of Mathematics and Its Applications, 64-70. doi: https://doi.org/10.33369/diophantine.v3i1.34193

Balboa, A., Cuesta, A., González-Villa, J., Ortiz, G., & Alvear, D. (2024). Logistic Regression vs machine learning to predict evacuation decisions in fire alarm situations. Safety science, 174, 106485. doi: https://doi.org/10.1016/j.ssci.2024.106485

Chacón, A. M. P., Ramírez, I. S., & Márquez, F. P. G. (2023). K-Nearest Neighbor and K-fold cross-validation used in wind turbines for false alarm detection. Sustainable Futures, 6, 100132. doi: https://doi.org/10.1016/j.sftr.2023.100132

Cholil, S. R., Handayani, T., Prathivi, R., & Ardianita, T. (2021). Implementasi algoritma klasifikasi k-Nearest Neighbor (knn) untuk klasifikasi seleksi penerima beasiswa. IJCIT (Indonesian Journal on Computer and Information Technology), 6(2), 118-127. doi: https://doi.org/10.31294/ijcit.v6i2.10438

Firmansyach, W. A., Hayati, U., & Wijaya, Y. A. (2023). Analisa Terjadinya Overfitting Dan Underfitting Pada Algoritma Naive Bayes Dan Decision Tree Dengan Teknik Cross Validation. JATI (Jurnal Mahasiswa Teknik Informatika), 7(1), 262-269. doi: https://doi.org/10.36040/jati.v7i1.6329

Gorunescu, F. (2011). Data Mining: Concepts, models and techniques (Vol. 12). Springer Science & Business Media. doi: https://doi.org/10.1007/978-3-642-19721-5

Han, J., Pei, J., & Tong, H. (2022). Data mining: concepts and techniques. Morgan kaufmann.

Kuswanto, H., & Mubarok, R. (2019). Classification of cancer drug compounds for radiation protection optimization using CART. Procedia Computer Science, 161, 458-465. doi: https://doi.org/10.1016/j.procs.2019.11.145

Li, L., Zhou, Z., Bai, N., Wang, T., Xue, K. H., Sun, H., .& Miao, X. (2022). Naive Bayes classifier based on memristor nonlinear conductance. Microelectronics Journal, 129, 105574. doi: https://doi.org/10.1016/j.mejo.2022.105574

Lin, K. Y. C. (2024). Optimizing variable selection and neighbourhood size in the K-Nearest Neighbor algorithm. Computers & Industrial Engineering, 110142. doi: https://doi.org/10.1016/j.cie.2024.110142

Pamungkas, F. S., & Kharisudin, I. (2021, February). Analisis Sentimen dengan SVM, NAIVE BAYES dan KNN untuk Studi Tanggapan Masyarakat Indonesia Terhadap Pandemi Covid-19 pada Media Sosial Twitter. In PRISMA, Prosiding Seminar Nasional Matematika (Vol. 4, pp. 628-634).

Prasetya, J., Fallo, S. I., & Aprihartha, M. A. (2024). Stacking Machine Learning Model for Predict Hotel Booking Cancellations. Jurnal Matematika, Statistika dan Komputasi, 20(3), 525-537. doi: https://doi.org/10.20956/j.v20i3.32619

Rahmaulidyah, F. N., Hayati, M. N., & Goejantoro, R. (2021). Perbandingan Metode Klasifikasi Naive Bayes Dan K-Nearest Neighbor Pada Data Status Pembayaran Pajak Pertambahan Nilai di Kantor Pelayanan Pajak Pratama Samarinda Ulu. Eksponensial, 12(2), 161-164. doi: https://doi.org/10.30872/eksponensial.v12i2.809

Saputra, N. D. (2021). Penggunaan metode Classification and Regression Tree (CART) dalam mengklasifikasikan pasien penderita DBD di Rumah Sakit Anwar Makkatutu Kabupaten Bantaeng [Skripsi, Universitas Islam Negeri (UIN) Alauddin Makassar].

von Neumann, J. (2016). Model selection and overfitting. Nat. Methods, 13, 703-704.

Downloads

Published

2024-09-20

How to Cite

Aprihartha, M. A., & Idham, I. (2024). Optimization of Classification Algorithms Performance with k-Fold Cross Validation. EIGEN MATHEMATICS JOURNAL, 7(2), 61–66. https://doi.org/10.29303/emj.v7i2.212

Issue

Section

Articles