Enhancing Student Retention with Machine Learning: A Data-Driven Approach to Predicting College Student Persistence

Abstract

This study examined the application of machine learning (ML) models to predict college student persistence. Using a dataset of 8,776 student records spanning 7 years, 10 ML algorithms were evaluated, with a focus on Logistic Regression and Random Forest (RF). Results indicated that RF outperformed other models in accuracy and recall, particularly in identifying at-risk students. The use of the Synthetic Minority Oversampling Technique improved prediction for non-persistent students. Feature importance analysis revealed that cumulative resident terms, grade point average, financial factors, and engagement metrics were key predictors. Adjusting the prediction threshold further enhanced the identification of non-persistent students. Despite data limitations, the study provides actionable insights for improving student retention through data-driven strategies. Future research should refine feature selection, incorporate real-time data, and enhance predictive models to support institutional decision-making.

Keywords

student persistence machine learning retention prediction educational data mining

Get full access to this article

View all access options for this article.

References

Alyahyan

Düştegör

(2020). Predicting academic success in higher education: A literature review and best practices. International Journal of Educational Technology in Higher Education, 17(3), 1–21. https://doi.org/10.1186/s41239-020-0177-7

Arif

M. A.

Ahmad

Rehman

Khan

(2021). An improved prediction system of students’ performance using feature selection algorithm. International Journal of Advanced Soft Computing & Its Applications, 13(1), 162–177. https://doi.org/10.3934/moisci.2021014

Asenjo Muro

E. D.

Ramírez

J. C.

Ledesma

R. P.

(2024). Implementing effective sociocultural integration strategies to decrease university student dropout rates. Sapienza: International Journal of Interdisciplinary Studies, 5(4), e24079. https://doi.org/10.51798/sijis.v5i4.832

Bahalkar

Peddi

Jain

(2024). AI-driven career guidance system: A predictive model for student subject recommendations. Frontiers in Health Informatics, 13(3), 8216–8230. https://doi.org/10.52783/fhi.vi.781

Bean

J. P.

(1980). Dropouts and turnover: The synthesis and test of a causal model of student attrition. Research in Higher Education, 12(2), 155–187. https://doi.org/10.1007/BF00976194

Bishop

C. M.

(2006). Pattern recognition and machine learning. Springer.

Buraimoh

E. F.

(2021). Predicting student success using student engagement in the online component of a blended-learning course [Master’s thesis]. University of the Witwatersrand. Zenodo. https://doi.org/10.5281/zenodo.4578910

Chandrashekar

Sahin

(2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024

Dafid

Halim

Ibrahim

(2023). A framework for predicting academic success using classification methods through filter-based feature selection. International Journal of Advanced Computer Science and Applications, 14(9). https://doi.org/10.14569/IJACSA.2023.0140947

10.

Guyon

Elisseeff

(2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. https://dl.acm.org/doi/10.5555/944919.944968

11.

Harif

Kassimi

M. A.

(2024). Predictive modeling of student performance using RFECV-RF for feature selection and machine learning techniques. International Journal of Advanced Computer Science and Applications, 15(7). https://doi.org/10.14569/ijacsa.2024.0150723

12.

Kannan

Mahalingam

Murugan

(2024). Predicting academic success and identifying at-risk students using ensemble and deep learning models. Journal of System and Management Sciences, 14(6), 93–112. https://doi.org/10.33168/JSMS.2024.0607

13.

Karalar

Kapucu

Gürüler

(2021). Predicting students at risk of academic failure using an ensemble model during the pandemic in a distance learning system. International Journal of Educational Technology in Higher Education, 18(1), 1–18. https://doi.org/10.1186/s41239-021-00300-y

14.

Kibe

Smith

(2024). Student attrition, student retention: Retention strategies at universities in the South. Journal of Student Success and Retention, 9(2), 1–12.

15.

Loder

A. K. F.

(2023). Predicting the number of “active” students: A method for preventive university management. Journal of College Student Retention: Research, Theory & Practice, 1–20. https://doi.org/10.1177/15210251231201394

16.

Manyanga

Sithole

Hanson

S. M.

(2017). Comparison of student retention models in undergraduate education from the past eight decades. Journal of Applied Learning in Higher Education, 7(1), 30–42. https://eric.ed.gov/?id=EJ1188373

17.

Martins

M. V.

Tolledo

Machado

Baptista

L. M. T.

Realinho

(2021). Early prediction of student performance in higher education: A case study. In Rocha

Á.

Adeli

Reis

L. P.

Costanzo

(Eds.), Trends and advances in information systems and technologies. WorldCIST 2021. Advances in intelligent systems and computing (Vol. 1365, pp. 166–175). Springer. https://doi.org/10.1007/978-3-030-72657-7_16

18.

Riad

(2024). Active intervention programs: Providing support for students on academic probation. Journal of Student Success and Retention, 9(2), 1–11.

19.

Ridwan

Priyatno

A. M.

(2024). Predict students’ dropout and academic success with XGBoost. Journal of Education and Computer Applications, 1(2), 1–8. https://doi.org/10.69693/jeca.v1i2.13

20.

Sekeroglu

Abiyev

Ilhan

Arslan

Idoko

J. B.

(2021). Systematic literature review on machine learning and student performance prediction: Critical gaps and possible remedies. Applied Sciences, 11(22), 10907. https://doi.org/10.3390/app112210907

21.

Tinto

(1975). Dropout from higher education: A theoretical synthesis of recent research. Review of Educational Research, 45(1), 89–125. https://doi.org/10.3102/00346543045001089

22.

Wahyuningsih

Prasetyo

Nurhasanah

(2024). Predicting students’ success level in an examination using advanced linear regression and extreme gradient boosting. Computer Science and Information Technologies, 5(1), 29–37. https://doi.org/10.11591/csit.v5i1.pp29-37