Abstract
This study examined the application of machine learning (ML) models to predict college student persistence. Using a dataset of 8,776 student records spanning 7 years, 10 ML algorithms were evaluated, with a focus on Logistic Regression and Random Forest (RF). Results indicated that RF outperformed other models in accuracy and recall, particularly in identifying at-risk students. The use of the Synthetic Minority Oversampling Technique improved prediction for non-persistent students. Feature importance analysis revealed that cumulative resident terms, grade point average, financial factors, and engagement metrics were key predictors. Adjusting the prediction threshold further enhanced the identification of non-persistent students. Despite data limitations, the study provides actionable insights for improving student retention through data-driven strategies. Future research should refine feature selection, incorporate real-time data, and enhance predictive models to support institutional decision-making.
Get full access to this article
View all access options for this article.
