Abstract
Over the past few years, machine learning (ML) has seen widespread application across diverse tasks, from simple regression to more complex issues like text generation and image classification. As these techniques continue to evolve, one of the persistent challenges in data analysis is how to handle missing values effectively. Addressing this issue, recent advancements in the R package VIM have integrated ML-based methods, specifically XGBoost and transformer models, to enhance the imputation of missing data. Through a comprehensive simulation study, we evaluate the performance of these advanced imputation techniques against traditional donor-based and model-based approaches. The study draws on real-world data sets, including trade micro-data and housing registers, to simulate various missing data mechanisms (MCAR, MAR, and MNAR). Our results indicate that, while the XGBoost and transformer-based methods exhibit competitive performance, there are trade-offs between accuracy, potential bias, and computational costs. The findings underscore the potential of machine learning-based imputation to enhance data quality in large-scale tabular data sets, with implications for future developments in the VIM package, including the use of pre-trained language models. These advances hold promise in improving data quality in official statistics and other domains requiring robust imputation solutions.
Get full access to this article
View all access options for this article.
