Sage Journals: Discover world-class research

Abstract

Over the past few years, machine learning (ML) has seen widespread application across diverse tasks, from simple regression to more complex issues like text generation and image classification. As these techniques continue to evolve, one of the persistent challenges in data analysis is how to handle missing values effectively. Addressing this issue, recent advancements in the R package VIM have integrated ML-based methods, specifically XGBoost and transformer models, to enhance the imputation of missing data. Through a comprehensive simulation study, we evaluate the performance of these advanced imputation techniques against traditional donor-based and model-based approaches. The study draws on real-world data sets, including trade micro-data and housing registers, to simulate various missing data mechanisms (MCAR, MAR, and MNAR). Our results indicate that, while the XGBoost and transformer-based methods exhibit competitive performance, there are trade-offs between accuracy, potential bias, and computational costs. The findings underscore the potential of machine learning-based imputation to enhance data quality in large-scale tabular data sets, with implications for future developments in the VIM package, including the use of pre-trained language models. These advances hold promise in improving data quality in official statistics and other domains requiring robust imputation solutions.

Keywords

Imputation machine learning transformer models ICES

Get full access to this article

View all access options for this article.

References

Kowarik

Templ

. Imputation with the R package VIM. J Stat Softw 2016; 74: 1–16.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

Wright

Wager

Probst

. Wright MMN. Package ‘ranger’. Version 011. 2019;2.

Kowarik

Niederhametner

. VIM: Visualization and Imputation of Missing Values (Transformer Impute Branch); 2024. Available at: https://github.com/statistikat/VIM/tree/transformerImpute (accessed on 20 December 2024).

Jadhav

Pramod

Ramanathan

. Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 2019; 33: 913–933.

Bertsimas

Pawlowski

Zhuo

. From predictive methods to missing data imputation: an optimization approach. J Mach Learn Res 2018; 18: 1–39.

Van Buuren

. Flexible multivariate imputation by MICE. TNO Prevention and Health. 1999.

Jäger

Allhorn

Bießmann

. A benchmark for data imputation methods. Front Big Data 2021; 4: 693674.

Chen

. Xgboost: extreme gradient boosting. R package version 04-2 2015; 1: 1–4.

10.

Deng

Lumley

. Multiple imputation through xgboost. J Comput Graph Stat 2024; 33: 352–363.

11.

Zhang

Yan

Gao

, et al. XGBoost imputation for time series data. In: 2019 IEEE International conference on healthcare informatics (ICHI), 2019, p.1–3. IEEE.

12.

Vaswani

, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems. Vol. 30, 2017.

13.

Mei

Song

Fang

, et al. Capturing semantics for imputation with pre-trained language models. In: 2021 IEEE 37th International conference on data engineering (ICDE), 2021, p.61–72. IEEE.

14.

Borisov

Seßler

Leemann

, et al. Language models are realistic tabular data generators. arXiv preprint arXiv:221006280. 2022.

15.

Solatorio

Dupriez

. Realtabformer: Generating realistic relational and tabular data using transformers. arXiv preprint arXiv:230202041. 2023.

16.

Abadi

Agarwal

Barham

, et al. TensorFlow: large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. Available from: http://tensorflow.org/.

17.

Allaire

Chollet

. keras: R Interface to ’Keras’; 2019. R package version 2.2.4.1. Available from: https://CRAN.R-project.org/package=keras.

Performance evaluation of machine learning-based imputation for missing data analysis in the R package VIM

Abstract

Keywords

Get full access to this article

References