Abstract
Artificial or simulated data are particularly relevant in tests and benchmarks for machine learning methods, in teaching for exercises and for setting up analysis workflows. They are relevant when real data may not be used for reasons of data protection, or when special distributions or effects should be present in the data to test certain machine learning methods. In this paper a generator for multivariate numerical data with arbitrary marginal distributions and – as far as possible – arbitrary correlations is presented. The data generator is implemented in the open source statistics software R. It can also be used for categorical variables, if data are generated separately for the corresponding characteristics of a categorical variable. Additionally, outliers can be integrated. The use of the data generator is demonstrated with a concrete example.
Get full access to this article
View all access options for this article.
