Sage Journals: Discover world-class research

Abstract

With the rapid proliferation of social data, the prevalence of missing values has become increasingly common. Various factors, including human error and machine failure, contribute to the emergence of missing values in datasets. Datasets containing missing values not only consume storage space but also pose a significant obstacle to direct utilization, resulting in substantial resource wastage. Consequently, accurately imputing missing values has emerged as a focal point in research. Generative missing value imputation methods, leveraging generative models, have demonstrated notable efficacy in recent years by directly generating values for missing components based on observable data values. This paper introduces a novel generative method for missing value imputation based on a diffusion denoising model, termed the conditional diffusion model for missing value imputation (CDMVI). Specifically, CDMVI trains a conditional diffusion model using complete data samples (samples devoid of missing values) and subsequently utilizes the trained model to impute missing values in datasets. During the training stage (i.e., the forward process of the diffusion model), a subset of features is randomly selected from complete data samples, and varying levels of random noise are introduced as condition inputs to the noise predictor within the diffusion model. In the imputation stage (i.e., the backward process of the diffusion model), the missing segments of the data are initially replaced with random noise, serving as a guide for the diffusion model to generate complete samples. Experimental evaluations across multiple datasets demonstrate the competitive performance of our proposed CDMVI method.

Keywords

missing value imputation diffusion model denoising

1. Introduction

The significance of data quality in contemporary artificial intelligence is paramount, as it directly influences the efficacy of model training. In practical scenarios, the pervasive issue of missing values remains a common hindrance to data quality. The presence of missing values can be attributed to a variety of factors, including human errors, machine malfunctions, and other unforeseen circumstances (Baraldi & Enders, 2010). For instance, in the context of recording medical data, inaccuracies due to operator errors might hinder the recording of a patient’s blood pressure, or the omission of sensitive information could result from the imperative of safeguarding patient privacy. These diverse reasons collectively contribute to the emergence of missing values in the final dataset (Ibrahim et al., 2012). Failing to address the challenge of missing values can obstruct the resolution of problems reliant on such data, potentially leading to unforeseeable losses. As a result, recent research endeavors have been dedicated to addressing the predicament of missing values. Among the proposed solutions, imputing the missing parts emerges as the most rational approach, minimizing any wastage of crucial data. Numerous researchers have introduced and developed distinct missing value imputation techniques from various perspectives. The overarching principle revolves around judiciously estimating the distribution of missing values and predicting reasonable fill-in values, ensuring that the imputed samples align with the overall data distribution (Dong & Peng, 2013).

In recent years, various methods for imputation of missing values have been proposed. According to the literature (Jarrett et al., 2022), existing methods for imputation of missing values can be roughly divided into two categories: iterative methods and deep generative models-based methods. Iterative methods hinge on estimating the conditional distribution of a feature by utilizing all other available features. In each iteration, a conditional distribution estimator is trained to predict the value of each feature. This single-variable model is then employed recursively to impute missing values until the process converges, guided by a prespecified convergence criterion (Zheng & Charoenphakdee, 2022). This method has undergone extensive study and application (Khan & Hoque, 2020; Stekhoven & Bühlmann, 2012), with multiple imputation using chain equations (MICE; Van Buuren & Oudshoorn, 2000 standing out as one of the most renowned techniques. The MICE algorithm predicts missing values in a feature by establishing equations between the single feature and other observed features. It utilizes the predicted missing values as known variables to further predict the missing values in another single feature. Through iterative execution of these steps, the MICE algorithm converges to obtain the final imputed values for the missing part. On the other hand, missing value imputation methods based on deep generative models operate by estimating the joint model of all features. By training a generative model, these methods generate plausible values for the missing part based on observed values (Li et al., 2019; Rezende et al., 2014). According to the characteristics of the generative model, Yoon et al. (2018) proposed the generative adversarial imputation nets (GAIN), The core idea behind GAIN is to generate estimates of the missing part using a generator and employ a discriminator to assess the disparity between the generated estimates and the actual values. This iterative optimization process refines both the generator and discriminator, ultimately yielding a more accurate imputation result for the missing data. Simultaneously, Gondara and Wang (2018) proposed multiple imputation using denoising autoencoders (MIDA) for missing value imputation. The MIDA model adopts a self-encoder architecture (Wang et al., 2016) and introduces noise to the input to enhance the model’s ability to reconstruct the original input from noisy data. Subsequently, Nazabal et al. (2020) proposed a missing value imputation model called HIVAE, which uses variational autoencoders (VAEs) as the main architecture. Other methods for imputation of missing values based on generative models include (Dai et al., 2021; Li et al., 2019; Yoon & Sull, 2020).

In recent times, diffusion models (Cao et al., 2022; Chen et al., 2023; Yang et al., 2023), emerging as a noteworthy class of generative models, have demonstrated their effectiveness across diverse domains, including computer vision (Ho et al., 2020), time series data (Rasul et al., 2021), and natural language processing (Li et al., 2022). When compared to other generative models, diffusion models have exhibited notable prowess. These models incrementally introduce noise to the data through the forward process and systematically denoise the random Gaussian noise during the backward process to generate samples. Despite their demonstrated efficacy in various applications, diffusion models, as a relatively novel generative model, have seen limited exploration in the context of addressing the challenge of missing value imputation.

In this study, we introduce a diffusion-based framework, referred to as conditional diffusion model for missing value imputation (CDMVI), designed for the imputation of missing values in data. Specifically, CDMVI employs a diffusion model trained on complete samples (i.e., samples without missing values) to capture the original distribution of the data. In the forward process of the diffusion model, we systematically introduce random Gaussian noise to complete samples, generating perturbed samples. Concurrently, we implement distinct masking strategies to simulate missing values in the data. We input samples with missing values and perturbed samples with noise into a U-shaped network to predict the amount of noise added to the samples. Here, the samples with missing values serve as a conditioning factor for the diffusion model, enhancing the U-shaped network’s accuracy in predicting the noise added during the forward process. In the backward process of the diffusion model, we utilize random Gaussian noise and real samples with missing values as inputs. After multiple denoising iterations, complete samples corresponding to the samples with missing values are obtained. Consequently, we refer to the backward process as the missing value-filling process. Our contributions can be summarized as follows:

We introduce a framework, CDMVI, based on the conditional diffusion model, tailored for generative missing value imputation, with a specific emphasis on numerical data and multiple missing patterns.

We design a conditional generation model and seamlessly integrate the generated conditions into the noise predictor within the diffusion model. This integration facilitates the denoising process in the direction of missing value generation.

We conduct extensive numerical experiments on various real datasets under different missing mechanisms, demonstrating the efficacy of CDMVI in effectively addressing missing value imputation challenges.

2. Related Work

2.1. Missing Data

Dealing with missing values is a prevalent challenge in both data analysis and machine learning. This pertains to the presence of unobserved or invalid values within a dataset. Such missing values can arise due to various factors, including measurement errors, problems during data acquisition, intentional data gaps, or issues related to data quality. Table 1 illustrates scenarios where missing values are encountered in the dataset, with these instances being represented by the symbol “?.”

Table 1.
Missing Value Display.

Id Sex Age Blood pressure Blood glucose

1 ? 28 108 5.4

2 M ? 141 ?

3 F 74 ? 5.7

Id	Sex	Age	Blood pressure	Blood glucose
1	?	28	108	5.4
2	M	?	141	?
3	F	74	?	5.7

2.2. Missing Mechanism

Missing data can be broadly categorized into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR; Purwar & Singh, 2015). MCAR denotes missing data that is independent of both itself and other variables, meaning the occurrence of missing values for a variable is entirely random. For instance, in a street survey, participants may drop out midway for various reasons, resulting in incomplete data. Such missing values are considered completely random (Jerez et al., 2010). Handling MCAR is relatively straightforward, as missing values can be directly deleted without introducing estimation bias. Alternatively, appropriate filling methods can be applied to maximize the utilization of sample information, though this approach may result in some loss of data. On the other hand, MAR describes situations where data is missing in connection to other observable variables. For instance, in a test scenario, individuals failing to meet the minimum intelligence quotient requirement of 100 points may be excluded from subsequent personality tests. Here, missing values not only lead to information loss but can also introduce bias into the analysis results. Dealing with MAR requires more sophisticated techniques than MCAR, as directly deleting or filling missing values with averages may not be suitable (Cummings, 2013). Although MAR is more complex than MCAR, it is a more commonly encountered scenario. MNAR implies that missing data is only related to the variable itself. For example, in a company that recently hired 20 employees, six were dismissed during the probation period due to poor performance. In the subsequent performance evaluation after the probation period, the performance scores of the dismissed employees are missing (Bertsimas et al., 2018).

3. Missing Value Filling Method

Various effective techniques have been developed for addressing missing values, and in the subsequent discussion, we provide a brief overview of the current landscape of methods in this domain. We delve into several algorithms for comparison, exploring their relationships and offering our recommendations. One of the most widely used imputation methods is class mean imputation, where the entire sample is categorized into classes based on weighted classes derived from auxiliary variables. Within each imputation category, missing responses are imputed with the average value of the corresponding class. Despite its common usage, this method may introduce bias in estimating variance and covariance. Sensitivity to extreme values can compromise average calculations. The random imputation method involves selecting a respondent at random from the imputation category to fill in the missing value. While cost-effective and easy to implement, it tends to underestimate standard errors, leading to an overestimation of test statistics. Regression imputation predicts the missing value by establishing a regression equation with other variables. Although it may incur high computational costs, particularly with datasets featuring numerous variables, it remains a viable option. The k nearest neighbor (kNN) imputation method (Liu et al., 2015) employs values computed from the kNN observations to fill in missing data. It identifies the kNNs of a new data point in the feature space and predicts the missing value based on the attribute values of these neighbors, using measures such as average, median, or mode. Support vector machine and support vector regression are well-established learning-based missing value imputation techniques, catering to discrete/classification and continuous/numerical missing data imputation, respectively. These techniques utilize a kernel function to map the original feature space nonlinearly to a high-dimensional feature space, constructing a hyperplane for linear separation of data samples in the new feature space (Byun & Lee, 2003). Recently, there has been a surge in applying deep learning techniques to missing value imputation challenges. This includes methods leveraging graph neural networks (Zhong et al., 2023), diffusion models (Zheng & Charoenphakdee, 2022), or VAEs (Mattei & Frellsen, 2019). These techniques, with their nonlinear computation capabilities, prove adept at uncovering intricate correlations within data.

4. Method

4.1. Problem Definition

Let data matrix $X \in R^{n \times d}$ indicate the data with missing values, consisting of $n$ samples and $d$ features. The $j$ th feature of the $i$ th samples denoted as $x_{i j}$ . A binary mask matrix $M \in {0, 1}^{n \times d}$ is given to indicate the location of missing values in tabular, where $x_{i j}$ is missing only if $m_{i j} = 0$ . In this paper, the goal of data imputation is to generate the missing value $x_{i j}$ at $m_{i j} = 0$ .

4.2. Denoising Diffusion Probabilistic Models

This article employs the conditional diffusion model (Dhariwal & Nichol, 2021) as a generative method for the restoration of missing values in data. Similar to other generative models, the diffusion model discerns the distribution of training data through specific training methods. The fundamental concept behind the diffusion model involves systematically perturbing the structure within the data distribution using a forward Markov process. Subsequently, the model learns a backward process to recover this perturbed structure, resulting in a highly flexible and easily manageable generative model. During the training phase of the diffusion model, the forward process remains fixed, and the focus lies solely on training its backward process.

During the training process, the diffusion model defines a diffusion process that converts the sample $x_{0} \sim q (x)$ into Gaussian white noise $x_{T} \sim N (0, I)$ in $T$ steps. Each step in the forward process is given by,

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(1)

where

{β_{t} \in (0, 1)}_{t = 1}^{T}

is a predetermined parameter. Since each step

t

in the forward process is only related to the step

t - 1

, as long as

x_{0}

is given, samples at any step can be obtained.

\begin{aligned} q (x_{1 : T} | x_{0}) = \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}) \end{aligned}

(2)

\begin{aligned} x_{t} = \sqrt{\bar{α_{t}}} x_{0} + \sqrt{1 - \bar{α_{t}}} z_{t} \end{aligned}

(3)

where

α_{t} = 1 - β_{t}

\bar{α_{t}} = \prod_{i = 1}^{t} α_{i}

z_{t}

is Gaussian noise. In this process, as

t

increases,

x_{t}

becomes closer to pure noise. When

T \to \infty

x_{t}

is a complete Gaussian noise. In addition,

β_{t}

increases as

t

increases, that is,

β_{1} < β_{2} < \dots < β_{T}

. Moreover, the forward process is a noise-adding process. On the contrary, the reverse process is a noise-removing inference process. If we can gradually obtain the distribution

q (x_{t - 1} | x_{t})

, we can restore the original distribution from the complete standard Gaussian distribution

x_{T} \sim N (0, I)

. However, we cannot simply infer

q (x_{t - 1} | x_{t})

, so we use a deep learning model to predict such a distribution

p_{θ}

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), \sum_{θ} (x_{t}, t))

(4)

The learning objective for equation (2) is derived by considering the variational lower bound,

\begin{aligned} E [- \log p_{θ} (x_{0})] \leq E_{q} [- \log \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} | x_{0})}] \\ = E_{q} [- \log p (x_{T}) - \sum_{t \geq 1} \log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}] = L \end{aligned}

(5)

As extended by Ho et al. (2020), this loss can be further decomposed as,

\begin{aligned} E_{q} [\underset{L_{T}}{\underset{⏟}{D_{K L} (q (x_{T} | x_{0}) | | p (x_{T}))}} + \sum_{t > 1} \underset{L_{t - 1}}{\underset{⏟}{D_{K L} (q (x_{t - 1} | x_{t}, x_{0}) | | p_{θ} (x_{t - 1} | x_{t}))}} - \underset{L_{0}}{\underset{⏟}{\log p_{θ} (x_{0} | x_{1})}}] \end{aligned}

(6)

Using Bayes theorem, one can calculate the posterior

q (x_{t - 1} | x_{t}, x_{0})

in terms of

β_{t}^{'}

and

μ_{t}^{'} (x_{t}, x_{0})

, which are defined as follows:

\begin{aligned} α_{t} = 1 - β_{t} \end{aligned}

(7)

\begin{aligned} β_{t}^{'} := \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t} \end{aligned}

(8)

\begin{aligned} μ_{t}^{'} (x_{t}, x_{0}) := \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} x_{0} + \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t} \end{aligned}

(9)

As reported by Ho et al. (2020), the best way to parametrize the model is to predict the cumulative noise

x_{t}

that is added to the current intermediate sample

x_{t}

. Thus, we obtain the following parametrization of the predicted mean

μ_{θ} (x_{t}, t)

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t))

(10)

Ho et al. (2020) found that predicting

ε

worked best, especially when combined with a reweighted loss function:

L_{simple} = E_{t, x_{0}, ε} [{‖ ε - ε_{θ} (x_{t}, t) ‖}^{2}]

(11)

This objective can be seen as a reweighted form of

L_{v l b}

(without the terms affecting

Σ_{θ}

). The previous studies found that optimizing this reweighted objective resulted in much better sample quality than optimizing

L_{v l b}

directly, and explain this by drawing a connection to generative score matching (Huang et al., 2021).

4.3. Training Conditional Diffusion Model With Missing Data

Traditionally, the training of the diffusion model involves instructing an Unet, also referred to as the noise predictor, to forecast the noise introduced to the samples. This Unet takes the sample $x_{t}$ and steps $t$ as input and produces an output representing the noise amount with dimensions identical to the sample. In this paper, we propose enhancing the guidance information provided to the noise predictor’s input to improve the accuracy of the Unet in predicting noise. The specific training process is illustrated in Figure 1. Initially, we introduce $t$ steps noise to the pristine sample $x_{0}$ to generate $x_{t}$ . Concurrently, we simulate the occurrence of missing values in real samples by removing certain features’ values, rendering them as missing. Leveraging the denoising characteristics of the diffusion model’s backward process, we treat the missing part of the sample as a noisy feature. Consequently, our approach involves using the visible features of the sample as conditions to guide the diffusion model in denoising the missing segment, ultimately yielding clean features. The conditional samples are derived by combining the masked samples with the noisy samples.

X_{c} = M \times X + (1 - M) \times X_{T}

(12)

where

X

is the training sample (i.e., the clean sample),

M

is the indicator matrix consisting of 0 and 1,

X_{T}

is the sample with added

T

steps noise.

X_{c}

is the condition sample with the same dimension as

X_{T}

. According to the training rule of the diffusion model, the noise added to different samples in

X_{T}

is different. Therefore, the missing part of each sample contains different amounts of noise. We define a confidence level to indicate the credibility of the noise features in

X_{c}

guiding the denoising process. Its definition is as follows:

confidence = M_{i} + (\frac{T - t_{i}}{T}) \times (1 - M_{i})

(13)

where

M_{i}

represents the

i

th row of the instruction matrix,

t_{i}

represents the number of times the

i

th sample is added to the noise, and

T

represents the total number of diffusion steps. It can be seen from equation (13) that the more steps a sample is added to the noise, the lower the confidence of its corresponding conditional sample in guiding the denoising process. Next, we rewrite equation (11) by adding additional guidance information to the input of the noise predictor in the diffusion model, resulting in,

L_{simple} = E_{t, x_{0}, ε} [{‖ ε - ε_{θ} (x_{t}, t, x_{c}, confidence) ‖}^{2}]

(14)

Therefore, we can obtain a noise predictor

ε_{θ}

with conditional guidance by training on a complete dataset.

Figure 1.

The framework of our proposed conditional diffusion model for missing value imputation (CDMVI).

4.4. Missing Value Imputation

By minimizing the error between the predicted noise and the actual noise of the sample, we can train the parameters in the model $ε_{θ}$ . Next, we will introduce how to use the trained noise predictor to fill in missing values for samples containing missing values. The specific process is shown in Figure 2. First, given a sample $x_{i}$ containing missing values and its corresponding missing indicator matrix $M_{i}$ , we first fill in the missing parts with random noise to obtain the initial condition sample $x_{c}$ .

x_{i}^{'} = x \times M_{i} + (1 - M_{i}) \times ε

(15)

where

ε \sim N (0, 1)

is a randomly sampled Gaussian noise with mean 0 and variance 1. Meanwhile, we randomly sample standard Gaussian noise as the noise sample

x_{T}

in step

T

, with the same dimension as the sample

x_{i}

. Then, we calculate the initial confidence level according to equation (13) and determine the value of the step number

T

. The variables obtained above are simultaneously input into

ε_{θ}

to obtain the predicted noise

ε^{'}

Figure 2.

Classification performance of different missing rates: (a) USPS, (b) MNIST, (c) First-order, (d) Satimage, (e) Fashion, and (f) Optdigits datasets.

After predicting the noise, we can obtain the mean and variance of $x_{T - 1}$ according to equations (8) and (9). This allows us to sample $x_{T - 1}$ and use it as the input for the next iteration. The conditional samples $x_{c}^{'}$ for the next iteration can be combined from the mask samples and $x_{T - 1}$ .

x_{c}^{'} = x \times M_{i} + (1 - M_{i}) \times x_{T - 1}

(16)

By iteratively repeating the aforementioned steps

T

times, the randomly sampled Gaussian noise undergoes a gradual denoising process, transforming into clean samples. Specifically, the CDMVI method introduced in this paper represents a generative approach to missing value imputation. Leveraging the generative properties of the diffusion model, it interprets the missing section of the data as a feature significantly affected by noise. Consequently, the missing value imputation challenge is reframed as a denoising problem. Built upon the designed conditional diffusion model, CDMVI employs the visible features of the sample as guiding conditions during the diffusion model’s training. This strategic utilization allows for a more precise removal of added noise from the sample. In the imputation stage, as noise in the missing part gradually diminishes, the generated values for the missing section become increasingly meaningful. CDMVI considers the guiding value of the generated samples at each step, continuously updating conditional samples with the newly generated missing values. Ultimately, CDMVI adeptly eliminates noise from the missing section, achieving accurate filling of the missing values.

5. Experiments

In this section, we perform an extensive set of experiments to assess the effectiveness of our proposed method. Initially, we evaluate multiple aspects of CDMVI using the USPS dataset. We qualitatively showcase the behavior of CDMVI across diverse missing patterns and architectural variations. Subsequently, we compare CDMVI against various baseline methods in the context of missing data imputation. This comparative analysis is conducted across six datasets, encompassing a spectrum of missing settings.

5.1. Datasets

In this paper, we evaluated CDMVI on six datasets that are widely used in machine learning and data science. These datasets have different characteristics and uses, covering different fields and problems. For all datasets, the feature value range of each sample was rescaled to $[0, 1]$ . Here is a detailed introduction to each dataset:

MNIST dataset: MNIST is a large dataset containing handwritten digits that are commonly used to train image processing systems. The dataset contains 60,000 training samples and 10,000 test samples, each of which is a $28 \times 28$ pixel handwritten digit image, all from 10 categories. Due to its simplicity and popularity, MNIST has become one of the classic datasets in the field of machine learning and deep learning.

Fashion-MNIST dataset: Fashion-MNIST is an image dataset that replaces the MNIST handwritten digit set. It covers 70,000 different product front images from 10 categories, with 7000 $28 \times 28$ pixel grayscale images per category. This dataset can be used for image classification, image recognition, and other tasks, and is fully consistent with the size, format, and training/testing set division of MNIST.

USPS dataset: The USPS dataset is a handwritten digit image dataset that contains a large number of handwritten digit images and corresponding labels. Each image in the USPS dataset is a $16 \times 16$ pixel grayscale image that has been standardized and centered for processing and analysis by machine learning models.

First-order dataset: The content of the first-order dataset is given in a theorem to predict which of the five heuristics will give the fastest proof when used by a first-order prover. If the theorem is too difficult, the sixth prediction rejects the attempt to prove it. The dataset contains 6120 samples, each with 51 features, and all samples can be divided into six categories.

Satellite image dataset: The dataset consists of multispectral values of pixels in a $3 \times 3$ neighborhood and the classification of the center pixel in each neighborhood. The goal is to predict this classification based on the multispectral values. In the sample database, the categories of pixels are encoded as numbers. The interpretation of remote sensing scenes can be of great significance through the comprehensive interpretation of different types of spatial data and data with different resolutions, including multispectral and radar data, topographic maps, land use maps, etc. The dataset contains 6430 samples, each with 36 features, from six categories.

Optdigits dataset: The optdigits dataset contains 5620 samples, each of which contains 64 features, from 10 categories in total.

5.2. Comparison With Related Methods

Mean: The basic idea of this method is to use the average of other values in the dataset to fill in missing values. Specifically, for a set of samples containing missing values, calculate the average of each feature excluding the missing values, and replace the missing part with the average value corresponding to each feature as the fill value.

MICE (Zhang, 2016): It is constructed based on multiple imputation methods. For a variable with missing values, data from other variables are used to fit the variable, and the fitted prediction values are used to fill in the missing values of the variable.

MIDA (Gondara & Wang, 2018): We propose a multi-imputation framework based on a fully supervised denoising autoencoder model, in which we simulate multiple predictions by initializing our model with a different set of random weights at each run.

kNN (Zhang, 2012): It first finds the kNNs of any case based on the similarity between cases, and then fills in missing values by setting function values (such as mean, median, mode, etc.) in these nearest neighbor cases

Autoencoder (AE; Ng et al., 2011): During the encoding stage, the AE encodes the input data into a low-dimensional representation, often learning effective features of the data. During the decoding stage, the AE decodes the low-dimensional representation back into the original data space, generating a complete output with no missing values.

GAIN (Yoon et al., 2018): In GAIN, there is a generator and a discriminator. The generator is responsible for generating new data samples, while the discriminator is responsible for determining whether these generated data samples are similar to real data samples. First, the generator generates a fill-in value for each missing value. Then, the discriminator judges whether this fill-in value is similar to the true value. If it is similar, then this fill-in value will be retained; if not, it will be returned to the generator for adjustment.

Missing data importance weighted autoencoder (MIWAE; Mattei & Frellsen, 2019): The MIWAE model is a generalized version of the IWAE. IWAE is a generative model with the same architecture as the VAE, which introduces an importance-weighted strategy to optimize the objective function of the VAE. In IWAE, the encoder model uses multiple samples to approximate the posterior, which is more flexible for modeling complex posteriors. Unlike IWAE, the objective function of MIWAE only focuses on the observed part with single-value filling. Finally, the missing values of the original matrix $X$ are predicted by the trained MIWAE.

5.3. Evaluation Metrics

The sum of the root mean square error (RMSE) calculated using each attribute on the test set is compared, and the calculation result is,

\begin{aligned} RMSE = \sqrt{E (\sum_{i = 1}^{n} \sum_{j = 1}^{m} {(x_{i j} - {\hat{x}}_{i j})}^{2})} \end{aligned}

(17)

In it, we have

m

features and

n

samples.

x_{i j}

is the true value of the feature and

{\hat{x}}_{i j}

is the value that the feature is filled with.

5.4. Experimental Setting

We conducted a total of 10 experiments, each comprising five cross-validation trials, wherein we employed RMSE as the performance metric. To illustrate the impact of filled features in machine learning tasks, we further conducted classification tasks on the samples with filled features. This involved utilizing supervised training classification models to assess classification accuracy. Default settings from respective papers were adhered to for all baseline methods. The datasets were partitioned into an 80% training subset and a 20% test subset. For missing value-filling algorithms necessitating training (e.g., GAIN, MIWAE, and CDMVI), the model underwent training on the training set. Subsequently, values of features in the test set were randomly deleted, and the trained model was then validated on the test set. In contrast, algorithms not requiring training (e.g., Mean, MICE, and kNN) had the training set and test set concatenated as input. During training, data feature values were scaled to $[0, 1]$ using the MinMax scaler. Subsequently, training subsets of incomplete datasets were simulated. Specifically, in the simulation based on the MCAR deletion mechanism, we randomly deleted feature values from the dataset at rates of 30%, 40%, 50%, 60%, 70%, and 80% for each training subset. Additionally, we designed distinct feature deletion mechanisms for image datasets, such as USPS, incorporating entire block deletions at image edges and centers. It is crucial to note that all experiments were conducted in Python, utilizing PyTorch 1.11.0, and executed on an NVIDIA GeForce RTX 3090 GPU.

5.5. RMSE Results

We employed six real datasets sourced from the UCI machine learning repository to quantitatively assess the performance of our method in terms of imputation accuracy. Tables 2 to 7 present the RMSE values of CDMVI alongside seven baseline methods, with the optimal outcomes highlighted in bold. Across the six tables, it is evident that all methods exhibit a decline in performance as the missing rate increases. Notably, our method consistently outperformed other approaches in the majority of cases. Specifically, for the six datasets spanning varying missing rates, the average RMSE values for our method were 0.034, 0.125, 0.070, 0.058, 0.114, and 0.178, respectively. In comparison to alternative methods, our approach reduced RMSE by 0.078, 0.016, 0.007, 0.003, 0.003, and 0.014 across the six datasets when compared to the best-performing method. When contrasted with the least effective method, our approach achieved RMSE reductions of 0.284, 0.133, 0.087, 0.216, 0.184, and 0.122 on the respective datasets. Several factors contribute to the superior performance of our method. Firstly, diffusion-based methods effectively learn data distributions and generate values that align with the missing parts distribution. Secondly, observable feature values provide conditional guidance for the diffusion process, enhancing the accuracy of generated values for missing data. Thirdly, our method leverages the numerical values of missing parts obtained at each step of sample denoising as additional information, contributing to its efficacy.

Table 2.
RMSE on USPS Dataset for Different Missing Rates.

USPS

Rate 0.3 0.4 0.5 0.6 0.7 0.8

Mean 0.272 0.271 0.271 0.271 0.271 0.271

MICE 0.050 0.081 0.118 0.168 0.227 0.275

MIDA 0.162 0.203 0.240 0.276 0.311 0.343

kNN 0.104 0.105 0.107 0.112 0.133 0.233

AE 0.090 0.096 0.103 0.113 0.126 0.143

GAIN 0.329 0.328 0.327 0.327 0.326 0.271

MIWAE 0.218 0.222 0.230 0.235 0.243 0.261

Our 0.019 0.020 0.022 0.028 0.042 0.070

	USPS
Mean	0.272	0.271	0.271	0.271	0.271	0.271
MICE	0.050	0.081	0.118	0.168	0.227	0.275
MIDA	0.162	0.203	0.240	0.276	0.311	0.343
kNN	0.104	0.105	0.107	0.112	0.133	0.233
AE	0.090	0.096	0.103	0.113	0.126	0.143
GAIN	0.329	0.328	0.327	0.327	0.326	0.271
MIWAE	0.218	0.222	0.230	0.235	0.243	0.261
Our	0.019	0.020	0.022	0.028	0.042	0.070

Note. RMSE = root mean-sqaure error; MICE = multiple imputation using chain equations; MIDA = multiple imputation using denoising autoencoders; kNN = k nearest neighbor; AE = autoencoder; GAIN= generative adversarial imputation nets; MIWAE = missing data importance weighted AE.

Table 3.

RMSE on MNIST Dataset for Different Missing Rates.

	MNIST
Rate	0.3	0.4	0.5	0.6	0.7	0.8
Mean	0.259	0.258	0.258	0.258	0.258	0.258
MICE	0.099	0.104	0.122	0.141	0.170	0.207
MIDA	0.168	0.177	0.201	0.234	0.274	0.317
KNN	0.157	0.159	0.161	0.166	0.179	0.229
AE	0.168	0.169	0.168	0.176	0.175	0.183
GAIN	0.217	0.229	0.257	0.269	0.287	0.291
MIWAE	0.201	0.223	0.244	0.267	0.290	0.295
Our	0.098	0.100	0.108	0.128	0.138	0.178

Table 4.

RMSE on First-Order Dataset for Different Missing Rates.

	\hspace*{\tabcolsep}First order
Rate	0.3	0.4	0.5	0.6	0.7	0.8
Mean	0.157	0.158	0.155	0.158	0.156	0.157
MICE	0.072	0.078	0.095	0.108	0.124	0.156
MIDA	0.010	0.111	0.128	0.149	0.160	0.192
kNN	0.047	0.055	0.057	0.102	0.128	0.140
AE	0.056	0.054	0.078	0.080	0.090	0.104
GAIN	0.153	0.154	0.155	0.156	0.156	0.157
MIWAE	0.111	0.124	0.130	0.142	0.152	0.152
Our	0.045	0.053	0.056	0.075	0.089	0.103

Table 5.

RMSE on Satimage Dataset for Different Missing Rates.

	Satimage
Rate	0.3	0.4	0.5	0.6	0.7	0.8
Mean	0.191	0.192	0.193	0.192	0.192	0.192
MICE	0.045	0.047	0.054	0.055	0.072	0.109
MIDA	0.180	0.216	0.246	0.285	0.323	0.395
kNN	0.046	0.048	0.071	0.119	0.153	0.149
AE	0.046	0.049	0.052	0.062	0.072	0.083
GAIN	0.190	0.191	0.191	0.193	0.194	0.196
MIWAE	0.062	0.071	0.074	0.085	0.086	0.094
Our	0.045	0.047	0.052	0.054	0.068	0.081

Table 6.

RMSE on Fashion Dataset for Different Missing Rates.

	Fashion
Rate	0.3	0.4	0.5	0.6	0.7	0.8
Mean	0.292	0.292	0.292	0.292	0.292	0.293
MICE	0.092	0.098	0.106	0.114	0.130	0.159
MIDA	0.180	0.222	0.263	0.304	0.321	0.383
kNN	0.138	0.138	0.140	0.142	0.143	0.154
AE	0.143	0.148	0.149	0.152	0.158	0.163
GAIN	0.292	0.293	0.297	0.299	0.301	0.303
MIWAE	0.283	0.284	0.289	0.293	0.297	0.299
Our	0.090	0.094	0.105	0.113	0.129	0.151

Table 7.

RMSE on Optdigits Dataset for Different Missing Rates.

	MNIST
Rate	0.3	0.4	0.5	0.6	0.7	0.8
Mean	0.274	0.273	0.274	0.274	0.275	0.275
MICE	0.158	0.167	0.178	0.194	0.216	0.239
MIDA	0.243	0.265	0.291	0.321	0.336	0.344
kNN	0.138	0.151	0.186	0.268	0.293	0.291
AE	0.162	0.172	0.184	0.198	0.218	0.234
GAIN	0.272	0.272	0.272	0.274	0.277	0.279
MIWAE	0.222	0.234	0.274	0.264	0.278	0.304
Our	0.134	0.143	0.157	0.179	0.212	0.243

5.6. Classification of Imputed Data

In addition to calculating the RMSE for the filled values, we explored the use of the filled samples for downstream tasks to more intuitively showcase the effectiveness of the imputed values. We selected classification tasks as the downstream applications following the imputation. All six datasets chosen for this analysis have ground truth labels for classification. The following steps were undertaken to obtain and compare the classification results: First, we simulated missing values of the MCAR type with varying missing rates. Subsequently, estimation methods were employed to fill the datasets with missing values. Next, complete samples were utilized as the training set to train a classifier, with the filled samples employed for testing and reporting the classification accuracy. Additionally, two baseline methods were included for reference: one utilizing the classification accuracy of real samples as a supervisory benchmark, and the other filling the missing part with zeros. This sampling process was iterated 10 times to conduct statistical analysis on the estimation methods.

Figure 2 illustrates the performance of all missing value imputation methods on classification tasks. It is evident from the figure that the classification accuracy of the complete dataset without missing values is intuitively the highest. While our method performs slightly below the classification accuracy of the complete dataset without missing values, it outperforms all other compared methods. Furthermore, as the missing rate in the sample increases, the classification accuracy of all missing value imputation methods in downstream classification tasks continues to decline. Specifically, our method achieved average classification accuracies of 0.940, 0.910, 0.492, 0.868, 0.806, and 0.869 on six datasets, respectively. In comparison to the classification accuracy of the complete sample, our method is lower by 0.005, 0.018, 0.059, 0.027, 0.010, and 0.101, respectively. Contrasting with the worst-performing method, our method is higher by 0.513, 0.260, 0.118, 0.606, 0.270, and 0.329, respectively.

Additionally, an intriguing observation is that some methods perform worse than 0-value filling on certain datasets, such as USPS, MNIST, and Fashion datasets. This phenomenon suggests that the values filled in the missing positions by these methods not only fail to enhance the performance of the samples in classification tasks but also impede the samples’ utility. Upon further investigation, we noted that datasets exhibiting this phenomenon are image data, leading us to speculate that incorrect filling values may impact the extraction of image semantics and hinder the model’s understanding of the image.

5.7. Qualitative Analysis

To visually illustrate the accuracy of our method in imputing missing values, we conducted a qualitative analysis of the visual outcomes across various techniques applied to the USPS dataset. We introduced three distinct types of missing values, including completely random missing, bar-shaped missing, and center-missing. The filling results of all methods were compared, as depicted in Figure 3. The visual inspection of the results clearly highlights the superior performance of our method in accurately filling in missing values across all three types of scenarios. Notably, the visual effects achieved by our method closely approach the completeness of the original image. This demonstration emphasizes the robustness and precision of our approach, showcasing its ability to generate visually clear and accurate imputations in diverse missing data patterns.

Figure 3.

Breast.

6. Conclusion and Future Works

We introduce a missing value imputation method, termed CDMVI. In this approach, we incorporate conditional guidance into the training of the diffusion denoising model to judiciously leverage the samples acquired at each denoising step. Extensive numerical simulations have been conducted, demonstrating the robust imputation performance of our method. Moreover, our results indicate that CDMVI achieves competitive performance compared to other well-established imputation methods. Future research directions may explore the design of alternative training strategies for the diffusion model to streamline the denoising process and enhance imputation accuracy for missing data segments.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Baraldi

A. N.

Enders

C. K.

(2010). An introduction to modern missing data analyses. Journal of School Psychology, 48, 5–37.

Bertsimas

Pawlowski

Zhuo

Y. D.

(2018). From predictive methods to missing data imputation: An optimization approach. Journal of Machine Learning Research, 18, 1–39.

Byun

Lee

S. W.

(2003). A survey on pattern recognition applications of support vector machines. International Journal of Pattern Recognition and Artificial Intelligence, 17, 459–486.

Cao

Tan

Gao

Chen

Heng

P. A.

S. Z.

(2022). A survey on generative diffusion model. IEEE Transactions on Knowledge and Data Engineering, 36(7), 2814–2830.

Chen

Sun

Song

Luo

(2023). Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 19830–19843). IEEE.

Cummings

(2013). Missing data and multiple imputation. JAMA Pediatrics, 167, 656–661.

Dai

Long

(2021). Multiple imputation via generative adversarial network for high-dimensional blockwise missing value problems. In 2021 20th IEEE international conference on machine learning and applications (ICMLA) (pp. 791–798). IEEE.

Dhariwal

Nichol

(2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

Dong

Peng

C. Y. J.

(2013). Principled missing data methods for researchers. SpringerPlus, 2, 1–17.

10.

Gondara

Wang

(2018). MIDA: Multiple imputation using denoising autoencoders. In Advances in knowledge discovery and data mining: 22nd Pacific-Asia conference, PAKDD 2018, Melbourne, VIC, Australia, June 3–6, 2018, proceedings, Part III 22 (pp. 260–272). Springer.

11.

Jain

Abbeel

(2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

12.

Huang

C. W.

Lim

J. H.

Courville

A. C.

(2021). A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34, 22863–22876.

13.

Ibrahim

J. G.

Chu

Chen

M. H.

(2012). Missing data in clinical studies: Issues and methods. Journal of Clinical Oncology, 30, 3297.

14.

Jarrett

Cebere

B. C.

Liu

Curth

van der Schaar

(2022). Hyperimpute: Generalized iterative imputation with automatic model selection. In International conference on machine learning (pp. 9916–9937). PMLR.

15.

Jerez

J. M.

Molina

García-Laencina

P. J.

Alba

Ribelles

Martín

Franco

(2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50, 105–115.

16.

Khan

S. I.

Hoque

A. S. M. L.

(2020). SICE: An improved missing data imputation technique. Journal of Big Data, 7, 1–21.

17.

S. C. X.

Jiang

Marlin

(2019). MisGAN: Learning from incomplete data with generative adversarial networks. https://doi.org/10.48550/arXiv.1902.09599.

18.

Thickstun

Gulrajani

Liang

P. S.

Hashimoto

T. B.

(2022). Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35, 4328–4343.

19.

Liu

Z. G.

Liu

Dezert

Pan

(2015). Classification of incomplete data based on belief functions and k-nearest neighbors. Knowledge-Based Systems, 89, 113–125.

20.

Mattei

P. A.

Frellsen

(2019). MIWAE: Deep generative modelling and imputation of incomplete data sets. In International conference on machine learning (pp. 4413–4423). PMLR.

21.

Nazabal

Olmos

P. M.

Ghahramani

Valera

(2020). Handling incomplete heterogeneous data using VAEs. Pattern Recognition, 107, 107501.

22.

, et al. (2011). Sparse autoencoder. CS294A Lecture Notes, 72, 1–19.

23.

Purwar

Singh

S. K.

(2015). Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications, 42, 5621–5631.

24.

Rasul

Seward

Schuster

Vollgraf

(2021). Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International conference on machine learning (pp. 8857–8868). PMLR.

25.

Rezende

D. J.

Mohamed

Wierstra

(2014). Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning (pp. 1278–1286). PMLR.

26.

Stekhoven

D. J.

Bühlmann

(2012). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics (Oxford, England), 28, 112–118.

27.

Van Buuren

Oudshoorn

C. G.

(2000). Multivariate imputation by chained equations. Netherlands Organization for Applied Scientific Research (TNO). https://amices.org/mice/

28.

Wang

Yao

Zhao

(2016). Auto-encoder based dimensionality reduction. Neurocomputing, 184, 232–242.

29.

Yang

Zhang

Song

Hong

Zhao

Zhang

Cui

Yang

M. H.

(2023). Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56, 1–39.

30.

Yoon

Jordon

Schaar

(2018). Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning (pp. 5689–5698). PMLR.

31.

Yoon

Sull

(2020). GAMIN: Generative adversarial multiple imputation network for highly missing data. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8456–8464). IEEE.

32.

Zhang

(2012). Nearest neighbor selection for iteratively KNN imputation. Journal of Systems and Software, 85, 2541–2552.

33.

Zhang

(2016). Multiple imputation with multivariate imputation by chained equation (MICE) package. Annals of Translational Medicine, 4(2), 30. https://doi.org/10.3978/j.issn.2305-5839.2015.12.63

34.

Zheng

Charoenphakdee

(2022). Diffusion models for missing value imputation in tabular data. https://doi.org/10.48550/arXiv.2210.17128.

35.

Zhong

Gui

(2023). Data imputation with iterative graph reconstruction. Proceedings of the AAAI conference on artificial intelligence. AAAI, 37(9), 11399–11407.

Conditional Diffusion Model for Missing Value Imputation

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Missing Data

Table 1. Missing Value Display. Id Sex Age Blood pressure Blood glucose 1 ? 28 108 5.4 2 M ? 141 ? 3 F 74 ? 5.7

3. Missing Value Filling Method

4. Method

4.1. Problem Definition

4.2. Denoising Diffusion Probabilistic Models

5.1. Datasets

5.2. Comparison With Related Methods

5.3. Evaluation Metrics

5.5. RMSE Results

5.7. Qualitative Analysis

Footnotes

Funding

Declaration of conflicting interests

References

Table 1.
Missing Value Display.

Id Sex Age Blood pressure Blood glucose

1 ? 28 108 5.4

2 M ? 141 ?

3 F 74 ? 5.7