Abstract
Low response rates due to unit nonresponse have always been a ubiquitous problem in survey-based empirical research, and calibration is a popular method to adjust for bias caused by unit nonresponse. Typically, some external information on the true population quantities of margins for some calibration variables is available, and sometimes also of higher-order interactions. Weighting algorithms try to adjust the sample to these external benchmarks. It is generally assumed that even if the underlying missingness mechanism of the unit nonresponse is non-ignorable, weighting will at least alleviate the severity of the bias. We discuss data situations where weighting under a missing at random (MAR) assumption adjusts the sample correctly but still increases the bias for the analysis model, and we describe strategies for identifying auxiliary variables that are less susceptible to these unwanted effects.
1. Introduction
Weighting is fundamental to survey research. It allows researchers to address heterogeneous inclusion probabilities resulting from both the survey design and unit nonresponse. If all heterogeneity can be attributed to the survey design, the weights are fixed, known quantities. However, calibration weights for unit nonresponse are unknown and must be treated as estimates (see e.g., Little and Vartivarian 2003).
The problem of unit nonresponse has been recognized for decades (see e.g., Hansen and Dedrick 1938; Hansen and Hurwitz 1946), and comprehensive overviews are provided by, for example, Groves et al. (2002), Särndal and Lundström (2005), Groves (2006), Särndal (2007), Groves and Peytcheva (2008), Bethlehem (2011). Weighting methods to mitigate unit-nonresponse bias date back at least to iterative proportional fitting (Deming and Stephan 1940) and have been continuously refined over the years (see e.g., Brick 2013; Lundström and Särndal 1999). Nevertheless, the underlying principle has remained the same: calibrating the sample so that the means of adjustment cells match external information from the full sample or population. While Kott and Chang (2010) compare different strategies for calibrating unit-nonresponse bias, practical recommendations remain scarce in the literature, even though the problem has intensified due to ever-decreasing response rates (see e.g., Curtin et al. 2005).
Adjusting for unit nonresponse relies on a corresponding nonresponse model, which itself is based on rather strong assumptions. One assumption frequently made (either explicitly or implicitly) to handle missing data is ignorability (see e.g., Little and Rubin 2019). If, for instance, we observe a skewed age distribution in our sample, we typically calibrate the sample to the age distribution of the target population. In doing so, we implicitly assume that “Age” is the cause of the unit nonresponse and that conditioning on age removes any associated bias. In contrast, non-ignorability describes situations where it is impossible to fully eliminate nonresponse bias using the available information. In such cases, we might still observe a skewed age distribution, even though the (unknown) cause of nonresponse is merely correlated with it. These assumptions and their underlying mechanisms were first described by Rubin (1976) in the context of item nonresponse, but it took several decades for the concept to be generally adopted for both item and unit nonresponse problems.
Since ignorability cannot be disproved by the observed data, it often seems plausible to rely on this assumption even if it is unlikely to hold. This approach appears intuitively correct: if the factors causing nonresponse are correlated with the variables used in the nonresponse model, then controlling for these variables should, in theory, reduce estimator bias. Several publications support this strategy (see e.g., Deville 2000; Earp et al. 2012; Kott and Liao 2017), although Oh and Scheuren (1983) and Bethlehem (1988) had already noted that weighting can occasionally increase bias. Furthermore, Kreuter and Olson (2011) demonstrated in a simulation study that the choice of predictors significantly impacts the degree of bias reduction, with the potential for bias to actually increase.
In this paper, we investigate the validity of the ignorability assumption in the context of calibration. We introduce data scenarios where, contrary to intuition, bias becomes more severe. Finally, we propose strategies for identifying calibration variables that increase rather than reduce nonresponse bias.
The remainder of this paper is structured as follows: Section 2 describes the underlying assumptions for missing-data mechanisms and their application to our framework, which is detailed in Section 3. Section 4 provides simulation-based demonstrations and develops practical recommendations. The paper concludes with a brief summary of our findings in Section 5. Corresponding proofs and derivations can be found in the Appendix.
2. Missing Data Mechanisms for Unit Nonresponse
2.1. Historical Background
Rubin (1987) originally described missing data mechanisms from the perspective of item nonresponse, primarily because multiple imputation was developed to alleviate problems associated with item nonresponse. In the original notation, let
In contrast, Missing not at random (MNAR) means that
2.2. (Non-)Ignorability in the Context of Sampling
Andridge et al. (2019) introduce a slight modification of MAR for unit nonresponse in survey sampling, where “missing” is replaced by “selection.” For instance, any calibration model implicitly assumes selection at random (SAR) to produce unbiased point estimators. To illustrate this point, we slightly modify Rubin’s original notation: while Rubin described data using a multivariate variable Y, we introduce a completely observed auxiliary variable X and hereafter refer to Y as the (univariate or multivariate) analysis variable.
A peculiarity of unit nonresponse problems is that they are not immediately recognizable as missing-data problems. A typical empirical survey-based dataset does not contain statistical units whose entire set of variables is missing (e.g., due to failed contact or refusal) such that their values are explicitly set to “NA” or another missing data indicator. Instead, these units are omitted entirely from the released data. The remaining sample of respondents is then often calibrated using external aggregated information, such as population margins from a census. Over time, this common practice has implicitly embedded the ignorability assumption into the calibration literature (see e.g., Brick 2013).
2.3. Missing Data Mechanisms Based on Unobserved Variables
As stated in Section 2.2, the original definition of missing-data mechanisms focused on information that was (at least theoretically) available within the data itself. In our research, we take this notion one step further by introducing a scenario in which missingness is governed by unobserved variables. While such a scenario is particularly plausible in unit nonresponse, it is also theoretically possible in item nonresponse settings. For instance, a common textbook example of MNAR is missing income values, where high (or low) income increases the probability of nonresponse. Although the underlying causes for missing values are rarely discussed in detail, one could imagine a scenario where the actual reason for missing income data is a latent trait such as “cautiousness.” Cautiousness might not be captured by the survey but is nonetheless related to income.
We introduce such a situation into our unit nonresponse framework, where the cause for (non-)participation lies outside the survey; we denote this unobserved variable as Z. We further denote the analysis variable as Y and the auxiliary (calibration) variable as X. The nonresponse-governing variable Z is not part of the survey; therefore, only X and Y are included in the indicator matrix R. The resulting dataset contains only
Calibration can only be based on X, since Z is not measured. Any calibration method assumes ignorability—that is, we assume
Moreover, calibration has the potential to introduce bias that would not have existed otherwise, as initially noted by Oh and Scheuren (1983). We aim to investigate the underlying statistical mechanisms to mitigate the risk of increasing bias in practical applications. To illustrate the problem, consider the following example: the unobserved variable Z is correlated with the calibration variable X, but not with the analysis variable Y. Specifically, imagine that “Headache” (Z) governs the propensity to participate in a survey. It is positively correlated with “Age” (X) but marginally uncorrelated with “Income” (Y). In this case, an analysis relying solely on the marginal distribution of Y would yield an unbiased estimator. However, since the propensity to have a headache is positively correlated with age, the marginal distribution of age in the sample will be skewed (as the probability of missingness increases with age). While it is tempting to calibrate by age groups, if “Age” and “Income” are not independent, this calibration will introduce bias that was not previously present. Although this may seem counterintuitive, it is mathematically entirely possible.
We can take this example further by demonstrating that calibration can also exacerbate existing bias. Assume that “Headache” (Z) is negatively correlated with “Income” (Y), while “Age” (X) and “Headache” remain positively correlated, as do “Age” and “Income.” In the uncalibrated sample, “Income” is overestimated due to its negative correlation with “Headache” (wealthier people with fewer headaches are overrepresented). Conversely, “Age” is underestimated because older people (who have more headaches) are underrepresented. Calibrating for “Age” will therefore further increase the overestimation of “Income,” as the wealthier older participants receive even higher weights. We emphasize that although the weighting is performed correctly according to standard procedures, the incorrect ignorability assumption leads to an increase in bias.
Using these “Headache” examples as a starting point, this paper derives a formal description of nonresponse and weighting to understand the conditions under which calibration fails. This requires an appropriate description of data transformations through both nonresponse and weighting. As we shall see, the associations between the three variables X, Y, and Z determine the resulting bias. The respective correlation matrix for
3. A Model of Unit Nonresponse and Weighting
3.1. Latent Population Level, Selection through Nonresponse, and Weighting
In this section, we introduce additional assumptions and constraints regarding the joint distribution
The expected impact of nonresponse on survey data and its subsequent calibration via weighting is captured using a linear model. In a linear framework, the effect of a change in variable A on variable B depends on the absolute change
In its basic form, the model employs three variables

A model describing the mechanisms of nonresponse and weighting.
In general, Z is unobservable and determines R, the response indicator. Y represents any variable of interest, and X is an auxiliary variable for which the true population mean is known. Each of the variables
Selection through nonresponse can be formalized as conditioning population-level variables
A feasible approach to modeling nonresponse is a logistic regression where a higher value of z
i
for unit i results in a higher probability of a positive response
In addition to reducing the sample size, nonresponse considerably affects the observed data structure. The multivariate distribution of the survey data
Consequently, survey-based research must account for the bias
For the purpose of describing biases in mean estimators rather than their standard errors, the specific weighting algorithm used is largely irrelevant, provided the weighted mean of the auxiliary variable matches the population target. However, since interval-scaled auxiliary variables provide a more straightforward initial approach, simple algorithms like iterative proportional fitting are insufficient. Therefore, we assume the application of a GREG weighting algorithm (e.g., Bethlehem and Keller 1987; Deville et al. 1993; Rao and Singh 1997) capable of weighting for mean values (see Appendix B).
3.2. Model Outline
Assuming a simple random sample, the model describes how nonresponse bias propagates from Z to X and Y, and how weighting-induced counter-bias further propagates from X to Y. While nonresponse is the root cause of the bias, the model applies to any unintended deviation from equal-probability sampling. Note that the model does not describe the impact of weighting on the precision of estimates; weighting can, however, inflate the variance of Y, as noted by Kish (1987) and Little and Rubin (2019).
The framework demonstrates that weighting can successfully remove all nonresponse bias in Y when the underlying response process is SAR. It also illustrates why the response rate alone cannot predict the extent of bias before weighting. Instead, the combination of selection bias and variance reduction in the unobserved propensity variable Z constitutes the primary cause. These two features of nonresponse are propagated to observable variables via the joint covariance matrix, which fully determines the nonresponse bias in any observed variable.
3.3. Bias in the Variable of Interest
Y
In the following, let σ
x
denote the standard deviation of X and r
xy
the correlation between X and Y (likewise for other variables). For explanatory purposes (WLOG), we assume that the participation propensity is positively associated with Z. The variance of the response propensity variable Z is reduced by nonresponse because units with higher values of Z are more likely to be selected. Moreover, nonresponse disproportionately affects the lower tail of the Z-distribution, shifting the average upward. Hence,
The parameters δ and
This is a non-trivial result with implications that are not immediately obvious. We first analyze this by assuming SAR conditioned on X in Section 3.4, followed by the SNAR case in Section 3.5.
3.4. Weighting under SAR
The definition of SAR states that nonresponse does not affect Y when the data is controlled for X. That is, Z and Y are independent when conditioned on X, which, in a linear framework, implies
To justify SAR, the auxiliary variable(s) X must fully characterize the selection process. For example, in a landline survey, X would need to capture the probability of an individual being home, their willingness to answer the phone, and the internal household selection process. While sociodemographic variables may capture part of this process, they are unlikely to account for it entirely. Thus, SAR is a very strong and often unrealistic assumption, as it implies that all relevant causal influences on response behavior are captured by the available X.
3.5. Weighting under SNAR
Under SNAR, conditional independence no longer holds. Z biases Y without being fully controllable via X, making it impossible to judge the impact of nonresponse and weighting with high precision. To describe the uncertainty under SNAR, we explore the potential space of the total bias
A tipping point is reached when the bias after weighting
This point defines the boundary between a decrease and an increase in bias due to weighting. If r
xy
and r
xz
have identical signs, weighting increases the bias of Y if and only if (see Appendix A “Determining the Critical Value for
Conversely, if r xy and r xz have opposite signs, weighting increases the absolute bias of Y if and only if:
As a first approximation, the term
Weighting always increases the bias of Y when the initial bias
Another detrimental mechanism occurs when the counter-bias pushes in the same direction as the initial bias, thereby compounding the total bias. This is akin to the second “Headache” example in Section 2.3. This happens when the product of the correlations is negative (
4. Illustration of Practical Relevance
4.1. Laboratory Examples
Figure 2 illustrates the biases of Y for six scenarios (A–F) with varying parameters

Six scenarios depicting bias before weighting
Bias values are only plotted for those r yz for which the resulting correlation matrix remains positive semidefinite. The shaded regions indicate the ranges where the absolute bias of the weighted mean is smaller than that of the unweighted mean. In the unshaded regions, weighting increases rather than reduces the absolute bias.
All biases are linear in r
yz
, with
Scenarios A to C in Figure 2 demonstrate that the magnitude of bias
Finally, the extreme cases discussed in the previous section are reflected in Scenarios D and E at the leftmost limit of the graphs (the Fréchet-Hoeffding bounds, where δ= 0.408). In Scenario D, the weighted bias at r
yz
= 0.5 is approximately two-thirds larger than in the unweighted case. In Scenario E, there is no initial bias at
4.2. Practical Examples
The fact that weighting for an auxiliary variable X can increase the bias of a dependent variable Y is counterintuitive. Previously, the prevailing assumption was that eliminating bias in an auxiliary variable would invariably propagate to any dependent variable, reducing bias there to some degree. To better understand the practical implications of counterproductive weighting, consider two illustrative examples.
First, imagine a landline survey where the response driver Z is characterized by “Openness” and “Accessibility”. The variable of interest, Y (“Gambling”), is largely determined by “Openness” and “Risk-Taking”. The auxiliary variable, X (“Age”), can be viewed as a construct of “Accessibility” minus “Risk-Taking”, as the likelihood of being accessible via landline increases with age, while the propensity for risk-taking generally decreases. Consequently, Y and X are both positively correlated with Z, while Y and X are negatively correlated.
Second, suppose one aims to estimate the average “Statistical Knowledge” (Y) of scientists. Authors are randomly contacted for an online test, and the researcher controls for X (“Number of Publications per Author”) to ensure “representativeness.” Let the primary driver for participation be Z (“Ambition”), as authors may use the test results to signal their proficiency. Assume that publication frequency is determined by a combination of knowledge and ambition, which are themselves statistically independent. Even if publication frequency is controlled for in the initial sampling, the observed mean
4.3. Continuous versus Dichotomous Variables
Categorical variables are frequently used as auxiliary variables; in many surveys, dichotomous or polytomous variables (e.g., sex, age class, region) dominate. Similarly, the analysis variable Y may be categorical. These cases can be addressed by applying transformations, such as McCall’s (1922) area transformation, to derive a standard normal variable X
n
from a dichotomous variable X
d
. Instead of
4.4. Multiple Auxiliary Variables
In household or person surveys, a large number of auxiliary variables are often employed, with each category of a polytomous variable serving as a distinct dummy variable. Each auxiliary variable can contribute to the total bias
The question of an absolute upper limit for total bias may be secondary to the fact that weighting efficiency typically deteriorates long before reaching theoretical extremes. In conclusion, while bias accumulation is possible, it is limited. It is more likely that a few dominant auxiliary variables—specifically those furthest from their population means—drive the weighting procedure and should be the primary focus of the researcher.
4.5. Evaluating Sampling Quality
Sampling quality is typically defined by potential biases resulting from nonresponse or other deviations from probability sampling. Although the response rate ρ is a widely used and accessible metric, it is not a reliable indicator of bias, as it fails to capture the underlying biasing processes (see Section 3.2). Instead, bias is fully determined by
Theoretically, it is possible to estimate the bias parameters
4.6. Practical Advice
Although Z is unobserved and r
yz
is not identifiable, one can often deduce whether a specific auxiliary variable X might inflate bias. Researchers can often make reasonable assumptions about the nonresponse process based on expert knowledge of the data collection context. Hence, one may be able to assume the sign or even the expected range of the correlation r
yz
. Since r
xy
can be approximated by
Two scenarios warrant particular caution: when r
yz
is assumed to be small but weighting imposes a strong counter-bias, or when the product of the three correlations (
5. Summary and Outlook
In this paper, we investigated the implications of incorrectly assuming ignorability in sample calibration. We demonstrated that the common practice of calibrating auxiliary variables to external benchmarks can be counterproductive in specific data scenarios. Using a parsimonious linear model, we outlined the mechanisms under which weighting becomes problematic and provided guidance on identifying detrimental auxiliary variables.
Despite these risks, we remain proponents of sample calibration, as weighting is an efficient method for incorporating population information. However, researchers must remain cognizant of the underlying missing-data assumptions. As Lohr (2023) emphasizes, over-reliance on nonresponse model assumptions can be hazardous.
Looking forward, employing multiple weighting schemes tailored to different analysis models could provide more flexibility, though this may be impractical for general-purpose datasets. A compromise could involve focusing calibration on variables central to the most common analytical needs. Counterintuitive results in science often serve as a catalyst for deeper understanding. Our findings can be extended to related fields, such as non-probability sampling, where similar assumptions regarding selection mechanisms are required.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
The authors are grateful for helpful comments from five anonymous reviewers which substantially improved the quality of the paper.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Received: February 6, 2025
Accepted: January 19, 2026
