Abstract
Introducing a web option in interviewer-administered surveys could increase response rates and reduce costs. However, this requires careful assessment of the effects of mixed-mode designs on data quality and key measures, especially across important sociodemographic subgroups. In 2018 and 2020, the Health and Retirement Study (HRS) experimentally introduced web in a sequential mixed-mode design for panelists assigned to the telephone mode. Initial analyses found a limited number of mode effects on key outcome distributions and data quality measures. This paper extends this initial analysis by assessing possible heterogeneity in these effects among sociodemographic subgroups defined by race/ethnicity, sex, and others. We interact mode with each sociodemographic indicator in statistical models for each outcome. Overall, we found limited evidence of heterogeneity in the mode effects, with 3% of the 204 interaction terms we tested emerging as significant. For example, previous work showed that more household roster changes are reported in the web-first group, and we found that this was more pronounced for females and those with some college education. Although some heterogeneity in mode effects was observed across subgroups, the effects were generally too small to cause data quality concerns. We conclude with a discussion of broader considerations for survey researchers.
Keywords
1. Introduction
In general, response rates are trending down in the U.S. (Williams and Brick 2018) while at the same time data collection costs continue to rise. The longitudinal Health and Retirement Study (HRS) provides an example of this trend, where biannual panel AAPOR RR6 response rates held steady from the 1994 to 2014 waves (ranging from 87% to 93%) before beginning to trend downward to 68% in 2022 (HRS Staff 2025). This puts pressure on the survey research community to develop innovative data collection strategies. One popular strategy is to sequentially introduce the web as a data collection mode to surveys that traditionally rely on costly interviewer-administered modes, such as telephone and in-person. Sequential mixed-mode data collection first invites respondents to participate via the least expensive mode, typically web. Those who do not initially respond are then invited to participate over the telephone or via an in-person interview. While this is an attractive cost saving measure with the potential to improve sample composition, it is possible that the introduction of a new mode could lead to unintended mode effects on participation, survey data quality, and response distributions (Couper 2011; Kreuter 2013).
While longitudinal studies are generally affected by declining response rates, they are uniquely equipped to implement mixed-mode designs, because they typically collect a variety of contact information, including email addresses and cell phone numbers (Sastry and McGonagle 2022). These additional pieces of contact information can be used to invite panel members to complete web surveys via emails and text messages. Cabrera-Álvarez and Lynn (2024) found that while using text messages as part of a contact strategy did not directly increase response rates in wave 11 of the Understanding Society study, it increased responses among panel members for whom they had no physical or email addresses and those with inconsistent response patterns in previous waves.
Panel members may find a web survey to be a convenient option for participation. When participating via the web, respondents can respond according to their own schedule and at their own pace using as many sessions as needed, rather than having to coordinate a specific time to talk with an interviewer in-person or over the phone. Further, since they were previously empaneled, they may not need an interviewer-based interaction to persuade them to participate. For example, the Age 55 wave of the British National Child Development Survey (NCDS) experimentally introduced a sequential web-telephone design by randomly assigning study members either to the treatment group (sequential web-telephone) or the control (telephone only) group. The treatment group achieved a significantly higher response rate compared to the telephone only control group (83% vs. 78%, respectively; Goodman et al. 2022). Recent work has suggested that the implementation of mixed-mode designs in panel studies like this can maintain sample composition and reduce costs without affecting response rates (Bianchi et al. 2017).
Panel studies are also unique because differential nonresponse due to mode could be experienced in the wave in which the new mode is introduced and in future waves. As discussed above, the ability to participate over the web may meet the immediate convenience of a respondent, but may loosen their ties with the panel and affect future participation. It is possible that administering surveys over the phone would make a panelist more likely to participate in a future wave compared to a web survey. This assumes that the interaction with an interviewer reinforces a stronger connection to the study than a one-sided data collection on the web. An experiment was conducted in wave 5 of the Innovation Panel of Understanding Society (IP-US) that compared subsequent wave participation between a treatment group that received a sequential mixed-mode design (web to face-to-face) and a control group that received the standard face-to-face design. While the wave 6 response rate was significantly lower for the wave 5 treatment cases (52%) compared to the control cases (61%), that effect disappeared over waves 7, 8, and 9 (Gaia 2017), suggesting that any negative impact on future wave participation was not long-lasting.
The use of alternative modes could also affect data quality and response distributions. Potential mode effects that are relevant for the present study in large part stem from the presence or absence of an interviewer. On one hand, having an interviewer present could introduce social desirability bias, which occurs when respondents answer sensitive questions in a more socially acceptable or flattering light (Groves et al. 2009). Previous studies have shown that interviewers are associated with higher levels of socially desirable responding compared to self-administered modes. For example, a recent experiment conducted in the age 55 NCDS found more negative responses to self-rated health and wellness questions in the web-telephone design, when compared to telephone alone (Goodman et al. 2022). Likewise, IP-US found that 7% of their outcomes of interest had significant differences between their sequential web-CAPI assigned group compared to their CAPI-only assigned group (Jäckle 2016). Further, the presence of an interviewer has the potential to lead to motivated underreporting, specifically for household roster information. Motivated underreporting happens when respondents intentionally misreport household information to avoid being identified as eligible for a survey (Tourangeau et al. 2012). For example, Graber et al. (2022) found that respondents provided more personal identifiers in a web screener compared to an interviewer-administered screener.
On the other hand, interviewers have the ability to motivate and encourage respondents to give complete answers to actual survey questions after household screening has been completed, help respondents understand difficult or complex topics, and facilitate the collection of physical measurements. Web surveys have been associated with more non-differentiation compared to the telephone (Bowyer and Rogowski 2017; Fricker et al. 2005). Further, Goodman et al. (2022) and Jäckle et al. (2015) reported higher item-nonresponse rates among the sequential mixed-mode groups compared to the control groups where interviewer administration was used. While mixed-mode strategies may help to increase response rates or reduce costs in panel studies, their effects on responses to individual survey questions also need careful consideration (e.g., Cernat 2014; see also Cernat and Sakshaug 2021).
We add to the large body of literature on mode effects by extending a recent analysis of an experiment conducted by the HRS, which introduced a sequential mixed-mode design in the 2018 wave (Ofstedal et al. 2022). This study found some evidence of mode effects, including increased missing data rates among some topic domains, less precise asset reporting, and an increase in reported family composition changes in the mixed-mode group. However, the differences they detected were modest, and they concluded that the mixed-mode approach did not introduce substantial cause for concern. This previous analysis, similar to the vast majority of analyses of mode effects like those described above, focused on the overall effect of a mixed-mode design on data quality. This is an appropriate first step, but it is important to allow for the possibility of heterogeneity in mode effects across different subgroups of interest to survey researchers. The objective of this paper is to examine possible heterogeneity in the effects of a web-first versus telephone-only protocol on several measures of data quality and HRS survey measures across key sociodemographic subgroups.
Meaningful mode effects could be masked when important subgroups that are affected differently by alternative modes are lumped together for analysis. For example, Aquilino (1994) demonstrated that reports of illicit substance use were higher using paper SAQs compared to interviewer-administered modes for Blacks and Hispanics, but for Whites no mode effects were observed. This could be due to differences in the levels of confidentiality concerns and trust across different racial and ethnic groups. Similar findings were reported by Currivan et al. (2004) in the context of reports of smoking, where much larger differences in reported smoking behavior emerged when comparing self-administered and interviewer-administered modes for minoritized racial and ethnic groups. This likely reflects the larger social desirability bias that can emerge when using interviewer-administered modes in demographic subgroups with higher levels of confidentiality concerns and reduced trust. This motivates the consideration of alternative modes for these subgroups in an effort to increase data quality.
Studying heterogeneous mode effects is particularly critical when there are specific subpopulations that are of particular interest to the study itself. For example, the HRS over-samples Hispanics and non-Hispanic Blacks to facilitate research on these specific populations in more detail and with greater statistical power. It is possible that a mixed-mode survey design protocol may work well for non-Hispanics, but not as well for Hispanics; this result would be missed in an overall mode effect analysis. Analyses of the type performed in this paper have the potential to inform future adaptive survey designs, where sequential mixed-mode protocols are efficiently deployed to those groups that are expected to provide the highest quality data when subjected to these protocols.
1.1. Research Questions
The initial findings from Ofstedal et al. (2022) motivated us to extend their analysis by assessing possible heterogeneity in these mode effects among sociodemographic subgroups that are important to the HRS. While their results suggest that there are only slight negative effects of introducing the web as a mode option for HRS panel data collection, it is possible that there are heterogeneous mode effects across subgroups that could impact the study’s representativeness. Furthermore, we also wanted to explore the potential impact of the web-first treatment on future wave participation (when in-person interviewer administration would be used, see section 2.1 for details on the HRS design). We therefore have the following research questions (RQ):
RQ1: Is there evidence of significant variability in mode effects across key sociodemographic subgroups?
RQ2: Does assignment to the sequential mixed-mode web-phone protocol affect future in-person wave participation, either overall or differently across subgroups?
While the existing literature provides some theoretical expectations with regard to heterogeneity in the effects of self-administered versus interviewer-administered modes across sociodemographic subgroups, empirical evidence of this heterogeneity (especially across different types of subgroups) is largely lacking. As a result, our analyses are largely exploratory and designed to motivate future research and design considerations in this area. After presenting our results, we conclude with a discussion of the practical implications of our findings for the HRS, and then discuss the broader implications of our work for the survey research field.
2. Methods
2.1. HRS Background
The HRS is a longitudinal study that began in 1992 and collects health and financial information from approximately 20,000 panel members every two years. The HRS is representative of adults over age 50 in the United States, and every six years a new cohort of panelists age 50 to 56 and their spouses is recruited via in-person data collection to ensure the panel’s representativeness. Once recruited, half of the HRS panelists are asked to participate by telephone and the other half in-person. Then, in the following wave, their assigned survey mode is switched. All HRS panelists are asked a core set of survey questions and those assigned to in-person data collection are also asked to provide physical measurements and complete a psychosocial questionnaire. In the 2018 wave, the HRS began experimentally administering the core survey via the web to a random subset of those panelists assigned to the telephone mode who met several additional eligibility criteria, including having internet access, speaking English, and not residing in a nursing home. Given initial promising results in 2018, the experiment was repeated in the 2020 wave. Data collection for the 2020 HRS took place from March 2020 through May 2021 (HRS Staff 2023). The final AAPOR RR6 response rates for the full panel in the 2018 HRS and the 2020 HRS were the same (74%; HRS Staff 2025).
2.2. Experimental Design
In 2018, 3,750 HRS panelists met the eligibility requirements for the experiment. To be eligible, panelists needed to have reported access to the internet in a previous wave, not reside in a nursing home, not completed their most recent interview in Spanish, and not require a proxy to respond on their behalf. Of these, 60% were randomly assigned to the web-first treatment group, and the other 40% are considered the control group. Those assigned to the control group were pursued for a telephone interview using the standard HRS telephone contact protocols, with no option to complete via the web. The standard telephone protocol includes an advance letter with an $80 pre-paid incentive, along with telephone calls, emails, and texts from interviewers to schedule an interview over the phone. Those in the web-first treatment group received an advance letter with the $80 pre-paid incentive that included a URL to access the web-based survey, and if an email address was available, they were sent an email with the survey URL. Periodic email and postal mail reminders were sent to non-responders over the following six to twelve weeks. After this time, those who did not respond on web were contacted by interviewers to schedule a telephone interview.
This experiment was repeated in 2020, with 75% of the 4,625 eligible panelists randomly assigned to the web-first group and 25% assigned to the control group. The web-first and control protocols were similar in 2018 and 2020. Response rates were similar between the web-first and control groups in both 2018 and 2020. The response rate for the 2018 web-first group was 81% (79% of these completed via web), and the response rate for the 2020 web-first group was 87% (65% of these completed via web).
This design provides a strong framework for comparing the web-first protocol that includes telephone follow up to the telephone only protocol. We conduct an intent-to-treat (ITT) effect analysis because the treatment and control groups are identifiable in a transparent and credible way due to the random assignment of the experiment. Further, the ITT effect is practically relevant for surveys that allow for mode switching from the web to the telephone mode in the mixed-mode treatment arm. This means that we are evaluating two different data collection protocols, not isolating a “pure” mode effect. Therefore, we define our analytic groups using treatment assignment. The treatment group (those assigned to web with telephone follow-up) includes respondents who complete on the web or over the phone, and the control group (those assigned to telephone only) includes respondents who complete on the telephone.
In expectation, the composition of our experimental groups should be balanced due to the random assignment of treatment and control. However, we did not achieve a 100% response rate in either group, and it is possible that our resulting analytic samples may not be balanced in terms of highly relevant covariates such as sociodemographic or household characteristics. To ensure that this expectation held, we assessed the sample composition balance between those who completed the survey in our treatment/control groups using chi-square tests of association between the treatment and several key sociodemographic characteristics (sex, race/ethnicity, study cohort, educational attainment and age in 2020), and t-tests for equal group means of two key household characteristics (total assets and number of children and household members); see Table 1. Despite one statistically significant finding related to experimental group and HRS cohort (p-value = .0465), suggesting that those in the War Babies cohort have a slightly higher response rate in the control group and those in the Early and Late Baby Boomer cohorts have a slightly higher response rate in the web-first group, this single result does not suggest that significant imbalances were introduced by differential survey response across the experimental groups.
Sample Balance Among Treatment and Control Groups of 2020 Respondents.
We do not include one major substantive domain covered by the HRS, cognitive ability, in this analysis. Cognition measurement can certainly be affected by the mode of survey administration, and we direct readers who are interested in mode effects on cognitive functioning measures collected by the HRS to Domingue et al. (2023), who explore this topic in detail.
2.3. Outcome Measures
To assess the overall impact of the web-first treatment in the 2018 HRS, Ofstedal et al. (2022) identified several domains of outcome measures from the HRS survey, including financial assets, expectations about the future, health measures, and family composition. We extracted the same set of outcomes for this analysis using the 2020 HRS wave data (Health and Retirement Study 2023).
First, we look at eleven yes/no questions about ownership of financial assets and their corresponding follow up questions about the value of each owned asset; see Table 2. For future expectations, we selected five items that are administered to all respondents that ask about the likelihood of a future event occurring; for example, “On a scale from 0 to 100, what is the percent chance that Congress will change the Medicare program sometime in the next ten years, so that it becomes less generous than now?” The thirty-four physical health items include physician diagnosed conditions and general health problems. Our disability measures included both physical functioning (PF) limitations and Instrumental Activities of Daily Living (IADL) limitations. The psychological health measures consist of eight items from the Center for Epidemiologic Studies Depression Scale (CES-D; Steffick 2000), an overall life satisfaction item, and the stem question from the Short-Form Composite International Diagnostic Interview (Kessler et al. 1998). Finally, we include measures of family composition by investigating updates made to information for each member of the respondent’s family. This includes name, sex, year of birth, residence, coupleness, and relationship to the respondent for both household members and non-residential children.
Selected Survey Items from the 2020 HRS.
This was a brief overview of the survey items included in this analysis. For a more thorough explanation of the rationale and calculation of each data quality and response distribution metric see Ofstedal et al. (2022).
2.4. Measures of Data Quality
We created data quality measures using items from the survey domains outlined in Table 2. To assess data quality, we calculated the following metrics for each experimental group (see column 1 of Table 3):
The percentage of items with missing values in each survey domain, and the percentage of respondents with a missing value for the survey item: Do you have a checking account? We consider a higher percentage of missing items to be indicative of poorer data quality.
For items where the response is a value, we evaluate the quality of the answer given. For the value of financial assets, if respondents do not provide an exact value, follow-up questions are asked to try and obtain the range in which the value falls. Therefore, asset value responses can be classified as an exact value, a complete bracket, an incomplete bracket, or no value. We analyze the percentage of responses that fall into each classification. We assume that exact values represent better quality than complete/incomplete brackets, or a missing value. Future expectation responses (that take the form of probabilities between 0 and 100) are classified into three data quality categories: non-focal (values of 1–49, or 51–99), 50, and 0 or 100. In this context, non-focal values are thought to be indicative of higher quality (Fischhoff and Bruine de Bruin 1999; Hurd 2009). Again, the metric of interest is the percentage of responses that fall into each category among the six future expectation items.
The mean number of discrepancies between the number of physical conditions reported this wave versus a previous wave.
For family composition, we count the number of changes (additions, deletions, corrections) made to rostered household members, non-residential children, and the total of the two. We also created a 0/1 indicator for any changes made to the roster. Rosters are completed for each household, and in some circumstances both respondents from a household are asked to make changes to the roster. In an effort to capture all roster information provided by one or both household respondents, we sum the changes within each household. Therefore, the metrics of interest are the proportion of households with any changes (for household members, non-residential children, or in total) and the mean number of changes (for household members, non-residential children, or in total). Drawing on the literature on satisficing (Fricker et al. 2005; Krosnick 1991) and motivated underreporting (Graber et al. 2022; Tourangeau et al. 2012) as response strategies in surveys, especially those using self-administered modes, we assume that fewer reported roster changes and health conditions are indicative of lower data quality.
Measures of Data Quality and Response Distributions.
2.5. Comparisons of Response Distributions
In addition to data quality, we also compare distributions of survey responses within the domains of interest (see column 2 of Table 3). These include the percent that respond “Yes” to the financial asset question “Do you have a checking/savings account?” mean responses for each future expectation, the mean number of endorsed items in each health domain, mean self-rated health, and overall mean life satisfaction. Our goal is to compare the response distributions in an exploratory fashion. We do not make any assumptions about the direction of potential mode effects on response distributions.
2.6. Sociodemographic Subgroups
To investigate possible heterogeneity of mode effects, we focus on six sociodemographic factors. These include race and ethnicity (Hispanic, non-Hispanic Black, non-Hispanic White/Other), sex (male, female), HRS cohort (AHEAD/CODA/HRS [born 1918–1941, recruited in 1992, 1993, and 1998], War Babies [born 1942–1947, recruited in 1998], Early Baby Boomers [born 1948–1953, recruited in 2004], Mid Baby Boomers [born 1954–1959, recruited in 2010], Late Baby Boomers [born 1960–1965, recruited in 2016]), education (HS or less, Some College <4-year degree, College 4-year degree, Master’s degree or more), respondent age in 2020 (22–59, 60–69, 70–79, 80+), and cognitive ability, assessed at a previous wave of data collection (normal, impaired, or demented; Langa et al. 2023).
We selected these factors because they are of key importance to the HRS study and its research community. Furthermore, because they are stable characteristics (except for cognitive ability, which could decline, but is unlikely to improve), they can be used to target interventions in future waves of data collection if the results suggest that would be useful. As this is an exploratory study, we do not have specific expectations regarding heterogeneity in mode effects for each of these factors. As noted previously, however, a study by Aquilino (1994) found that differences in reports of sensitive behaviors for self-administered versus interviewer-administered modes were larger for ethnic minorities than for Whites. Although the outcomes examined in our paper are not sensitive for the most part, we might expect larger mode differences for Black and Hispanic respondents than for White respondents.
2.7. Statistical Analysis Plan
To answer RQ1 (Is there evidence of heterogeneity in mode effects across sociodemographic subgroups?), we fit a series of models that included covariates for treatment group, sociodemographic group, and an interaction term between treatment and sociodemographic group. For each outcome
In this model,
For RQ2 (Does assignment to the sequential mixed-mode web-phone protocol affect future in-person wave participation, either overall or differently across subgroups?), we calculated the 2022 HRS panel AAPOR RR6 response rates by 2020 experimental group assignment (web-first vs. control group) and used chi-square tests to test for statistical significance. Given our ITT method of analysis, response rates were calculated among all cases in the 2020 experiment regardless of 2020 participation or mode. To test for possible heterogeneity among key sociodemographic subgroups, we employed an analogous method of checking for significant interaction terms using Equation (1) as described for RQ1. We then followed up those results with a stratified analysis comparing 2022 response rates by 2020 treatment group within sociodemographic subgroups, including chi-square tests for significance.
All analyses are conducted without accounting for the HRS survey weights or accounting for the complex sample design of the HRS. This is because in this analysis we are testing the efficacy of the treatment among HRS panelists, and not attempting to make broader population inferences. In this analysis, 204 independent tests of the interaction term from each model were conducted using a likelihood ratio test, which we can think of as the overall effect of the interaction between treatment group and subgroup. We account for multiple testing using the Holm-Bonferroni correction (Holm 1979). Based on this method, we order the p-values from the 204 independent likelihood ratio tests from smallest to largest and compare each to a critical alpha. The critical alpha for the smallest p-value is the same as the Bonferroni procedure, α/k where k is the number of tests, in our case .05/204 = .000245. The critical alpha level for the second smallest p-value is α/(k−1), for the third smallest p-value we use α/(k−2), and so on. We reiterate that this is largely an exploratory analysis of potentially heterogeneous mode effects. With this approach we can still catch potential moderation of the mode effects by different sociodemographic measures, while accounting for multiple testing.
All analyses were conducted in SAS 9.4.
3. Results
3.1. Research Question 1
We visually summarize the results for RQ1 across the sociodemographic groups. Each vertical panel is a subgroup, and each row represents an outcome from Table 3. Each dot is the pairwise difference between the web-first and control groups in the predicted mean or proportion based on model in Equation (1). Star dots represent pairwise differences that are significantly different from 0 at the 0.01 level based on a t test. Rows highlighted with light gray shading indicate outcomes where the Type 3 likelihood ratio (LR) test for the interaction is significant based on the Holm-Bonferroni correction. Details of these results can be found in Table 1 of the Supplemental Materials.
We tested 204 interaction effects for the 2020 HRS wave, and 6 (3%) were significant under the Holm-Bonferroni correction. The estimated model covariates and results from the Type 3 LR test from these six models can be found in Table 2 of the Supplemental Materials. Figure 1 illustrates differences among the outcomes by HRS cohort. The call-out box shows that significantly more Late Baby Boomers report complete bracket asset values (in lieu of exact values) in the web-first group, 7%, compared to 2% in the control group (p-value <.0001). We see the opposite effect for the oldest HRS cohorts and War Babies, who report fewer complete bracket asset values in the web-first group. Therefore, we have evidence that asset data quality is worse among the Late Baby Boomers in the web-first group, and slightly better for the older HRS cohorts. The mean number of reported IADL limitations is also higher in the web-first group (0.23) than the control group (0.11) for Late Baby Boomers (diff = 0.12, p-value = .0006). Conversely, Mid Baby Boomers in the web-first group reported an average of 0.13 IADL limitations compared to 0.21 IADL limitations in the control group (diff = −0.08, p-value = .0083).

Cohort-specific pairwise differences between web-first and control (telephone-only) estimates of means and percentages for key HRS outcomes (2020 HRS).
We note that there are rows in Figure 1 where there are significant pairwise differences that do not have light gray shading. This occurs when a pairwise difference is significant, but the overall interaction effect is not. For example, there can be significant pairwise differences that are in the same direction for all subgroups in the absence of a significant interaction. We also note that some significant pairwise differences are close to the zero line. These are significant (small) pairwise differences that most frequently occur in data quality metrics. For example, the mean rate of missing health values among Late Baby Boomers is 0.0027 in the web-first group and 0.0006 in the control group, a difference of 0.0021 (p-value <.001).
We provide a brief summary of the significant heterogeneity detected among the other subgroups; full figures for each set of sociodemographic subgroups can be found in Figures 1 to 5 of the Supplemental Materials. On average, more roster changes are reported by females in the web-first group (1.8) over the control group (1.3; p-value <.0001). No difference is observed for males, where the web-first and control groups each have an average of 1.9 roster changes (p-value = .9928). Those with some college education (web-first = 1.9, control = 0.9, p-value <.0001) or a four-year degree (web-first = 1.6, control = 1.3, p-value = .0053) report more roster changes in the web-first group than in the control group. A similar result was found specifically for child roster changes. These results suggest that more roster changes are reported in the web-first treatment, in particular for females and those with some college or a four-year degree. Finally, non-Hispanic Blacks in the web-first group report fewer CES-D symptoms than those in the control group (1.4 vs. 1.7 respectively, p-value = .0196). The opposite was observed for non-Hispanic Whites and Others, with a mean of 1.4 in the web-first group and 1.2 in the control group (p-value <.0001). We did not find substantial evidence of heterogeneity in mode effects by age in 2020 or previous cognitive ability, nor for any outcomes measuring the subjective probabilities of future expectations domain or the quality of health responses.
3.2. Research Question 2
Next, we assess whether assignment to the sequential web-phone protocol in the 2020 wave affected participation in the 2022 wave, which was conducted in-person. The 2022 response rate for those panelists assigned to web-first in 2020 was 79%, compared to 77% among those assigned to the control group in 2020 (Chi-sq p-value = .1583); see Table 4. Results from our interaction analysis show a marginally significant LR test result (chi-sq = 5.2, d.f. = 2, p-value = .074) for the interaction of race and ethnicity (Hispanic/Non-Hispanic Black/Non-Hispanic White and Other) with 2020 treatment group; see Table 3 in the Supplemental Materials. To investigate this further, we found a significant interaction between 2020 treatment group and a two-category variable measuring ethnicity only (Hispanic/Non-Hispanic; chi-sq = 4.3, d.f. = 1, p-value = .0381). While these results are only approaching the .05 level of statistical significance, they are practically relevant to HRS due to the study’s oversample of Hispanics. A consistent result is found in the stratified response rates; see Table 4. Among Hispanics, those assigned to the web-first treatment in 2020 had a higher response rate in 2022, 75%, than those assigned to the control group (64%; Chi-sq p-value = .0136). There were no significant interaction results or differences detected in the 2022 response rate by 2020 treatment group within subgroups defined by sex, HRS cohort, educational attainment, or age.
2022 HRS Panel Response Rate by 2020 Treatment Assignment.
p-Value <.05.
3.3. Detailed Roster Results
Given that half of the significant main and interaction effects emerging from this analysis were related to the family roster section, we drilled down into the nature of these changes. We looked individually at each of the six pieces of information gathered for each non-residential child and household member (name, sex, year of birth, coupleness, residence, and relationship to the respondent). Overall, more roster updates occurred in the web-first group, and we find that most of these updates are to the coupleness and the residence demographics; see Table 5. The mean number of updates to coupleness status for the web-first group, 0.70, is significantly higher than for the control group, 0.55 (p-value = .0041). The mean number of updates to residence is also significantly higher in the web-first group, 0.69, compared to the control group, 0.55 (p-value = .0090). We expect these two pieces of information to change over time, and updates to name, year of birth, and sex are usually associated with fixing data entry errors.
2020 HRS Panel Household Family Composition Updates: Mean Number of Changes (SE).
p-Value <.01.
The web-first treatment therefore appears to be capturing more substantive family composition updates compared to the control group. We also compared the mean number of newly added persons to the family roster. More new family/household members were added, on average, in the web-first group, 0.49, than in the control group, 0.38 (p-value = .003). We also found evidence of one heterogeneous mode effect in terms of these roster updates across selected sociodemographic subgroups. Early and Mid Baby Boomer cohorts reported more new persons in the control group compared to all other HRS cohorts, who reported more new persons in the web-first group.
4. Discussion
4.1. Summary of Results
For RQ1, we found evidence of heterogeneous mode effects in just 6 of the 204 interactions we tested, and the magnitudes of these effects on data quality and response distributions were relatively small. In some cases, the direction of the effect favored the web-first treatment group. For example, females on average reported 0.5 more household roster changes in web-first compared to control, and those with some college education reported 1 additional roster change on average in the web-first group compared to controls. Further, while there is a small amount of evidence of heterogeneity in the mode effects across the subgroups analyzed, the actual mode effects identified within the subgroups generally were not of a magnitude that would introduce cause for concern in terms of the quality of the data collected in the HRS. For example, Non-Hispanic Blacks reported 0.3 more CES-D symptoms on average in the control group, and Hispanics and non-Hispanic Whites and Others reported more in the web-first group (means higher by 0.3 and 0.2, respectively). This resulted in a significant interaction, but the actual effects themselves do not introduce practical concerns.
Among the significant interactions, half were from the household and adult child rostering section. When we investigated this further, we found that the nature of these changes was substantive (changes to relationship and residence status) and not just correcting errors to name, birth year, or sex. More of these changes were made in the web-first group, particularly among females and those with some college education or a four-year degree. This suggests positive effects of the web-first treatment on data quality.
With regard to RQ2, we did not find evidence of a negative impact of the web-first treatment on future wave in-person participation. Our concern that talking to someone on the phone would make a person more likely to come back in the next wave compared to completing a web survey was not substantiated. Overall, the web-first treatment in the 2020 wave did not have a significant effect on the probability of responding to an in-person interview in the subsequent 2022 wave. When we look within subgroups, we found that Hispanics respond at a higher rate after the web-first treatment, which is a positive finding because Hispanics are an important subpopulation in the HRS.
4.2. Implications for HRS
Based on the overall mode effects reported by Ofstedal et al. (2022) and the small number of heterogeneous mode effects found in this analysis, HRS considered the web-first protocol not harmful to data quality and continued to use web-first liberally, albeit experimentally, in the 2022 and 2024 panel data collections. This analysis did not explore the impacts of introducing web on the cognitive functioning outcomes that are also collected in the HRS. These outcomes have been shown to be affected by survey administration mode, and we recommend that those who are interested in those outcomes specifically review Domingue et al. (2023) for more information.
One major takeaway from our analysis is the beneficial impact of web-first on the number of roster changes reported in the web-first group, particularly among females. One future consideration for the HRS is how to incorporate this finding more broadly within the interviewer-assisted modes. Under the current in-person protocol, the roster section is conducted by the interviewer reading pre-loaded information for each household member and adult child that the respondent is asked to confirm or update without the use of visual aids. The roster section in the web instrument is more visual, because the respondent is interacting with questions about the household roster on a screen, and they can see their responses as they are being typed out, which inherently acts as a visual aid. This could account for the increased data quality on the web. The HRS could consider adding a visual aid to the in-person roster section to help improve data quality in that mode.
Another possible future consideration is to have HRS introduce an adaptive design that is targeted to specific subgroups. One option would be to institute a “knock to nudge” protocol, where interviewers go in-person to encourage panel members to complete the web instrument for subgroups where the web-first protocol was advantageous (Siemiatkowska 2024; Tortoriello and Kastberg 2023). Another option is to tailor study materials and protocols to appeal to specific subgroups of the panel with a low propensity to participate (Lynn 2014, 2017). The basic idea is to use the rich set of previous wave data to identify subgroups of panel members with similar and well-defined characteristics that lend themselves to targeted treatment. Then one could vary design features, such as mode assignment or the wording of study materials, including invitation letters (Lynn 2016) or between-wave mailings (Fumagalli et al. 2013), based on those characteristics to encourage participation and maximize data quality.
While we did not find substantial evidence of heterogeneous mode effects in the 2018 and 2020 waves of HRS, we saw more roster changes and less specific asset values being reported by the newest HRS cohort in the web-first group. This finding suggests that mode effects may change over time as new HRS cohorts are enrolled into the HRS. We will continue to monitor for possible heterogeneous mode effects for future HRS cohorts, which are of particular importance to the stability of the HRS data collection.
As a sensitivity analysis for the roster metrics, we modified how we calculated the number of household roster changes in the 3% of households where both responders made changes. Instead of summing all changes made by both household responders, we only used information from the first person within a household to report roster changes. The results were largely similar and we report the summed changes within a household to include the maximum amount of information. In light of our sample balance check, which indicated possible imbalance among the HRS cohorts, we conducted another sensitivity analysis. We re-fit each of our models and included study cohort as a covariate to see if the interaction effect between treatment group and a given demographic variable remained significant while accounting for study cohort. We found that the results remained consistent after making this change.
4.3. Broader Implications/Considerations for Survey Researchers
This work raises several points for future work and discussion for the broader survey research community. First, in general we encourage other survey methodologists to conduct analyses by investigating mode effects both overall and across subgroups that are crucial for the objectives of a given study. This paper presents limited evidence of heterogeneous mode effects among HRS panelists who happen to be older U.S. adults. Mode effects may be quite different in other target populations, and could vary a lot more across important subgroups in those populations.
Second, we must consider the timing and the purpose of analyses of mode effects. Researchers are not able to test for main or heterogeneous mode effects in a study until after the data collection is finished, or a large portion of data has already been collected. In the cross-sectional survey context, analyses of mode effects are limited to describing the outcome of the data collection effort, or they could be incorporated into a weighting procedure if major mode effects are detected. In the panel study context, the results of mode effect analyses are particularly useful because they can also be used for designing protocols in subsequent data collection waves. As survey costs continue to rise and public willingness to participate in surveys declines, the pressure to achieve the same level of respondent participation as in previous waves remains. These forces point to the continued use of web as a cheaper mode for data collection. However, the impact of the web mode should continue to be monitored across different subgroups of interest to help studies maintain the highest levels of data quality. Our study focused on a small set of sociodemographic factors of key importance to HRS. Future studies that focus on other more substantive factors, such as health and economic status, would be greatly beneficial.
Finally, the following open question warrants further discussion among survey researchers: How do we determine what magnitudes of mode effects (overall or heterogeneous) are acceptable? In this specific study, about 3% of the interactions we tested pointed toward heterogeneous mode effects. Our results were mixed, in that some suggested data quality benefits of the web mode, and some suggested decreases in quality. The HRS found the outcome of this analysis reassuring in that for the specific set of outcomes that we studied, we found the effects to be “modest,” but that is not based in a widely accepted standard or benchmark.
Supplemental Material
sj-docx-2-jof-10.1177_0282423X261429705 – Supplemental material for Assessing the Heterogeneity in Mode Effects on Data Quality, Response Distributions, and Future Participation Across Sociodemographic Subgroups in a Mixed-Mode Panel Study
Supplemental material, sj-docx-2-jof-10.1177_0282423X261429705 for Assessing the Heterogeneity in Mode Effects on Data Quality, Response Distributions, and Future Participation Across Sociodemographic Subgroups in a Mixed-Mode Panel Study by Heather M. Schroeder, Mary Beth Ofstedal and Brady T. West in Journal of Official Statistics
Supplemental Material
sj-xls-1-jof-10.1177_0282423X261429705 – Supplemental material for Assessing the Heterogeneity in Mode Effects on Data Quality, Response Distributions, and Future Participation Across Sociodemographic Subgroups in a Mixed-Mode Panel Study
Supplemental material, sj-xls-1-jof-10.1177_0282423X261429705 for Assessing the Heterogeneity in Mode Effects on Data Quality, Response Distributions, and Future Participation Across Sociodemographic Subgroups in a Mixed-Mode Panel Study by Heather M. Schroeder, Mary Beth Ofstedal and Brady T. West in Journal of Official Statistics
Footnotes
Acknowledgements
The authors thank Abdelaziz Adawe for his careful assistance with managing the family roster data.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institute on Aging of the National Institutes of Health, and the Social Security Administration [U01 AG009740].
Supplemental Material
Supplemental material for this article is available online.
Received: March 31, 2025
Accepted: February 17, 2026
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
