Sage Journals: Discover world-class research

Abstract

Aggregated relational data (ARD), derived from questions of the form “How many people do you know who [belong to subpopulation X]?” are widely used to estimate the size and composition of social networks, often adopting the network scale-up method (NSUM). However, their measurement properties are insufficiently studied. The authors address this gap by assessing (1) the test-retest reliability of a large set of ARD questions and NSUM-estimated network sizes and (2) the convergent validity of these network size estimates. This mixed-methods study involved a heterogeneous quota sample of 50 citizens in Barcelona, Spain, in 2023. Respondents were interviewed twice over a 10- to 15-day period, answering a series of ARD questions on each occasion. Qualitative debriefing provided valuable insights into their response behaviors. Our findings indicate that NSUM accurately ranked respondents’ network sizes but did not estimate their values consistently across measurements. Respondents gave lower answers in the second interview than in the first. In particular, the network sizes of people with large networks (“hubs”) fluctuated significantly. NSUM-estimated network size moderately correlated with estimates from the summation method and Facebook friend counts. The authors discuss the implications and provide practical recommendations for ARD item selection and the use of NSUM instruments.

Keywords

aggregated relational data network scale-up method reliability social networks acquaintanceship volume

Social scientists have long investigated how individuals’ relationships with different social groups reflect underlying population structures. Aggregated relational data (ARD) and the network scale-up method (NSUM) are effective tools for such research. ARD are responses to survey questions of the form “How many people do you know who [belong to subpopulation X]?” (Breza et al. 2020). When linked to respondents’ own subpopulation memberships, such questions reveal how well connected different segments of the population are. ARD questions form the core of NSUM (Bernard et al. 1989; Killworth, Johnsen et al. 1998; McCarty et al. 2001), which estimates the size of “hard-to-count” subpopulations (i.e., subpopulations that lack official statistics and are difficult to sample), such as undocumented migrants or lethal victims of a natural disaster. In this methodology, a random sample of the national population reports how many people they know who belong to the hard-to-count subpopulation, along with various subpopulations with known sizes. Responses for the known subpopulations help researchers estimate respondents’ network sizes (see the next section for more details). Responses for the unknown subpopulations are then used to infer their size on the basis of individuals’ estimated network sizes and the national population size.

NSUM has been used to estimate the size of subpopulations relevant to public health (e.g., men who have sex with men [Ezoe et al. 2012], women who have had abortions [Sully, Giorgio, and Anjur-Dietrich 2020]), criminal justice (e.g., victims of cybercrime [Breen, Herley, and Redmiles 2022], victims of trafficking [Li et al. 2023]), and disaster management (e.g., forcibly displaced people [Schroeder et al. 2019]), among other areas. Organizations such as the Joint United Nations Programme on HIV/AIDS and the World Health Organization have embraced these methods for gaining valuable information on vulnerable subpopulations (Shelton 2015; UNAIDS 2010; UNAIDS/WHO Working Group on Global HIV/AIDS and STI Surveillance 2010).

Beyond estimating hard-to-count subpopulations, ARD and NSUM have been used to assess individuals’ acquaintanceship network size (e.g., Feehan, Son, and Abdul-Quader 2022; Hofstra, Corten, and van Tubergen 2021; Lubbers, Molina, and Valenzuela-García 2019), network composition (Otero et al. 2022), and segregation (Breza et al. 2020; DiPrete et al. 2011; Zheng, Salganik, and Gelman 2006). These uses of ARD questions help us understand broad structural differences in social connectivity, such as between men and women or upper and lower social classes, and their implications, for instance, for mobilization of support, which affects individual outcomes like health and social mobility, and for collective outcomes such as societal cohesion. The growing interest in this type of data has fostered research into question-order and interviewer effects (Snidero et al. 2009) and the robustness of ARD instruments (Kunke et al. 2024). Researchers have improved estimation methods (e.g., Feehan and Salganik 2016; Feehan et al. 2016; Habecker, Dombrowski, and Khan 2015; Laga, Kunke, et al. 2024; Maltiel et al. 2015; McCormick, Salganik, and Zheng 2010) and implemented them in R packages (e.g., Laga, Bao, and Niu 2021; Maltiel et al. 2015).

Yet despite their growing popularity, the measurement properties of ARD questions and NSUM-estimated network sizes remain largely unknown, even though they are crucial for ensuring the outcomes’ robustness, credibility, and replicability. For instance, if individuals respond inconsistently to the same ARD questions on different occasions, these questions’ ability to accurately measure important social dimensions and their predictive power is questionable. To our knowledge, the test-retest reliability of individuals’ estimated network size and ARD responses has only been reported twice. Kazemzadeh et al. (2016) reported a substantial mean test-retest agreement of $κ = 0.79$ for the ARD scores of 30 students in Iran, interviewed twice over 10 days, with only one ARD question showing significant mean change. They did not report the test-retest reliability of the estimated network sizes. Vardanjani, Baneshi, and Haghdoost (2015) reported a test-retest correlation of $r = 0.81$ for their ARD questionnaire (n = 25, $r_{\min}$ = 0.77, $r_{\max}$ = 0.99) but they did not specify the time interval or assess the agreement in values over time. Neither study examined whether reliability varies by question order, question type, or respondent attributes (e.g., education). More work is also needed to establish how well NSUM-estimated network size aligns with other measures of network size.

In this article we address these gaps by (1) estimating the test-retest reliability of ARD and NSUM-estimated network size, (2) estimating the convergent validity of NSUM-estimated network sizes by providing “relationships between test scores and other measures intended to assess the same or similar constructs” (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 2014:16–17), and (3) reaching a deeper understanding of response behaviors. To achieve this, we conducted a mixed-methods study in Spain in 2023 among 50 respondents, interviewed twice over 10 to 15 days. Both interviews included ARD questions regarding a wide range of subpopulations, as well as questions about respondents’ attributes and two related measures of individuals’ acquaintanceship network size. We qualitatively debriefed respondents about their response behaviors.

The study addresses four research questions (RQs): (1) How well do respondents understand and follow the instructions for ARD questions? (2) What are the test-retest reliability and agreement of ARD questions, and which types of questions are more and less reliable? (3) What are the test-retest reliability and agreement of acquaintanceship network size estimated using NSUM procedures, and which types of individuals have more and less reliable network size estimates? and (4) How well does the NSUM-estimated network size align with other measures of acquaintanceship network size?

RQs 1 and 4 are exploratory. Regarding RQ 2, we expected individuals’ ARD responses to be similar on the two measurements, indicating good reliability. However, if responses varied over time, we expected higher responses in the second measurement because of potential recall. This pattern would suggest good reliability in terms of ranking but lower agreement in the values. For RQ 3, as ARD questions about relatively rare names generate fewer biases than questions about other subpopulations (see next section), we expected name questions to have higher test-retest reliability than questions regarding other subpopulations (e.g., the number of people with particular occupations or religions). We also expected the first ARD items in the questionnaire to be more reliable than subsequent ones, given respondents’ limited attention span and difficulty retaining the definition of knowing over time (Habecker 2017). Finally, we expected that test-retest reliability would be higher among younger individuals, because of better memory, and among higher educated individuals, because of potentially greater consistency in understanding.

The Network Scale-Up Method

The NSUM was designed to estimate the size of hard-to-count subpopulations. Its inventors (Bernard et al. 1989) stipulated that the size of these subpopulations could be estimated on the basis of their prevalence in the acquaintanceship networks of a representative population sample and the total population size. Killworth, McCarty, et al. (1998) formalized this relationship as

{\hat{N}}_{j} = N \frac{\sum_{i = 1}^{I} y_{i j}}{\sum_{i = 1}^{I} d_{i}},

(1)

where ${\hat{N}}_{j}$ is the estimated size of subpopulation $j$ , $y_{i j}$ is the number of persons respondent $i = {1, \dots, I}$ reports knowing in subpopulation $j$ , $d_{i}$ is respondent $i$ ’s degree (i.e., the number of people $i$ knows in the total population), and $N$ is the size of the total population. Because individuals’ degrees ( $d_{i}$ ) are typically unknown, Killworth, Johnsen, et al. (1998) proposed estimating them by asking respondents about a set of $K$ subpopulations with known sizes (e.g., known from national statistics institutes). For $j \in K$ , the degree is estimated as

{\hat{d}}_{i} = N \frac{\sum_{j \in K} y_{i j}}{\sum_{j \in K} N_{j}},

(2)

where ${\hat{d}}_{i}$ is the estimated degree of individual $i$ , $y_{i j}$ is the number of people respondent $i$ reports knowing in subpopulation $j = {1, \dots, J}$ , and $N_{j}$ is the size of subpopulation $j$ in the total population. Equation (2) formalizes the “known population method” for estimating network size, or “back estimation.” Thus, NSUM instruments typically contain various ARD questions about known subpopulations and one or more questions about the unknown populations whose size researchers wish to estimate.

The method relies on three main assumptions. First, respondents are expected to know people independently of the subpopulations they belong to, for the subpopulations of interest. This assumption of random mixing would be violated if, for instance, respondents associate more with people from their own subpopulations than with others (i.e., homophily; McPherson, Smith-Lovin, and Cook 2001). Such violations are called barrier effects. Second, respondents are assumed to have full knowledge of each network member’s belonging to the subpopulations they are asked about. Violations of this assumption due to, for instance, not knowing network members well enough or network members’ hiding their attributes because of stigma are called transmission errors. Third, respondents are assumed to know their exact number of acquaintances in each subpopulation, which is easier for smaller than larger subpopulations. Violations of this assumption are called recall bias.

To estimate individuals’ degrees (and therefore the size of hard-to-count subpopulations) with minimal bias, McCormick et al. (2010) proposed using a set of relatively rare first names as subpopulations. By selecting names whose bearers collectively represent society in terms of gender, age cohort, and other attributes for which national statistics are available, barrier effects are minimized. First names are typically one of the first things people know about others (Morgan 2009), which reduces transmission errors. Furthermore, by selecting relatively rare names (McCormick et al. [2010] recommend names with a 0.1 percent to 0.2 percent population prevalence), respondents are more likely to accurately recall how many people they know with those names, reducing recall bias.

Subpopulations other than names have also been used, both as unknown subpopulations whose size is to be estimated (e.g., sex workers, homeless people) and as known subpopulations that help estimate acquaintanceship volume (e.g., occupations, demographics like “widow[er]s under 65”). Although they may have higher barrier effects than names, Zheng et al. (2006) and DiPrete et al. (2011) showed that instead of assuming random mixing, it is possible to estimate the extent to which observed variation exceeds expected variation on the basis of random mixing. This “overdispersion” indicates barrier effects and, more substantively, social segregation (DiPrete et al. 2011). Their proposal adds another substantive use to NSUM.

Killworth, Johnsen, et al. (1998) used a simple Maximum Likelihood estimator, and Zheng et al. (2006) proposed an improvement, using multilevel overdispersed Bayesian Poisson regression models to capture the variability in individual propensities to connect to different groups, thereby controlling for barrier effects. More advanced methods are available (see Laga et al. 2021), but they are too complex for our small, nonprobability sample.

Methodology

Sample

The data were collected between May and August 2023 for a pretest of a large survey.¹ A quota sample² of 50 individuals was drawn from the resident population of Barcelonès county in Spain, containing Barcelona and four adjacent municipalities. To ensure diversity in key variables that could affect response behaviors (e.g., comprehension and memory), we defined the quotas on the basis of the intersections of migration background (70 percent locals, 30 percent born abroad), gender (50 percent female, 50 percent male), and education level (50 percent lower to middle vocational education, 50 percent higher education). Each segment (e.g., men and women) included a range of ages. Because of some attrition caused by scheduling issues (n = 2) and nonresponse to the second invitation (n = 1), we calculated test-retest statistics for 47 respondents. Of these 47, 24 were women (51 percent); 16 were 18 to 34 years old, 16 were 35 to 54 years old, and 15 were 55 to 84 years old; and 60 percent were born in Spain and 40 percent abroad (32 percent of the latter were raised in Spain). Eighteen respondents completed secondary education or lower, 8 completed middle or higher vocational education, and 21 had university education.

Procedures

We interviewed respondents twice, 10 to 15 days apart (mean = 12.6 days; in one case, 19 days because of scheduling conflicts). The interval was based on Polit (2014), who reported one to two weeks, and Nunnally and Bernstein (1994), who suggested two weeks. Recommendations regarding time intervals vary widely, but the general advice is to choose a period long enough to avoid memory effects but short enough to prevent real changes in the underlying construct. Given the difficulty of recalling exact responses to 48 items and the instability of weak relationships (see Hidd et al. 2022), we chose 10 to 15 days.

Five field-workers with BSc or MSc degrees in social sciences conducted the interviews after receiving three training sessions. In the first interview, they administered the survey, including the ARD questions described below. At the end, respondents were debriefed qualitatively to understand response behaviors better. In the second interview, all ARD questions about first names and a selection of the other questions were readministered, with the exact same instructions. With respondents’ consent, both interviews were recorded to transcribe relevant parts and identify difficulties encountered during the interviews. Throughout the text, we denote the first and second interviews as $t_{1}$ and $t_{2}$ , respectively, corresponding to t = 1 and t = 2 in the regression models.

Respondents were informed about the study’s objectives and their right to skip questions and to withdraw at any time. They received modest compensation for both interviews, which the invitation had not mentioned. The ethics committee of the Autonomous University of Barcelona approved the procedures (ID 5675).

Measures

ARD

The first interview included 48 ARD questions for all respondents, preceded by the following introduction:

Now I will ask you about people you know in general. By knowing someone, we mean that you know that person BY SIGHT AND BY FIRST NAME, and that this person also knows you by sight and name. This includes both CLOSE RELATIONSHIPS, such as your partner, family, and friends, and LESS CLOSE RELATIONSHIPS, such as people you have met in your neighborhood, at work, or through other people, and even PEOPLE YOU DON’T KNOW WELL OR YOU DON’T LIKE. These people do not have to live near you. (INTERVIEWER, IF NECESSARY, ADD: “Please do not include people who are deceased, under the age of 18, or yourself.”) Please take as much time as you need to respond.

ARD questions are cognitively demanding because people tend to organize relationships in their memory by context (e.g., family, work; Fiske 1995), and ARD questions require cross-referencing these contexts. To aid respondents’ recall, we designed an A4-sized response card listing potential social contexts and the definition of knowing someone (see Supplemental Figure S1 in the online supplement). Interviewers placed this card on the table for respondents to glance at if they wished.

Interviewers then asked, “How many people of 18 years and older do you know in Spain who [belong to a certain subpopulation]?” Twelve questions concerned occupations with varying prestige (e.g., lawyers), 3 social statuses (e.g., homeless, unemployed), 7 origins and ethnicities (e.g., people with African ancestry who live in Spain [e.g., from Morocco, Senegal, or Gambia]), 3 religions (e.g., Muslims), 3 political orientations (e.g., people probably voting VOX [a national far-right party] and people probably voting CUP [a radical left-wing, anticapitalist Catalan proindependence party]), and 20 first names (e.g., women whose first name is Amparo). Before the names questions, interviewers repeated the definition of knowing someone. Interviewers used the “if necessary” instruction in only a few cases, when they detected or suspected that respondents deviated from the instructions.

The response format was numeric (0, 1, 2, . . .), but higher numbers were categorized into bins (11–20, 21–50, >50). We used three bins because an earlier survey in Spain had many cases in a single upper category, “11 or more,” for some questions (Lubbers forthcoming). Past NSUM research has asked respondents to provide exact numbers, either with top coding (e.g., Lubbers et al. 2019; Sully et al. 2020) or without it (e.g., Habecker 2017; Killworth, McCarty et al. 1998) or to select from predefined bins (e.g., DiPrete et al. 2011; Hofstra et al. 2021). Bins reduce respondent burden, especially when asking about large subpopulations (DiPrete et al. 2011). For analysis, we recoded bins to their midpoints: “11 to 20 persons” to 15, “21 to 50” to 35, and “>50” to 51.

We used the same set of first names as a national survey conducted in Spain in 2021 (n = 1,500). The selection was based on national statistics of name prevalence by gender and birth cohort (in decades) in 2021 from Spain’s National Institute of Statistics. For the present study, we updated the prevalences using the most recent statistics (from January 1, 2022) available at the start of data collection. Name selection followed these principles: (1) each name had a low prevalence in the population according to national name statistics (0.06 percent to 0.26 percent in the updated statistics); (2) collectively, the bearers of these names represent, on a smaller scale, the gender × birth cohort distribution of Spain’s overall population; (3) names clearly associated with a specific social class, such as Cayetano or Borja for male upper-class names, were avoided; and (4) name variations were included to account for linguistic differences in Spain.

If individuals reported knowing no one on more than 10 first names, they were asked about four additional names twice as prevalent as the others, without being informed of this difference. Conversely, if they reported knowing five or more persons for each of 15 or more first names, they were asked about four additional names with half the prevalence as the others. These questions helped us discern over- or underreporting from having larger or smaller network sizes. Thresholds were chosen on the basis of previous survey data, where the number of zero responses was more evenly distributed than the number of high responses.

In the second interview, interviewers repeated the 20 name questions and a selection of 13 other ARD questions to estimate test-retest reliability (see Table 1 for the repeated subpopulations). These questions were preceded by the same introductions as at $t_{1}$ and used the same visual aid.

Table 1.

Test-Retest Variability for ARD Questions and Network Scale-Up Method–Estimated Network Size.

Note: Name questions are ordered by gender and mean age of the name bearers in Spain (N = 47). The pound sign (#) denotes the position of the ARD question in the rank of all ARD items. ARD = aggregated relational data.

Based on national statistics.

The darker the cells in the test-retest reliability columns, the more reliable the item is (the color gradient on the right shows the full spectrum used).

p < .05. **p < .01. ***p < .001.

NSUM Estimate of Acquaintanceship Network Size

We estimated individuals’ degrees at $t_{1}$ and $t_{2}$ on the basis of their responses to the ARD questions about first names and national statistics on the prevalence of names in Spain. We summed the prevalence statistics of the different variations of a name with the same (e.g., Miriam, Myriam) or similar pronunciations (e.g., Ismael, Ismail), some regional variations (e.g., Alfred, Alfredo), and variations where “Maria” (female) or “José” (male) precedes the name (Maria and José often precede another name in Spain, but are typically omitted in everyday speech for most selected names). Because the interviews were conducted orally, these variations were only included in the questionnaire if pronounced differently (e.g., “Alfredo or Alfred”) or if the contraction with Maria or José was typically used in daily speech (e.g., “Tomás or José Tomás”). As we enquired about acquaintances age 18 years and older, we used the prevalence of all birth cohorts until the 2000s and added half of the 2000–2009 cohort to approximate the total number of adult name bearers (in 2022, people born before 2005 were adults).

To estimate network size, we used Killworth, Johnsen, and colleagues’ (1998) Maximum Likelihood estimator, as implemented in the R package networkscaleup (Laga, Bao, and Niu 2024), and Zheng and colleagues’ (2006) barrier model, in the form

\begin{matrix} y_{i j} ~ NB (e^{α_{i} + β_{j}}, ω_{j}) \\ α_{i} ~ N (μ_{α}, σ_{α}^{2}) \\ β_{j} ~ N (μ_{β}, σ_{β}^{2}) \\ \frac{1}{ω_{j}} ~ U (0, 1), \end{matrix}

(3)

where $y_{i j}$ is the number of persons respondent i knows in subpopulation j, $α_{i}$ is respondent i’s “gregariousness” parameter (a random effect expressing the respondent’s propensity to form ties), $β_{j}$ is the prevalence parameter of subpopulation j, and $ω_{j}$ is a subpopulation-specific overdispersion parameter. We computed degree estimates by rescaling the $α_{i}$ values using the approach relying on all known subpopulations, as proposed by Laga et al. (2021, equation 5). This model specification follows Zheng and colleagues’ original specification. We estimated the model using Stan and the R package stansum (Bojanowski and Baum 2024) with four independent chains, each with 1,000 burn-in iterations, resulting in a total of 1,000 Markov-chain Monte Carlo samples (250 from each chain) from the posterior distribution.

By assessing the reliability of estimated network size across these two methods, our conclusions about NSUM reliability will be more robust to the estimation technique. Furthermore, this process helps us assess whether the added complexity of Zheng and colleagues’ (2006) model, which adjusts for biases, improves the reliability of NSUM estimates.

Individual Predictors

Respondents’ gender, age (categorized in maximum 10-year bins), completed education level, and country of birth (Spain or abroad) were measured at $t_{1}$ . For age, we used the midpoint of each category. Given the small sample size, we dichotomized education into lower or secondary education versus middle or higher vocational or university education. At $t_{2}$ , we asked respondents, “Since the first interview 10 to 14 days ago, have you had more social contact than usual, less social contact, or about the same as usual?” Two dummy variables discern more and less contact from “about the same.” We controlled for this variable because increased social interaction might remind respondents of contacts not considered at $t_{1}$ .

Item-Level Predictors

We used the following ARD-question attributes: type of subpopulation (first names, occupations/positions, world regions of birth, religions, voting), position in the item order (a ranked ratio-level variable), and for first names, gender and average age of the name bearers in the adult population of Spain as well as the name’s prevalence in Spain in per mille. Name characteristics (see Table 1) were derived from the previously mentioned name statistics.

Criterion Measures for Convergent Validity

At $t_{1}$ , we assessed two alternative measures of acquaintanceship network size. First, we estimated network size using the summation method (McCarty et al. 2001), which asks individuals how many people they know in various social circles. McCarty and colleagues (2001) view the NSUM and summation methods as alternative approaches to estimating acquaintanceship network size, concluding that “they yield very similar distributions” (p. 37). Given the many ARD questions, we limited our inquiry to five social circles (adapted from van Tubergen et al. 2016): (1) best friends and good friends, (2) family members including in-laws, (3) neighbors and people in the neighborhood, (4) people in respondents’ current work or educational setting, and (5) people respondents know from other organizations (e.g., church or mosque, associations). We asked respondents to avoid double-counting individuals. The five social circles are not exhaustive for all contexts, but we expect respondents’ sums of contacts in these circles to correlate with their NSUM-estimated network size. Network sizes estimated with the summation method range from 10 to 298 (mean = 89, s.d. = 58, median = 73.5), with two missing cases.³

As the NSUM and summation methods share method variance by asking similar questions, we additionally used a Facebook friend count. Although capturing slightly different interactions, this variable is often used as a proxy for network size (e.g., Arnaboldi et al. 2016; Brooks et al. 2011; Hofstra et al. 2021). Furthermore, Hampton et al. (2011) showed that 93 percent of Facebook ties are also known offline. We asked the 32 Facebook users in our sample to open their accounts and report their Facebook friends count (mean = 508, s.d. = 660, median = 289, valid n = 31).⁴ Interviewers verified whether respondents checked their accounts to retrieve this information (n = 24, mean = 629, s.d. = 707, median = 409.5) or not (n = 7, mean = 94, s.d. = 48, median = 100). All respondents who did not check their accounts responded in multiples of 10 (40, 50, 100, 120, 150); only one respondent who checked did so. Respondents who did not check their accounts used social media less regularly (with a median of “every two weeks”) than those who checked their accounts (median tied between “daily” and “two to three times a week”).

Qualitative Debriefing

To better understand individuals’ response behaviors, we qualitatively debriefed respondents after the first interview (see Habecker 2017). Interview prompts included how easy or difficult respondents found the ARD questions, which questions were more challenging, whether the definition of “knowing someone” was clear, and if they used any additional criteria to include or exclude people, such as people they barely knew or had not seen for a long time. For high responses, interviewers could ask how respondents arrived at this number and who those people were. Conversely, interviewers could select a subpopulation for which respondents had reported knowing no one and ask them to review the social contexts on the response card (Supplemental Figure S1) for confirmation. Because the debriefing was qualitative, interviewers had flexibility in wording questions to enhance comprehension. Interviewers could also adjust the question order, skip redundant questions on the basis of prior responses, or add clarification questions. Responses were transcribed and coded inductively (i.e., without using preconceived categories, following an exploratory approach).

Statistical Analyses

We used nonparametric methods because ARD and NSUM estimates are highly skewed count data with, in this case, binned (and potentially influential) upper categories. First, we present descriptive statistics for the test-retest reliability of ARD and NSUM-estimated network size. We distinguish between reliability (the extent to which respondents’ rank order was reproduced) and agreement (the extent to which their values were reproduced; Berchtold 2016). We used Spearman correlations for reliability and the paired samples Wilcoxon’s signed rank test for agreement. We created Bland-Altman plots (Bland and Altman 1986) to explore individual variability in the estimated network size’s reliability. Bland-Altman plots depict the relationship between the average of the two measurements and their difference (for each individual), revealing the stability of difference values with increasing average counts.

To understand how individual and item (or subpopulation) attributes predict the agreement of (1) ARD responses and (2) NSUM estimates of network size, we fit two-level negative binomial generalized linear models. For the agreement of ARD responses, we fit a longitudinal, cross-classified negative binomial random-intercept model where ARD responses at $t_{1}$ and $t_{2}$ are nested within individuals and also, separately, within ARD items (as illustrated in Supplemental Figure S2 in the online supplement):

\begin{matrix} y_{i j t} | μ_{i j t} ~ NB (e^{μ_{i j t}}, ω) \\ μ_{i j t} = γ_{0} + δ_{0} b_{t} + \sum_{g = 1}^{G} γ_{1 g} x_{g i} + \sum_{h = 1}^{H} γ_{2 h} z_{h j} + \sum_{g = 1}^{G} δ_{1 g} b_{t} x_{g i} + \sum_{h = 1}^{H} δ_{2 h} b_{t} z_{h j} + u_{i} + v_{j} \\ u_{i} ~ N (0, σ_{u}^{2}) \\ v_{j} ~ N (0, σ_{v}^{2}) \end{matrix}

(4)

where the response $y_{i j t}$ of respondent i = {1, . . ., 47} to item j = {1, . . ., 33} at time t = {1, 2} follows a negative binomial distribution with expected number of contacts $μ_{i j t}$ and dispersion parameter $ω$ . Binary variable $b_{t}$ , with values 0 if $t = 1$ (test) and 1 if $t = 2$ (retest), models the time effect and interacts with both $g = {1, . . ., G}$ person-level variables ( $x_{g i}$ , first grouping factor) and $h = {1, . . ., H}$ item-level ( $z_{h j}$ , second grouping factor) variables to capture their influences on the agreement between test and retest. $γ_{0}$ represents the global intercept, $γ_{1 g}$ represents the main effect of $g$ th person-level variable $x_{g i}$ , and $γ_{2 h}$ represents the main effect of $h$ th item-level variable $z_{h j}$ . $δ_{0}$ represents the main effect of time, $δ_{1 g}$ is the interaction effect of the $g$ th person-level variable $x_{g i}$ with binary time variable $b_{t}$ , and $δ_{2 h}$ is the interaction effect of the $h$ th item-level variable $z_{h j}$ with time variable $b_{t}$ . Finally, $u_{i}$ and $v_{j}$ are, respectively, respondent- and item-specific residuals.

This model has random effects for individuals and items. The analysis uses 3,083 observations: responses from 47 respondents to 33 ARD items at two measurements each, excluding 19 missing responses (0.6 percent of the data). We used 6 individual-level variables (respondents’ gender, age, education, country of birth, and two dummies for relative change in social contact) and 5 item-level variables (question order and four dummies for item type) to predict test-retest agreement. To do so, we included the 11 interaction terms between these items and the binary variable test, capturing the test-retest difference. A separate model for only the name variables (n = 1,880) examined whether the name bearers’ gender, average age in the population, and population prevalence affect memorization, in addition to question-order and individual-level variables (i.e., 6 individual-level variables, 4 item-level variables, and 10 interactions).

The second analysis evaluates the agreement of the NSUM-estimated degrees using a longitudinal, two-level negative binomial regression with a random effect for respondents. Here, estimates of individuals’ degrees $d_{i t}$ (rounded to integers) are nested within individuals i = {1, . . ., 47}, yielding 94 observations (47 respondents with two measurements each, t = {1, 2}). Parameters are as defined in equation (4). Again, we model the test-retest agreement by specifying a binary time variable $b_{t}$ ( $b_{1} = 0$ ; $b_{2} = 1$ ) and interaction terms $b_{t} z_{h i}$ to model the effects of individual attributes $z_{h i}$ on agreement:

\begin{matrix} d_{i t} | μ_{i t} ~ NB (e^{μ_{i t}}, ω) \\ μ_{i t} = γ_{0} + δ_{0} b_{t} + \sum_{h = 1}^{H} γ_{h} z_{h i} + \sum_{h = 1}^{H} δ_{h} b_{t} z_{h i} + u_{i} \\ u_{i} ~ N (0, σ_{u}^{2}) \end{matrix}

(5)

The fixed effects include the six individual attributes, the retest variable, and six interactions; models with fewer parameters can be found in Supplemental Table S4 in the online supplement. We fitted the models with the R package glmmTMB (Brooks et al. 2017).

Finally, we assessed the convergent validity of NSUM-estimated degrees, comparing them with degrees estimated with the summation method and Facebook friend counts using Spearman correlations. We also compared the scaling factor between Facebook friend counts and NSUM-estimated degrees with that of other studies.

Results

Comprehension and Response Behaviors

In the qualitative debriefing, respondents indicated they generally understood the questions and our definition of knowing someone by sight and name. For instance, respondent R06 said, “Yes, I think it’s good because by sight, there can be many more people but knowing your name will be a little more concrete, right?” and R21 said, “Yes . . . it seemed good to me because knowing someone’s name, you usually have a relationship that is not necessarily close, but of dialogue, of some kind, right?”

However, some respondents saw the criterion of knowing acquaintances’ names as a “barrier” (R13), having acquaintances whose names they did not know, such as the concierge they “talk to every day” (R13). Likewise, respondents may have forgotten some names. As R31 said, “Of many of the people I know, I forget the names. At a certain moment I don’t remember them, and I don’t know if I have to consider them acquaintances.”

Respondents occasionally included a person in a non-name ARD question whose name they did not know, such as R04: “That’s why [I included] the girl from China, I told you, I don’t know her name. She told me but I don’t remember, apart from that, I’m very bad [with names], I don’t remember people’s names.” Occasionally, respondents added other criteria. R31 excluded people with whom he had no “practical personal relationship,” and R06 only counted people she liked. Such personal criteria, deviating from instructions, may introduce interindividual variability.

Names were generally considered the easiest to respond to because they are “concrete” (R01). The inclusion criteria were clear, and their low prevalence facilitated recall: “if they tell you a name and no one comes to mind, you probably don’t know one” (R16). Regions of origin were also relatively easy for respondents to report, especially if “you can see that someone is, for example, South American” (R12). This comment suggests respondents may undercount people who do not fit their perception of a region’s “typical” inhabitants. Religions and jobs were considered the most difficult to report. However, one religious respondent found the ARD questions on religious affiliations the easiest, suggesting item salience varies across individuals. Even nonreligious respondents seemed better informed about their acquaintances’ religions than they initially thought, because “religious people normally manifest [their religion] in some way” (R28), such as by wearing veils (R30) or crucifix pendants (R28) or sending images of saints via WhatsApp (R06).

Some people doubted the inclusion criteria for jobs. For instance, at $t_{2}$ , R17 realized she had not considered her own high school teachers at $t_{1}$ when asked how many high school teachers she knew, and she asked whether she could include them. She did at $t_{2}$ . Similarly, R28 said she knew many people who had studied law but was unsure if they currently worked as lawyers.

When asked how they reached certain numbers, respondents reported counting the lower ones (“When they were few, I counted” R36), and estimating higher numbers, sometimes starting as low as four. R42, who reported knowing more than 50 lawyers, explained:

It was an estimate, but since the figures [were] brackets, it was more like, “okay I don’t have to give exact [numbers], I have a margin.” But within those limits, I think I have chosen the right one. Yes, I know there are more than 50 for sure. . . . [They are] from my career, teachers—there are many people, parents of my friends, friends of my parents; many, many people in our social circle are lawyers.

In general, however, most respondents were quite confident about their answers, even when reviewing large and small numbers they had reported. Yet some respondents needed time for reflection. R44 initially said he only knew two people from South America but arrived at “approximately 10” upon reflection in the debriefing, thinking about specific countries. “I spaced out,” he said. In such cases, we kept respondents’ original answers to the ARD questions for subsequent analyses; we did not correct them after debriefing.

In the second interview, some respondents changed their response behaviors, such as trying to remember their earlier answers (unlikely, given the many questions) or changing their inclusion criteria after having misinterpreted them at $t_{1}$ . For instance, at $t_{1}$ , R36 had not understood that the counted people should live in Spain, and R17 had interpreted the question about secondary-school teachers differently. Despite these changes, these two respondents’ network size estimates changed relatively little (see Figure 1).

Figure 1.

Distribution of estimated network sizes at both time points, using Killworth et al. s (L) and Zheng et al. s (R) model.

Seven respondents with low answers on ARD questions were routed to the names with double the prevalence (see “Measures: ARD”). Five gave higher answers on these items (on average, they reported 0.52 persons per question for the first 20 names, and 1.05 per question for the four names with double prevalence); the other two cases gave similar and lower answers. Only one respondent with a vast network received the four name questions with half the prevalence. This person mentioned, on average, 9.6 people on the first 20 names and 4.75 on the four names with halved prevalence. These patterns suggest high/low values mainly indicate larger/smaller networks, rather than a tendency to over- or underreport, providing evidence of the instrument’s reliability.

Test-Retest Reliability of ARD Items

Table 1 presents test-retest reliability statistics for all ARD items. Counter to our expectation that recall might increase $t_{2}$ estimates, all names except Fátima and Alfredo had lower estimates at $t_{2}$ than at $t_{1}$ . A t-test showed that the time in seconds respondents spent on the 20 names (questions and instructions) did not differ significantly between the measurements ( $mea n_{t 1} = 208, s . d ._{t 1} = 68; mea n_{t 2} = 206, s . d ._{t 2} = 95; t = 0.167, df = 46, p > 0.05$ ).

Despite lower responses at $t_{2}$ than at $t_{1}$ , six names showed strong $(0.70 \leq ρ < 0.80),$ and 14 very strong $(0.80 \leq ρ < 1.00),$ Spearman correlations, indicating high reliability in reproducing respondents’ rankings. The name with the lowest test-retest correlation, Alicia, had a relatively high population prevalence (0.25 percent), although equally prevalent names (e.g., Ricardo) were more reliable. Wilcoxon’s signed rank test indicated a significant location shift of the medians for six names. Thus, the ranking of respondents was reliably reproduced for these names, but their precise values were not.

Average and median responses on most other ARD subpopulations also decreased from $t_{1}$ to $t_{2}$ (in approximately similar ratios as names), except for occupations and homelessness (see Table 1). ARD questions about occupations and religion were more reliable than questions about origins and voting, effectively reproducing both the order and the values reported at $t_{1}$ . The strong reliability of the religion items contrasts with the difficulty respondents reported in answering these items (see previous section). Voting items were challenging because they focused on relatively large subpopulations, which may have introduced recall bias. Furthermore, people may not disclose their party preferences to acquaintances, introducing transmission error. For the birth region items, the term ancestry may have been variably interpreted.

To understand whether respondent and item attributes affected test-retest agreement, we used two longitudinal, cross-classified negative binomial multilevel models (see Table 2). The first focused on the test-retest stability of items. Model 1 in Table 2, encompassing responses to all items, shows the origins and voting items received higher responses at $t_{1}$ than did other items (see also Supplemental Table S4 in the online supplement). Their high population prevalence may partially explain this, as well as, in the case of origins, the large share of respondents with migration backgrounds. Interaction effects indicate whether item and individual attributes were associated with changes in responses between $t_{1}$ and $t_{2}$ (i.e., lower test-retest reliability). Only one attribute significantly lowered item stability $(p < . 05)$ : higher educated individuals were more likely than lower educated respondents to lower their answers from $t_{1}$ to $t_{2}$ . Higher educated respondents may have initially overreported because of social desirability bias.

Table 2.

Longitudinal, Cross-Classified Negative Binomial Multilevel Models of Item Response.

Variable	Model 1 All Items (n = 3,083)		Model 2 Name Items (n = 1,880)
Variable	Coefficient	s.e.	Coefficient	s.e.
Intercept	−0.001	0.746	−0.874	0.544
Retest (binary)	0.396	0.302	0.216	0.407
Individual attributes, main effects
Female (binary)	−0.444	0.228	−0.411	0.223
Middle to higher education (binary)	0.454	0.235	0.635**	0.231
Age in years	0.005	0.007	0.009	0.007
Born in Spain (binary)	−0.038	0.240	−0.034	0.235
Change in contact since $t_{1}$ (reference: same as usual)
Less contact than usual since $t_{1}$ (dummy)	0.641*	0.307	0.615*	0.300
More contact than usual since $t_{1}$ (dummy)	0.021	0.271	−0.085	0.267
Item attributes, main effects
Item type (ref.: names)
Origin (dummy)	1.285*	0.590	—	—
Voting (dummy)	1.281*	0.566	—	—
Occupations and social statuses (dummy)	1.014	0.623	—	—
Religions (dummy)	0.456	0.539	—	—
Item order	−0.009	0.026	0.004	0.009
Is a male name (binary)	—	—	0.051	0.100
Mean age of name bearers	—	—	−0.015***	0.004
Prevalence of the name (in ‰)	—	—	0.014***	0.002
Individual attributes, effects on test-retest change
Retest × female	0.048	0.092	−0.087	0.102
Retest × middle to higher education	−0.219*	0.095	−0.274*	0.109
Retest × age in years	−0.005	0.003	−0.001	0.003
Retest × born in Spain	−0.118	0.094	−0.122	0.105
Retest × less contact than usual since $t_{1}$	−0.165	0.116	−0.021	0.130
Retest × more contact than usual since $t_{1}$	−0.008	0.110	0.205	0.125
Item attributes, effects on test-retest change
Retest × item type origin	−0.176	0.224	—	—
Retest × item type voting	−0.356	0.209	—	—
Retest × item type occupations and statuses	<−0.001	0.248	—	—
Retest × item type religions	−0.316	0.213	—	—
Retest × item order	−0.004	0.011	−0.001	0.009
Retest × is a male name	—	—	−0.042	0.098
Retest × mean age of name bearers	—	—	<−0.001	0.004
Retest × prevalence of the name	—	—	<0.001	0.002
Random effects	Variance		Variance
Respondents, $σ_{u}^{2}$	0.525		0.487
Items $, σ_{v}^{2}$	0.412		0.026
Dispersion parameter, $ω$	1.28		3.61

p < .05. **p < .01. ***p < .001.

Model 2 (Table 2), encompassing only responses to name items, examines how name attributes relate to test-retest agreement. The main effect of name prevalence (see also model 1, Supplemental Table S4) and its near-zero interaction effect with time reveal that respondents knew more people with more prevalent names in both interviews, confirming NSUM’s effectiveness. Furthermore, names more prevalent in older generations received lower responses, all else being equal, aligning with older people’s smaller networks. Interaction effects with measurement were generally nonsignificant, showing that test-retest stability was independent of the name bearers’ population prevalence, gender or age, or item position. The only significant interaction was with education.

Test-Retest Reliability of NSUM-Estimated Degrees

The name items formed the basis for estimating network sizes with Killworth, Johnsen, and colleagues’ (1998) and Zheng and colleagues’ (2006) NSUM models. As the bottom two rows of Table 1 indicate, respondents’ test-retest reliability was high for both methods (Killworth, Johnsen, et al.: Spearman’s $ρ = 0.93, p < 0.001$ ; Zheng et al.: $ρ = 0.92, p < 0.001)$ . Thus, individuals’ rankings at $t_{1}$ were excellently reproduced at $t_{2}$ . Nonetheless, the estimated median degree was approximately 120 to 140 persons lower at $t_{2}$ than at $t_{1}$ , slightly narrowing the degree distribution (see Figure 1, top panel). Wilcoxon’s signed ranks tests indicate significant difference in the median degree between interviews (see Table 1), implying a poor reproduction of precise values.

To better understand the test-retest reliability of NSUM-estimated network sizes, we plotted individuals’ size estimates from the two measurements (see Figure 1, bottom panel). Confidence intervals were wide, especially for higher estimates. Many respondents had higher estimates at $t_{1}$ than at $t_{2}$ , but their confidence intervals of $t_{1}$ and $t_{2}$ overlapped. Four respondents (R02, R14, R15, and R46) had radically different estimates at $t_{2}$ than at $t_{1}$ : all had high degrees at $t_{1} ({\hat{d}}_{i 1} > 1, 000)$ , were perceived as honest by interviewers, and did not stand out in the measured demographic characteristics in any way.

Bland-Altman plots (see Figure 2) confirm that changes in the initially smaller estimated network sizes generally fell within confidence interval limits. The outliers were concentrated at the scale’s higher end. All 11 respondents with Killworth estimates above 1,250 at $t_{1}$ changed substantially between $t_{1}$ and $t_{2}$ , suggesting room for improvement in estimating the size of larger networks.

Figure 2.

Bland-Altman plots for the degree estimate (upper panel) and logged degree estimate (lower panel) using Killworth et al.'s (R) and Zheng et al.'s (L) model.

To understand whether demographic variables explain individual variation in test-retest reliability, we ran longitudinal, two-level negative binomial regression models with respondents’ attributes as predictors (see Table 3). The significant positive main effect of education (see also model 1 in Supplemental Table S5 in the online supplement) and its negative interaction with retest show that higher educated respondents had larger networks than did lower educated respondents at $t_{1}$ , a common finding for acquaintanceship networks (see Lubbers et al. 2019). However, higher educated respondents experienced a larger decrease in degrees between $t_{2}$ and $t_{1}$ than did lower educated respondents, whose estimates remained more stable over time, all else being equal (see Supplemental Table S5 for more parsimonious models). Supplemental Figure S3 visualizes this association between education and test-retest change on the basis of raw scores and a refitted model with random effects set to zero, including only all main effects and the interaction effect of retest and education (model 2 in Supplemental Table S5). The figure shows that higher educated individuals, regardless of initial network size, reported lower numbers at $t_{2}$ than at $t_{1}$ . Despite the observed decrease from $t_{1}$ to $t_{2}$ , higher educated individuals still had, on average, larger network estimates at $t_{2}$ compared with lower educated individuals.

Table 3.

Longitudinal, Two-Level Negative Binomial Regression Model of the Estimated Degree (Using Killworth, Johnsen, and Colleagues’ Model; $n = 94$ ).

Parameter	Model 1
Parameter	Coefficient	s.e.
Intercept	5.957***	.401
Retest (binary)	.121	.140
Individual attributes, main effects
Female (binary)	−.390	.221
Middle to higher education (binary)	.627**	.228
Age in years	.008	.007
Born in Spain (binary)	−.065	.233
Change in contact since $t_{1}$ (ref.: same as usual)
Less contact than usual since $t_{1}$ (dummy)	.665*	.300
More contact than usual since $t_{1}$ (dummy)	−.019	.263
Individual attributes, effects on test-retest change
Retest × female	−.113	.077
Retest × middle to higher education	−.251**	.079
Retest × age in years	−.001	.002
Retest × born in Spain	−.055	.081
Retest × less contact than usual since $t_{1}$	−.081	.104
Retest × more contact than usual since $t_{1}$	.098	.092
Random effects	Variance
Respondents, $σ_{u}^{2}$	.506
Dispersion parameter, $ω$	33.1

p < .05. **p < .01. ***p < .001.

Interviewer effects were evaluated separately (not shown here), but neither the main nor the interaction effects with retest were significant. This suggests interviewers did not influence the responses variably.

Convergent Validity

Finally, we examined the alignment between NSUM-estimated degrees at $t_{1}$ (using either Killworth’s or Zheng’s) and degrees estimated using the summation method and Facebook friend counts (for Facebook users who checked their accounts, n = 24; respondents who did not check their accounts had a much lower association). All degree variables had right-skewed distributions, which we log-transformed before correlating them. Figure 3 shows that the NSUM-estimated degree (regardless of estimation method) had positive, moderate Spearman correlations with the two criterion variables $(0.53 < ρ < 0.66)$ . Thus, although related, the measures capture different aspects of acquaintanceship network sizes, possibly different spheres of interaction.

Figure 3.

Correlation matrix of the logged network sizes estimated with Killworth et al.'s model, Zheng et al. s model, and the two convergent validity measures. *p < .05; *p < .01; ***p < .001.

At $t_{1}$ , the median NSUM-estimated degree (using Killworth’s) was 765 (see Table 1), and the median Facebook network size was 289 (see “Measures”), resulting in a scaling factor of 2.6. This number aligns with Hofstra and colleagues’ (2021) scaling factor of 2.5 (r = 0.34) and Hampton and colleagues’ (2011) scaling factor of 2.8 between these two measures. As expected, the summation method produced much lower estimates (see “Measures”) because our measure did not cover all social contexts.

Conclusions

ARD questions and NSUM estimates of network size have become increasingly popular, but their measurement properties have been insufficiently studied. In this mixed-methods study we examined these measurement properties in a quota sample of 50 individuals from diverse backgrounds in Spain. The results should be interpreted cautiously because of the small sample size and the possibility that pretest respondents exert more effort to provide accurate answers than do survey respondents (Habecker 2017:118).⁵

We first analyzed response behaviors and found that respondents generally understood the ARD question instructions and the definition of “knowing someone.” However, some respondents questioned the inclusion criteria or added personal criteria, such as a minimum level of acquaintanceship or liking, despite instructions that neither strong relationships nor liking were necessary to count persons as acquaintances. The effect of these personally added or altered inclusion criteria on statistical inference depends on the prevalence of this behavior and whether it occurs randomly or nonrandomly. If specific subpopulations apply these criteria, it could introduce error in subgroup-specific degree variation. If these criteria are selectively applied to certain relationships (e.g., when the criterion of liking is added in a context of homophily), it could create new barrier effects and distort segregation estimates. Clear instructions help ensure uniform interpretation, even if some respondents still deviate.

To further foster uniform interpretation, we designed a response card listing a broad range of social contexts (e.g., family, work, neighborhood, associations) and reminding respondents of the definition of “knowing someone.” Such memory aids are uncommon in NSUM research despite ARD questions being cognitively demanding. Interviewers observed that respondents often glanced at the card, suggesting it helped them focus their recall. On the basis of these observations, we recommend using response cards or similar visual prompts in NSUM research. Although we have not tested the effect on test-retest reliability or debriefed respondents about their experience, we believe the visual prompt promotes a more uniform recall process by emphasizing contexts that structure social relationships. Future research could evaluate the effect of visual prompts on data quality and respondent experience.

Consistent with Habecker (2017), we found that people tended to provide exact counts when the true number was small, but switched to estimation as the numbers grew larger, with some estimating numbers as low as four. A reassuring finding was that responses to name items increased with the prevalence of the names in the population, not only on average, as other studies have shown, but also within respondents. Most respondents who gave low responses for names with 0.06 percent to 0.26 percent prevalence gave numbers twice as high for names with double that prevalence. Conversely, one respondent with exceptionally high answers showed the opposite pattern. These results suggest people with remarkably low and high degrees are not simply under- and overreporting, as they are likely unaware of each name’s national prevalence.

Second, we assessed the test-retest reliability of ARD items and whether item attributes affected their reliability. Contrary to our expectation that respondents would report similar or higher counts on the retest than on the test, respondents reported lower values on most items during the retest. All name questions showed high test-retest reliability (reproduction of the ranking of individuals across measurements), and most had high agreement (reproduction of their exact values). In the qualitative debriefings, respondents indicated that name questions were easier to answer than questions about other types of populations, because of more clearly defined inclusion criteria. In practice, however, they also responded consistently to questions about occupations and religions. Despite potentially larger recall bias, transmission errors, and barrier effects, ARD questions about these subpopulations were highly reliable. ARD questions on religions may have been simpler to answer than those on voting or regions of origin because of the low prevalence of the selected religions in Spain (Zurlo 2024; 0.06 percent for Hindus, 0.11 percent for Jews, and 2.76 percent for Muslims in 2020). Furthermore, respondents noted that small signals (veils, crucifix pendants) or comments via social media or in person often revealed people’s religions. The lower reliability of ARD questions on voting and regions of origin may be related to larger subpopulation sizes and less clear inclusion criteria. Question-order and interviewer effects were insignificant. Whether these findings are generalizable to other studies depends on the comparability of study designs, such as interviewer training, question formulation, and the placement of sensitive ARD questions within the questionnaire. In our case, the debriefing at the end of $t_{1}$ might have led to higher fluctuations, but it was necessary for pretesting purposes beyond this article. Future research should examine the test-retest reliability of ARD questions under different conditions.

Third, we examined the test-retest reliability of acquaintanceship network size estimates using NSUM on the basis of the name items, and assessed whether individual and item attributes predicted the stability of these estimates. We compared two estimation methods: the original method by Killworth, Johnsen, et al. (1998) and an improvement by Zheng et al. (2006), which produced strongly correlated estimates in our case, as we used relatively rare names for NSUM estimation that minimized biases. Researchers using more biased ARD items for NSUM may find larger differences in reliability between the two methods. Both estimates showed excellent test-retest reliability for NSUM-estimated degrees $(ρ > 0.90),$ but low agreement of precise values as estimated degrees decreased. Respondents’ age, gender, and origin hardly affected the stability of degree estimates, but higher educated individuals had larger networks and more unstable estimates than did lower educated people. This instability was also observed among highly educated respondents with initially smaller networks. Higher educated respondents may have initially overreported because of social desirability bias.

Higher estimates of network size fluctuated considerably between measurements, even though they remained high. Thus, NSUM can reliably detect hubs but struggles to produce stable estimates of their network sizes. This instability is related to individuals’ difficulty in accurately recalling high numbers of contacts and the use of bins to ease those problems. Translating bins to their midpoints for estimation transforms one-category differences (e.g., from “11–20” to “21–50”) to seemingly much larger ones (from 15–35), potentially resulting in more volatile measurements. Whether NSUM estimation methods that do not simply take the midpoints of categories but treat them with distributional assumptions (see Feehan et al. 2016) improve test-retest reliability, especially for larger networks, remains to be seen.

Fourth, we evaluated the alignment of NSUM-estimated network size with alternative estimates of acquaintanceship network size. We found moderate correlations between NSUM measures and network size estimated with the summation method and Facebook friend counts $(0.53 < ρ < 0.66)$ . These correlations are considered inadequate for the convergent validity of psychometric scales, but they are considerable for this type of sociological, behavioral measures, with each measure operationalizing acquaintanceship networks slightly differently. Indeed, the correlations suggest the measures capture partially different aspects of this dimension, potentially related to spheres of interaction (e.g., online/offline). Easier-to-obtain measures such as Facebook friend counts thus do not replace NSUM measures and have the added disadvantage that many people (38 percent in our sample) do not use Facebook. Our finding that Facebook network size was more strongly related to NSUM-estimated network size when respondents checked their accounts than when they estimated it from memory highlights the importance of ensuring that survey respondents check their accounts when reporting Facebook friend counts. When surveys lack such controls, our research suggests that responses in multiples of ten tend to indicate memory-based answers.

In summary, we demonstrate that ARD and NSUM estimates reliably reproduce the ranking of individuals across measurements, although the precise values of some ARD items and network size estimates may vary significantly between measurements. Consistent ranking is sufficient for research relating network size to its predictors or outcomes, which is the primary aim of most researchers using these estimates. However, whether consistent ranking suffices for estimating the size of hard-to-count subpopulations remains uncertain. Both the estimated acquaintanceship volume (equation 1’s denominator) and responses to the underlying ARD questions (equation 1’s numerator) decreased between measurements, suggesting that reestimates of a subpopulation’s size may remain within the initial estimate’s confidence interval. However, as Table 1 shows, items varied in their change over time, so the reliability of estimates of unknown subpopulation sizes depends on whether reported numbers of acquaintances in hard-to-count subpopulations change in tandem with those in the known subpopulations used for NSUM.

Furthermore, the lower stability of estimates of higher network size is concerning. Given the mulitfaceted of hubs in social networks, for example, in their role as superspreaders of diseases (Manzo and Van De Rijt 2020), brokers of innovative ideas (Clement, Shipilov, and Galunic 2018), and key nodes for network resilience (Callaway et al. 2000), future research should propose ways to estimate their network sizes more precisely. Rather than relying on high responses to ARD questions, a more robust approach could involve using very low-prevalence items in addition to standard items, as explored in this study.

This study provides numerous practical recommendations for ARD and NSUM research. Future research could be conducted in other cultural settings, for other types of ARD questions, and with variable time intervals to bolster the knowledge base regarding the measurement properties of ARD and NSUM. Such studies would further strengthen the design, use, and analysis of NSUM instruments.

Supplemental Material

sj-pdf-1-smx-10.1177_00811750251340398 – Supplemental material for The Measurement Properties of Aggregated Relational Data and NSUM-Estimated Network Size

Supplemental material, sj-pdf-1-smx-10.1177_00811750251340398 for The Measurement Properties of Aggregated Relational Data and NSUM-Estimated Network Size by Miranda J. Lubbers, Michał Bojanowski, Nuria Targarona Rifa and Alejandro Ciordia in Sociological Methodology

Footnotes

Acknowledgements

We thank the anonymous reviewers for their valuable input.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article is part of the project “A network science approach to social cohesion in European societies” (PATCHWORK; Miranda J. Lubbers, principal investigator). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement 101020038; 10.17605/OSF.IO/BU2WK). Michał Bojanowski thanks ICM University of Warsaw for support through computational grant G74-3. Miranda J. Lubbers is grateful for funding from the Catalan Institution for Research and Advanced Studies (ICREA Acadèmia).

ORCID iD

Miranda J. Lubbers

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Miranda J. Lubbers is a professor of social and cultural anthropology at the Autonomous University of Barcelona and an Acadèmia fellow of the Catalan Institute of Research and Advanced Studies. Her research studies how social networks shape processes of social cohesion, polarization, and exclusion. She is currently directing the European Research Council Advanced Grant–funded project “A Network Science Approach to Social Cohesion in European Societies” (PATCHWORK), among other projects. She is an elected member of the European Academy of Sociology and associate editor of Social Networks.

Michał Bojanowski is a postdoctoral researcher at the Department of Social and Cultural Anthropology of the Autonomous University of Barcelona and assistant professor in the Department of Quantitative Methods and Information Technology, Kozminski University. He is a computational sociologist and R developer and trainer. His research interests are social network dynamics, computational modeling, and simulation. His work has appeared in Social Networks, Network Science, and the Journal of Mathematical Sociology, among other journals.

Nuria Targarona Rifa is a PhD candidate in the Department of Social and Cultural Anthropology, Autonomous University of Barcelona. She is a researcher in the European Research Council–funded project “A Network Science Approach to Social Cohesion in European Societies” (PATCHWORK), where she studies ethnic boundary making in social networks. She conducts mixed-method social network analysis to explore the processes of categorization and boundary drawing. She has previously published in the Journal of Refugee Studies.

Alejandro Ciordia is a postdoctoral researcher on the Faculty of Political and Social sciences, Scuola Normale Superiore, Florence, and the Maastricht Sustainability Institute at Maastricht University. His research focuses on social cohesion and polarization, organized civil society, social movements and protests, and environmental and climate activism. To examine these topics, he draws on relational theories and employs mixed-method research designs, with a particular emphasis on social network analysis. His work has been published in Voluntas and Mobilization, among other journals.

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, eds. 2014. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

Arnaboldi

Valerio

Conti

Marco

La Gala

Massimiliano

Passarella

Andrea

Pezzoni

Fabio

. 2016. “Ego Network Structure in Online Social Networks and Its Impact on Information Diffusion.”Computer Communications 76:26–41.

Berchtold

André

. 2016. “Test–Retest: Agreement or Reliability?” Methodological Innovations 9:1–7.

Bernard

H. Russell

Johnsen

Eugene C.

Killworth

Peter D.

Robinson

1989. “Estimating the Size of an Average Personal Network and of an Event Subpopulation.” Pp. 159–74 in The Small World, edited by Kochen

New York: Ablex.

Bland

J. Martin

Altman

Douglas G

1986. “Statistical Methods for Assessing Agreement between Two Methods of Clinical Measurement.”The Lancet 327(8476):307–10.

Bojanowski

Michał

Baum

Derick S

2024. “Stansum: Bayesian Models for Aggregated Relational Data.” Retrieved May 4, 2025. https://coalesce-lab.github.io/stansum/.

Breen

Casey

Herley

Cormac

Redmiles

Elissa M

2022. “A Large-Scale Measurement of Cybercrime against Individuals.” Pp. 1–41 in CHI ’22: Conference on Human Factors in Computing Systems – Proceedings. New York: Association for Computing Machinery.

Breza

Emily

Chandrasekhar

Arun G.

McCormick

Tyler H.

Pan

Mengjie

. 2020. “Using Aggregated Relational Data to Feasibly Identify Network Structure without Network Data.”American Economic Review 110(8):2454–84.

Brooks

Brandon

Welser

Howard T.

Hogan

Bernie

Titsworth

Scott

. 2011. “Socioeconomic Status Updates: Family SES and Emergent Social Capital in College Student Facebook Networks.”Information, Communication & Society 14(4):529–49.

10.

Brooks

Mollie E.

Kristensen

Kasper

van Benthem

Koen J.

Magnusson

Arni

Berg

Casper W.

Nielsen

Anders

Skaug

Hans J.

Mächler

Martin

Bolker

Benjamin M

2017. “glmmTMB Balances Speed and Flexibility among Packages for Zero-Inflated Generalized Linear Mixed Modeling.”The R Journal 9(2):378.

11.

Callaway

Duncan S.

Newman

Mark E. J.

Strogatz

Steven H.

Watts

Duncan J

2000. “Network Robustness and Fragility: Percolation on Random Graphs.”Physical Review Letters 85(25):5468–71.

12.

Clement

Julien

Shipilov

Andrew

Galunic

Charles

. 2018. “Brokerage as a Public Good: The Externalities of Network Hubs for Different Formal Roles in Creative Organizations.”Administrative Science Quarterly 63(2):251–86.

13.

DiPrete

Thomas A.

Gelman

Andrew

McCormick

Tyler

Teitler

Julien

Zheng

Tian

. 2011. “Segregation in Social Networks Based on Acquaintanceship and Trust.”American Journal of Sociology 116(4):1234–83.

14.

Ezoe

Satoshi

Morooka

Takeo

Noda

Tatsuya

Sabin

Miriam Lewis

Koike

Soichi

. 2012. “Population Size Estimation of Men Who Have Sex with Men through the Network Scale-Up Method in Japan.”PLoS One 7(1):e31184.

15.

Feehan

Dennis M.

Salganik

Matthew J

2016. “Generalizing the Network Scale-Up Method: A New Estimator for the Size of Hidden Populations.”Sociological Methodology 46(1):153–86.

16.

Feehan

Dennis M.

Son

Vo Hai

Abdul-Quader

Abu

. 2022. “Survey Methods for Estimating the Size of Weak-Tie Personal Networks.”Sociological Methodology 52(2):193–219.

17.

Feehan

Dennis M.

Umubyeyi

Aline

Mahy

Mary

Hladik

Wolfgang

Salganik

Matthew J

2016. “Quantity versus Quality: A Survey Experiment to Improve the Network Scale-Up Method.”American Journal of Epidemiology 183(8):747–57.

18.

Fiske

Alan P.

1995. “Social Schemata for Remembering People: Relationships and Person Attributes in Free Recall of Acquaintances.”Journal of Quantitative Anthropology 5:305–24.

19.

Habecker

Patrick

. 2017. “Who Do You Know: Improving and Exploring the Network Scale-Up Method.” PhD thesis, University of Nebraska-Lincoln.

20.

Habecker

Patrick

Dombrowski

Kirk

Khan

Bilal

. 2015. “Improving the Network Scale-Up Estimator: Incorporating Means of Sums, Recursive Back Estimation, and Sampling Weights.”PLoS One 10(12):e0143406.

21.

Hampton

Keith N.

Goulet

Lauren Sessions

Rainie

Lee

Purcell

Kristen

. 2011. “Social Networking Sites and Our Lives: How People’s Trust, Personal Relationships, and Civic and Political Involvement Are Connected to Their Use of Social Networking Sites and Other Technologies.”Pew Research Center. Retrieved May 4, 2025. https://www.pewresearch.org/2011/06/16/social-networking-sites-and-our-lives/.

22.

Hidd, Valentin Vergara, Eduardo Lopez, Simone Centellegher, Sam Roberts, Bruno Lepri, and Robin Dunbar. 2022. “The Stability of Transient Relationships.”Scientific Reports 13:6120.

23.

Hofstra

Bas

Corten

Rense

van Tubergen

Frank

. 2021. “Beyond the Core: Who Has Larger Social Networks?” Social Forces 99(3):1274–1305.

24.

Kazemzadeh

Yasan

Shokoohi

Mostafa

Baneshi

Mohammad Reza

Haghdoost

Ali Akbar

. 2016. “The Frequency of High-Risk Behaviors among Iranian College Students Using Indirect Methods: Network Scale-Up and Crosswise Model.”International Journal of High Risk Behaviors and Addiction 5(3):5–10.

25.

Killworth

Peter D.

Johnsen

Eugene C.

McCarty

Christopher

Shelley

Gene Ann

Russell Bernard

1998. “A Social Network Approach to Estimating Seroprevalence in the United States.”Social Networks 20(1):23–50.

26.

Killworth

Peter D.

McCarty

Christopher

Bernard

H. Russell

Shelley

Gene Ann

Johnsen

Eugene C

1998. “Estimation of Seroprevalence, Rape, and Homelessness in the United States Using A Social Network Approach.”Evaluation Review 22(2):289–307.

27.

Kunke

Jessica P.

Laga

Ian

Niu

Xiaoyue

McCormick

Tyler H

2024. “Comparing the Robustness of Simple Network Scale-Up Method (NSUM) Estimators.”Sociological Methodology 54(2):385–403.

28.

Laga

Ian

Bao

Niu

Xiaoyue

. 2021. “Thirty Years of The Network Scale-Up Method.”Journal of the American Statistical Association 116(535):1548–59.

29.

Laga

Ian

Bao

Niu

Xiaoyue

. 2024. “Networkscaleup: Network Scale-Up Models for Aggregated Relational Data (Version 0.1-2).” Retrieved May 4, 2025. https://cran.r-project.org/web/packages/networkscaleup/networkscaleup.pdf.

30.

Laga

Ian

Kunke

Jessica P.

McCormick

Tyler H.

Niu

Xiaoyue

. 2024. “Estimating and Correcting the Degree Ratio Bias in the Network Scale-Up Method.” arXiv. Retrieved May 4, 2025. https://arxiv.org/abs/2305.04381.

31.

Meng Hao

Yang

Siddique

Abu Bakkar

Lee

Narae

Haque

Md Reazul

Tariq Rahman

Md Lutfay

Ahmad

Manzur

, et al. 2023. “Using the Network Scale-Up Method to Characterise Kidney Trafficking in Kalai Upazila, Bangladesh.”BMJ Global Health 8(11):1–10.

32.

Lubbers

Miranda J

. Forthcoming. “The Role of Social Networks in Institutional Trust during Economic Downturns.”European Sociological Review. https://doi.org/10.1093/esr/jcaf011

33.

Lubbers

Miranda J.

Molina

José Luis

Valenzuela-García

Hugo

. 2019. “When Networks Speak Volumes: Variation in the Size of Broader Acquaintanceship Networks.”Social Networks 56:55–69.

34.

Maltiel

Rachael

Raftery

Adrian E.

McCormick

Tyler H.

Baraff

Aaron J

2015. “Estimating Population Size Using the Network Scale Up Method.”Annals of Applied Statistics 9(3):1247–77.

35.

Manzo

Gianluca

Van De Rijt

Arnout

. 2020. “Halting SARS-CoV-2 by Targeting High-Contact Individuals.”Journal of Artificial Societies and Social Simulation 23(4):10.

36.

McCarty

Christopher

Killworth

Peter D.

Bernard

H. Russell

Johnsen

Eugene C

2001. “Comparing Two Methods for Estimating Network Size.”Human Organization 60(1):28–39.

37.

McCormick

Tyler H.

Salganik

Matthew J.

Zheng

Tian

. 2010. “How Many People Do You Know? Efficiently Estimating Personal Network Size.”Journal of the American Statistical Association 105(489):59–70.

38.

McPherson

Miller

Smith-Lovin

Lynn

Cook

James M

2001. “Birds of a Feather: Homophily in Social Networks.”Annual Review of Sociology 27:415–44.

39.

Morgan

David

. 2009. Acquaintances: The Space between Intimates and Strangers. New York: Open University Press.

40.

Nunnally

Jum C.

Bernstein

Ira H

1994. Psychometric Theory. New York: McGraw-Hill.

41.

Otero

Gabriel

Völker

Beate

Rözer

Jesper

Mollenhorst

Gerald

. 2022. “The Lives of Others: Class Divisions, Network Segregation, and Attachment to Society in Chile.”British Journal of Sociology 73(4):754–85.

42.

Polit

Denise F.

2014. “Getting Serious about Test-Retest Reliability: A Critique of Retest Research and Some Recommendations.”Quality of Life Research 23(6):1713–20.

43.

Schroeder

Matt

Florquin

Nicolas

Hideg

Gergely

Shumska

Olena

. 2019. “Small Arms Trafficking: Perceptions of Security and Radicalization in Ukraine. Assessment for the Ukrainian Ministry of Temporarily Occupied Territories and Internally Displaced Persons.”Geneva, Switzerland: Small Arms Survey.

44.

Shelton

Janie F.

2015. “Proposed Utilization of the Network Scale-Up Method to Estimate the Prevalence of Trafficked Persons.” Pp. 85–94 in Forum on Crime & Society, Vol. 8, edited by United Nations Office on Drugs and Crime. Geneva, Switzerland: United Nations.

45.

Snidero

Silvia

Zobec

Federica

Berchialla

Paola

Corradetti

Roberto

Gregori

Dario

. 2009. “Question Order and Interviewer Effects in CATI Scale-Up Surveys.”Sociological Methods and Research 38(2):287–305.

46.

Sully

Elizabeth

Giorgio

Margaret

Anjur-Dietrich

Selena

. 2020. “Estimating Abortion Incidence Using the Network Scale-Up Method.”Demographic Research 43:1651–84.

47.

Van Tubergen

Frank

Al-Modaf

Obaid Ali

Almosaed

Nora F.

Al-Ghamdi

Mohammed Ben Said

. 2016. “Personal Networks in Saudi Arabia: The Role of Ascribed and Achieved Characteristics.”Social Networks 45:45–54.

48.

UNAIDS (Joint United Nations Programme on HIV/AIDS). 2010. “Network Scale-up: A Promising Method for National Estimates of the Sizes of Populations at Higher Risk.”Geneva, Switzerland: United Nations.

49.

UNAIDS/WHO Working Group on Global HIV/AIDS and STI Surveillance. 2010. “Guidelines on Estimating the Size of Populations Most at Risk to HIV.”Geneva, Switzerland: World Health Organization.

50.

Vardanjani

Hossein Molavi

Baneshi

Mohammad Reza

Haghdoost

Ali Akbar

. 2015. “Total and Partial Prevalence of Cancer across Kerman Province, Iran, in 2014, Using an Adapted Generalized Network Scale-Up Method.”Asian Pacific Journal of Cancer Prevention 16(13):5493–98.

51.

Zheng

Tian

Salganik

Matthew J.

Gelman

Andrew

. 2006. “How Many People Do You Know in Prison? Using Overdispersion in Count Data to Estimate Social Structure in Networks.”Journal of the American Statistical Association 101(474):409–23.

52.

Zurlo

Gina A.

2024. World Religion Database. Leiden, the Netherlands: Brill.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.51 MB