Sage Journals: Discover world-class research

Abstract

Fine-scale covariate rasters are routinely used in geostatistical models for mapping demographic and health indicators based on household surveys from the Demographic and Health Surveys (DHS) program. However, the geostatistical analyses ignore the fact that GPS coordinates in DHS surveys are jittered for privacy purposes. We demonstrate the need to account for this jittering, and we propose a computationally efficient approach that can be routinely applied. We use the new method to analyse the prevalence of completion of secondary education for 20-49 year old women in Nigeria in 2018 based on the 2018 DHS survey. The analysis demonstrates substantial changes in the estimates of spatial range and fixed effects compared to when we ignore jittering. Through a simulation study that mimics the dataset, we demonstrate that accounting for jittering reduces attenuation in the estimated coefficients for covariates and improves predictions. The results also show that the common approach of averaging covariate values in windows around the observed locations does not lead to the same improvements as accounting for jittering.

Keywords

Jittering DHS surveys demographic and health indicators geostatistical analysis template model builder (TMB)

1 Introduction

Fine-scale spatial estimation of demographic and health indicators has become commonplace (Burstein et al., 2019; Utazi et al., 2019; Local Burden of Disease Vaccine Coverage Collaborators, 2021). This article is focused on prevalences, which include many important indicators such as completion of secondary education, neonatal mortality, and vaccination coverage (Fuglstad et al., 2021). For low- and middle-income countries (LMICs), the household surveys conducted by the Demographic and Health Surveys (DHS) Program are a crucial data source. Geographic information in DHS data is given through GPS coordinates, which describe centres of clusters of households. However, cluster centres are randomly displaced by up to 10 km before being published in order to protect participants' privacy (Burgert et al., 2013). We refer to such small random displacements as jittering of the GPS coordinates.

In global health, it is common practice to ignore jittering and estimate risk using a standard geostatistical model with a binomial likelihood. The latent spatial variation in risk is modelled as the combination of raster- and distance-based covariates and a Gaussian random field (GRF). However, covariate values extracted from rasters can vary widely on the distance scale of jittering. Using the covariate value at the jittered location instead of the original location induces a non-standard form of measurement error (Gustafson, 2003). This may in turn lead to attenuation of effect estimates and errors in uncertainty. Furthermore, not accounting for the positional uncertainty for the GRF, artificially reduces estimated spatial dependency and may reduce predictive power as well (Cressie and Kornak, 2003; Fanshawe and Diggle, 2011; Fronterrè et al., 2018).

To address uncertainty in covariates, Perez-Heydrich et al. (2013, 2016) suggested (a) to use regression calibration in the context of distance-based covariates (Warren et al., 2016), and (b) to average spatial covariates within a 5 km buffer zone for continuous and categorical rasters. However, this approach does not address the issue of attenuation of associations. Fanshawe and Diggle (2011) proposed a Bayesian approach to account for positional uncertainty for the GRF, but did not propagate uncertainty in the covariates, and only used Gaussian likelihoods that are not applicable to prevalences. The approach was also computationally expensive, but Fronterrè et al. (2018) made the approach computationally efficient and demonstrated its applicability to analyse malnutrition based on DHS data.

Recently, Wilson and Wakefield (2021) formulated a full geostatistical model for DHS data that includes an observation model for the jittered GPS coordinates, and estimated the model with integrated nested Laplace approximations (INLA) (Rue et al., 2009) within Markov chain Monte Carlo (MCMC) (Gómez-Rubio and Rue, 2018). Their approach addresses the effect of positional uncertainty on both the spatial covariates and the GRF, but was computationally expensive with 1000 MCMC iterations requiring 52 hours in their simulation study. Altay et al. (2022) proposed a similar model as Wilson and Wakefield (2021), but used a more efficient inference scheme with computation time being measured in minutes instead of hours. Their approach was made possible through an approximation of the likelihood, the stochastic partial differential equations (SPDE) approach (Lindgren et al., 2011), and Laplace approximations through template model builder (TMB) (Kristensen et al., 2016).

The simulation study in Altay et al. (2022) revealed that small spatial ranges for the GRF or larger jittering than the DHS scheme were required to see substantial improvements with the new approach over ignoring jittering. Altay et al. (2022) focused on the impact of jittering on the GRF, and lacked raster- and distance-based covariates. Such covariates are far more variable at small spatial scales than a smoothly varying GRF. The aim of this article is to extend the approach in Altay et al. (2022) to a full generalized geostatistical model for prevalence, and to demonstrate that ignoring jittering can lead to attenuation of associations and reduced predictive power when analysing DHS data. We show this via a spatial analysis of the prevalence of secondary education completion among women aged 20-49 in 2018 based on the 2018 Nigeria DHS (NDHS2018) (National Population Commission - NPC and ICF, 2019).

In addition to the new approach, which adjusts for jittering, and the standard approach, which ignores jittering, we consider the common approach of averaging covariates in 5 km × 5 km windows around the provided GPS coordinates (Perez-Heydrich et al., 2013, 2016). The three methods cannot be compared with cross-validation since the true coordinates of the clusters are not known. Therefore, we construct a simulation study that mimics the NDHS2018 dataset to compare them in terms of their ability to estimate parameters and to predict risk at unobserved locations. We use bias and root mean square error (RMSE) to assess parameter estimation, and RMSE and continuous rank probability score (CRPS) (Gneiting and Raftery, 2007) to assess predictive ability.

We introduce the datasets and variables of interest in Section 2. Then, in Section 3, we describe the new approach that adjusts for jittering, and discuss its implementation. In Section 4, we evaluate parameter estimation and prediction with the different methods through a simulation study that mimics the prevalence of secondary education completion. Then, in Section 5, we demonstrate the differences between adjusting and not adjusting for jittering when analysing the prevalence of secondary education completion for women aged 20-49 in Nigeria. The article ends with discussion and conclusions in Section 6. The code used in the article can be found in the GitHub repository https://github.com/umut-altay/GeoAdjust.

2 Data sources and variables of interest

Our outcome of interest is completion of secondary education, which is an indicator of social wellbeing and life outcome (Lewin, 2008). Rates vary strongly between women and men, but also between urban and rural areas. According to UNESCO (2019), only 1 % of the poorest girls in low income countries will complete secondary education. If a girl completes secondary education, the risk of HIV infection is reduced by about 50 % (UNAIDS, 2022).

We consider the prevalence of secondary education completion for women aged 20-49 years in Nigeria in 2018. The lower bound was chosen because younger women may not have completed secondary education yet, and the upper limit was chosen since older women are not available in the DHS surveys. The year 2018 was chosen since this corresponds to the most recent DHS survey in Nigeria, NDHS2018.

Nigeria is an LMIC with a population of more than 200 million, and NDHS2018 has data collected from 1389 clusters with responses from 33,398 women aged 20-49, where 15,621 of them reported that they had completed their secondary education. In this article, we use the 1380 clusters with valid GPS coordinates, which have responses from 33,193 women aged 20-49 where 15,490 reported that they had completed their secondary education.

Figure 1(b) shows the direct estimates (Rao and Molina, 2015) computed based on the data and the survey design for the 36 states and one federal capital territory. The corresponding uncertainty is expressed through the coefficient of variation (CV). These 37 areas are the first administrative level (admin1), and the results show considerable variation between areas. The CVs increase when direct estimates are calculated at finer spatial scales due to observations needing to be distributed amongst more areas, resulting in smaller sample sizes per area. The second administrative level (admin2) for Nigeria, for example, consists of the 774 local government areas shown in Figure 1(a). The red dots in the figure indicate the 1380 clusters with GPS coordinates within Nigeria available in NHDS2018. The national boundary, admin1 boundaries and admin2 boundaries are based on GADM version 4.0 (GADM).

Figure 1

Maps of (a) admin2 areas and cluster locations for NDHS2018, and (b) admin1 areas with direct estimates.

We expect the prevalence of completion of secondary education to be closely related to the access to educational resources, such as technological infrastructure, schools and teachers. We consider five spatial covariates: population count (PopD) (World Pop, 2022), travel time to nearest city (CityA) (Weiss et al., 2018), elevation (Elev) (National Oceanic and Atmospheric Administration, 2022), distance to nearest river or lake (DistW) (Natural Earth, 2012), and urbanicity ratio (Pesaresi et al., 2016). For UrbR, we scale the original covariate to be on a zero to one scale, and for the other four covariates, we use a log(1 + x)-transformation and then center and standardize the covariate rasters across the pixels. The information about the covariate rasters and figures is summarized in Table 1. The covariate rasters are available at different resolutions, and have not been resampled or aggregated to the same resolution since this would involve an extra preprocessing step that would induce misalignment error and potentially add ecological bias (Greenland and Morgenstern, 1989).

Table 1

Summary of covariate rasters providing name, description and figure. CityA, Elev, DistW and PopD are transformed, while UrbR is not.

Name	Description	Figure
PopD	Population count (250 m × 250 m)	2(a)
CityA	Travel time in minutes (1 km × km)	2(b)
Elev	Elevation in meters (1 km × 1 km)	2(c)
DistW	Distance to nearest river or lake in degrees (1 km × 1 km)	2(d)
UrbR	Urbanicity ratio (250 m × 250 m)	2(e)

We aim to map spatial variation at a continuous spatial scale through a geostatistical model. A key challenge with using the above covariates is that the locations shown in Figure 1(a) are not the true locations. The true locations have been randomly displaced while respecting the admin 2 borders. This means that we can only imprecisely extract covariates from the rasters shown in Figure 2. The new method to account for the uncertainty in locations is described in Section 3.

Figure 2

Transformed covariate rasters for Nigeria.

3 Adjusting for jittering in a geostatistical model

3.1 Notation for DHS data

For a given country and DHS household survey, $C$ clusters are visited. These clusters constitute small geographic areas and are collections of households. A total of $n_{c}$ people at risk are observed and $y_{c} \leq n_{c}$ individuals have positive outcomes for clusters $c = 1, 2, \dots, C$ . The reported GPS coordinates of the cluster centres are $s_{c} \in ℝ^{2}, c = 1, \dots, C$ . These locations are not the true GPS coordinates, but the jittered GPS coordinates. Additionally, the urban/rural designation is known for each visited cluster.

3.2 Geostatistical model

3.2.1 Model for spatial variation in risk

We envision a spatially varying risk, $r (\cdot)$ , for the country of interest $D \subset ℝ^{2}$ modelled through

r (s) = {logit}^{- 1} (η (s)) = {logit}^{- 1} (x {(s)}^{T} β + u (s)), s \in D,

where $x (\cdot)$ is a $p$ -dimensional vector of covariates, $β$ is a $p$ -dimensional vector of covariate effect sizes, and $u (\cdot)$ is a Matérn GRF. The Matérn covariance function with smoothness $v = 1$ is parametrized as

C_{M} (s_{1}, s_{2}; σ_{S}^{2}, ρ_{S}) = σ_{S}^{2} (\sqrt{8} \frac{s_{1} - s_{2}}{ρ_{s}}) K_{1} (\sqrt{8} \frac{s_{1} - s_{2}}{ρ_{s}}), s_{1}, s_{2} \in D,

where $σ_{S}^{2}$ is the marginal variance, $ρ_{S}$ is the spatial range, and $K_{1}$ is the modified Bessel function of the second kind, order 1.

3.2.2 Unadjusted model

When jittering is ignored, the reported cluster locations $s_{1}, \dots, s_{C}$ are treated as the true locations. This gives the unadjusted observation model:

\begin{array}{l} y_{c} ∣ r_{c}, n_{c} \sim Binomial (n_{c}, r_{c}), \\ r_{c} = r (s_{c}) = {logit}^{- 1} (η (s_{c})), \end{array}

(3.1)

where $r_{c}$ is the risk in cluster $c$ , for $c = 1, \dots, C$ .

3.2.3 Adjusted model

Let $s_{1}^{*}, \dots, s_{C}^{*} \in D$ denote the true locations corresponding to the jittered locations $s_{1}, \dots, s_{C}$ . The adjusted observation model is,

\begin{matrix} y_{c} ∣ r_{c}, n_{c} \sim Binomial (n_{c}, r_{c}), s_{c} ∣ s_{c}^{*} \sim π_{Urb [c]} (s_{c} ∣ s_{c}^{*}), \\ r_{c} ∣ s_{c}^{*} = r (s_{c}^{*}) = {logit}^{- 1} (η (s_{c}^{*})), \end{matrix}

(3.2)

where $r_{c}$ is the risk in cluster $c$ , and $Urb [c] \in \{U, R\}$ corresponds to the cluster's urban (U) or rural (R) designation, for $c = 1, \dots, C$ . In this observation model, both $y_{c}$ and $s_{c}$ are treated as observed quantities. The unobserved true locations $s_{c}^{*}$ are treated as random quantities and assigned a uniform prior $s_{c}^{*} \sim U (A (s_{c})$ ), where $A (s) \in \{1, \dots, K\}$ denotes the administrative region containing location $s \in D$ that a cluster location must be jittered within and $K$ is the number of such administrative regions. For example, the NDHS2018 jittered cluster locations must be in the same admin2 area as the associated true cluster locations. This implies that we treat all true cluster locations $s_{c}^{*}$ in the corresponding admin 2 area and within the maximum jittering distance of $s_{c}^{*}$ as equally likely a priori.

The jittering distributions $π_{U}$ and $π_{R}$ follow from the (known) DHS jittering scheme. Then for an urban cluster $c$ , which can be jittered up to $2 km$ , the jittering distribution is

π_{U} (s_{c} ∣ s_{c}^{*}) \propto \frac{I (A (s_{c}) = A (s_{c}^{*})) \cdot I (d (s_{c}, s_{c}^{*}) < 2)}{d (s_{c}, s_{c}^{*})}, s_{c} \in D,

where $d (s_{c}, s_{c}^{*})$ is the distance in kilometers between $s_{c}$ and $s_{c}^{*}$ , and $I$ is the indicator function. For a rural cluster $c$ , which can be jittered up to $5 km$ except for $1 %$ of clusters jittered up to $10 km$ , the jittering distribution is:

π_{R} (s_{c} ∣ s_{c}^{*}) \propto \frac{I (A (s_{c}) = A (s_{c}^{*}))}{d (s_{c}, s_{c}^{*})} [\frac{99 I (d (s_{c}, s_{c}^{*}) < 5)}{100} + \frac{I (d (s_{c}, s_{c}^{*}) < 10)}{100}], s_{c} \in D .

3.2.4 Priors

We assume linear covariate associations, and use the prior $β \sim N_{p} (0, 25 I_{p})$ . The range, $ρ_{S}$ , and marginal variance, $σ_{S}^{2}$ , of the Matérn GRF is assigned a penalized complexity (PC) prior (Fuglstad et al., 2019). This requires selecting two hyperparameters: the a priori median of range $R_{0}$ , and the a priori 95th percentile of marginal standard deviation $S_{0}$ .

3.3 Implementation

3.3.1 Inference scheme

The observation model in Equation (3.2) can be written as,

\begin{matrix} π (y_{c}, s_{c} ∣ η (\cdot)) = \int_{ℝ^{2}} π (y_{c}, s_{c} ∣ η (\cdot), s_{c}^{*}) π (s_{c}^{*}) d s_{c}^{*} \\ = \int_{ℝ^{2}} π (y_{c} ∣ η (s_{c}^{*})) π_{Urb [c]} (s_{c} ∣ s_{c}^{*}) π (s_{c}^{*}) d s_{c}^{*}, \end{matrix}

(3.3)

for $c = 1, \dots, C$ . Let $θ = (log (σ_{S}^{2}), log (ρ_{S}))$ . We propose an empirical Bayes approach:

Step 1: Calculate the maximum a posteriori (MAP) estimate, $\hat{θ}$ , of $θ$ using $π (θ ∣ y_{1}, \dots, y_{C}, s_{1}, \dots, s_{C})$ .

Step 2: Extract inference about $β$ from $π (β ∣ y_{1}, \dots, y_{C}, s_{1}, \dots, s_{C}, θ = \hat{θ})$ .

Step 3: Estimate risk $r (s)$ at location $s$ using

π (r (s) ∣ y_{1}, \dots, y_{C}, s_{1}, \dots, s_{C}, θ = \hat{θ}) .

Two key components are combined for rapid inference: the SPDE approach to approximate the Matérn GRF (Lindgren et al., 2011), and Template Model Builder (TMB) for empirical Bayesian inference (Kristensen et al., 2016).

3.3.2 SPDE approach

For each cluster $c$ , the true location $s_{c}^{*}$ is not known, and the observation model in Equation (3.3) involves the spatial field $u (\cdot)$ at all locations that are compatible with the jittered location $s_{c}$ . If we replace the integral in Equation (3.3) by a integration scheme using $N_{Int}$ integration points, we need to evaluate the spatial field at $C \cdot N_{Int}$ locations. A standard implementation of the Matérn model would result in a dense $C \cdot N_{Int} \times C \cdot N_{Int}$ matrix and make computations infeasible even for a few locations.

The SPDE approach (Lindgren et al., 2011) overcomes this issue by approximating the Matérn GRF that results in a sparse precision matrix. First, the area of interest is triangulated with a triangulation consisting of $m$ nodes. Then the GRF $u (\cdot)$ is approximated by

\tilde{u} (s) = \sum_{i = 1}^{m} w_{i} ϕ_{i} (s)

(3.4)

where $ϕ_{i} (\cdot)$ are pyramidal basis functions and $w = {(w_{1} \dots w_{m})}^{T}$ are weights for the basis functions. The SPDE approach results in a distribution $w \sim N_{m} (0, Q {(θ)}^{- 1})$ , where $Q (θ)$ is sparse.

From Equation (3.4), the value at any location is a linear transformation $\tilde{u} (s) = a {(s)}^{T} w, s \in D$ , where $a (s) \in ℝ^{m}$ is sparse with at most three nonzero elements depending on the location $s$ . This means that the spatial field can be evaluated at a large number of locations quickly.

The SPDE is given by

(κ^{2} - Δ) (τ u (s)) = W (s), s \in \tilde{D},

where $κ > 0$ and $τ > 0$ are related to marginal variance and range, $Δ$ is the Laplacian, $W (\cdot)$ is standard Gaussian white noise, and $\tilde{D} \supset D$ is an extended domain to reduce boundary effects. We use Neumann boundary conditions to make the problem well defined, and following Lindgren et al. (2011), the effective range and marginal variance are calculated from the SPDE parameters as

ρ_{S} = \frac{\sqrt{8}}{κ} and σ_{S}^{2} = \frac{1}{4 π τ^{2} κ^{2}} .

3.3.3 Template model builder

We implement the empirical Bayesian inference scheme by employing the built-in auto-differentiation and Laplace approximations of the TMB R package. Unlike sampling based MCMC methods, TMB uses numerical integration, through Laplace approximations, to perform inference. The auto-differentiation is used to speed up the Laplace approximations. See Kristensen et al. (2016) for the details.

In TMB, one can compute arbitrary likelihoods, and we can approximate the likelihood in Equation (3.3) through the integration scheme

π (y_{c}, s_{c} ∣ η (\cdot)) \propto \sum_{k = 1}^{K} α_{k} π (y_{c} ∣ η (s_{c, k}^{*})) π_{Urb [c]} (s_{c} ∣ s_{c, k}^{*}) π (s_{c, k}^{*},)

(3.5)

where $α_{1}, \dots, α_{K}$ are integration weights. More details are available in Altay et al. (2022). Critically, the integration scheme in Equation (3.5) involves

η (s_{c, k}^{*}) = x {(s_{c, k}^{*})}^{T} β + u (s_{c, k}^{*}), k = 1, \dots, K, c = 1, \dots, C .

Based on the known jittering distribution, we construct the integration scheme with rings of integration points around each cluster center. The observed cluster center is the first integration point, and we use 5 and 10 rings for the clusters that are located within urban and rural administrative areas, respectively. Each ring consists of 15 angularly equidistant integration points.

Throughout this article we consider the three approaches shown in Table 2. UnAdj denotes the traditional model that does not adjust for jittering, Smoothed denotes a model where the covariates have been averaged over $5 km \times 5 km$ windows around each location (Perez-Heydrich et al., 2013, 2016) and FullAdj denotes the new geostatistical model that fully adjust for the positional uncertainty.

Table 2

The three approaches considered in the article.

Approach	Description
UnAdj	Traditional geostatistical model.
Smoothed	UnAdj with covariates averaged over 5 km × 5 km windows.
FullAdj	Geostatistical model adjusting for jittering.

4 Simulation study

In this section, we evaluate the three models illustrated in Table 2 using a known data generating model, where the parameters correspond to FullAdj estimated based on completion of secondary education in Section 5, see Table 4. The data generating model is chosen to achieve realistic scenarios, and the aim is to evaluate accuracy in parameter estimation, and accuracy in predictive distributions. The key interest is how these accuracies vary across different strengths of the signal from the covariates.

We assume that the true spatial risk varies as

r (s) = {logit}^{- 1} (x {(s)}^{T} β + u (s)), s \in D,

(4.1)

where $D$ is Nigeria, $x (\cdot)$ is a 6-dimensional spatially varying vector with 1 as the first element denoting the intercept, and the covariates DistW, CityA, Elev, PopD, and UrbR as the five last elements, $β = {(μ, β_{DistW}, β_{City}, β_{Elev}, β_{PopD}, β_{UrbR})}^{T}$ , and $u (\cdot)$ is a Matérn GRF. Spatial range $ρ_{S}$ and the marginal variance $σ_{S}^{2}$ of the GRF are set to the values given in Table 4 for FullAdj. Then we construct three scenarios according to weaker, the same, and stronger association to the covariates compared to estimated coefficients under FullAdj in Table 4:

SignalLow: $β$ is set to 0.5 times the estimated values.

SignalMed: $β$ is set to 1.0 times the estimated values.

SignalHigh: $β$ is set to 1.5 times the estimated values.

For each of the three scenarios, we generate $n_{sim} = 50$ true risk surfaces, $r (\cdot)$ , according to Equation (4.1).

For observations, we follow the design of the NDHS2018 survey. We fix the number of clusters to $C = 1, 380$ , and for each cluster $c$ , and fix 568 urban locations and 812 rural locations. We also fix the number-at-risk, $n_{c}$ , according to NDHS2018 for $c = 1, \dots, C$ . For each of the 150 true risk surfaces, the true locations $s_{1}^{*}, \dots, s_{C}^{*} \in D$ are simulated as a Poisson point process with intensity proportional to population density. This ensures higher likelihood to place observations at locations with more people, and no locations at locations with no people. Then, for each cluster $c = 1, \dots, C$ , we simulate response $y_{c} ∣ r (s_{c}^{*}), n_{c} \sim Binomial (n_{c}, r (s_{c}^{*}))$ and observed location $s_{c} ∣ s_{c}^{*}$ according to the DHS jittering scheme.

We fit the models UnAdj, Smoothed and FullAdj described in Section 3 using an intercept and the five covariates described above. The PC prior on the Matérn GRF is specified by $P (σ_{S} > S_{0}) =$ 0.05 and $P (ρ_{S} > R_{0}) = 0.50$ , where $S_{0} = 1$ is the 95th percentile of the marginal standard deviation, and $R_{0} = 160 km$ is the median range. Inference is performed as described in Section 3.3.

Parameter estimation is evaluated by computing the RMSE, $\frac{1}{n_{sim}} \sum_{b = 1}^{n_{sim}} {({\hat{θ}}^{(b)} - θ)}^{2}$ , and the bias, $\frac{1}{n_{sim}} \sum_{b = 1}^{n_{sim}} ({\hat{θ}}^{(b)} - θ)$ , where ${\hat{θ}}^{(b)}$ is the posterior mean (or MAP in the case of $ρ_{S}$ and $σ_{S}^{2}$ ) for dataset $b$ and $θ$ is the true value of the coefficient. Predictions are evaluated on a fixed set of 1,000 randomly selected locations within Nigeria, where we predict $η (s) = logit (r (s))$ with the posterior median. These predictions are evaluated by the average RMSE and CRPS defined by $\int_{ℝ^{2}} {(F (x) - I (y \leq x))}^{2} d x$ , where $y$ is the true value and $F (\cdot)$ is the predictive distribution.

The simulation study was run on a computing server that operates on Linux (Ubuntu 20.04). The computing server has 28 cores (2X14-core Xeon 2.6GHz) and provides 256GB memory limit per user. It took on average 4 minutes to estimate the parameters $σ_{S}$ and $ρ_{S}$ for FullAdj for each one of the 150 datasets, which is the most time-consuming part of the inference described in Section 3.3.1.

Table 3 shows the bias and RMSE for each of the parameters and covariate coefficients for SignalMed. The results show that UnAdj and Smoothed performs almost the same for $σ_{S}^{2}$ and $ρ_{S}$ , but that Smoothed gives higher bias and RMSE for the coefficients of the covariates. The latter is a consequence of the fact that smoothing the covariates changes the interpretation of the coefficients.

Table 3

Bias and RMSE of parameter estimation for SignalMed.

		Parameter
	Model	$ρ_{S}$	$σ_{S}^{2}$	$μ$	$β_{DistW}$	$β_{CityA}$	$β_{Elev}$	$β_{PopD}$	$β_{UrbR}$
Bias	UnAdj	-21.32	0.05	0.32	0.00	0.08	0.01	-0.16	0.84
	Smoothed	-19.97	0.05	0.21	0.14	-0.22	0.04	0.15	-1.03
	FullAdj	7.51	0.009	0.27	0.01	0.02	0.02	-0.09	0.37
RMSE	UnAdj	22.48	0.09	0.42	0.27	0.09	0.19	0.16	0.85
RMSE	Smoothed	21.37	0.09	0.34	0.37	0.24	0.29	0.17	1.17
	FullAdj	11.96	0.09	0.37	0.26	0.04	0.17	0.09	0.42

For example, the association with the urbanicity ratio of a 250 m x 250 m pixel and the urbanicity ratio of a 5 km x 5 km pixel are two different things. Additional tables for SignalLow and SignalHigh are found in Section 3 in the Supplementary Material.

Figure 3 shows boxplots of estimated $ρ_{S}$ and $β_{UrbR}$ underlining that difference between models is also clear when you consider variation between simulations within the same scenario. There is a clear trend to underestimate the spatial range for UnAdj and Smoothed, and the association with the covariate is too weak for UnAdj. For Smoothed, the model is estimating a coefficient with different interpretation than for the other models due to the averaging of covariate rasters over $5 km \times 5 km$ windows. The differences are larger for stronger covariate signal levels.

Figure 3

Box plots of estimated $ρ_{S}$ and $β_{U r b R}$ for SignalLow, SignalMed and SignalHigh. The horizontal red lines show the true parameter value.

Figure 4 shows the variation in RMSE and CRPS across datasets for predictions. FullAdj and UnAdj perform almost the same in both predictive metrics for SignalLow, but FullAdj is slightly better for SignalMed, and substantially better for SignalHigh. This indicates that the stronger the signal of the spatial covariates, the larger the gain from adjusting for jittering. The figure also indicates that Smoothed does not have better predictive ability than UnAdj.

Figure 4

Box plots of CRPS and RMSE for predictions for SignalLow, SignalMed and SignalHigh.

In addition to the simulation study described above, we performed a simulation study with the true locations fixed to the observed locations of NDHS2018. The goal was to examine the predictive metrics for the specific spatial design of NDHS2018. There were only minor differences between this simulation study, and the one described above, and we, therefore, show these results in Section 2 of the Supplementary Material. This suggests that we should expect the unadjusted model to underestimate range, and lose predictive accuracy due to reduced association with the covariates.

5 Analysis of completion of secondary education

In this section, we analyse the spatial variation in prevalence of completion of secondary education among 20-49 year old women in Nigeria in 2018 based on the NDHS2018 using the data sources described in Section 2. Our analysis has two aims. First, to map the fine-scale spatial variation in completion of secondary education for 5 km x 5 km pixels, and for the admin 1 areas. Second, to determine the associations between the spatial variation in risk and a set of explanatory spatial covariates.

The NDHS2018 has C = 1,380 clusters with jittered GPS coordinates available under the same jittering distribution as in Section 3.2. For all clusters, the jittering was restricted to stay within the correct admin2 area. In total, 33,193 women aged 20-49 years were interviewed and 15,490 of these had completed secondary education. We use the notation $n_{c}$ individuals-at-risk, $y_{c}$ successes, and jittered GPS coordinate $s_{c}$ for $c = 1, \dots, C$ . The five covariates of interest are introduced in Section 2.

We fit the models UnAdj, Smoothed and FullAdj described in Section 3 using an intercept and the five covariates described in Section 2. We use the PC prior on the Matérn GRF specified by $P (σ_{S} > S_{0}) = 0.05$ and $P (ρ_{S} > R_{0}) = 0.50$ , where $S_{0} = 1$ is the 95th percentile of the marginal standard deviation, and $R_{0} = 160 k m$ is the median range. Inference is performed as described in Section 3.3.

Table 4 shows the estimated parameters and their corresponding credible interval lengths (except for $ρ_{S}$ and $σ_{S}^{2}$ , which are fixed to their MAP estimates). The results show that $ρ_{S}$ is estimated substantially smaller for UnAdj and Smoothed than for FullAdj. Based on the results in Section 4, this indicates that spatial correlation is lost by not accounting for jittering. Additionally, $σ_{S}^{2}$ is estimated higher for UnAdj and Smoothed than for FullAdj. This could be because the covariates are able to explain less of the spatial variation for the former two models.

Table 4

Parameter estimates and the corresponding 95% credible interval lengths in parentheses. Uncertainty is not computed for $ρ_{S}$ and $σ_{S}^{2}$ .

	Parameter
Model	$ρ_{S}$	$σ_{S}^{2}$	$μ$	$β_{DistW}$	$β_{City}$	$β_{Elev}$	$β_{PopD}$	$β_{UrbR}$
UnAdj	64.15	1.91	$- 2.24 (0.89)$	$0.91 (1.24)$	$- 0.40 (0.13)$	$- 0.14 (0.69)$	$0.14 (0.09)$	$- 0.14 (0.42)$
Smoothed	58.42	1.88	$- 2.32 (0.82)$	$1.08 (1.43)$	$- 0.78 (0.29)$	$- 0.40 (0.88)$	$0.49 (0.30)$	$- 2.97 (2.20)$
FullAdj	107.68	1.65	$- 2.21 (1.04)$	$0.62 (1.29)$	$- 0.43 (0.17)$	$- 0.02 (0.61)$	$0.32 (0.13)$	$- 1.35 (0.69)$

For PopD and UrbR, there is a strong attenuation when jittering is ignored. The credible intervals for the coefficient of UrbR, $β_{U r b R}$ , suggest that $β_{U r b R}$ is not significant at the $95 %$ level for UnAdj, whereas $β_{UrbR}$ is significant for FullAdj. This suggest that not accounting for jittering can lead to misleading conclusions. As discussed in Section 4, it is hard to do direct comparisons in the estimated coefficients for Smoothed and for UnAdj and FullAdj since averaging covariates changes their meaning.

Figure 5 shows the $5 k m \times 5 k m$ pixel maps of predicted risk and CVs for UnAdj, Smoothed and FullAdj. The results show that some areas such as Borno (in the north-east) have up to three times the risk under the UnAdj approach as under FullAdj. Further, UnAdj tends to lead to higher uncertainty in the predictions than FullAdj.

Figure 5

Rows 1, 2 and 3 are predicted risk and the CVs for UnAdj, Smoothed, and FullAdj, respectively. Row 4 shows ratios (UnAdj/FullAdj) of predictions and CVs.

We aggregate point level predictions with respect to population density to produce areal estimates at the $37 a d m i n 1$ areas (for more on aggregating point level predictions with respect to a population, see Paige et al. 2022). Figure 6 shows the predicted risk and associated CVs for UnAdj, Smoothed and FullAdj. From Figure 6(g), we see that the point estimates vary from a factor 0.9 to 1.2, and Figure 6(h) shows that some areas differ with a factor of up to 1.4 in CVs.

Figure 6

Rows 1 and 2 are predicted risk and CVs for UnAdj and FullAdj respectively at the admin 1 level, and row 3 shows ratios (UnAdj/FullAdj) of predictions and CVs.

In Figure 7, we compare the areal estimates from the FullAdj model against direct estimates. Figure 7(b) shows that the FullAdj reduces uncertainty compared to the direct estimates, and Figure 7(a) shows that the estimates from FullAdj and the direct estimates are compatible with no unexpected deviations.

Figure 7

Comparison direct estimates and (areal) model-based estimates for (a) the mean, and (b) the coefficient of variations.

The ability to predict risk at unobserved locations for UnAdj, CovAdj and FullAdj cannot be compared with cross validation. If data is held-out from NDHS2018, we can only evaluate the models' ability to predict risk at a new jittered cluster with unknown true location. However, the simulation study in Section 4 indicates that FullAdj leads to an improvement in prediction if the data-generating model is the one estimated in this section.

6 Discussion

Accounting for jittering substantially changed the parameter estimates for the geostatistical model for completion of secondary education among women aged 20-49 years. The simulation study demonstrated that these differences were linked to the strength of the signal of the spatial covariates when explaining the spatial variation. For strong signals, the associations were attenuated and the predictive power was reduced.

The most important aspect of jittering in the context of geostatistical models for DHS data is to account for the resulting uncertainty in covariates extracted from rasters or extracted based on distances. This induces measurement error that may lead to attenuation in associations between covariates and the responses. Some covariates such as sanitation practices and household assets can be known exactly (Burgert-Brucker et al., 2016), but these cannot be included when the goal is prediction since fine-scale rasters are not available.

In influential work such as Burstein et al. (2019) and Local Burden of Disease Vaccine Coverage Collaborators (2021), covariates are resampled to a $5 k m \times 5 k m$ grid. This is similar, but slightly different than the Smoothed model discussed in this article which uses $5 k m \times 5 k m$ windows around the observed locations. However, the results suggest that such approaches that average covariates do not address the issue of jittering, and do not improve predictions.

This article used uniform priors for the unknown true locations. One could expect including information about population density and urbanicity into the priors would produce more accurate inference. However, population density maps and urbanicity maps are also modelled surfaces with biases and uncertainties that are not well understood. This means that evaluation of the sensitivity to such maps would have to be investigated, and one would need a way to evaluate whether such a model works better.

The inference scheme uses empirical Bayes. It is possible to investigate methods such as INLA, but the implementation in the $R$ ( $R$ Core Team, 2022) package inla does not allow the likelihood to depend on the latent risk at multiple locations, which is necessary due to the integration points. MCMC algorithms such as STAN (Stan Development Team, 2020) has the required flexibility, but is infeasible for thousands of spatial locations.

When analysing completion of secondary education, we found that, for urbanicity, an effect size of 0 was contained in the $95 %$ credible interval when not adjusting for jittering, and not contained when adjusting for jittering. This suggests that not accounting for jittering when analysing DHS data is a practice that can alter conclusions about statistical significance. Since the proposed approach is fast for spatial analysis, we suggest using the new approach routinely for analysing DHS data to avoid the risk of misleading conclusions and reduced predictive power. The code used throughout the article can be found in the GitHub repository https://github.com/umut-altay/GeoAdjust.

Supplementary material

Supplementary material for this article is available online.

Supplemental Material for Impact of jittering on raster- and distance-based geostatistical analyses of DHS data by Umut Altay, John Paige, Andrea Riebler and Geir-Arne Fuglstad, in Statistical Modelling

Appendix

Computational details

$R$ ( $R$ Core Team, 2022) is the main programming language used for the code that produced the results in this manuscript. The computational part of the study utilizes auto-differentiation via TMB (Kristensen et al., 2016, 2023). This means that low-level computations are done according to C++ code while high-level computations are done in $R$ . The code scripts that are used for this article can be found in the GitHub repository https://github.com/umut-altay/GeoAdjust.

The application in Section 5 was run on a MacBook Pro (2.4GHz Quad-Core Intel Core i5 and 16 GB memory). For FullAdj, it took 7.8 minutes to estimate parameters, and 4.2 minutes to compute predictions.

A newly developed R-package GeoAdjust (Altay et al., 2023) is now available on CRAN for analysing DHS data while accounting for jittering in a more compact way, similar to the analysis in Section 5. The package provides a user-friendly functionality by isolating the user from $C + +$ and complex R code. It is flexible on allowing analysing data with Gaussian, binomial and Poisson responses as well.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

References

Altay

, Paige

, Riebler

and G-A

Fuglstad

(2022) Accounting for spatial anonymization in DHS household surveys. arXiv preprint arXiv:2202.11035 .

Altay

, Paige

, Riebler

and G-A

Fuglstad

(2023) Geoadjust: Adjusting for positional uncertainty in geostatistial analysis of dhs data. arXiv preprint arXiv:2303.12668 .

Burgert

, Colston

, Roy

and Zachary

(2013) Geographic displacement procedure and georeferenced datarelease policy for the Demographic and Health Surveys. https://dhsprogram.com/pubs/pdf/SAR7/SAR7.pdf. DHS Spatial Analysis Reports No. 7.

Burgert-Brucker

, Domtamsetti

, Marshall

and Gething

(2016) Guidance for use of the DHS program modeled map surfaces. Spatial Report No. 14.

Burstein

, Henry

, Collison

, Marczak

, Sligar

, Watson

, Marquez

, Abbasalizad-Farhangi

, Abbasi

and AbdAllah

, . (2019) Mapping 123 million neonatal, infant and child deaths between 2000 and 2017. Nature , 574(7778), 353358.

Cressie

and Kornak

(2003) Spatial statistics in the presence of location error with an application to remote sensing of the environment. Statistical Science , 436–456.

Fanshawe

and Diggle

(2011) Spatial prediction in the presence of positional error. Environmetrics , 22(2), 109–122.

Fronterrè

, Giorgi

and Diggle

(2018) Geostatistical inference in the presence of geomasking: a composite-likelihood approach. Spatial Statistics , 28, 319–330.

G-A

Fuglstad

, Simpson

, Lindgren

and Rue

(2019) Constructing priors that penalize the complexity of Gaussian random fields. Journal of the American Statistical Association , 114, 445–452.

10.

G-A

Fuglstad

, Li

and Wakefield

(2021) The two cultures for prevalence mapping: Small area estimation and spatial statistics. arXiv preprint arXiv:2110.09576 .

11.

GADM. GADM (version 4.0). https://gadm.org/download_country.html.

12.

Gneiting

and Raftery

(2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association , 102(477), 359–378.

13.

Gómez-Rubio

and Rue

(2018) Markov chain Monte Carlo with the integrated nested Laplace approximation. Statistics and Computing , 28(5), 1033–1051.

14.

Greenland

and Morgenstern

(1989) Ecological bias, confounding and effect modification. International Journal of Epidemiology , 18, 269–274.

15.

Gustafson

(2003) Measurement error and misclassification in statistics and epidemiology: impacts and Bayesian adjustments . CRC Press.

16.

Kristensen

, Nielsen

, Berg

, Skaug

and Bell

(2016) TMB: Automatic differentiation and laplace approximation. Journal of Statistical Software , 70(5), 1–21.

17.

Kristensen

, Bell

, Skaug

, Magnusson

, Berg

, Nielsen

, Maechler

, Michelot

, Brooks

, Forrence

, Albertsen

and Monnahan

(2023) TMB: Template model builder: A general random effect tool inspired by 'ADMB'. URL https://CRAN.Rproject.org/package=TMB. R package version 1.9.6 .

18.

Lewin

(2008) Strategies for sustainable financing of secondary education in Sub-Saharan Africa, volume 136. World Bank Publications.

19.

Lindgren

, Rue

and Lindström

(2011) An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic differential equation approach (with discussion). Journal of the Royal Statistical Society, Series B , 73, 423–498.

20.

Collaborators

Local Burden of Disease Vaccine Coverage

(2021) Mapping routine measles vaccination in low-and middle-income countries. Nature , 589(7842), 415–419.

21.

Oceanic

National

and Administration

Atmospheric

(2022) National centers for environmental information . URL https://www.ngdc.noaa.gov/mgg/topo/gltiles.html.

22.

National Population Commission - NPC ICF (2019) Nigeria Demographic and Health Survey 2018 - final report. http://dhsprogram.com/pubs/pdf/FR359/FR359.pdf.

23.

Natural Earth (2012) Rivers + lake centerlines. URL http://www.naturalearthdata.com/downloads/10m-physical-vectors/10mrivers-lake-centerlines.

24.

Paige

, G-A

Fuglstad

, Riebler

and Wakefield

(2022) Spatial aggregation with respect to a population distribution: Impact on inference. Spatial Statistics , 52, 1–21.

25.

Perez-Heydrich

, Warren

, Burgert

and Emch

(2013) Guidelines on the use of DHS GPS data. ICF International, Calverton, Maryland . Spatial analysis reports no. 8.

26.

Perez-Heydrich

, Warren

, Burgert

and Emch

(2016) Influence of Demographic and Health Survey point displacements on raster-based analyses. Spatial Demography , 4 (*), 135–153.

27.

Pesaresi

, Ehrlich

, Ferri

, Florczyk

, Freire

, Halkia

, Julea

, Kemper

, Soille

and Syrris

, . (2016) Operating procedure for the production of the global human settlement layer from Landsat data of the epochs 1975, 1990, 2000, and 2014. Publications Office of the European Union, 1-62.

28.

R Core Team (2022): A language and environment for statistical computing. R Foundation for Statistical Computing , Vienna, Austria. URL https://www.R-project.org/.

29.

Rao

and Molina

(2015) Small area estimation , Second Edition. John Wiley, New York.

30.

Rue

, Martino

and Chopin

(2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 71(2), 319–392.

31.

Stan Development Team (2020) RStan: the R interface to Stan. URL http://mc-stan.org/. R package version 2.21.2.

32.

UNAIDS (2022) School saves lives: World leaders back a courageous goal, 'Education Plus', to prevent new HIV infections through education and empowerment. URL https://www.unaids.org/en/resources/presscentre/pressreleaseandstatementarchive/2022/september/20220919_PR_TES_Eduplus.

33.

UNESCO (2019) Her education, our future: UNESCO fast-tracking girls' and women's education. URL https://en.unesco.org/news/hereducation-our-future-unesco-fast-trackinggirls-and-womens-education.

34.

Utazi

, Thorley

, Alegana

, Ferrari

, Takahashi

, Metcalf

CJE

, Lessler

, Cutts

and Tatem

(2019) Mapping vaccination coverage to explore the effects of delivery mechanisms and inform vaccination strategies. Nature Communications , 10(1), 1–10.

35.

Warren

, Perez-Heydrich

, Burgert

and Emch

(2016) Influence of demographic and health survey point displacements on distance-based analyses. Spatial Demography , 4(2), 155–173.

36.

Weiss

, Nelson

, Gibson

, Temperley

, Peedell

, Lieber

, Hancher

, Poyart

, Belchior

and Fullman

, . (2018) A global map of travel time to cities to assess inequalities in accessibility in 2015. Nature , 553(7688), 333–336.

37.

Wilson

and Wakefield

(2021) Estimation of health and demographic indicators with incomplete geographic information. Spatial and Spatio-temporal Epidemiology , 37, 100421.

38.

World Pop (2022) Open spatial demographic data and research. URL https://hub.worldpop.org/doi/10.5258/SOTON/WP00648.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.30 MB

Impact of jittering on raster- and distance-based geostatistical analyses of DHS data

Abstract

Keywords

1 Introduction

2 Data sources and variables of interest

Figure 1

Maps of (a) admin2 areas and cluster locations for NDHS2018, and (b) admin1 areas with direct estimates.

Summary of covariate rasters providing name, description and figure. CityA, Elev, DistW and PopD are transformed, while UrbR is not.

Transformed covariate rasters for Nigeria.

3.1 Notation for DHS data

3.2 Geostatistical model

3.2.1 Model for spatial variation in risk

3.2.2 Unadjusted model

3.3 Implementation

3.3.1 Inference scheme

The three approaches considered in the article.

Bias and RMSE of parameter estimation for SignalMed.

Box plots of estimated ρ S and β U r b R for SignalLow, SignalMed and SignalHigh. The horizontal red lines show the true parameter value.

Box plots of CRPS and RMSE for predictions for SignalLow, SignalMed and SignalHigh.

Table 4

Parameter estimates and the corresponding 95% credible interval lengths in parentheses. Uncertainty is not computed for ρ S and σ S 2 .

Rows 1, 2 and 3 are predicted risk and the CVs for UnAdj, Smoothed, and FullAdj, respectively. Row 4 shows ratios (UnAdj/FullAdj) of predictions and CVs.

Rows 1 and 2 are predicted risk and CVs for UnAdj and FullAdj respectively at the admin 1 level, and row 3 shows ratios (UnAdj/FullAdj) of predictions and CVs.

Comparison direct estimates and (areal) model-based estimates for (a) the mean, and (b) the coefficient of variations.

Supplementary material

Supplementary material for this article is available online.

Appendix

Computational details

Footnotes

Declaration of Conflicting Interests

Funding

References

Supplementary Material

Box plots of estimated $ρ_{S}$ and $β_{U r b R}$ for SignalLow, SignalMed and SignalHigh. The horizontal red lines show the true parameter value.

Parameter estimates and the corresponding 95% credible interval lengths in parentheses. Uncertainty is not computed for $ρ_{S}$ and $σ_{S}^{2}$ .