Sage Journals: Discover world-class research

Abstract

This paper proposes a methodology to obtain estimates in small domains when the target is a composite indicator. These indicators are of utmost importance for studying multidimensional phenomena, but little research has been done on how to obtain estimates of these indicators under the small area context. Composite indicators are particularly complex for this purpose since their construction requires different data sources, aggregation procedures, and weighting which makes challenging not only the estimation for small domains but also obtaining uncertainty measures. As case study of our proposal, we estimate the incidence of multidimensional poverty at the municipality level in Colombia by incorporating innovative data sources such as geospatial data. Furthermore, we provide uncertainty measures based on a parametric bootstrap algorithm.

Keywords

composite indicators small area estimator generalized linear mixed model multidimensional poverty

1. Introduction

Official statistics are a useful tool for decision-makers, as they provide information on the characteristics of a country’s population and allow them to apply and monitor public policies aimed at specific population groups. Composite indicators are commonly used to summarize complex phenomena that consider two or more dimensions. These types of indicators are implemented in different areas, for example, social science: Human Development Index (HDI; United Nations Development Programme, 2023b), Gender Development Index (United Nations Development Programme 2023a), Multidimensional Poverty Index (United Nations Development Programme 2023c), Corruption Perceptions Index (Transparency International 2023), environment: Environmental Performance Index (Wolf et al. 2022), economy: Composite of Leading Indicators (OECD 2023).

A composite indicator facilitates a joint analysis of relevant aspects, allowing for a more comprehensive understanding of complex phenomena, enables better communication of the results, and facilitates decision-making (CEPAL 2013).

Besides their advantages for public policy, the construction and use of composite indicators should be handled carefully. For example, the application of different weighting and aggregation methods could lead to misleading results, and the quality of the different data sources used for their construction should be evaluated (Freudenberg 2003).

Among the challenges in the construction of composite indicators and their applicability is the disaggregation. The demand for disaggregated information is quickly increasing, since it is essential to develop and implement strategies to improve the quality of life of the inhabitants of a country, on issues such as employment, poverty, education, and health. A recent case that properly illustrates this need are the Sustainable Development Goals, which seek to provide information disaggregated by different relevant characteristics of the population (United Nations General Assembly 2015).

Small area estimation (SAE) has been proposed as a solution to the dissaggregation problem. SAE methods have the goal of producing accurate estimates in small domains with adequate precision, by combining two or more sources of information. Most of these methodologies, usually supported by unit or area-level regression models, provide efficiency gains if the correlation between existing auxiliary information and the survey data is sufficient (Pfeffermann 2013; Pratesi 2016; Rao and Molina 2015; Tzavidis et al. 2018). However, to the best of our knowledge, no literature has been produced on obtaining quality estimates of composite indicators within the small area context. The literature related to this subject is mainly focused on dimensionality reduction for latent indicators such as economic well-being (see e.g., Moretti et al. (2020, 2021)).

Although the construction of composite indicators requires some standard steps (Joint Research Centre of the European Commission and OECD 2008), we focus only on the estimation of indicators that are already defined well as their weights. That means we do not deal with dimensionality reduction, normalization, aggregation, and weighting aspects.

Composite indicators can aggregate the information in different ways. One option is to first aggregate at the population level and then combine dimensions. An example for this is the Human Development Index, which is based on summary indicators for the country. A second option is to first aggregate across dimensions and then across individuals. This is the case of indices such as the Unmet Basic Needs index or the Multidimensional Poverty Index Alkire and Foster (2007). Here we present a proposal for a composite indicator of the latter type, which first aggregates indicators defined as dichotomous variables at the individual level.

To illustrate our proposal, we use the incidence of multidimensional poverty. The index is based on the Alkire and Foster (2007) methodology, which identifies deprivations for each relevant dimension of well-being and then aggregates them at the household level. The main difference is that the index presented here corresponds only to the incidence, which is a component of the Multidimensional Poverty Index. The incidence of multidimensional poverty that we define here is only for illustration purposes.

The components of a multidimensional poverty index are usually estimated using data from household surveys. These instruments usually collect information on many different dimensions of well-being at the household level, which is required by the Alkire and Santos (2010) method and other approaches that consider simultaneous. Surveys in Latin America have also the benefits of being frequently updated (on a yearly basis in most countries) and having nationally representative samples.

One shortcoming of household surveys is their limited ability for disaggregation for specific groups of population and geographical areas. In addition to the poverty rate at the national level, it is desirable to identify which groups of the population are more afflicted by poverty and what is the relative contribution of each particular deprivation to their poverty level, so as to provide useful information for the implementation of public policies.

Data sources such as administrative records and population censuses are better suited for attaining higher levels of disaggregation, but they face their own particular limitations. In Latin American countries, administrative records are usually not accessible at the individual level, do not have the necessary quality to produce reliable statistics, or do not provide information on different deprivations for the same individuals. On the other hand, population censuses are usually produced only every ten years and they collect information on a restricted set of variables, that do not allow the calculation of a complete multidimensional poverty index.

The objective of this paper is to provide a methodology to produce disaggregated estimations at the municipal level of the incidence of a multidimensional poverty index using small-area methods, taking Colombia as a case study. To that end, model-based estimation methods are applied to integrate data from the survey with the population census. The methodology aims to preserve the information of each deprivation indicator so that the final index (incidence of multidimensional poverty) can be decomposed by dimension. As an uncertainty measure, the mean squared error (MSE) is derived via parametric bootstrap.

The structure of the paper is as follows: the proposal to obtain small area estimates for composite indicators is explained in Section 2, as well as the procedure to obtain MSE estimates for the corresponding point estimates. We present a case study to show the implementation of our proposal in Section 3 as well as a simulation exercise to validate our proposal. Conclusions and further research are presented in Section 4.

2. Incidence of Multidimensional Poverty: A Methodology for Small Area Estimation

Unlike other problems in SAE, getting a final indicator at the domain level (e.g., a total or mean) is not useful in this case. Because of its nature, the computation of the incidence of multidimensional poverty requires information on each individual for each of the indicators. That means, the challenge is to first estimate the status of deprivation (deprived or not deprived) for each indicator across all persons in the census.

In general, the type of data required by the various composite indicators may differ. Some indicators are constructed from numerical variables, for example, life expectancy, income, or years of schooling. In this paper, we focus on presenting a proposal to obtain small area estimates for a composite indicator consisting only of dichotomous variables.

In this section, we first explain the example that we use in our case study: the incidence of multidimensional poverty. Next, we introduce small area estimation methods for binary response variables (indicators). Then, we present our proposals where more than one indicators are missing in the census. Finally, we also describe how we address the problem of finding an uncertainty measure for this scenario.

2.1. The Incidence of Multidimensional Poverty

National statistical surveys often are not suitable to provide reliable socio-demographic estimates under small sample sizes at domain levels due to high costs. SAE procedures are estimation methodologies for obtaining such highly disaggregated target information under small sample sizes. Their basic principle is to improve classical procedures by combining survey and register data through a desired model.

In this paper, we focus on composite indicators that are constructed from dichotomous variables. For this reason we selected as example the incidence of multidimensional poverty, since it is a well-known indicator (Joint Research Centre of the European Commission and OECD 2008; Moretti et al. 2020). One particularity of Alkire and Foster (2007) methodology where the Global Multidimensional Poverty Index is presented, is that each country can define its own set of indicators and dimensions based on its own necessities. Regardless of their specification, multidimensional poverty indexes share the same characteristic: indicators explaining deprivations are grouped in within dimensions. Dimensions correspond to the relevant components of well-being that are related to the notion of poverty. The measurement of each dimension is operationalized through specific indicators, selected based on the information that is available, usually in household surveys. For each indicator, a deprivation cut-off is used to determine whether a person is to be considered deprived.

Let us assume an index with $K$ indicators which are measured as deprivations: $y_{dj}^{k} = 1$ if the person has the deprivation and $y_{dj}^{k} = 0$ if the person has not had the deprivation. The index requires the information for each individual $j = 1, \dots, N_{d}$ in $d = 1, \dots, D$ domains, where $N_{d}$ denotes the population size of the domain $d$ . The Global Multidimensional Poverty Index for the domain $d$ proposed by Alkire and Foster (2007) is computed as:

MD I_{d} = H_{d} \cdot A_{d},

where $H_{d}$ is the incidence of multidimensional deprivations, that measures the proportion of people that is deprived in a specified number of dimensions.

$A_{d}$ is the intensity and reflects the depth of poverty, and is calculated as the average deprivation score among those classified as multidimensionally poor. In this paper, we focus only on the incidence $H_{d}$ , which is defined as:

H_{d} = \frac{1}{N_{d}} \sum_{j = 1}^{N_{d}} I (q_{dj} > z) .

(1)

The indicator function $I (\cdot)$ equals 1 when the condition $q_{dj} > z$ is met, where $z$ is a threshold defining if the person is poor or not which is traditionally set equal for all domains. In this context $q_{dj}$ represents the number of weighted deprivations that an individual presents.

Several models could be applied to obtain estimates of the final $MD I_{d}$ , the $H_{d}$ , and each dimension independently using for example, area-level models (Fay and Herriot 1979), as well as directly modeling the number of deprivations $q_{dj}$ that an individual has via unit-level models. However, our approach seeks to provide information not only for the incidence of multidimensional poverty but also for each component. Applying for instance several area-level models for each dimension, would not allow the computation of a final estimate. Our approach, that is, a unit-level model for each single indicator $k$ , has this advantage in comparison with these alternatives.

2.2. Small Area Estimation for Binary Variables

In many applications the variable of interest in small areas is binary, for example, $y_{dj} = 0$ or $1$ representing the absence (or not) of a specific characteristic. For a binary case, the target estimate in each domain $d = 1, \dots, D$ can be the proportion ${\bar{Y}}_{d} = π_{d} = \frac{1}{N_{d}} \sum_{j = 1}^{N_{d}} y_{dj}$ of the population having this characteristic, being $π_{dj}$ the probability that a specific unit $j$ in the domain $d$ obtains the value 1.

Although other methods have been proposed for binary outcomes, for example, based on M-quantile modeling (Chambers et al. 2016), in this application we follow the traditional approach based on generalized linear mixed models. Under this scenario, the $π_{dj}$ with a logit link function defined as:

logit (π_{dj}) = \log (\frac{π_{dj}}{1 - π_{dj}}) = η_{dj} = x_{dj}^{T} β + u_{d}

(2)

with $j = 1, \dots, N_{d}$ , $d = 1, \dots, D$ , $β$ a vector of fixed effect parameter, and $u_{d}$ the random area-specific effect for the domain $d$ with $u_{d} ~ N (0, σ^{2})$ . $u_{d}$ are assumed independent and $y_{dj} | u_{d} ~ Bernoulli (π_{dj})$ with $E (y_{dj} | u_{d}) = π_{dj}$ and $Var (y_{dj} | u_{d}) = σ_{dj}^{2} = π_{dj} (1 - π_{dj})$ . Furthermore, $x_{dj}$ represents the $p \times 1$ vector of values of $p$ unit-level auxiliary variables.

Since our specific problem is to find deprivations (0, 1) for several indicators, we use this unit-level Bernoulli logit mixed model as the starting point. Derivations of different algorithms to fit the unit-level logit mixed model can be found in Morales et al. (2021), namely: method of simulated moments (MSM), expectation-maximization (EM) algorithm, penalized quasi-likelihood (PQL) algorithm (González-Manteiga et al. 2007), or maximum likelihood Laplace (ML—Laplace) approximation algorithm which is described in Morales et al. (2021). For ease, we implement the latest algorithm as it is available in the R lme4 package.

The goal of obtaining quality estimates in small domains can be jeopardized when several of these domains are not in the sample or the sample size is not enough to produce reliable results. An empirical best predictor (EBP) can be defined for this purpose (Jiang 2003; Jiang and Lahiri 2001).

The EBP for quantities of interests such as probabilities, sums of probabilities, and proportions by domains can be approximated using Monte Carlo simulation (Hobza and Morales 2016). In practice, this option is usually avoided since it does not have a closed form requiring numerical approximation for its computation (Chambers et al. 2016). As a solution, the plug-in predictor of $π_{dj}$ is defined as:

{\hat{π}}_{dj}^{in} = \frac{\exp (x_{dj}^{T} \hat{β} + {\hat{u}}_{d})}{1 + \exp (x_{dj}^{T} \hat{β} + {\hat{u}}_{d})},

(3)

which would allow obtaining the plug-in predictor of ${\bar{Y}}_{d}$ :

{\hat{\bar{Y}}}_{d}^{in} = \frac{1}{N_{d}} (\sum_{j \in s_{d}} y_{dj} + \sum_{j \in r_{d}} {\hat{π}}_{dj}^{in}),

(4)

where $s$ and $r$ represent the in- and out-of-sample observations respectively.

2.3. Point Estimation for the H Predictor

Let $K$ be the number of missing indicators for each individual $j$ in the census. The proposed procedure to estimate H is as follows:

Use the sample data to fit a unit-level Bernoulli logit mixed model for each indicator and estimate ${\hat{β}}^{k}$ , ${\hat{u}}_{d}^{k}$ , and finally, ${\hat{π}}_{dj}^{in, k}$ , with $k = 1, \dots, K$ as in Equation (3).

For $l = 1, \dots, L$ Monte Carlo simulations:

For each individual in the census, predict the probability of obtaining the value 1 for the $k$ -th indicator. That is, ${\hat{π}}_{dj}^{in, k, (l)} \forall j \in U_{d}$ .

Obtain Monte Carlo estimates ${\tilde{y}}_{dj}^{k, (l)}$ with $y_{dj}^{k} ~ Bernoulli ({\hat{π}}_{dj}^{in, k})$ .

Compute the $H_{d}^{(l)}$ , with the indicators already available in the census and the new indicators ${\tilde{y}}_{dj}^{k, (l)}$ , as indicated in Equation (1).

The final point estimate in each small area $d$ is computed by taking the mean over each $L$ simulation:

{\hat{H}}_{d} = \frac{1}{L} \sum_{l = 1}^{L} H_{d}^{(l)} .

Note that under this proposal, the incidence of multidimensional poverty (H) can be estimated even if there are several missing indicators $K \geq 1$ .

For specific cases when only one or two indicators are missing, generating $y_{dj}$ via Monte Carlo simulations is not a must. Probabilities $π_{dj}$ of obtaining the value 1 (indicating a deprivation), can be directly calculated by finding the expectation of $I_{q_{dj} > δ}$ and therefore computing $H_{d}$ . The procedure to obtain the expectation of $I_{{q_{dj} > δ}}$ when one indicator is missing can be found in Appendix A and for the case of two missing indicators, it can be found in Appendix B.

2.4. Estimation of the MSE

The estimation of the mean squared error (MSE) as the accuracy measure when using small area estimators is a key step when estimating socio-demographic information. In case the variable of interest is binary, some approximations are available for obtaining the analytic form of the MSE. González-Manteiga et al. (2007) derived a small area robust bootstrap (SAWB) for the uncertainty estimation of an empirical predictor. Based on this bootstrap scheme, we present a modification that allows to consider that the target estimate, that is, the multidimensional poverty incidence, has several components or indicators and one or more of these indicators are estimated via SAE methods.

The steps of the proposed parametric bootstrap are as follows: For each missing indicator, $Y_{k}$ , with $k = 1, \dots, K$ and for $b = 1, \dots, B$ , with $B$ denoting the number of bootstraps:

Using the already estimated ${\hat{β}}^{k}, {\hat{σ}}_{u}^{2, k}$ as described in Section 2.2, generate $u_{d}^{*, k} ~ N (0, {\hat{σ}}_{u}^{2, k})$ iid.

Simulate a bootstrap superpopulation for each indicator $y_{dj}^{k, (b)} ~ Bernoulli (π_{dj}^{in, *, k})$ with $π_{dj}^{in, *, k} = \frac{\exp (x_{dj}^{T} {\hat{β}}^{k} + u_{d}^{*, k})}{1 + \exp (x_{dj}^{T} {\hat{β}}^{k} + u_{d}^{*, k})}$ .

Calculate the $H_{d}^{(b)}$ as indicated in Equation (1).

Extract the bootstrap sample and obtain the ${\hat{H}}_{d}^{(b)}$ following the point estimate—Monte Carlo approach described in Subsection 2.2.

\hat{MSE} [{\hat{H}}_{d}] = 1 / B \sum_{b = 1}^{B} {[H_{d}^{(b)} - {\hat{H}}_{d}^{(b)}]}^{2} .

3. Case Study: Multidimensional Poverty Incidence for the Adult Population in Colombia

The case study presented in this paper uses an example of the incidence of a multidimensional poverty index (described in Equation (1)) across thirty-three primary administrative divisions, known as departments, and 1,122 secondary administrative divisions, known as municipalities. We use this index only for illustrative purposes. The composition of the index here presented (dimensions and indicators) is not the one defined by National Statistical Office of Colombia (in Spanish, Departamento Administrativo Nacional de Estadistica, DANE) or another international organization.

For this example, we use data from Colombia, as it has the latest available population and housing census from 2018 and a household survey from the same year. We chose these datasets for their validated reliability and stability compared to more recent data, which may be influenced by ongoing changes, such as those from the COVID-19 pandemic.

Table 1 describes the composition of the index that we use in this example. This index is based on previous research by UN-ECLAC on possible structures for a multidimensional poverty index that is comparable for Latin American countries, based on the availability of information from national household surveys (CEPAL 2014; Santos and Villatoro 2018). The index includes five dimensions (housing; water and sanitation; energy and connectivity; education; and employment and social protection) and $K = 8$ indicators. Indicators and cut-offs are the same for adults and seniors, except in two cases. For education, the cut-off for insufficient education is twelve years for adults ages 18 to 29, nine years for adults ages 30 to 59, and four years for ages 60 and over. For the dimension of employment and social protection, the indicator unemployment or insufficient employment income for adults is replaced by no pension or insufficient pension income for seniors.

Table 1.

Composition of the Index, Availability of Indicators in the Colombian Census and Target Population.

Dimension	Indicator	Weight	In census	Target
Housing	Poor housing materials	1/10	Yes	Adults, seniors
Housing	Overcrowding	1/10	Yes	Adults, seniors
Water and sanitation	Lack of drinking water	1/10	Yes	Adults, seniors
Water and sanitation	Lack of sanitation	1/10	Yes	Adults, seniors
Energy and connectivity	Lack of internet service	1/10	Yes	Adults, seniors
Energy and connectivity	Lack of electricity	1/10	Yes	Adults, seniors
Education	Unfinished education	2/10	No	Adults, Seniors
Employment and social protection	No or insufficient pension	2/10	No	Seniors
Employment and social protection	Unemployment or insufficient employment-related income			Adults

The bold text to explain that only these 2 indicators are assumed “missing” in the census.

Data available in censuses usually includes the required information for calculating this index, except in the case of the employment and social protection dimensions, which require data on individual income. For the purpose of this paper, it has been assumed that information on education is also not available in the census and thus has to be estimated through SAE methods.

We focus on the incidence of multidimensional poverty described in Equation (1). Here, we require $K = 8$ indicators which are measured as deprivations: $y_{dj}^{k} = 1$ if the person has the deprivation and $y_{dj}^{k} = 0$ if the person has not had the deprivation.

The index requires the information for each individual $j = 1, \dots, N_{d}$ in $d = 1, \dots, D$ domains, where $N_{d}$ denotes the population size of the domain $d$ .

The indicator function $I (\cdot)$ equals 1 when the condition $q_{dj} > z$ is met. For the purpose of this paper, we use the value of 0.4 for $z$ that is, $I (\cdot)$ equals 1 when $q_{dj} > 0.4$ . $q_{dj}$ is a weighted quantity considering the $K = 8$ indicators that comprise the index (see Table 1):

q_{dj} = 0.1 \sum_{k = 1}^{6} y_{dj}^{k} + 0.2 \sum_{k = 7}^{8} y_{dj}^{k} .

The first part of the sum includes the indicators for housing, water and sanitation, energy and connectivity dimensions. The second part, the indicators of education and employment and social protection dimensions. The latter are in fact the two missing indicators in the census and will be estimated with the methodology presented in Subsection 2.3.

Figure 1 shows the proportion of people who had deprivations in six of the eight indicators that make up H, that is, the incidence of multidimensional poverty. These maps were generated at the municipal level with information from the census. The maps of the two missing indicators were not generated since the census does not include the required information. In addition, the calculation of the H is not possible.

Figure 1.

Indicators of the H available in the census at the municipal level.

3.1. Data Sources

3.1.1. Censo Nacional de Población y Vivienda (CNPV) 2018

The national population and housing census (in Spanish, Censo Nacional de Población y Vivienda, CNPV) is conducted by the DANE. Although it is planned to be every ten years, the last census was carried out three years later than planned due to administrative and economic reasons. For first time, the information was collected through electronic self-interviewing, in addition to the traditional face-to-face interview. The data collection phase took place in 2018 for ten months (DANE 2019).

The census aims at collecting demographic information on the Colombian population and its living conditions, including housing and household characteristics. This information is essential for territorial planning and decision-making in the country (DANE 2019). As aforementioned, this census has most of the single indicators that are required to compute the target index. The estimation based on the model described in Section 2 is applied for the indicators that are not available in the census. Among others, we include as predictors the available indicators belonging to the following dimensions: “housing,”“water and sanitation,” and “energy and connectivity.”

3.1.2. Gran Encuesta Integrada de Hogares (GEIH) 2018

The Great Integrated Household Survey (in Spanish, Gran Encuesta Integrada de Hogares, GEIH) is conducted annually by the DANE. It provides information on the size and structure of the labor force, as well as the sociodemographic characteristics of the population and households, in addition to housing, educational level, affiliation to social security, income, among others (DANE 2019). It is the official source of information for employment and income poverty indicators and includes the necessary variables to calculate the ECLAC multidimensional deprivation index. Administratively, Colombia is organized in 5 regions, 33 departments, and 1,122 municipalities; the GEIH provides representative information at the national level, urban and rural areas, regions, and 24 of the 33 departments. In 2018, the valid survey sample included 231,128 households and 762,753 individuals.

The version of the GEIH used in this paper comes from the Household Survey Data Bank (BADEHOG), a repository of household surveys from eighteen Latin American countries maintained by the ECLAC Statistics Division. In this repository, variables are harmonized to allow the construction of various indicators and their comparison across countries. This feature could simplify the estimation of the incidence of multidimensional poverty for the countries of the region.

3.1.3. Satellite Imagery

Geospatial data has become a powerful resource to improve quality in local-area estimations. Although big data, in general, has gained popularity for this purpose, geospatial data has two special advantages: it is easy to access and it is bias selection free (Masaki et al. 2020). Newhouse (2023) highlights the critical role and impact of integrating geospatial data with traditional surveys to enhance the accuracy and efficiency of poverty and wealth estimates at detailed spatial levels or small domains. Merfeld et al. (2024) points out that the inclusion of geospatial data could mitigate the limitations in terms of robustness of traditional surveys, such as limited sample sizes and high costs. By integrating satellite imagery, mobile phone data, and other geospatial information with traditional survey data, this approach allows for highly accurate and timely estimates of economic well-being. This integration enhances the precision and efficiency of small area estimates, specially in regions where traditional data collection methods are limited. For instance, The integration of diverse covariates could reduce mean squared errors, thereby producing more accurate and reliable estimates.

Satellite imagery as auxiliary source of information has been implemented in several small area estimation problems in topics such as well-being (Engstrom et al. 2022), population density (Deng and Wu 2013; Harvey 2002; Steinnocher et al. 2019), poverty mapping (Babenko et al. 2017; Chandra et al. 2018), and multidimensional poverty (Betti et al. 2024; Koebe et al. 2022; Pokhriyal and Jacques 2017), among others.

For this case study, we make use of the resources available in the Earth Engine Data Catalog (Gorelick et al. 2017). Among many available products, we incorporate in our model area-level information on night light intensity, urban cover fraction, and crop cover fraction.

3.1.4. Predictors

Based on the auxiliary sources: The National Population and Housing Census of Colombia 2018 and Satellite Imagery, described in Subsections 3.1.1 and 3.1.3 respectively, we listed the specific predictors used in our models. It is relevant to mention that these predictors were carefully selected based on expert opinions and statistical diagnostics.

From census:

- Poor housing materials

- Overcrowding

- Lack of drinking water

- Lack of sanitation

- Lack of internet service

- Lack of electricity

- Group of age of the head of the household

- Area (urban/rural)

- Department

- Sex of the head of the household

From Satellite Imagery:

- Intensity of nighttime lights

- Distance to cultivated areas (crops)

- Urbanization (human settlements)

3.2. Results and Evaluation

In this Section, we present the main results of applying the proposed methodology to obtain estimates of the multidimensional poverty incidence described in Subsection 2.1 in departments and municipalities of Colombia.

The distribution of sample and population sizes for the domains of interest is presented in Table 2. The census size is $N = 34, 180, 812$ and the sample has a size $n = 549, 077$ which covers 24 out of 33 departments and 438 out of 1,122 municipalities.

Table 2.

Distribution of the Sample and Census Sizes Across Departments and Municipalities.

Domain	Source	Min	First Q	Median	Mean	Third Q	Max
Department
(In-sample:)	Survey	6,613	21,601	22,344	22,878	24,692	35,264
73%	Census	21,037	256,244	752,416	1,035,781	1,036,702	5,847,519
Municipality
(In-sample:)	Survey	14	127	220	1,254	369	24,594
40%	Census	103	4,433	8,440	30,464	17,601.25	5,847,519

As can be seen in Table 2, although the sample size is not necessarily small for all domains, the aim is to improve the precision of this estimate, as well as to provide quality information in the domains that were not included in the sample. We analyze the accuracy of the estimates based on the coefficients of variations, defined as

CV ({\hat{H}}_{d}) = \frac{\sqrt{MSE ({\hat{H}}_{d}})}{{\hat{H}}_{d}} \times 100 .

Summary statistics of the CVs from the direct and model-based estimates are presented in Table 3. Here, the CVs are also disaggregated by domain level (department and municipality), and for each missing indicator (education and employment). At the department level, the CVs are relatively small for both indicators and for direct and model-based estimates. The uncertainty provided by the out-of-sample departments can explain the higher values for the model-based estimates. The benefit of using SAE methods is most apparent in municipalities since the average CVs are similar to the direct estimates but the third quartile and the maximum value are lower than the direct estimates. This behavior is observed for both indicators, education, and employment.

Table 3.

Descriptive Statistics of the Coefficients of Variation Across Departments and Municipalities (in Percentage).

Domain/indicator	Estimation	Min	First Q	Median	Mean	Third Q	Max
Department
Education	Direct	0.64	1.24	1.43	1.47	1.67	2.09
Education	Model-based	0.36	0.44	0.49	1.56	2.85	9.22
Employment	Direct	0.87	1.39	1.73	1.80	2.04	3.40
Employment	Model-based	0.26	0.40	0.53	1.66	2.90	9.24
Municipality
Education	Direct	0.00	2.60	5.37	6.27	8.75	39.81
Education	Model-based	0.45	3.81	5.37	5.23	6.51	10.87
Employment	Direct	0.00	2.33	5.19	5.91	8.46	27.60
Employment	Model-based	0.46	3.69	5.58	5.53	7.03	18.78

The stable and low CVs that the model-based estimates provide, can be clearly observed in Figure 2) for the in-sample municipalities.

Figure 2.

Coefficients of variation (in percentage) of the direct and model-based estimates at the department and municipality level for the indicators employment and education ordered by sample size.

Since we observed that the CVs from direct estimates are acceptable for both indicators, the main benefit of applying the small area estimation method that we propose is to obtain reliable estimates for the out-of-sample domains: 9 departments and 684 municipalities, in order to provide the proportion of the population of interest under multidimensional poverty for all domains of interested. The results are showed in Figures 3 to 5.

Figure 3.

Model-based estimates for the indicators of employment and social protection and unfinished education at the municipality level.

Figure 4.

Final H at the municipality level.

Figure 5.

Coefficients of variation of the H at the municipality level.

Figure 3 presents the two indicators at municipality level that were not available in the census data and required the use of SAE methods to obtain these estimates.

In general, Colombia is a country with diverse regions, each with its own unique set of economic and social challenges. As shown in Figure 3, one of the regions that requires special attention is the Amazon region, which encompasses departments such as Guainía, Vaupés, and Guaviare, as well as other departments such as Chocó, Guajira, and Sucre. These departments are characterized by high levels of poverty, low levels of education and employment, and limited access to basic services such as housing, water, and sanitation. In particular, education is a critical determinant of an individual’s well-being and standard of living. Despite the recent progress made in increasing access to education in those departments, they tend to have lower enrollment rates and lower levels of educational attainment, resulting in many children not having access to quality education and facing a greater risk of poverty and exclusion from the formal economy. Employment and social protection is the second key indicator of poverty and well-being that was estimated. Even the unemployment rate in Colombia has decreased in recent years, there are still disparities in employment across the country. The departments mentioned above tend to have higher unemployment rates and a larger informal sector, resulting in many people being unable to find formal employment and facing a greater risk of poverty and exclusion from the formal economy. In some departments, such as Bogotá, there are higher levels of formal employment and educational attainment with a significant portion of the population working in the formal sector and having completed tertiary education.

With these two indicators, now it is possible to compute the final H, the result is shown in Figure 4 for municipalities of Colombia. The CVs of the final H are below 15% (Figure 5), indicating that the estimates are “acceptable” in terms of precision.

The estimates obtained with the proposed methodology allow deeper analysis from two points of view: What are the most severe deprivations and what are the most affected areas in general? We take the department Valle del Cauca as an example, which is one of the departments with the lowest levels of the H (Figure 4: middle-west of the country with an MDI of 38.5%).

Valle del Cauca has a low proportion of the adult population with deprivations in housing, overcrowding, drinking water, sanitation, and electricity (between 0% and 20%), and intermediate values (40%–60%) in internet service, education and employment, and health insurance (see Figure 6). Furthermore, when we take a look at the general H, it is possible to see that there are strong differences between municipalities in this department. Figure 7 shows that the municipalities El Águila and Dagua have high levels of of H (92% and 82% respectively), while other areas such as Cali and Tulua have relatively low values (30% and 39%). It is important to note that there are still significant disparities in poverty incidence across municipalities within the department. This underscores the need for targeted interventions to address these challenges and reduce multidimensional poverty in the department.

Figure 6.

Valle del Cauca: indicator of the H at the municipality level.

Figure 7.

Valle del Cauca: final H at the municipality level.

3.2.1. Evaluation

The evaluation of the proposed method is twofold. First, an internal comparison between the direct and the model-based estimates is presented. Second, a design-based simulation study is described to evaluate the performance of the estimator proposed in Subsection 2.2.

Direct and model-based estimates of in sample domains are compared in Figure 8. As expected, for both cases, at the department and municipality level, the point estimates produced with the proposed method are very close to the direct estimates, with a Pearson correlation of .984 and .880 respectively.

Figure 8.

Comparison between the direct and model-based estimates of the H at department and municipality level.

The performance of the estimator proposed in Subsection 2.2 is evaluated with a design-based simulation experiment.

To make the evaluation on a realistic case, we use the National population and housing census of Colombia 2018 as a fixed population and repeated samples are taken from it under a simple random sampling design with different sample sizes: (a) 500, (b) 5,000, (c) 50,000 observations. A final scenario (d) is performed considering a complex sampling design with 550,000 observations, as in the original case study. The complex design mirrors the characteristics Great Integrated Household Survey (GEIH) of Colombia 2018 described in Subsection 3.1.

Similarly as in the application, the target indicator is the multidimensional poverty index for small domains of Colombia (i.e., departments and municipalities). There are six of eight variables already available in the population data, which will be held as fixed. Two variables will be obtained following the methodology explained in Subsection 2.1. The performance of the small area predictors is evaluated with the coefficient of variation (previously described), the bias, and the root mean squared error for each domain. The last two measures are defined as:

Bias ({\hat{H}}_{d}) = \frac{1}{T} \sum_{t = 1}^{T} ({\hat{H}}_{d}^{(t)} - H_{d});

RMSE ({\hat{H}}_{d}) = \sqrt{\frac{1}{T} \sum_{t = 1}^{T} {({\hat{H}}_{d}^{(t)} - H_{d})}^{2}},

where $H_{d}$ was defined in Section 2.1, and subscript $t$ indicates the $T$ simulations runs, in this exercise we set $T = 100$ .

Figure 9 reports the distributions of the domain-specific bias, RMSE and CVs over domains for the evaluated estimator. As expected, increasing the sample size reduces the bias and increases the accuracy. This is clear for the “complex” case, which has a larger sample size than the other three scenarios.

Figure 9.

Performance measures of the area-specific point estimates for the multidimensional poverty incidence under different sample sizes and sampling designs.

4. Concluding Remarks and Further Research

Composite indicators are of great value in the study of complex phenomena and are widely used for public policies. Given the growing need for disaggregated information, there is a challenge to increase the level of disaggregation of these types of indicators. Small area estimation methods address this problem, although there is no literature for the specific case of composite indicators. This paper aims to reduce this lack of information by proposing a methodology to obtain small area estimates when the indicator of interest is composed of dichotomous variables. We exemplify our approach with the incidence of multidimensional poverty (H).

The challenge of producing small area estimates when working with composite indicators, such as the H, raises many questions that require further investigation. First, a more general approach that includes the analysis of covariance structures and dependencies among indicators and dimensions might be considered. For example, study the possibility of including correlated random effects. In this case study, exactly two indicators were estimated under the assumption that no dependencies between them (and other indicators) exist. Second, the time gap between the sources of information (e.g., census and survey data) is another strain of research. Especially, if some of the indicators are available in census data and others would be estimated using up-to-date survey data. Third, regarding the point and MSE estimation, a model-based simulation could help to validate the proposed methodology. Fourth, benchmark procedures might be further investigated for correcting possible inconsistencies between the different estimates (e.g., coming from the effects of sampling and non-sampling errors). Last, but not least, the final H, which is obtained by applying SAE methods, aims to capture the complexity of poverty along multiple dimensions of well-being, housing, water and sanitation, energy and connectivity, education, and employment and social protection. However, the intra-household inequality and inequality within the poor population need to be captured in further research. Such measures, which can capture the big picture of poverty in a country at the most required disaggregated areas have become a critical underpinning for policy-relevant applications.

Further methods are needed to obtain small area estimates when the variables that make up the composite indicators are not dichotomous. A clear example is the Human Development Index (HDI) which is composed of life expectancy, years of education, and the Gini coefficient. This paper focuses on the estimation of one component of the global MPI. Further research is required to be able to compute the complete index, considering the intensity of poverty.

Footnotes

Appendix A

Appendix B

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Natalia Rojas-Perilla has received the research Start-up Grant from the United Arab Emirates University (UAEU).

ORCID iD

Natalia Rojas-Perilla

Received: December 2023

Accepted: October 2024

References

Alkire

Foster

2007. “Counting and Multidimensional Poverty Measures.”https://ophi.org.uk/working-paper-number-07/ (accessed June, 2021).

Alkire

Santos

M. E.

2010. “Acute Multidimensional Poverty: A New Index for Developing Countries.”Working Paper, Oxford Poverty & Human Development Initiative (OPHI).

Babenko

Hersh

Newhouse

Ramakrishnan

Swartz

2017. “Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, with an Application in Mexico.” arXiv preprint arXiv:1711.06323. DOI: https://doi.org/10.48550/arXiv.1711.06323.

Betti

Crescenzi

Mori

2024. “Estimation of Multidimensional Poverty in Morocco: A Small Area Estimation Approach Using Meteorological and Socio-Economic Covariates.” Social Indicators Research 131: 123–47. DOI: https://doi.org/10.1007/s11205-024-03340-9.

CEPAL. 2013. Panorama Social de América Latina.

CEPAL. 2014. Panorama Social de América Latina.

Chambers

Salvati

Tzavidis

2016. “Semiparametric Small Area Estimation for Binary Outcomes with Application to Unemployment Estimation for Local Authorities in the UK.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 179 (2): 453–79.

Chandra

Aditya

Sud

U. C.

2018. “Localised Estimates and Spatial Mapping of Poverty Incidence in the State of Bihar in India—An Application of Small Area Estimation Techniques.” PLOS ONE 13 (6): 1–14.

DANE. 2019. “Ficha metodológica censo nacional de población y vivienda 2018.” Technical Report.

10.

Deng

2013. “Improving Small-Area Population Estimation: An Integrated Geographic and Demographic Approach.” Annals of the Association of American Geographers 103 (5): 1123–41. DOI: https://doi.org/10.1080/00045608.2013.770364.

11.

Engstrom

Hersh

Newhouse

2022. “Poverty from Space: Using High Resolution Satellite Imagery for Estimating Economic Well-Being.” The World Bank Economic Review 36 (2): 382–412. DOI: https://doi.org/10.1093/wber/lhab015.

12.

Fay

R. E.

Herriot

R. A.

1979. “Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data.” Journal of the American Statistical Association 74 (366a): 269–77. DOI: https://doi.org/10.1080/01621459.1979.10482505.

13.

Freudenberg

2003. “Composite Indicators of Country Performance.”OECD Science, Technology and Industry Working Papers No. 16.

14.

González-Manteiga

Lombardía

M. J.

Molina

Morales

Santamaría

2007. “Estimation of the Mean Squared Error of Predictors of Small Area Linear Parameters Under a Logistic Mixed Model.” Computational Statistics & Data Analysis 51 (5): 2720–33. DOI: https://doi.org/10.1016/j.csda.2006.01.012.

15.

Gorelick

Hancher

Dixon

Ilyushchenko

Thau

Moore

2017. “Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone.” Remote Sensing of Environment 202: 18–27. DOI: https://doi.org/10.1016/j.rse.2017.06.031.

16.

Harvey

J. T.

2002. “Estimating Census District Populations from Satellite Imagery: Some Approaches and Limitations.” International Journal of Remote Sensing 23 (10): 2071–95. DOI: https://doi.org/10.1080/01431160110075901.

17.

Hobza

Morales

2016. “Empirical Best Prediction Under Unit-Level Logit Mixed Models.” Journal of Official Statistics 32 (3): 661. DOI: https://doi.org/10.1515/jos-2016-0034.

18.

Jiang

2003. “Empirical Best Prediction for Small-Area Inference Based on Generalized Linear Mixed Models.” Journal of Statistical Planning and Inference 111 (1–2): 117–27. DOI: https://doi.org/10.1016/S0378-3758(02)00293-8.

19.

Jiang

Lahiri

2001. “Empirical Best Prediction for Small Area Inference with Binary Data.” Annals of the Institute of Statistical Mathematics 53 (2): 217–43. DOI: https://doi.org/10.1023/A:1012410420337.

20.

Joint Research Centre of the European Commission, and OECD. 2008. Handbook on Constructing Composite Indicators: Methodology and User Guide. OECD.

21.

Koebe

Arias-Salazar

Rojas-Perilla

Schmid

2022. “Intercensal Updating Using Structure-Preserving Methods and Satellite Imagery.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 185: S170–96. DOI: https://doi.org/10.1111/rssa.12802.

22.

Masaki

Newhouse

Silwal

A. R.

Bedada

Engstrom

2020. “Small Area Estimation of Non-Monetary Poverty with Geospatial Data.” Statistical Journal of the IAOS 38: 1035–51. DOI: https://doi.org/10.3233/SJI-210902.

23.

Merfeld

J. D.

Chen

Lahiri

Newhouse

2024. “Small Area Estimation with Geospatial Data: A Primer.”

24.

Morales

Esteban

M. D.

Pérez

Hobza

2021. A Course on Small Area Estimation and Mixed Models: Methods, Theory and Applications in R. Springer.

25.

Moretti

Shlomo

Sakshaug

J. W.

2020. “Multivariate Small Area Estimation of Multidimensional Latent Economic Well-Being Indicators.” International Statistical Review 88 (1): 1–28. DOI: https://doi.org/10.1111/insr.12333.

26.

Moretti

Shlomo

Sakshaug

J. W.

2021. “Small Area Estimation of Latent Economic Well-Being.” Sociological Methods & Research 50 (4): 1660–93. DOI: https://doi.org/10.1177/0049124119826160.

27.

Newhouse

D. L.

2023. “Small Area Estimation of Poverty and Wealth Using Geospatial Data: What Have We Learned So Far?” Technical Report WPS 10512, The World Bank.

28.

OECD. 2023. “Composite Leading Indicators.”https://www.oecd-ilibrary.org/economics/data/main-economic-indicators/composite-leading-indicators_data-00042-en (accessed March, 2023).

29.

Pfeffermann

2013. “New Important Developments in Small Area Estimation.” Statistical Science 28 (1): 40–68. DOI: https://doi.org/10.1214/12-STS395.

30.

Pokhriyal

Jacques

D. C.

2017. “Combining Disparate Data Sources for Improved Poverty Prediction and Mapping.” Proceedings of the National Academy of Sciences of the United States of America 114 (46): E9783–92. DOI: https://doi.org/10.1073/pnas.1700319114.

31.

Pratesi

2016. Analysis of Poverty Data by Small Area Estimation. Hoboken, NJ: John Wiley & Sons.

32.

Rao

J. N. K.

Molina

2015. Small Area Estimation. 2nd ed. Hoboken, NJ: Wiley.

33.

Santos

M. E.

Villatoro

2018. “A Multidimensional Poverty Index for Latin America.” Review of Income and Wealth 64 (1): 52–82. DOI: https://doi.org/10.1111/roiw.12275.

34.

Steinnocher

Bono

A. D.

Chatenoux

Tiede

Wendt

2019. “Estimating Urban Population Patterns from Stereo-Satellite Imagery.” European Journal of Remote Sensing 52 (sup2): 12–25. DOI: https://doi.org/10.1080/22797254.2019.1604081.

35.

Transparency International. 2023. “Corruption Perceptions Index.”https://www.transparency.org/en/cpi/2021 (accessed March, 2023).

36.

Tzavidis

Zhang

L.-C.

Luna Hernandez

Schmid

Rojas-Perilla

2018. “From Start to Finish: A Framework for the Production of Small Area Official Statistics.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (4): 927–79. DOI: https://doi.org/10.1111/rssa.12364.

37.

United Nations Development Programme. 2023a. “Gender Development Index.”https://hdr.undp.org/gender-development-index#/indicies/GDI (accessed March, 2023).

38.

United Nations Development Programme. 2023b. “Human Development Index.”https://hdr.undp.org/data-center/human-development-index#/indicies/HDI (accessed March, 2023).

39.

United Nations Development Programme. 2023c. “Multidimensional Poverty Index.”https://hdr.undp.org/content/2022-global-multidimensional-poverty-index-mpi#/indicies/MPI (accessed March, 2023).

40.

United Nations General Assembly. 2015. “Res 70/1. Transforming Our World: The 2030 Agenda for Sustainable Development.” Technical Report, United Nations General Assembly.

41.

Wolf

M. J.

Emerson

J. W.

Esty

D. C.

de Sherbin

Wendling

Z. A.

2022. Environmental Performance Index. New Haven, CT: Yale Center for Environmental Law & Policy.

Small Area Estimation for Composite Indicators: The Case of Multidimensional Poverty Incidence

Abstract

Keywords

1. Introduction

2. Incidence of Multidimensional Poverty: A Methodology for Small Area Estimation

2.1. The Incidence of Multidimensional Poverty

2.2. Small Area Estimation for Binary Variables

2.3. Point Estimation for the H Predictor

2.4. Estimation of the MSE

3. Case Study: Multidimensional Poverty Incidence for the Adult Population in Colombia

3.1. Data Sources

3.1.1. Censo Nacional de Población y Vivienda (CNPV) 2018

3.1.2. Gran Encuesta Integrada de Hogares (GEIH) 2018

3.1.3. Satellite Imagery

3.1.4. Predictors

3.2. Results and Evaluation

3.2.1. Evaluation

4. Concluding Remarks and Further Research

Footnotes

Appendix A

Appendix B

Funding

ORCID iD

References