Statistical learning in official statistics: The case of statistical matching

Abstract

Data integration is becoming a crucial task in National Statistical Institutes in order to exploit the information provided by already existing data sources. Here the focus is on statistical matching methods; they are designed to integrate data stemming out from traditional sample surveys referred to the same target population. In particular, this work shows how popular statistical learning techniques can be beneficial for matching purposes. Two proposals are presented, having a different final scope: the creation of a “fused” data set or the assessment of the uncertainty due to the typical statistical matching scenario. The characteristics of these procedures are investigated through a series of simulations and in an application to real survey data. The achieved results are encouraging and show that some statistical learning techniques can be very effective in exploiting the information provided by already existing survey data, permitting a reduction of the uncertainty determined by the typical statistical matching setting.

Keywords

Data integration machine learning

Get full access to this article

View all access options for this article.

References

D’Orazio

Di Zio

Scanu

. Statistical Matching, Theory and Practice. Chichester: John Wiley & Sons; 2006.

Donatiello

D’Orazio

Frattarola

Rizzi

Scanu

Spaziani

. The role of the conditional independence assumption in statistically matching income and consumption. Statistical Journal of the IAOS, 2016; 32: 667-675. doi: 10.3233/SJI-161000.

Hastie

Tibshirani

Friedman

. The Elements of Statistical Learning. 2nd ed. New York: Springer; 2009.

Breiman

. Statistical modeling: the two cultures. Statistical Science, 2001; 16: 199-215.

D’Orazio

. A two step non parametric procedure for statistical matching. 8

{}^{\text{th}}

Scientific meeting of the CLAssification and Data Analysis Group of the Italian Statistical Society (CLADAG 2011), 7–9 September 2011, University of Pavia, Italy.

D’Orazio

Di Zio

Scanu

. Auxiliary variable selection in a statistical matching problem. In: Zhang

Chambers

, eds. Analysis of Integrated Data. Boca Raton: CRC Press; 2019. pp. 101-120.

D’Orazio

Di Zio

Scanu

. Statistical matching for categorical data: displaying uncertainty and using logical constraints. Journal of Official Statistics, 2006; 22: 137-157.

Conti

Marella

Scanu

. Uncertainty analysis in statistical matching. Journal of Official Statistics, 2012; 28: 69-88.

Zhang

. On proxy variables and categorical data fusion. Journal of Official Statistics, 2015; 31: 783-807.

10.

D’Orazio

Di Zio

Scanu

. The use of uncertainty to choose matching variables in statistical matching. International Journal of Approximate Reasoning, 2017; 90: 433-440. doi: 10.1016/j.ijar.2017.08.015.

11.

Kuhn

Johnson

. Applied Predictive Modeling. New York: Springer; 2013.

12.

Quinlan

. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers; 1993.

13.

Breiman

. Random forests. Machine Learning, 2001; 45: 5-32.

14.

Alfaro

Gamez

Garcia

. adabag: An R Package for Classification with Boosting and Bagging. Journal of Statistical Software, 2013; 54.

15.

ESS European Social Survey Round 7 Data (2014). Data file edition 2.1. NSD – Norwegian Centre for Research Data, Norway – Data Archive and distributor of ESS data for ESS ERIC.

16.

Agresti

. Categorical Data Analysis, 3

{}^{\text{rd}}

Edition. Hoboken: John Wiley & Sons; 2013.