Sage Journals: Discover world-class research

Abstract

Keywords

selective editing automatic editing measurement error

1. Introduction

I have been asked to present some thoughts on the future challenges and research needs of statistical data editing. While the scope of this short note is necessarily limited and my choice of topics is without a doubt subjective, I still hope it will be of interest to some readers.

Statistical data editing is connected to imputation and the two are often grouped together as “E&I” (GSDEM 2019). Here, I will focus on the editing part. Granquist (1997) identified three purposes of data editing for the production of official statistics based on surveys:

1. “Identify and collect data on problem areas, and error causes in data collection and processing, producing the basics for the (future) improvement of the survey vehicle.”

2. “Provide information about the quality of the data.”

3. “Identify and handle concrete important errors and outliers in individual data.”

To this, EDIMBUS (2007) added a fourth goal:

4. “When needed, provide complete and consistent (coherent) individual data.”

The rest of this note is structured as follows. Sections 2 and 3 discuss the topics of selective editing and automatic editing, respectively. Section 4 briefly discusses approaches that could be used as alternatives or supplements to editing. Some final remarks follow in Section 5.

2. Selective Editing

In general, statistical output is affected by many sources of potential inaccuracy (see, e.g., Bethlehem 2009), of which only a subset can be addressed by data editing. It is therefore not necessary—and in fact could be counterproductive—to try to edit the data until all individual errors have been removed. This idea can be traced back to Nordbotten (1955; and possibly further) but it only started to become accepted widely in official statistics during the 1980s and 1990s, thanks to studies such as Granquist (1984, 1995) and Granquist and Kovar (1997).

For business surveys in particular, which tend to contain mostly variables with right-skewed distributions, it has been found that manually editing a limited number of highly influential errors is usually sufficient. The residual effects of measurement errors on statistical output have then become negligible compared to other sources of inaccuracy, such as sampling error or coverage error. Continuing to edit the data after this point would be inefficient and amounts to “over-editing” (Granquist 1997). To guide the manual editing process, the challenge now becomes to identify beforehand which observations contain influential errors. Methods developed for this purpose are known as selective editing or significance editing (De Waal et al. 2011, Chapter 6).

The main technique developed for selective editing is the score function. Key references include Hidiroglou and Berthelot (1986), Latouche and Berthelot (1992), Lawrence and McKenzie (2000), Hedlin (2003), and Norberg (2016). Most score functions combine a measure of the suspicion that an observation contains an error with a measure of the potential impact on statistical output of editing that observation if it is found to contain an error. Each so-called local score function is related to a specific (combination of) variable(s) and a specific statistic at some level of aggregation. For the purpose of prioritizing units for selective editing, different local scores can be aggregated to a single global score (Hedlin 2008). Finally, a threshold can be chosen such that only units with a global score above the threshold are selected for manual editing.

Selective editing is now widespread in national statistical institutes (NSIs), at least for business surveys, and has contributed to establishing production processes that are more efficient and rational. For the most part, the techniques have been developed in a heuristic way. This has often led to systems that are either easy to use but specific to one particular application, or generic and flexible enough to handle many applications but at the cost of having many parameters to be set by a user. The SELEKT framework, developed by Statistics Sweden (Norberg 2016), is perhaps the most advanced generic system of this type so far. More recently, two approaches to selective editing have been proposed that are grounded in statistical theory: the SeleMix framework (Di Zio and Guarnera 2013), based on a contamination model for the observed data, and the SelEdit framework (Arbués et al. 2013; Salgado et al. 2018), based on a mathematical optimization problem in combination with an observation-prediction model.

De Waal (2013) identified as “the most important research question for the near future” regarding selective editing: how to support the practical application of selective editing frameworks? Over ten years on, this remains an important topic. Most score functions rely on a predicted or anticipated value and the efficiency of a selective editing procedure depends strongly on the quality of these predictions. Traditionally, historical data have often been used for predicted values. Developing a good selective editing procedure in practice remains challenging for variables that are inherently difficult to predict (e.g., investments). Semi-continuous variables that are non-zero intermittently over time are a case in point. For instance in the case of foreign trade statistics, certain expensive goods are traded occasionally, so the next value in a time series could be either zero or (much) larger than zero and the score function should account for both options (Van de Pol 1998).

The rise of artificial intelligence, including machine learning, may provide opportunities for improved selective editing (Dumpert 2020); see Forteza and García-Uribe (2025) for a recent application. Barragán and Salgado (2022) used machine learning for selective editing in two different ways: (i) obtaining predicted values for a traditional score function and (ii) letting the algorithm classify directly whether an observation requires manual editing. During a discussion at the 2022 UNECE Expert Meeting on Statistical Data Editing, David Salgado noted that these could be seen as two extreme cases of a general approach, in which an algorithm is trained to prioritize observations for manual editing and the user imposes more or less structure on the way the algorithm operates. Exploring intermediate cases of this approach and optimizing the degree of imposed structure may be interesting topics for future research.

Finally, selective editing of categorical data appears to be a neglected topic. Of course, in a purely categorical context all observations are equally influential and selective editing is not particularly useful. Suppose, however, that the parameters of interest are domain totals of the form $Y_{c} = \sum_{i \in U} I (a_{i} = c) y_{i}$ , where $U$ denotes the target population, $a_{i} \in {1, \dots, C}$ is the true domain code of unit $i$ , $I (.)$ denotes the indicator function, and $y_{i}$ is a numerical target variable. Instead of the value pair $(a_{i}, y_{i})$ for all $i \in U$ , one observes $({\tilde{a}}_{i}, y_{i})$ for all $i$ in a random sample $S \subseteq U$ , where ${\tilde{a}}_{i}$ is an error-prone version of $a_{i}$ , and $Y_{c}$ is estimated by ${\hat{Y}}_{c} = \sum_{i \in S} I ({\tilde{a}}_{i} = c) d_{i} y_{i}$ , where $d_{i}$ denotes a survey weight. Misclassifications in ${\tilde{a}}_{i}$ can now be more or less influential in ${\hat{Y}}_{c}$ due to the factor $d_{i} y_{i}$ , if at least one of $d_{i}$ and $y_{i}$ has a right-skewed distribution. Here, it is not trivial to find an efficient selective editing strategy that is guaranteed to improve the accuracy of all ${\hat{Y}}_{1}, \dots, {\hat{Y}}_{C}$ , since each domain estimate is affected by both false positives and false negatives. By editing these errors in a one-sided way, one may in fact increase bias for some domains (Van Delden et al. 2016).

For an application to statistics on energy use ( $y_{i}$ ) by economic sector ( $a_{i}$ ), Scholtus et al. (2025) proposed a score function of the following form, given that one has obtained estimates ${\hat{B}}_{c}$ of bias in each original ${\hat{Y}}_{c}$ and predicted probabilities ${\hat{π}}_{ic} = \hat{P} (a_{i} = c)$ :

s_{i c} = \frac{sign ({\hat{B}}_{c}) {I ({\tilde{a}}_{i} = c) - {\hat{π}}_{i c}} d_{i} y_{i}}{| {\hat{Y}}_{c} + {\hat{B}}_{c} |} .

For each $c$ , the units with the $m_{c}$ largest positive values of $s_{ic}$ could be selected for manual editing. Depending on the sign of ${\hat{B}}_{c}$ , this score either prioritizes influential units that are currently classified inside domain $c$ that are likely to belong elsewhere (positive bias) or influential units currently classified elsewhere that are likely to belong in domain $c$ (negative bias). This topic warrants further study. More complicated situations could also arise, where for instance both the observed domain code and the observed numerical variable are error-prone.

3. Automatic Editing

Automatic editing is an umbrella term for methods that try to correct errors in microdata without human intervention. This can involve both deductive (often rule-based) correction of systematic errors (i.e., errors with a known cause) and error localization of random errors (i.e., all other errors); see, for example, Pannekoek et al. (2013) for more details. Deductive correction occurs in some form or other for most surveys, for instance to correct unit of measurement errors such as the “thousand error” (De Waal et al. 2011, Chapter 2). Error localization has so far been applied less widely, but it is an important process step for some surveys, in particular to ensure that the microdata are fully consistent with all edit rules.

Current applications of error localization are usually based on the seminal work of Fellegi and Holt (1976). Given a set of edit rules that should be satisfied by the data and a positive reliability weight $w_{j}$ attached to each variable $y_{j}$ , the sum of reliability weights of the variables to be edited is minimized, such that all rules become satisfied. This casts error localization as a mathematical optimization problem which can be solved automatically. While often perceived as a “mechanical” technique without a statistical basis, the paradigm of Fellegi and Holt can be motivated as yielding an approximate maximum likelihood estimator of the true error pattern under a simple intermittent error model, dating back to Naus et al. (1972). In particular, this model assumes that all errors independently affect one variable at a time and that an error in variable $y_{j}$ occurs with probability $p_{j} < 0.5$ such that $w_{j} = \log ((1 - p_{j}) / p_{j})$ (Liepins 1980).

The Fellegi-Holt paradigm can be generalized. Scholtus (2013, 2015) proposed an extension that can accommodate soft edit rules (i.e., edit rules that may be failed by error-free data). Scholtus (2016) and Daalmans and Scholtus (2018) proposed a generalized error localization problem which minimizes the sum of reliability weights of so-called edit operations specified by the user. Each edit operation is supposed to correct one error by changing one or more variables in a prescribed way, which may involve zero, one or more free parameters. In particular, this allows for errors that affect multiple variables at the same time, such as interchanged values. The original Fellegi-Holt paradigm occurs as a special case, for a particular choice of edit operations.

Early implementations of automatic editing required a large investment to develop dedicated tools. Nowadays, it is possible to implement an entire automatic editing production system using open source R packages (Van der Loo and De Jonge 2018).

It is unlikely that automatic editing will ever completely remove the need for selective manual editing. Nonetheless, it could probably be used more than is done in current practice (Pannekoek et al. 2013). Rather than theoretical limitations of the methods themselves, the main obstacle in practice seems to be a lack of subject-matter information that can readily be incorporated in these systems, in the form of edit rules and reliability weights. What happens during manual editing is often not well-documented and relies on subject-matter knowledge that may be implicit and unstructured (and in some cases subjective). Implementing an automatic editing system that adequately mimics the decisions made by human editors can therefore be a long process of trial and error, requiring iterative feedback from subject-matter experts (Di Zio et al. 2005; Rhodes et al. 2024). From a purely theoretical point of view, the error localization problem—especially in its generalized form—is quite flexible. In principle, one could tailor the reliability weights, edit operations, and even the edit rules to each unit separately. However, in practice this is seldom done, due to a lack of suitable unit-specific information.

Again, artificial intelligence may be of use here (Dumpert 2020). One option is to use machine learning to search for suggested new edit rules or deductive correction rules based on historical edited data, although this may lead to relatively complicated rules without a clear substantive meaning (see, e.g., Petrakos et al. 2004). For error localization, a way forward might be as follows:

Use a model or algorithm to predict the above-mentioned probabilities $p_{j}$ for each unit.

For a given unit, immediately apply all edit operations for which ${\hat{p}}_{j} \geq 0.5$ .

If at least one edit rule remains failed after the previous step, then solve the error localization problem with the remaining edit operations, using $w_{j} = \log ((1 - {\hat{p}}_{j}) / {\hat{p}}_{j})$ .

Here, Fellegi-Holt-based error localization is used as a final step to ensure that all edit rules can be satisfied, which is difficult to achieve for complex systems of rules using machine learning alone. In practice, a limiting factor may be the availability of sufficient training data (Dumpert 2020).

On a practical level, techniques such as web scraping and text mining could be of interest to make unstructured information which is already used by analysts during manual editing also available as auxiliary data for automatic editing. This includes, for instance, financial statements published by businesses. Potentially, this could be very useful to increase the quality of automatic editing.

A possible criticism of the above discussion is that I am showing a lack of imagination, by staying quite close to existing methodology. For a more radical vision of the future, in which generative artificial intelligence becomes an important tool to enhance data quality, see Azeroual (2024).

Finally on this topic, it should be noted that as an alternative to the Fellegi-Holt-based approach some innovative Bayesian methods have been developed for combined editing and imputation; see Kim et al. (2015) and Aßmann et al. (2024). Similar in spirit to multiple imputation (Rubin 1987), these methods allow the uncertainty due to automatic editing in resulting statistical output to be taken into account, which has traditionally often been ignored. Currently, this approach still has important practical limitations; it is, for instance, challenging to take complex systems of edit rules into account. Moreover, this approach is quite far removed from current practice at most NSIs. But both of these things might change in the future. An alternative, general approach for measuring the total variance of an estimation process, including the uncertainty due to automatic editing and imputation, is provided by the bootstrap (Efron and Tibshirani 1994; Van der Loo et al. 2017).

4. Beyond Editing?

Data editing is currently the main approach to handle measurement errors at NSIs. Outside of official statistics, although some basic form of editing may still occur, it is more common to rely on other approaches that account for measurement error at the estimation or analysis stage. This includes robust estimation methods (Beaumont et al. 2013; Huber 1981) and statistical models that include an explicit measurement error component, such as latent variable models (Biemer 2011; Bollen 1989). It is interesting to consider whether such approaches could also be used more for official statistics, enabling a further reduction in selective manual editing.

In a similar spirit, but closer to current practice at NSIs, is the suggestion of Ilves and Laitila (2009) to draw a probability sample of units for manual editing, including units with scores below the usual selection threshold. This would allow the quality of statistical output after the regular selective editing process to be estimated in a design-based way. This is in line with the proposal of Zhang (2023) to use audit sampling as a framework for design-based quality evaluation of statistical output; here, an audit sample is a sample that is drawn not to estimate the target parameter itself, but rather to assess the quality of an existing estimator of the target parameter.

Zhang’s paper specifically discusses multisource statistics, that is, statistical output based on several data sources, which may include traditional surveys, registers, web scraped data, etc. On one hand, having multiple data sources provides an opportunity for improved editing because more information is available, but on the other hand, the challenges of selective and automatic editing become even larger in this context due to the size and complexity of data and edit rules.

The current methods for selective and automatic editing have been developed mostly in the context of traditional sample surveys. With the trend of using more different data sources and in particular non-survey data sources for the production of official statistics, approaches that go beyond editing may become more relevant in the coming years. For handling measurement errors in large administrative datasets, let alone “big data,” existing editing techniques may not be suitable. For instance, a selection of only the influential suspicious cases from an administrative dataset may still be too large to edit manually. Moreover, many analysts might not have the required administrative knowledge to perform suitable edits on those data. For a discussion of statistical data editing for “big data,” see De Waal et al. (2014).

5. Concluding Remarks

I conclude with two points that did not fit into the previous sections. First, statistical data editing is traditionally done in so-called stovepipes, where each editing process runs mostly in isolation. As a result, large inconsistencies between statistics can go unnoticed for a long time, until they are finally found during the production of national accounts. A somewhat recent development is that some NSIs have introduced so-called Large Cases Units that perform integrated editing across statistics for the largest and most complicated businesses (Vennix 2012). Recently at Statistics Netherlands, work has been done to apply editing across statistics as well to the rest of the business population. For selective editing, this has been tested successfully in a few places (Vaasen-Otten et al. 2022) and is now being implemented. Research on automatic editing across statistics is ongoing (Scholtus et al. 2024).

Finally, looking back at the four goals of editing mentioned in the introduction, it should be noted that Granquist considered these goals ranked by priority, from most to least important. However, in practice attention is often still focused mainly on the last two goals, while the first two goals have been more elusive (De Waal 2013). Thus, an important challenge remains for the future.

Footnotes

Acknowledgements

The views expressed in this article are those of the author and do not necessarily reflect the policies of Statistics Netherlands. I would like to thank Jeroen Pannekoek and an anonymous reviewer for several helpful comments.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Sander Scholtus

Received: January 11, 2025

Accepted: March 22, 2025

References

Arbués

Revilla

Salgado

2013. “An Optimization Approach to Selective Editing.” Journal of Official Statistics 29: 489–510. DOI: https://doi.org/10.2478/jos-2013-0037.

Aßmann

Würbach

Saidani

Dumpert

2024. “Full Conditional Distributions for Handling Restrictions in the Context of Automated Statistical Data Editing.” Working Paper, UNECE Expert Meeting on Statistical Data Editing, Vienna. https://unece.org/statistics/events/SDE2024.

Azeroual

2024. “Can Generative AI Transform Data Quality? A Critical Discussion of ChatGPT’s Capabilities.” Academia Engineering 1: 1–12. DOI: https://doi.org/10.20935/AcadEng7407.

Barragán

Salgado

2022. “Improving Statistical Data Editing with Machine Learning: First Use Cases in Statistics Spain (INE).” Working Paper, UNECE Expert Meeting on Statistical Data Editing (virtual). https://unece.org/statistics/events/SDE2022.

Beaumont

J.-F.

Haziza

Ruiz-Gazen

2013. “A Unified Approach to Robust Estimation in Finite Population Sampling.” Biometrika 100: 555–69. DOI: https://doi.org/10.1093/biomet/ast010.

Bethlehem

2009. Applied Survey Methods: A Statistical Perspective. Hoboken, NJ: John Wiley & Sons.

Biemer

P. P.

2011. Latent Class Analysis of Survey Error. Hoboken, NJ: John Wiley & Sons.

Bollen

K. A.

1989. Structural Equations with Latent Variables. Hoboken, NJ: John Wiley & Sons.

Daalmans

Scholtus

2018. “A MIP Approach for a Generalised Data Editing Problem.” Discussion Paper, Statistics Netherlands, The Hague. https://www.cbs.nl/en-gb/background/2018/29/a-mip-approach-for-a-generalised-data-editing-problem.

10.

De Waal

. 2013. “Selective Editing: A Quest for Efficiency and Data Quality.” Journal of Official Statistics 29: 473–88. DOI: https://doi.org/10.2478/jos-2013-0036.

11.

De Waal

Pannekoek

Scholtus

2011. Handbook of Statistical Data Editing and Imputation. Hoboken, NJ: John Wiley & Sons.

12.

De Waal

Puts

Daas

P. J. H.

2014. “Statistical Data Editing of Big Data.” Presented at the Royal Statistical Society International Conference, Sheffield. https://www.researchgate.net/publication/268923823_Statistical_Data_Editing_of_Big_Data.

13.

Di Zio

Guarnera

2013. “A Contamination Model for Selective Editing.” Journal of Official Statistics 29: 539–55. DOI: https://doi.org/10.2478/jos-2013-0039.

14.

Di Zio

Guarnera

Luzi

2005. “Improving the Effectiveness of a Probabilistic Editing Strategy for Business Data.” Report, ISTAT, Rome. https://lipari.istat.it/digibib/Contributi%20ISTAT/2005_02-1.pdf.

15.

Dumpert

2020. “Theme Report of the Editing & Imputation Group.” Report, UNECE HLG-MOS Machine Learning Project. https://statswiki.unece.org/spaces/ML/pages/290358735/WP1+-+Theme+2+Edit+and+Imputation+Report.

16.

EDIMBUS. 2007. Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys. Manual Prepared for Eurostat by Istat, CBS, and SFSO. https://ec.europa.eu/eurostat/documents/64157/4374310/30-Recommended+Practices-for-editing-and-imputation-in-cross-sectional-business-surveys-2008.pdf.

17.

Efron

Tibshirani

R. J.

1994. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC.

18.

Fellegi

I. P.

Holt

1976. “A Systematic Approach to Automated Edit and Imputation.” Journal of the American Statistical Association 71: 17–35. DOI: https://doi.org/10.1080/01621459.1976.10481472.

19.

Forteza

García-Uribe

2025. “A Score Function to Prioritize Editing in Household Survey Data: A Machine Learning Approach.” Journal of Official Statistics 41: 144–71. DOI: https://doi.org/10.1177/0282423X241309971.

20.

Granquist

1984. “Data Editing and Its Impact on the Further Processing of Statistical Data.” Workshop on Statistical Computing, Budapest, November 12–17.

21.

Granquist

1995. “Improving the Traditional Editing Process.” In Business Survey Methods, edited by B. G.

Cox

Binder

D. A.

Chinnappa

B. N.

Christianson

Colledge

M. J.

Kott

P. S.

Hoboken, NJ: John Wiley & Sons.

22.

Granquist

1997. “The New View on Editing.” International Statistical Review 65: 381–7. DOI: https://doi.org/10.1111/j.1751-5823.1997.tb00315.x.

23.

Granquist

Kovar

J. G.

1997. “Editing of Survey Data: How Much is Enough?” In Survey Measurement and Process Quality, edited by L. E.

Lyberg

Biemer

Collins

, et al. Hoboken, NJ: John Wiley & Sons.

24.

GSDEM. 2019. “Generic Statistical Data Editing Model, Version 2.0.” Report, UNECE, Geneva. https://unece.org/statistics/documents/2019/06/gsdem-v20.

25.

Hedlin

2003. “Score Functions to Reduce Business Survey Editing at the UK Office for National Statistics.” Journal of Official Statistics 19: 177–99. DOI: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/score-functions-to-reduce-business-survey-editing-at-the-u.k.-office-for-national-statistics.pdf.

26.

Hedlin

2008. “Local and Global Score Functions in Selective Editing.” Working Paper No. 31, UNECE Work Session on Statistical Data Editing, Vienna. https://unece.org/info/events/event/19759.

27.

Hidiroglou

M. A.

Berthelot

J.-M.

1986. “Statistical Editing and Imputation for Periodic Business Surveys.” Survey Methodology 12: 73–83. DOI: https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X198600114442.

28.

Huber

P. J.

1981. Robust Statistics. Hoboken, NJ: John Wiley & Sons.

29.

Ilves

Laitila

2009. “Probability-Sampling Approach to Editing.” Austrian Journal of Statistics 38: 171–82. DOI: https://doi.org/10.17713/ajs.v38i3.270.

30.

Kim

H. J.

Cox

L. H.

Karr

A. F.

Reiter

J. P.

Wang

2015. “Simultaneous Edit-Imputation for Continuous Microdata.” Journal of the American Statistical Association 110: 987–99. DOI: https://doi.org/10.1080/01621459.2015.1040881.

31.

Latouche

Berthelot

J.-M.

1992. “Use of a Score Function to Prioritize and Limit Recontacts in Editing Business Surveys.” Journal of Official Statistics 8: 389–400. DOI: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/use-of-a-score-function-to-prioritize-and-limit-recontacts-in-editing-business-surveys.pdf.

32.

Lawrence

McKenzie

2000. “The General Application of Significance Editing.” Journal of Official Statistics 16: 243–53. DOI: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/the-general-application-of-significance-editing.pdf.

33.

Liepins

G. E.

1980. “A Rigorous, Systematic Approach to Automatic Data Editing and Its Statistical Basis.” Report ORNL/TM-7126, Oak Ridge National Laboratory.

34.

Naus

J. I.

Johnson

T. G.

Montalvo

1972. “A Probabilistic Model for Identifying Errors in Data Editing.” Journal of the American Statistical Association 67: 943–50. DOI: https://doi.org/10.1080/01621459.1972.10481323.

35.

Norberg

2016. “SELEKT – A Generic Tool for Selective Editing.” Journal of Official Statistics 32: 209–29. DOI: https://doi.org/10.1515/jos-2016-0010.

36.

Nordbotten

1955. “Measuring the Error of Editing the Questionnaires in a Census.” Journal of the American Statistical Association 50: 364–9. DOI: https://doi.org/10.1080/01621459.1955.10501270.

37.

Pannekoek

Scholtus

Van der Loo

2013. “Automated and Manual Data Editing: A View on Process Design and Methodology.” Journal of Official Statistics 29: 511–37. DOI: https://doi.org/10.2478/jos-2013-0038.

38.

Petrakos

Conversano

Farmakis

Mola

Siciliano

Stavropoulos

2004. “New Ways of Specifying Data Edits.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 167: 249–74. DOI: https://doi.org/10.1046/j.1467-985X.2003.00745.x.

39.

Rhodes

Lipke

Maiwurm

Miller

Abreu

D. A.

Young

L. J.

2024. “Assessment of Manual vs Automated Survey Editing and Imputation.” Working Paper, UNECE Expert Meeting on Statistical Data Editing, Vienna. https://unece.org/statistics/events/SDE2024.

40.

Rubin

D. B.

1987. Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: John Wiley & Sons.

41.

Salgado

Esteban

M. E.

Saldaña

2018. “SelEdit – A Collection of R Packages to Implement Some Optimization-Based Selective Editing Techniques.” Romanian Statistical Review 4: 19–38. DOI: https://www.revistadestatistica.ro/wp-content/uploads/2018/12/rrs4_2018_A03.pdf.

42.

Scholtus

2013. “Automatic Editing with Hard and Soft Edits.” Survey Methodology 39: 59–89. DOI: https://www150.statcan.gc.ca/n1/pub/12-001-x/2013001/article/11825-eng.pdf.

43.

Scholtus

2015. “New Results on Automatic Editing Using Hard and Soft Edit Rules.” Working Paper No. 35, UNECE Work Session on Statistical Data Editing, Budapest. https://unece.org/statistics/events/SDE2015.

44.

Scholtus

2016. “A Generalized Fellegi-Holt Paradigm for Automatic Error Localization.” Survey Methodology 42: 1–18. DOI: https://www150.statcan.gc.ca/n1/pub/12-001-x/2016001/article/14538-eng.pdf.

45.

Scholtus

Van Delden

Boeschoten

2025. “Accuracy of Output Affected by Misclassified Domain Variables Derived from Multiple Sources.” Submitted for publication.

46.

Scholtus

Van Delden

Willems

Aelen

2024. “Current Work on Automatic Multisource Editing at Statistics Netherlands.” Working Paper, UNECE Expert Meeting on Statistical Data Editing, Vienna. https://unece.org/statistics/events/SDE2024.

47.

Vaasen-Otten

Aelen

Scholtus

De Jong

2022. “Towards a New Integrated Uniform Production System for Business Statistics at Statistics Netherlands: Quality Indicators to Guide Top-Down Analysis.” Working Paper, UNECE Expert Meeting on Statistical Data Editing (virtual). https://unece.org/statistics/events/SDE2022.

48.

Van Delden

Scholtus

Burger

2016. “Accuracy of Mixed-Source Statistics as Affected by Classification Errors.” Journal of Official Statistics 32: 619–42. DOI: https://doi.org/10.1515/jos-2016-0032.

49.

Van de Pol

1998. “Macro Editing in the Netherlands Foreign Trade Survey.” Research Paper, Statistics Netherlands, Voorburg.

50.

Van der Loo

De Jonge

2018. Statistical Data Cleaning with Applications in R. Hoboken, NJ: John Wiley & Sons.

51.

Van der Loo

Pannekoek

Rijnveld

2017. “Computational Estimates of Data-Editing Related Variance.” Working Paper, UNECE Work Session on Statistical Data Editing, The Hague. https://unece.org/statistics/events/SDE2017.

52.

Vennix

2012. “The Treatment of Large Enterprise Groups Within Statistics Netherlands.” Presented at the Fourth International Conference on Establishment Surveys (ICES IV), Montréal. https://ww2.amstat.org/meetings/ices/2012/papers/301992.pdf.

53.

Zhang

L.-C.

2023. “Audit Sampling as a Quality Standard for Multisource Official Statistics.” Spanish Journal of Statistics 5: 67–83. DOI: https://doi.org/10.37830/SJS.2023.1.05.

The Unknown Future of Statistical Data Editing: Some Imputations

Abstract

Keywords

1. Introduction

2. Selective Editing

3. Automatic Editing

4. Beyond Editing?

5. Concluding Remarks

Footnotes

Acknowledgements

Funding

ORCID iD

References