Improving Sampling Probability Definitions with Predictive Algorithms

Abstract

Place-based initiatives often use resident surveys to inform and evaluate interventions. Sampling based on well-defined sampling frames is important but challenging for initiatives that target subpopulations. Databases that enumerate total population counts can produce overinclusive sampling frames, resulting in costly outreach to ineligible participants. Quantifying eligibility before sampling using machine learning algorithms can improve efficiency and reduce costs. We developed a model to improve sampling for the West Philly Promise Neighborhood’s biennial population-representative survey of households with children within a geographic footprint. This study proposes a method to estimate probability of study eligibility by building a well-calibrated predictive model using existing administrative data sources. Six machine-learning models were evaluated; logistic regression provided the best balance of accuracy and understandable probabilities. This approach can be a blueprint for other population-based studies whose sampling frames cannot be well defined using traditional sources.

Get full access to this article

View all access options for this article.

References

Albert

M. V.

Kording

Herrmann

Jayaraman

. 2012. Fall classification by machine learning using mobile phones. PLoS One 7:e36556.

Altman

N. S.

1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46:175–85.

Austin

P. C.

J. V.

J. E.

Levy

Lee

D. S.

. 2013. Using methods from the data-mining and machine-learning literature for disease classification and prediction: A case study examining classification of heart failure subtypes. Journal of Clinical Epidemiology 66:398–407.

Büyüköztürk

Ş.

Çokluk-Bökeoğlu

Ö.

. 2008. Discriminant function analysis: Concept and application. Eurasian Journal of Educational Research 33:73–92.

Chew

R. F.

Amer

Jones

Unangst

Cajka

Allpress

Bruhn

. 2018. Residential scene classification for gridded population sampling in developing countries using deep convolutional neural networks on satellite imagery. International Journal of Health Geographics 17:1–17.

T. K.

1995. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition. Vol. 1, 278–82.

Köpcke

Lubgan

Fietkau

Scholler

Nau

Stürzl

Croner

Ulrich Prokosch

Toddenroth

. 2013. Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data. BMC Medical Informatics and Decision Making 13:134.

Levy

J. I.

Fabian

M. P.

Peters

J. L.

. 2014. Community-wide health risk assessment using geographically resolved demographic data: A synthetic population approach. PLoS One 9:e87144.

Pasek

Jang

S. M.

Cobb

C. L.

Dennis

J. M.

Disogra

. 2014. Can marketing data aid survey research? Examining accuracy and completeness in consumer-file data. The Public Opinion Quarterly 78:889–916.

10.

Roth

2001. Probabilistic discriminative kernel classifiers for multi-class problems. In Pattern recognition, lecture notes in computer science, eds. Radig

Florczyk

, 246–53. Berlin: Springer.

11.

Scornet

2017. Tuning parameters in random forests. ESAIM. Proceedings and Surveys; Les Ulis 60:144–62.

12.

Tibshirani

1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58:267–88.

13.

U.S. Census Bureau . 2018a. Demographic characteristics for occupied housing unit. https://data.census.gov/cedsci/table?q=S2502&tid=ACSST1Y2018.S2502 (accessed September 10, 2020).

14.

U.S. Census Bureau . 2018b. Households by presence of people under 18 years by household type. https://data.census.gov/cedsci/table?q=B11005&tid=ACSDT1Y2018.B11005 (accessed September 10, 2020).

15.

Westreich

Lessler

Jonsson Funk

. 2010. Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. Journal of Clinical Epidemiology 63:826–33.

16.

Roy

Stewart

W. F.

. 2010. Prediction modeling using EHR data: Challenges, strategies, and a comparison of machine learning approaches. Medical Care 48:S106–13.

17.

Zou

K. H.

C.-R.

Liu

Carlsson

M. O.

Cabrera

. 2013. Optimal thresholds by maximizing or minimizing various metrics via ROC-Type analysis. Academic Radiology 20:807–15.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.55 MB