Language Models in Sociological Research: An Application to Classifying Large Administrative Data and Measuring Religiosity

Abstract

Computational methods have become widespread in the social sciences, but probabilistic language models remain relatively underused. We introduce language models to a general social science readership. First, we offer an accessible explanation of language models, detailing how they estimate the probability of a piece of language, such as a word or sentence, on the basis of the linguistic context. Second, we apply language models in an illustrative analysis to demonstrate the mechanics of using these models in social science research. The example application uses language models to classify names in a large administrative database; the classifications are then used to measure a sociologically important phenomenon: the spatial variation of religiosity. This application highlights several advantages of language models, including their effectiveness in classifying text that contains variation around the base structures, as is often the case with localized naming conventions and dialects. We conclude by discussing language models’ potential to contribute to sociological research beyond classification through their ability to generate language.

Keywords

language model classification administrative data religiosity Indonesia

Get full access to this article

View all access options for this article.

References

Abramitzky

Ran

Boustan

Leah

Eriksson

Katherine

. 2020. “Do Immigrants Assimilate More Slowly Today Than in the Past?” American Economic Review: Insights 2(1):125–41.

Adcock

Robert

Collier

David

. 2001. “Measurement Validity: A Shared Standard for Qualitative and Quantitative Research.” American Political Science Review 95(3):529–46.

Alford

Richard

. 1987. Naming and Identity: A Cross-Cultural Study of Personal Naming Practices. New Haven, CT: HRAF Press.

Bertrand

Marianne

Mullainathan

Sendhil

. 2004. “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review 94(4):991–1013.

Biernacki

Richard

. 2012. Reinventing Evidence in Social Inquiry: Decoding Facts and Variables. New York: Palgrave MacMillan.

Bollen

Kenneth A.

1989. Structural Equations with Latent Variables. New York: John Wiley.

Brandt

Philipp

Timmermans

Stefan

. 2021. “Abductive Logic of Inquiry for Quantitative Research in the Digital Age.” Sociological Science 8:191–210.

Brants

Thorsten

Popat

Ashok C.

Peng

Och

Franz J.

Dean

Jeffrey

. 2007. “Large Language Models in Machine Translation.” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 858–67.

Brenner

Philip S.

2014. “Testing the Veracity of Self-Reported Religious Practice in the Muslim World.” Social Forces 92(3):1009–37.

10.

Chen

Yining

You

Jiali

Chu

Min

Zhao

Yong

Wang

Jinlin

. 2006. “Identifying Language Origin of Person Names With N-Grams of Different Units.” In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1.

11.

Dukes

Kais

Atwell

Eric

Habash

Nizar

. 2013. “Supervised Collaboration for Syntactic Annotation of Quranic Arabic.” Language Resources and Evaluation 47(1):33–62.

12.

Edelmann

Achim

Mohr

John

. 2018. “Formal Studies of Culture: Issues, Challenges, and Current Trends.” Poetics 68:1–9.

13.

Elchardus

Mark

Siongers

Jessy

. 2011. “First Names as Collective Identifiers: An Empirical Analysis of the Social Meanings of First Names.” Cultural Sociology 5(3):403–22.

14.

Enos

Ryan D.

2016. “What the Demolition of Public Housing Teaches Us about the Impact of Racial Threat on Political Behavior.” American Journal of Political Science 60(1):123–42.

15.

Finke

Roger

Bader

Christopher D.

2017. Faithful Measures: New Methods in the Measurement of Religion. New York: NYU Press.

16.

Freeman

Andrew T.

Condon

Sherri L.

Ackerman

Christopher M.

2006. “Cross Linguistic Name Matching in English and Arabic: A ‘One to Many Mapping’ Extension of the Levenstein Edit Distance Algorithm.” Pp. 471–78 in Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL.

17.

Fryer

Roland G.

Levitt

Steven D.

2004. “The Causes and Consequences of Distinctively Black Names.” Quarterly Journal of Economics 119(3):767–805.

18.

Feiyu

Uszkoreit

Hans

. 2010. “Determining the Origin and Structure of Person Names.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Paris: European Language Resources Association.

19.

Gaddis

S. Michael

. 2017a. “How Black Are Lakisha and Jamal? Racial Perceptions from Names Used in Correspondence Audit Studies.” Sociological Science 4:469–89.

20.

Gaddis

S. Michael

. 2017b. “Racial/Ethnic Perceptions from Hispanic Names: Selecting Names to Test for Discrimination.” Socius 3. Retrieved October 12, 2021. https://journals.sagepub.com/doi/10.1177/2378023117737193.

21.

Gamallo

Pablo

Pichel

Jose Ramom

Alegria

Iñaki

. 2017. “A Perplexity-Based Method for Similar Languages Discrimination.” Pp. 109–14 in Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). Stroudsburg, PA: Association for Computational Linguistics.

22.

Gerhards

Jürgen

Hans

Silke

. 2009. “From Hasan to Herbert: Name-Giving Patterns of Immigrant Parents between Acculturation and Ethnic Maintenance.” American Journal of Sociology 114(4):1102–28.

23.

Goldstein

Joshua R.

Stecklov

Guy

. 2016. “From Patrick to John F.: Ethnic Names and Occupational Success in the Last Era of Mass Migration.” American Sociological Review 81(1):85–106.

24.

Grimmer

Justin

Roberts

Margaret E.

Stewart

Brandon M.

2021. “Machine Learning for Social Science: An Agnostic Approach.” Annual Review of Political Science 24:395–419.

25.

Grofman

Bernard

Garcia

Jennifer R.

2014. “Using Spanish Surname to Estimate Hispanic Voting Population in Voting Rights Litigation: A Model of Context Effects Using Bayes’ Theorem.” Election Law Journal 13(3):375–93.

26.

Habash

Nizar

. 2008. “Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation.” Pp. 57–60 in Proceedings of ACL-08: HLT, Short Papers (Companion Volume).

27.

Habash

Nizar

Diab

Mona

Rambow

Owen

. 2012. “Conventional Orthography for Dialectal Arabic.” Pp. 711–18 in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012).

28.

Harris

J. Andrew

. 2015. “What’s in a Name? A Method for Extracting Information about Ethnicity from Names.” Political Analysis 23(2):212–24.

29.

Hofstra

Bas

de Schipper

Niek C.

2018. “Predicting Ethnicity with First Names in Online Social Media Networks.” Big Data & Society 5(1):1–14.

30.

Imai

Kosuke

Khanna

Kabir

. 2016. “Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records.” Political Analysis 24(2):263–72.

31.

Jauhiainen

Tommi

Lindén

Krister

Jauhiainen

Heidi

. 2017. “Evaluation of Language Identification Methods Using 285 Languages.” Pp. 183–91 in Proceedings of the 21st Nordic Conference on Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics.

32.

Jauhiainen

Tommi

Lui

Marco

Zampieri

Marcos

Baldwin

Timothy

Lindén

Krister

. 2019. “Automatic Language Identification in Texts: A Survey.” Journal of Artificial Intelligence Research 65:675–82.

33.

Johfre

Sasha Shen

. 2020. “What Age Is in a Name?” Sociological Science 7:367–90.

34.

Jurafsky

Daniel

Martin

James

. 2019. “Speech and Language Processing.” Unpublished manuscript.

35.

Karell

Daniel

Freedman

Michael

. 2019. “Rhetorics of Radicalism.” American Sociological Review 84(4):726–53.

36.

Kozlowski

Austin C.

Taddy

Matt

Evans

James A.

2019. “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 84(5):905–949.

37.

Kuipers

Joel C.

Askuri . 2017. “Islamization and Identity in Indonesia: The Case of Arabic Names in Java.” Indonesia 103(1):25–49.

38.

Lieberson

Stanley

. 2000. A Matter of Taste: How Names, Fashions, and Culture Change. New Haven, CT: Yale University Press.

39.

Lieberson

Stanley

Bell

Eleanor O.

1992. “Children’s First Names: An Empirical Study of Social Taste.” American Journal of Sociology 98(3):511–54.

40.

Lieberson

Stanley

Mikelson

Kelly S.

1995. “Distinctive African American Names: An Experimental, Historical, and Linguistic Analysis of Innovation.” American Sociological Review 60(6):928–46.

41.

Llewellyn

Aisyah

. 2018. “What’s in a Name in Indonesia?” Asia Times, February 6. Retrieved October 12, 2021. https://asiatimes.com/2018/02/whats-name-indonesia/.

42.

Mikolov

Tomas

Karafiat

Martin

Burget

Lukas

Cernocky

Jan

Khudanpur

Sanjeev

. 2010. “Recurrent Neural Network Based Language Model.” INTERSPEECH 2010. Retrieved October 12, 2021. https://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf.

43.

Nelson

Laura

. 2020. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods & Research 49(1):3–42.

44.

Nelson

Laura

. Forthcoming. “Leveraging the Alignment between Machine Learning and Intersectionality: Using Word Embeddings to Measure Intersectional Experiences of the Nineteenth Century U.S. South.” Poetics.

45.

OECD (Organisation for Economic Co-operation and Development). 2019. “Social Protection System Review of Indonesia.” Retrieved October 12, 2021. https://www.oecd-ilibrary.org/development/social-protection-system-review-of-indonesia_788e9d71-en.

46.

Olivetti

Claudia

Paserman

M. Daniele

. 2015. “In the Name of the Son (and the Daughter): Intergenerational Mobility in the United States, 1850–1940.” American Economic Review 105(8):2695–2724.

47.

Pepinsky

Thomas B.

Liddle

R. William

Mujani

Saiful

. 2018. Piety and Public Opinion: Understanding Indonesian Islam. New York: Oxford University Press.

48.

Power

David M. W.

2011. “Evaluation: From Precision, Recall, and F-Measure to ROC, Informedness, Markedness and Correlation.” Journal of Machine Learning Technologies 2(1):37–63.

49.

Ramisch

Carlos.

2008. “N-gram Models for Language Detection.”Technical report.

50.

Salameh

Mohammad

Bouamor

Houda

Habash

Nizar

. 2018. “Fine-Grained Arabic Dialect Identification.” Pp. 1332–44 in Proceedings of the 27th International Conference on Computational Linguistics.

51.

Seguin

Charles

Julien

Chris

Zhang

Yongjun

. 2021. “The Stability of Androgynous Names: Dynamics of Gendered Naming Practices in the United States 1880–2016.” Poetics 85:101501.

52.

Stolcke

Andreas

. 2002. “SRILM: An Extensible Language Modeling Toolkit.” Pp. 901–904 in Proceedings of the International Conference on Spoken Language Processing.

53.

Stoltz

Dustin S.

Taylor

Marshall A.

2019. “Concept Mover’s Distance: Measuring Concept Engagement via Word Embeddings in Texts.” Journal of Computational Social Science 2(2):293–313.

54.

Stoltz

Dustin S.

Taylor

Marshall A.

Forthcoming. “Cultural Cartography with Word Embeddings.” Poetics.

55.

Sue

Christina A.

Telles

Edward E.

2007. “Assimilation and Gender in Naming.” American Journal of Sociology 112(5):1383–1415.

56.

Susewind

Raphael

. 2015. “What’s in a Name? Probabilistic Inference of Religious Community from South Asian Names.” Field Methods 27(4):319–32.

57.

Tavory

Iddo

Timmermans

Stephan

. 2014. Abductive Analysis: Theorizing Qualitative Research. Chicago: University of Chicago Press.

58.

Tipple

Graham

Speak

Suzanne

. 2009. The Hidden Millions: Homelessness in Developing Countries. New York: Routledge.

59.

Traeger

Margaret L.

Sebo

Sarah Strohkorb

Jung

Malte

Scassellati

Brian

Christakis

Nicholas A.

2020. “Vulnerable Robots Positively Shape Human Conversational Dynamics in a Human–Robot Team.” Proceedings of the National Academy of Sciences 117(12):6370–75.

60.

Uhlenbeck

Eugenius Marius

. 1969. “Systematic Features of Javanese Personal Names.” Word 25:321–35.

61.

Vatanen

Tommi

Väyrynen

Jaakko J.

Virpioja

Sami

. 2010. “Language Identification of Short Text Segments with N-gram Models.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Paris: European Language Resources Association.

62.

Webb

Eugene J.

Campbell

Donald T.

Schwartz

Richard D

Sechrest

Lee

. 2000. Unobtrusive Measures. Rev. ed. Thousand Oaks, CA: Sage.

63.

Ying

Luwei

Montgomery

Jacob

Stewart

Brandon

. Forthcoming. “Topics, Concepts, and Measurement: A Crowdsourced Procedure for Validating Topics as Measures.” Political Analysis.

64.

Zaidan

Omar F.

Callison-Burch

Chris

. 2014. “Arabic Dialect Identification.” Computational Linguistics 40(1):171–202.

65.

Zelinsky

Wilbur

. 1970. “Cultural Variation in Personal Name Patterns in the Eastern United States.” Annals of the Association of American Geographers 60(4):743–69.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.38 MB