Sage Journals: Discover world-class research

Abstract

Automated text analysis is becoming extremely popular and image analysis is gaining interest. However, multimodal analysis that combines both text and image information remains rare, even though many real-world data are intrinsically multimodal, such as social media posts. The authors compare three practical workflows for clustering text–image pairs: (1) label-level combination, which clusters text and image separately and combines the resulting labels; (2) vector-level combination, which clusters concatenated embeddings extracted from each modality; and (3) joint embedding, which clusters unified representations from multimodal embedding models such as Contrastive Language-Image Pre-training. The authors also introduce a set of reusable evaluation tools to help researchers compare, validate, and benchmark multimodal clustering workflows: adjusted mutual information to assess text–image alignment, the S_DbW index to evaluate number of clusters, and within-cluster consistency to validate interpretability. The authors validate the methods on a Chinese protest data set from social media with 336,921 text–image pairs and test robustness and scope conditions using a smaller U.S. news data set on gun violence with 1,297 news headlines. The authors find that when text and image provide distinct, nonoverlapping information, the second and third methods outperform the first. This study serves as a bridge between the text-as-data and image-as-data communities.

Keywords

machine learning clustering big data computational social science image data

Get full access to this article

View all access options for this article.

References

Auxier

Brooke

Anderson

Monica

. 2021. “Social Media Use in 2021.”Washington DC: Pew Research Center.

Blei

David M.

Andrew Y.

I. Jordan

Michael

. 2003. “Latent Dirichlet Allocation.”Journal of Machine Learning Research 3:993–1022.

Cai

Yongshun

. 2010. Collective Resistance in China: Why Popular Protests Succeed or Fail. Stanford, CA: Stanford University Press.

Campello

Ricardo J.G.B.

Moulavi

Davoud

Sander

Joerg

. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” Pp. 160–72 in Advances in Knowledge Discovery and Data Mining, edited by Pei

Tseng

V. S.

Cao

Motoda

Berlin, Germany: Springer.

Caron

Mathilde

Bojanowski

Piotr

Joulin

Armand

Douze

Matthijs

. 2018. “Deep Clustering for Unsupervised Learning of Visual Features.” Pp. 139–56 in Computer Vision—ECCV 2018, edited by Ferrari

Hebert

Sminchisescu

Weiss

Cham, Switzerland: Springer International.

Casas

Andreu

Williams

Nora Webb

. 2018. “Images That Matter: Online Protests and the Mobilizing Role of Pictures.”Political Research Quarterly 72(2):360–75.

Casas

Andreu

Williams

Nora Webb

. 2019. “Images That Matter: Online Protests and the Mobilizing Role of Pictures.”Political Research Quarterly 72:360–75.

Chang

Jonathan

Gerrish

Sean

Wang

Chong

Boyd-Graber

Jordan L.

Blei

David M.

2009. “Reading Tea Leaves: How Humans Interpret Topic Models.” Pp. 288–96 in Advances in Neural Information Processing Systems 22, edited by Bengio

Schuurmans

Lafferty

J. D.

Williams

C. K. I.

Culotta

Red Hook, NY: Curran Associates.

Dosovitskiy

Alexey

Beyer

Lucas

Kolesnikov

Alexander

Weissenborn

Dirk

Zhai

Xiaohua

Unterthiner

Thomas

Dehghani

Mostafa

Minderer

Matthias

Heigold

Georg

Gelly

Sylvain

Uszkoreit

Jakob

Houlsby

Neil

. 2020. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv. Retrieved October 8, 2025. https://arxiv.org/abs/2010.11929.

10.

Gibson

Rhonda

Zillmann

Dolf

. 2000. “Reading between the Photographs: The Influence of Incidental Pictorial Information on Issue Perception.”Journalism & Mass Communication Quarterly 77:355–66.

11.

Goldberg

Yoav

Levy

Omer

. 2014. “word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method.” arXiv. Retrieved October 8, 2025. https://arxiv.org/abs/1402.3722.

12.

Grimmer

Justin

Roberts

Margaret E.

Stewart

Brandon M.

2021. “Machine Learning for Social Science: An Agnostic Approach.”Annual Review of Political Science 24:395–419.

13.

Grimmer

Justin

Stewart

Brandon M.

2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.”Political Analysis 21:267–97.

14.

Grootendorst

Maarten

. 2020. “BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure.” arXiv. Retrieved October 8, 2025. https://arxiv.org/abs/2203.05794.

15.

Halkidi

Maria

Batistakis

Yannis

Vazirgiannis

Michalis

. 2002. “Clustering Validity Checking Methods: Part II.”ACM Sigmod Record 31:19–27.

16.

Hastie

Trevor

Tibshirani

Robert

Friedman

Jerome

. 2009. The Elements of Statistical Learning. Berlin, Germany: Springer.

17.

Kaiming

Zhang

Xiangyu

Ren

Shaoqing

Sun

Jian

. 2016. “Deep Residual Learning for Image Recognition.” Pp. 770–78 in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: Institute of Electrical and Electronics Engineers.

18.

Joo

Jungseock

Steinert-Threlkeld

Zachary C.

2022. “Image as Data: Automated Content Analysis for Visual Presentations of Political Actors and Events.”Computational Communication Research4.

19.

Krizhevsky

Alex

Sutskever

Ilya

Hinton

Geoffrey E.

2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Pp. 1097–1105 in Advances in Neural Information Processing Systems 25, edited by Pereira

Burges

C. J. C.

Bottou

Weinberger

K. Q.

Red Hook NY: Curran Associates.

20.

Mayer

Richard E.

2002. “Multimedia Learning.” Pp. 85–139 in Psychology of Learning and Motivation. Amsterdam, the Netherlands: Elsevier.

21.

McInnes

Leland

Healy

John

Melville

James

. 2020. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv. Retrieved October 8, 2025. https://arxiv.org/abs/1802.03426.

22.

Mikolov

Tomas

Sutskever

Ilya

Chen

Kai

Corrado

Greg

Dean

Jeffrey

. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Pp. 3111–19 in Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, NIPS ’13. Red Hook, NY: Curran Associates.

23.

Paivio

Allan

. 1990. Mental Representations: A Dual Coding Approach. Oxford, UK: Oxford University Press.

24.

Peng

Yilang

. 2018. “Same Candidates, Different Faces: Uncovering Media Bias in Visual Portrayals of Presidential Candidates with Computer Vision.”Journal of Communication 68:920–41.

25.

Pennington

Jeffrey

Socher

Richard

Manning

Christopher D.

2014. “GloVe: Global Vectors for Word Representation.” Pp. 1532–43 in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. Kerrville, TX: Association for Computational Linguistics.

26.

Powell

Thomas E.

Boomgaarden

Hajo G.

De Swert

Knut

De Vreese

Claes H.

2015. “A Clearer Picture: The Contribution of Visuals and Text to Framing Effects.”Journal of Communication 65:997–1017.

27.

Radford

Alec

Kim

Jong Wook

Hallacy

Chris

Ramesh

Aditya

Goh

Gabriel

Agarwal

Sandhini

Sastry

Girish

Askell

Amanda

Mishkin

Pamela

Clark

Jack

Krueger

Gretchen

Sutskever

Ilya

. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv. Retrieved October 8, 2025. https://arxiv.org/abs/2103.00020.

28.

Reardon

Carley

Paik

Sejin

Gao

Parekh

Meet

Zhao

Yanling

Guo

Lei

Betke

Margrit

Wijaya

Derry Tanti

. 2022. “BU-NEmo: An Affective Dataset of Gun Violence News.” Pp. 2507–16 in Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France. Paris: European Language Resources Association.

29.

Sievert

Carson

Shirley

Kenneth

. 2014. “LDAvis: A Method for Visualizing and Interpreting Topics.” Pp. 63–70 in Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, MD. Kerrville, TX: Association for Computational Linguistics.

30.

Srinivasan

Krishna

Raman

Karthik

Chen

Jiecao

Bendersky

Michael

Najork

Marc

. 2021. “WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning.” arXiv. Retrieved October 8, 2025. https://arxiv.org/abs/2103.01913.

31.

Steinert-Threlkeld

Zachary C.

Chan

Alexander M.

Joo

Jungseock

. 2022. “How State and Protester Violence Affect Protest Dynamics.”Journal of Politics 84:798–813.

32.

Sweller

John

Van Merrienboer

Jeroen J. G.

Paas

Fred G. W. C.

1998. “Cognitive Architecture and Instructional Design.”Educational Psychology Review 10:251–96.

33.

Vaswani

Ashish

Shazeer

Noam

Parmar

Niki

Uszkoreit

Jakob

Jones

Llion

Gomez

Aidan N.

Kaiser

Łukasz

Polosukhin

Illia

. 2017. “Attention Is All You Need.”Advances in Neural Information Processing Systems 30, edited by Guyon

Von Luxburg

Bengio

Wallach

Fergus

Vishwanathan

Garnett

Red Hook NY: Curran Associates.

34.

Wilkerson

John

Casas

Andreu

. 2017. “Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges.”Annual Review of Political Science 20:529–44.

35.

Williams

Nora Webb

Casas

Andreu

Wilkerson

John D.

2020. Images as Data for Social Science Research: An Introduction to Convolutional Neural Nets for Image Classification. Cambridge, UK: Cambridge University Press.

36.

Wittenberg

Chloe

Tappin

Ben M.

Berinsky

Adam J.

Rand

David G.

2021. “The (Minimal) Persuasive Advantage of Political Video over Text.”Proceedings of the National Academy of Sciences 118:e2114388118.

37.

Zhang

Han

Pan

Jennifer

. 2019. “CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media.”Sociological Methodology 49:1–57.

38.

Zhang

Han

Peng

Yilang

. 2024. “Image Clustering: An Unsupervised Approach to Categorize Visual Data in Social Science Research.”Sociological Methods & Research 53:1534–87.

39.

Zhou

Bolei

Lapedriza

Agata

Xiao

Jianxiong

Torralba

Antonio

Oliva

Aude

. 2014. “Learning Deep Features for Scene Recognition Using Places Database.”Advances in Neural Information Processing Systems 27, edited by Ghahramani

Welling

Cortes

Lawrence

Weinberger

K. Q.

Red Hook, NY: Curran Associates.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB

Joint Text-and-Image Clustering for Social Science Research

Abstract

Keywords

Get full access to this article

References

Supplementary Material