Abstract
Automated text analysis is becoming extremely popular and image analysis is gaining interest. However, multimodal analysis that combines both text and image information remains rare, even though many real-world data are intrinsically multimodal, such as social media posts. The authors compare three practical workflows for clustering text–image pairs: (1) label-level combination, which clusters text and image separately and combines the resulting labels; (2) vector-level combination, which clusters concatenated embeddings extracted from each modality; and (3) joint embedding, which clusters unified representations from multimodal embedding models such as Contrastive Language-Image Pre-training. The authors also introduce a set of reusable evaluation tools to help researchers compare, validate, and benchmark multimodal clustering workflows: adjusted mutual information to assess text–image alignment, the S_DbW index to evaluate number of clusters, and within-cluster consistency to validate interpretability. The authors validate the methods on a Chinese protest data set from social media with 336,921 text–image pairs and test robustness and scope conditions using a smaller U.S. news data set on gun violence with 1,297 news headlines. The authors find that when text and image provide distinct, nonoverlapping information, the second and third methods outperform the first. This study serves as a bridge between the text-as-data and image-as-data communities.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
