Abstract
This paper presents a practical guide to machine learning–assisted visual content analysis for social scientists. Combining machine automation with human expertise and reflexivity, the proposed methodological framework bridges the gap between computer vision and social research. Our custom approach combines inductive, deductive, and abductive logics of scientific inquiry and consists of three complementary steps: (a) Pattern exploration—employing unsupervised learning to explore visual patterns within image datasets; (b) Theory-driven image classification—utilizing supervised learning with convolutional neural networks to systematically label visual content; and (c) Context-sensitive interpretation—to provide critical and creative engagement with the patterns identified in the previous steps. We illustrate these three steps, and their various combinations, through empirical examples from a study of visuality in digital diplomacy, and critically discuss the epistemological implications of using machine learning as a method in visual social research.
Introduction
The booming digital circulation of images, together with recent advancements in computational methods and AI research, has opened uncharted avenues for exploring and understanding the social world through its visual traces. The current abundance of both human- and AI-generated visual content covers every imaginable political, social, or cultural topic. This content takes many forms: from visual media shared on platforms such as Instagram, Reddit, Facebook, or TikTok, to satellite and drone imagery, CCTV footage, news visuals, and archival or historical visual records, among other types of image data. This marks a potentially unprecedented—and yet, largely underappreciated—shift for social science inquiry. However, the epistemological complexities linked, on the one hand, to the analysis of large unstructured visual datasets and, on the other hand, to their interpretation in light of the sociotechnical contexts from which they can be extracted, pose significant technical and heuristic challenges, making it difficult for social scientists to fully embrace the opportunities for knowledge gain in image data (Chen et al., 2023; Pearce et al., 2020). This article contributes to the systematization, formalization, and diffusion of this cross-disciplinary and mostly tacit methodological knowledge, through a practical guide for applying machine learning (ML) methods to visual content analysis in social research.
The methodological framework guiding this work lies at the intersection of multiple research streams centered on visual content analysis for the social sciences, including computational approaches (Joo and Steinert-Threlkeld, 2022), qualitative visual methodologies (Luhtakallio, 2013; Rose, 2016), and digital methods approaches to visual media analysis (Rogers, 2021). In addition, drawing inspiration from methodological and analytical frameworks developed for the computational analysis of large text data (Carlsen and Ralund, 2022; DiMaggio et al., 2013; Nelson, 2020), our practical guide attempts to combine the computational capabilities of machines with human expert knowledge, methodological reflexivity, and context-sensitive interpretation.
In recent years, a growing number of researchers within the broader horizon of social sciences have employed ML for image analysis to answer big societal questions (Casas and Williams, 2019; Chen et al., 2023; Torres and Cantú, 2022). However, the application of ML techniques to visual analysis in social research is currently done in the absence of standardized methodological guidelines for automatically classifying large datasets of images (Chen et al., 2023). Social science researchers interested in automatizing visual analysis are largely left to themselves to seek out experience and knowledge to do so.
This places them in a position similar to earlier phases of computational text analysis applications (Nelson, 2020): operating without theoretical and methodological grounding—an otherwise critical component in any social scientific analysis. Furthermore, from a critical, digital methods perspective, ML methods cannot be approached as neutral research tools (Coromina et al., 2023). On the contrary, training data, value-laden parameters, and specific “algorithmic techniques” (Rieder, 2020: 353) influence final results, often in opaque ways (Burrell, 2016), and occasionally introduce biases (Zou and Schiebinger, 2018). For this reason, our paper (a) emphasizes the active role of the humans in the (research) loop (see also Miner et al., 2023), showing how to practically make use of computer vision methods in reflexive and qualitatively driven ways and (b) relies on a custom-made machine classifier alternative to the ready-built—and yet opaque—ones accessible through digital corporations’ interfaces (e.g., Google Cloud Vision or Amazon Rekognition), guiding its implementation through an illustrative Python script (see Appendix A).
Unlike existing approaches to visual analysis that often prioritize either computational efficiency or qualitative depth, our framework bridges this divide by establishing a dynamic back-and-forth between the two. It leverages the scalability and pattern recognition capabilities of machines while integrating expert knowledge, epistemological reflexivity, and interpretative skills.
To operationalize this methodological approach, we propose a three-step guide articulated following Peirce's established conceptualization of the different logics of science—that is, induction, deduction, and abduction (Peirce, 1878). Building on contributions that describe ML as inherently based on forms of “statistical induction” (Campolo and Schwerzmann, 2023), as well as attuned with the abductive processes of scientific discovery and explicability characterizing qualitative social research (Brandt, 2023), we show how the three main logics of inquiry early described by Peirce drive different and largely complementary moments in ML-assisted visual content analysis: (a) pattern exploration (inductive logic), (b) theory-driven image classification (deductive logic), and (c) context-sensitive interpretation (abductive logic). We describe these three steps through empirical illustrations, highlighting their practical implementation and showing how they can be combined in different ways to address different research goals. In doing so, we aim to provide a substantial contribution to the emerging field of computational visual analysis by equipping social scientists with a rigorous yet flexible toolkit for navigating the challenges and opportunities brought forward by the abundance of image data and advances in computer vision technologies.
The paper is organized as follows. We begin with a summary of recent debates around ML in social science research more generally. Next, we situate our framework within the context of visual content analysis in the social sciences and then outline the three steps of the proposed practical guide, supplementing each step with empirical examples (“Methodological framework” section). In the conclusion, we critically reflect on the opportunities and challenges presented by this methodological development and argue for the importance of combining computational techniques with human expertise to tackle the complexities associated with machine-assisted visual analyses.
Machine learning as method in social science research
In the social sciences, ML is approached from different angles. It is an object (when not, literally, a subject) of critical and sociological inquiry; a sociotechnical mirror that reflects society and culture by means of their vectorized data traces, risking subtly amplifying biases and inequalities (Arseniev-Koehler and Foster, 2022; Burkhardt and Rieder, 2024; Veale and Binns, 2017). And, of course, a (set of) method(s), now extensively applied to the analysis of large unstructured datasets that were previously impossible or very difficult to analyze (Garip and Macy, 2023; Grimmer et al., 2021; Nelson, 2020). In this article, we focus on ML as a method, and we do so by acknowledging the critical implications for visual content analysis raised by the epistemological shift “from rules to examples” dissected by Campolo and Schwerzmann (2023).
Machine learning techniques are currently an integral part of the methodological toolkit of contemporary social science research, as they are increasingly employed to automatically classify and link data, detect patterns in large datasets, make causal inferences (Garip and Macy, 2023) and, to a lesser extent, formulate predictions (Salganik et al., 2020). Yet, so far, social science researchers have mostly used ML methods to analyze large text corpora (Edelmann et al., 2020). At first, it was text—and text only: years before the current academic buzz about AI, social and political scientists were already using supervised and unsupervised ML techniques borrowed from computer science to classify documents of various kinds, from digitized news articles to party manifestos and, of course, the short user-generated posts, reviews, and comments massively circulating online.
Discussing the application of ML to text, scholars have noted how these enable novel pathways to theory building (Arseniev-Koehler and Foster, 2022; DiMaggio et al., 2013; Grimmer et al., 2021) and suit both quantitatively and qualitatively driven research designs and interpretative frameworks (Miner et al., 2023; Nelson, 2020).
Over the years, ML tools have been extensively applied also in digital and media research: to classify posts extracted from social media platforms (e.g., Schweinberger et al., 2021), automate data collection procedures (Rama et al., 2022), or with the intent of “repurposing” existing algorithmic infrastructures for research purposes (Coromina et al., 2023).
Can ML similarly “augment” social research dealing with a different kind of unstructured social data, that is, images? So far, the application of ML techniques to the analysis of visual data amounts to only a small fraction of the computational social science literature (Edelmann et al., 2020). Despite the vast abundance of visual content in today's digital communication, their computational processing is relatively more difficult than in the case of text. Visual data are heavier and more complex: different from words, their “atomic features”—that is, pixels—are “meaningless” when taken on their own (Grimmer et al., 2021: 401). And yet—or precisely for this reason—advancements in the field of computer vision demonstrate that ML-assisted forms of visual content analysis can be tremendously valuable to the social sciences.
Their value is, nonetheless, highly dependent on the adoption of a sociologically and methodologically reflexive mindset attuned to the epistemological specificities of ML with respect to manual as well as rule-based analytical approaches to visual content analysis (Rogers, 2021; Rose, 2016). These include the—often opaque—role of training data, algorithmic techniques, and platform infrastructures (e.g., Google Cloud Vision) in shaping and confounding results, and, of course, the related risk of a biased classification—or generation—of visual content (Burkhardt and Rieder, 2024; Veale and Binns, 2017).
Hence, building on the critical scholarship on ML and digital methods, here we develop a qualitatively driven methodological framework for ML-assisted visual content analysis that privileges custom-made techniques (see Appendix A) over less transparent ready-built platform infrastructures, and context-sensitive interpretations over accuracy at all costs.
Visual content analysis with machine assistance
Visual content analysis—especially with computer vision—provides a way to navigate the complexity of images at scale by translating them into indexable, quantifiable components, effectively creating a simpler form of representation (Zhang, 2023).
Traditionally, scholars relied on manual, small-scale approaches to analyze visual data, exploring topics such as political activism and protest (Doerr et al., 2013; Luhtakallio, 2013) and media depictions of wars and conflicts (Hallin, 1989; Parry, 2024). More recent studies have examined how digital platforms shape visual discourses, from obesity in YouTube videos (Yoo and Kim, 2012) to gender identities and selfies on Instagram (Baker and Walsh, 2018; Veum and Undrum, 2018). While these works underscore the rich potential of human-led visual content analysis, the labor-intensive nature of manual approaches makes them increasingly inadequate for responding to the continuously increasing availability of visual material.
Advancements in computational methods now allow researchers to overcome many limitations of manual approaches, offering opportunities to study visual material at scale and uncover patterns that were previously inaccessible (Chen et al., 2023). Tools such as ImagePlot, ImageSorter, and PixPlot leverage unsupervised learning and clustering algorithms to analyze images based on visual features such as color, shape, and content (Manovich and Douglass, 2011; Rogers, 2021). These unsupervised approaches have paved the way for a diverse range of studies: from research on cross-platform analysis of social media images (Pearce et al., 2020), exploration of brand imagery on Instagram (Caliandro and Anselmi, 2021), to the categorization of visual data for uncover thematic patterns (Peng, 2021; Zhang and Peng, 2022).
While unsupervised approaches excel in inductive pattern exploration, supervised classification techniques provide systematic labeling of image content, thereby expanding the analytical scope to include statistical evaluations and hypothesis testing. Autotagging services such as Amazon's Rekognition and Google's Cloud Vision have been used to analyze relationships between presidential candidates’ facial expressions and election outcomes (Horiuchi et al., 2012) or to examine brand-related user-generated content on social media platforms (Nanne et al., 2020). However, while these tools offer efficiency and accessibility, they generally lack transparency and customization while also raising ethical concerns and questions about the reliability and replicability of their outputs (Goes Mintz et al., 2019; Webb et al., 2020). Perhaps for these reasons, social scientists often turn to custom approaches, which offer greater flexibility, transparency, and control throughout the classification process.
Political scientists, for example, have applied custom-supervised methods to decode complex visual cues in political communication and framing within news and social media landscapes (Joo and Steinert-Threlkeld, 2022; Peng, 2018). Other applications include estimating protest sizes from social media images (Sobolev et al., 2020), and analyzing Google Street View images as proxies for socioeconomic conditions (Gebru et al., 2017; Hwang et al., 2023). Historical datasets have also been used to detect election fraud (Cantú, 2019) and predict conflict through infrastructural mapping (Müller-Crepon et al., 2021).
Hence, the social science literature employing computational and ML methods to visual data varies considerably, particularly regarding the differential use of unsupervised and supervised approaches. While some studies utilize unsupervised techniques to explore emergent visual patterns (Caliandro and Anselmi, 2021; Zhang and Peng, 2022), others apply supervised classification directly when clear theoretical or hypothesis-driven labels already exist (Torres and Cantú, 2022). Our paper acknowledges the multiple applications of ML methods in visual analysis and systematizes them in light of inductive, deductive, and abductive logics of scientific inquiry (Peirce, 1878) and related research designs. The methodological framework illustrated below explicitly embraces methodological flexibility, suggesting that the choice and sequence of supervised and unsupervised techniques should adapt according to specific research questions and epistemological aims.
While usually excelling at automation and pattern recognition, the ML approaches mentioned above often struggle at capturing the finer-grained, context-dependent details that are essential for understanding the social significance of visual data. Our methodological framework addresses this limitation by integrating a customizable approach to ML that includes continuous human assessment and context-sensitive interpretation throughout the analytical process. This enables researchers to leverage computational efficiency while maintaining the reflexivity and interpretive depth required to unpack the complexities of images.
Methodological framework
In the following, we outline the three methodological steps of our practical guide, visualized in Figure 1. As previously noted, these steps can be applied independently or in combination. Researchers may engage inductively by moving from pattern exploration (A) directly to context-sensitive interpretation (C) or follow a theory-driven approach by applying supervised classification (B) before interpretation (C). Alternatively, the full framework can be concatenated in an abductive circle (A-C-B-C) combining unsupervised exploration with supervised classification while incorporating interpretative refinement at multiple stages.

Three-step framework for applying machine learning to visual content analysis.
Each step aligns with a distinct logic of inquiry: Step A follows an inductive approach, using unsupervised learning to explore emergent patterns. Step B is more deductive in kind, since it employs predefined labels for systematic supervised classification. While ML techniques excel at recognizing patterns, they do not provide explanations for these patterns. For this reason, step C's integration of abductive reasoning is essential, ensuring each application of the framework involves critical, iterative, and creative engagement with the findings.
We illustrate each step through selected empirical examples from a research study on visuality in digital diplomacy (Møller et al., 2024). In this work, the first author and coauthors explored a collection of more than 55k images retrieved from Twitter (now X) in 2021 from the public profiles of diplomats around the world, aiming to understand what these images convey, and how social norms, power structures, and hierarchies encode themselves into visual forms of digital diplomacy.
Step A: Pattern exploration
The first step is inductive by nature, focusing on exploring patterns within a dataset of images. The goal is to produce a structured description of the dataset by mapping its similarities and differences. This initial mapping serves two purposes: either it lays the groundwork for an in-depth qualitative analysis in step C or supports the developing labels for subsequent supervised classification in step B, facilitated through iterative integration of step C.
To explore patterns in large-scale visual datasets, we recommend employing unsupervised learning methods, such as image clustering and dimensionality reduction, as they represent valuable tools for revealing hidden trends and topics in the visual data (Zhang and Peng, 2022) 1 . Unlike supervised classification, these methods do not rely on predefined labels but instead detect latent patterns, allowing for an initial exploration of recurring visual themes. For a simple and fast implementation, tools such as PicArrange (Jung et al., 2022) and PixPlot (Yale Digital Humanities Lab, 2017) are well-suited for this first step. These tools leverage the power of pretrained convolutional neural networks (CNNs) with dimensionality reduction techniques to map and cluster images based on visual properties (Jung et al., 2022). The output is visualization where images are grouped by similarities in color, shapes, and content.
To explore and critically reflect on the emerging patterns in the clustering output, the researcher begins by examining them with openness and curiosity, allowing themes to emerge inductively, switching between close and distant reading of images (Rogers, 2021). This approach is comparable to computational text analysis using LDA, where topics are inferred from patterns in the data rather than from predefined categories (DiMaggio et al., 2013). The process is inherently iterative, often requiring refinement, merging, and adjustment of clusters as the researcher gains a clearer understanding of the dataset. Questions guiding this process might include: What patterns do the images create and what clusters can I identify? What specific content and color define the core and boundaries of these clusters? Where does one cluster end, and another begin?
From here, the step can proceed in different directions. If the goal is in-depth qualitative interpretation, analysis can proceed directly to step C. Alternatively, if aiming for systematic classification of images, the focus shifts to translating the identified clusters into labels, which will then serve as the foundation for training a supervised classifier in step B.
When transforming image clusters into semantic representations (labels), researchers are confronted with an epistemological gap between what images show and what they convey; between what exists at the pixel level and the multilayered meanings and nuances they also embed (Maltezos et al., 2024). Images derive their meanings from the social, cultural, and political contexts in which they are embedded, while simultaneously playing an active role in shaping these contexts (Awad, 2020). This complexity makes their systematic analysis challenging, and the construction of labels for training a computational model equally so.
While the multilayeredness of images may seem self-evident to a social scientist, computational models process images not as complex cultural artifacts but rather as a combination of pixels. This means that an image classifier may perform well in detecting concrete, object-based categories such as “people,” “tables,” or “food,” but would struggle with recognizing abstract or context-sensitive concepts, such as “social gatherings” or “diplomatic meetings,” which rely on relational and situational knowledge.
Because of this epistemological divide, the development of labels benefits from adopting an abductive logic to ensure that labels remain both machine-readable and socially meaningful. The framework addresses this challenge by suggesting an iterative engagement with pattern exploration (step A), context-sensitive interpretation of clusters (step C), and theory-driven image classification (step B), before the final interpretation of findings (step C).
Empirical example: Diplomatic imagery, part 1
The first author's diplomacy study investigated how 1207 diplomats around the world used images on Twitter to present themselves and their nation (Møller et al., 2024). The first task was to gain an overview of the visual variance in the large, 55k-images dataset. For this, PicArrange was deployed to visualize and cluster a random sample of 1000 images from the dataset. This generated a mosaic of images, which each author looked through, focusing on clusters of visual similarity, and noting their key characteristics.
At this stage, the process was inductive, focused on exploring emergent visual patterns rather than imposing predefined expectations. During the exploration, some authors paid attention to the object-based elements displayed in the images, noting concrete visual features such as person, screen, tree, flag, and the sky. Others approached the clusters thematically, identifying categories such as diplomatic meetings, national symbols, and gardens. These different “ways of seeing” highlighted the negotiation between machine vision and human interpretation—and a key challenge for the researchers: while the unsupervised technique effectively highlighted patterns of visual similarity, the nuanced meaning of these patterns required human insights.
To bridge the gap between computational analysis and human interpretation, the authors employed an abductive approach, moving back and forth between pattern exploration and preliminary interpretation. This involved iteratively refining the clusters by merging and splitting them, informed by both empirical evidence and interpretative insights from the researchers.
The overall aim was to develop labels for subsequent supervised classification (Step B), yet, these labels did not simply emerge from the data. Rather, they were developed through an iterative process between pattern exploration and interpretative engagement, continuously keeping in mind the limited ways in which machines “see.”
In sum, step A invites the researcher to engage inductively in exploring visual patterns within large collections of images. Applying unsupervised image clustering techniques in this exploratory phase helps uncover emerging themes and structures in the data that may otherwise go unnoticed while minimizing researcher biases (Nelson, 2020). From here, the process can proceed directly to context-sensitive interpretation (step C) or toward developing labels for theory-driven image classification (step B) through the iterative incorporation of step C, to ensure labels are both empirically grounded, analytically meaningful, and machine-readable before being applied for classification.
Step B: Theory-driven image classification
In step B, the goal is to train an ML model that can classify the content of the images in the dataset. This step follows a deductive logic, where predefined labels (derived from preexisting hypotheses or abductively developed between step A and C) are systematically applied to new images, enabling the automated classification of large-scale visual datasets.
While advances in ML, particularly deep learning and CNNs (LeCun et al., 2015), have made visual analysis significantly more accessible, the task of teaching machines to “see” remains complex. Convolutional neural networks operate in ways that are opaque and difficult for humans to fully understand (Borch and Hee Min, 2022; Burrell, 2016), meaning that despite their ability to automate classification, their application in social research requires continuous human assessment and intervention.
For a supervised learning algorithm to classify visual content—such as houses, people, and animals—it must be trained on images of houses, people, and animals; each image labeled with what it shows. As previously noted, distinct object-oriented labels are critical for ML. The number of annotated images required for effective CNN training depends on the complexity of the tasks and the desired level of accuracy. Typically, training a CNN from scratch requires millions of annotated images. However, in step B, we propose using transfer learning, which utilizes a pretrained CNN model (e.g., trained on ImageNet (Russakovsky et al., 2015) to be fine-tuned on the custom datasets (Pan and Yang, 2010). This approach significantly reduces the number of annotated images necessary to achieve satisfactory model performance.
To train a CNN, a sample of images—annotated according to the labels established between step A and C—is used as input. These images are split into a training set (used for model learning), a validation set (used to fine-tune model parameters), and a testing set (used for model evaluation). A typical split that is often considered effective for most classification tasks is 80% for training, 10% for validation, and 10% for testing (Webb et al., 2020). During training, the model learns from the training set, while the validation set helps fine-tune the model by evaluating its performance on previously unseen examples within the training phase. Testing on a separate dataset ensures that the model can generalize—as opposed to simply explaining peculiarities or noise in the sample (Brandt, 2023; Watts, 2014). For a detailed description of CNNs and their application in image classification within social sciences, see Michelle Torres and Francisco Cantú (2022) and Jungseock Joo and Zachary C. Steinert-Threlkeld (2022).
Assessing and refining the model is central. The ability of a CNN to generalize to unseen images is measured using performance metrics such as accuracy, precision, recall, and f1 score (for more information on diagnostics for CNNs, see Torres and Cantú, 2022). However, because ML models process images differently from humans (Burrell, 2016; see also Miner et al., 2023), we recommend supplementing quantitative performance metrics with qualitative assessment to identify systematic errors, ensure interpretative accuracy, and maintain human oversight throughout the classification process. An important way to achieve this involves manually inspecting a sample of the model's predictions and comparing them to human annotations. This allows researchers to identify patterns of misclassification, indicating systematic errors in the model. Additionally, it is important to examine images that were not assigned any predictions during classification, thereby ensuring that no major visual themes have been systematically overlooked. Beyond performance evaluation, researchers should remain attentive to how biases in training data influence classification results. Certain image categories may be over- or underrepresented due to dataset composition or biases embedded in training data, which can systematically shape the types of patterns that emerge in automated classifications (Buolamwini and Gebru, 2018; Zou and Schiebinger, 2018).
Informed by these assessments, necessary adjustments are made to both the model and the labels 2 . It is a cycle of assessing, fine-tuning, retraining, and reassessing, negotiating between prioritizing a high level of accuracy, yet, without compromising the analytical value of the labels and the overall integrity of the research. On the one hand, the researcher wants a model with the highest possible level of performance, so that its predictions are trustworthy. This is partially achieved by training on well-defined, object-oriented categories. On the other hand, high accuracy alone is not sufficient for methodological rigor. The pursuit of accuracy should not overshadow the meaningfulness of the labels, that is, how well they capture the variability in the data, and in turn, facilitate answers to the problems raised in the research questions. Achieving high prediction accuracy in narrowly defined, visually homogenous labels is technically easy—but may lead to trivial insights.
The goal of step B, therefore, is to develop a model that classifies visual content in unseen images using custom labels that are empirically grounded as well as analytically meaningful. Python code for implementing step 2 as a multilabel classification task (though, customizable for other tasks as well) and for assessing model performance, is available in Appendix A.
Empirical example: Diplomatic imagery, part 2
We now turn our attention back to the empirical example (Møller et al., 2024). Here, to classify the images in the dataset, the researchers proceeded toward training a CNN. Initially, a subset of images was annotated based on a set of labels derived from both emergent patterns in the data and the authors’ theoretical knowledge of diplomacy. Labels such as “flags,” “buildings,” “garden,” and “people posing” were informed not only by data-driven insights but also by theories on the symbols and practices that contribute to the practice of diplomacy. This dual approach ensured that the labels were both empirically grounded and analytically relevant.
The annotated images were then divided, allocating 80% reserved for training, while the remaining were equally split between testing and validation. For an in-depth description of the choice of dataset expansion, transfer learning techniques, network architecture selection, and fine-tuning methods, see Appendix A (code guide) or Møller et al. (2024).
To assess the model's performance, the researchers relied on standard metrics, including accuracy, precision, recall, and F1 score (Webb et al., 2020), supplemented with human assessment. A sample of 100 classified images was manually reviewed to identify potential errors and inconsistencies. This involved posing questions such as: How did the machine categorize these images in comparison to (their) human annotations? Were there recurring themes that were consistently misinterpreted? Addressing these questions provided a richer understanding of any systematic errors exhibited in the model, guiding necessary revisions of both the labels and the model itself.
This back-and-forth highlighted a key challenge: developing labels that were both machine-readable and geared toward meaningful interpretation. The researchers quickly realized that the classifier performed best when trained on distinct, object-oriented categories such as “people posing,” “trees/grass,” “flags,” and “sky.” While these labels were perhaps less analytically rich than broader conceptual labels such as “successful meetings,” “serene gardens,” and “harmony,” they nevertheless laid the groundwork from which more nuanced interpretation could occur in the next step.
The process of training the model, assessing its performance, and refining both model and labels was repeated until the classifier achieved satisfactory performance, meaning that it could adequately capture the visual variance in the dataset while remaining attuned to the research objectives. Once optimized, the model was applied to classify the entire dataset using a 0.5 confidence threshold. After a number of iterations, this threshold was determined to balance reliability and precision, slightly favoring precision over recall to ensure accurate predictions. The final outcome of this process was a numerical representation of the image content, ready for further analysis.
In sum, step B invites the researchers to engage in a deductive process that combines automation with human assessment. The process of training a CNN on custom labels, evaluating its performance, and refining both the model and the labels, exemplifies how machine efficiency and human expert knowledge and interpretative skills combine into a framework that can produce insights that are both technically accurate and analytically meaningful. Ultimately, the output of step B is a systematic labeling of the image content, preparing the dataset for context-sensitive interpretation in step C.
Step C: Context-sensitive interpretation
In step C, the task is to interpret the patterns in the context of the social worlds from which they emerge (Nelson, 2020). Abductively oriented, this step is iterative and creative, enabling researchers to uncover findings in light of existing knowledge while remaining open to new insights (Brandt, 2023). Specifically, this means moving between different levels of analysis—balancing large-scale pattern detection and exploration (understanding what the images show) with in-depth interpretation (unpacking how these visual regularities hold social importance).
The execution of step C depends on its placement in the framework. If step C is placed between steps A and B (A-C-B-C), its purpose is to develop and refine labels through an abductive process. When step C follows A without classification (A-C), it serves as a purely descriptive and qualitative analysis. Here, the focus is on interpreting emerging patterns and situating them within broader social contexts. When step C follows B (A-C-B-C or B-C), it extends interpretation to include theory-driven image classification. Drawing from the advantages of having achieved a systematic account of the image content through supervised classification, this phase incorporates statistical methods to examine distribution, relationships, and emergent themes.
Context and reflexivity play a central role in step C. Researchers must remain attentive to how patterns are shaped not only by the data but also by the computational processes that structure them. Decisions about data selection, algorithmic biases, and predefined labels influence the way patterns emerge, making it necessary to continually interrogate the underlying assumptions and methodological choices that shape the analysis. Reflexivity ensures that findings are not merely descriptive but analytically meaningful, accounting for the various factors that contribute to shaping the production of knowledge.
To begin the interpretation, researchers can start by engaging qualitatively with the emerging visual patterns, identifying overarching clusters in the dataset. What major themes emerge? Addressing this question encourages a digital methods approach to visual analysis (Rogers, 2021), which involves critically examining both large-scale patterns and finer details, also in light of the (sociotechnical) contexts where visual data have been generated and extracted. Examining clusters of frequently occurring content is relevant because these may reflect broader social trends or shed light on the normative and the ordinary—that is, facets of society that can be just as socially important as that of the extraordinary (Humphreys, 2018).
At the same time, it is equally important to remain attuned to unexpected findings, anomalies, and outliers—those “…little facts and their relations, and big unique events as well.” (Mills and Gitlin, 2000: 224). Frequency alone does not equate to social significance but often requires nuanced interpretation to unravel its meaning (Ball and Smith, 1992; Luhtakallio, 2013). Step C, therefore, encourages researchers to extend the analysis beyond dominant themes, maintaining sensitivity toward the unique and the overlooked. At this stage, researchers could engage with relevant theoretical frameworks to further contextualize and make sense of emerging patterns. While clusters might indicate dominating norms and trends, outliers can offer unique insights and challenge existing theoretical assumptions.
While the qualitative approach provides depth and nuance, the systematic classification from step B enables a structured and consistent analysis across the dataset. To translate the frequency and distribution of these classifications into meaningful social insights, researchers may employ various techniques. These include descriptive statistics, cross-tabulation, correlation analysis, or principal component analysis—see Krippendorff (2018) for instructions for useful approaches to content analysis. Additionally, network analysis can provide insights into the connections between image content and associated metadata (Correa and Ma, 2011; Freeman, 2000). By organizing large numbers of images into interpretable patterns, these methods provide a structured overview of the visual variance in a dataset, effectively capturing patterns of repetition and associations while exposing anomalies. Combining the systematicity and scale of computational methods with nuanced human interpretative skills, this approach shifts from merely object-oriented classifications to relational and contextual understanding. As such, it takes full advantage of the granularity digital traces (Airoldi, 2021) which “…may provide another way to render the social sciences empirical and quantitative without losing their necessary stress on particulars.” (Latour et al., 2012: 18).
Throughout step C, reflecting critically on the ways in which patterns are shaped by their sociotechnical contexts is essential. Drawing inspiration from digital methods (Venturini et al., 2018), researchers should consider how the origin of the data, including the platform-specific vernaculars of the images’ “native” digital environments (Gibbs et al., 2015) as well as the temporal, political, or historical factors, influence the observed patterns. Recognizing the limitations and biases in ML tools is an essential step in ensuring that patterns reflect the phenomena under study rather than artifacts of the computational processes themselves (Buolamwini and Gebru, 2018; Jacobsen, 2023).
Skepticism and continuous reflection must go hand in hand with the application of CNNs and transfer learning, in particular. For instance, with more than 45% of ImageNet data originating from the US, many pretrained CNNs exhibit a Western-centric bias (Zou and Schiebinger, 2018). Although retraining on custom datasets, as suggested in step B, can largely mitigate this issue and enhance transparency, it is important to recognize that AI technologies encode the gender, ethnic, and cultural biases introduced in their training data (Garcia, 2016; Zou and Schiebinger, 2018). Adding to this concern is the emerging issue of AI models increasingly being trained on synthetically generated data, that is, data produced by the algorithms themselves (Jacobsen, 2023). This recursive process risks amplifying existing biases and reinforcing distortions in machine-generated patterns. It is therefore important to critically assess both the origins and implications of the data used for training and classification.
Ultimately, step C elevates the analysis from a purely data-driven exercise to a context-sensitive inquiry. This enables the production of insights that are both fine-grained and comprehensive, “thick” and “thin.” All in all, through context-sensitive interpretation, this step aims to provide richer understandings of not only what patterns emerge but also how and why they occur, potentially leading to insights that may confirm, challenge, or refine existing knowledge.
Empirical example: Diplomatic imagery, part 3
Returning to the study of diplomats’ Twitter imagery (Møller et al., 2024): the researchers now employed statistical techniques to examine the distribution of image content across the labels identified in step B. This quantitative inspection revealed a striking pattern: five categories—including “people posing,” “people and tables,” and “buildings—constituted 74.6% of the dataset. This concentration suggested cultural regularities in diplomatic imagery that required further interpretation.
At the same time, attention to smaller visual themes was equally essential, as it helped identify both overlooked patterns and potential limitations and biases in the ML model. Particularly surprising to the authors was the low representation of images depicting food, flowers, and artwork—objects traditionally associated with diplomatic performances (Schultz, 1991). This imbalance raised methodological concerns about the classification process itself. The authors reflected on known limitations of CNN-based classification models, particularly how the size, color, and placement of visual content in images can affect model performance (Torres and Cantú, 2022). Multilabel classification models, in particular, often struggle to identify smaller objects within complex image compositions (Zhang et al., 2017). Therefore, they interpreted the underrepresentation of these categories not only as a reflection of diplomatic visual norms but also as a potential artifact of the applied algorithms.
Moreover, a small proportion of images (4.4%) were not assigned to any category in step B. Through in-depth qualitative examination of these unlabeled images, the authors ensured that no significant visual patterns had been overlooked. They also questioned whether these images shared specific characteristics that might indicate systematic bias in the classification model. Upon closer inspection, however, no systematic visual pattern emerged.
This review also encouraged them to reflect on the sociotechnical contexts under which the images were generated, including factors such as site and time of data collection. For example, the prevalence of images portraying online meetings in the dataset could reflect the collection period, marked by the COVID-19 pandemic. Furthermore, the authors considered how Twitter (now X) as a platform allows and promotes certain modes of communication (Duncombe, 2019), and the ways in which this may have influenced the types of diplomatic images that are shared and the overall patterns in the obtained data.
The authors engaged further, critically and qualitatively, with the classification results, zooming in on clusters that stood out. Applying a theoretical lens to interpret the repetition of specific labels prompted a deeper inquiry into what these emerging patterns reveal about the practice and performance of digital diplomacy. For example, images showing combinations of “flags,” “people posing,” “buildings,” and “people and tables” were interpreted as reflections of well-known (and seen!) symbolic practices in digital diplomacy on Twitter and beyond. These scenes—diplomats lined up in traditional Western attire, flags fluttering in the winds, well-trimmed and -kept gardens, grand buildings, and important meetings—strategically conveyed stories of harmony and successful relations. At the same time, the authors identified an element of exclusion and hierarchy in these images: only certain ways of performing and displaying diplomacy appeared to be accepted, favoring Western-centric norms and aesthetics. This, despite the rich representation of images in the dataset, shared by diplomats from the Global South.
By examining the co-occurrence of certain labels, such as “flags” and “people posing” often combining into visual compositions resembling diplomatic “family photos,” the authors moved beyond generic classifications toward unpacking their nuanced meanings. These images conveyed stories of peaceful relations and harmony. Yet, upon closer inspection—and drawing from knowledge of the specific political and historical contexts in which some of these meetings took place—an alternative reading emerged. In certain cases, these meetings and their carefully curated visual representations were intended to signal peace, yet they functioned as diplomatic performances designed to mask underlying conflicts.
To further contextualize their findings and enhance interpretative depth, the authors incorporated sociological theories of taste and visuality. These theoretical perspectives provided a critical lens for interpreting the broader patterns and situating the visual content within its historical and political contexts. By doing so, they not only traced how diplomatic aesthetics and rituals were constructed and maintained but also reflected on the hierarchies and exclusions embedded in these visual representations, revealing diplomacy as both a performative and power-laden practice in a globalized context.
In sum, step C invites the researcher to move beyond pattern exploration and detection toward context-sensitive interpretation. This involves an iterative dance between quantitative and qualitative levels of analysis, demonstrating the unique advantage of a computational inquiry: providing both overview and detail (Latour et al., 2012). Ultimately, by leveraging this framework, researchers move from pattern exploration (step A) and theory-driven image classification (step B) to a nuanced, critical context-sensitive interpretation (step C). This approach not only retains the systematicity and scale of computational methods but also enriches them with critical, creative, and nuanced contributions of human interpretative skills.
Conclusion
Analyzing the content of images with the eyes of a machine brings unique research opportunities for social research. As society becomes increasingly digitized, the abundance of visual data together with the advancements in ML, open new avenues for discovery, measurement, and inference (Grimmer et al., 2021). Computer vision emerges as an innovative tool for its capacity to recognize complex patterns from large amounts of visual data; a difficult task that requires the ability to transcend beyond the fundamental units of images—pixels—to the web of relations they form. These opportunities are not solely about scale but also about new ways of organizing and interpreting data, and, ultimately, about revealing patterns previously unattainable (Gebru et al., 2017).
Political scientists Grimmer and colleagues (2021: 397) argue that integrating ML tools into social research “…invites us to reconsider the traditional deductive model of social science.” thus inaugurating new ways of doing social research in the “…era of abundance [of data].” Yet, their recent review on “machine learning for social science” barely mentions computer vision and visual data and focuses instead on applications to text corpora. Despite the potential benefits, social scientists have been somewhat hesitant to fully embrace the methodological potential of ML (Edelmann et al., 2020), largely leaving the utilization of such tools—and, in particular, of computer vision—to other disciplines or commercial entities, with some exceptions (e.g., Bernasco et al., 2023). This hesitancy is likely to be partly linked to a general—yet decreasing—lack of technical expertise on this matter, and partly to legitimate concerns about the opacity and possible biases of ML methods, particularly when applied to nontextual, highly contextual forms of data such as images (Buolamwini and Gebru, 2018; Burrell, 2016; Jacobsen, 2023). While algorithmic classifications may uncover large-scale patterns, these methods do not operate in a vacuum; they are shaped by training data and algorithms, which have epistemological implications that must be critically examined.
With our three-step practical guide, we have attempted to make the methodological opportunities brought by visual data and ML-methods more accessible and transparent. This article shows how combining the processing of powers of computers with the context-sensitive interpretative skills of human researchers allows for the production of critical and rich knowledge about the social world. Through a structured, yet easily customizable, approach involving pattern exploration (Step A), theory-driven image classification (Step B), and context-sensitive interpretation (Step C), this framework empowers researchers to conduct comprehensive visual content analysis that is both systematic and sensitive to the diverse sociotechnical and cultural contexts within which images circulate.
Despite the methodological advantages of computer vision, there is—as Mitchell (2020) rightfully notes—a general tendency to overestimate the capabilities of ML systems: a reminder, to stay attuned not only to their strengths but also to their weaknesses (Bernasco et al., 2023). Machine learning methods struggle to grasp multilayered social meanings in images, which extend well beyond a mere combination of pixels computed in vector space. While the multifaceted and ultimately cultural character of images is self-evident upon a qualitative examination, it can be a blind spot for machines (Maltezos et al., 2024), no matter how well they are trained and tuned. Recognizing and addressing these blind spots is not just an analytical necessity but an ethical one, ensuring that computational approaches to visual analysis do not obscure or distort the social meanings embedded in images.
Ultimately, this guide demonstrates how the scalability of computational methods can be utilized without compromising human interpretation and reflexivity. It equips social scientists with the tools necessary to engage with the abundance of visual data in ways that are feasible yet insightful. The outcome is a methodological approach that balances computational efficiency with interpretative depth, potentially contributing to a deeper understanding of the complex social dynamics encoded within images.
Supplemental Material
sj-py-1-bds-10.1177_20539517251343860 - Supplemental material for With eyes of a machine: A three-step guide for applying machine learning to visual content analysis in social research
Supplemental material, sj-py-1-bds-10.1177_20539517251343860 for With eyes of a machine: A three-step guide for applying machine learning to visual content analysis in social research by Anna Helene Kvist Møller and Massimo Airoldi in Big Data & Society
Footnotes
Acknowledgements
The authors would like to thank the editor, Matthew Zook, the anonymous reviewers, Rebecca Adler-Nissen, Anders Blok, and Ilir Rama for their valuable comments, critique, and encouragements during the development of this article. Special thanks are also extended to the Copenhagen Center for Social Data Science for their generous support of this research.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: European Research Council StG 68010; Carlsberg Foundation grant no. CF16–0012.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
