Abstract
Psychology has made great strides in how researchers collect, analyze, and report data, but there has been less attention to improving hypothesis generation. Some researchers still rely on intuition, serendipitous observations, or a limited reading of the literature to come up with a single idea about a relationship between constructs. Although this approach has led to valuable insights, it can constrain thinking and often fails to generate a full picture of what is going on. New approaches, however, allow researchers to cast a wider net. Specifically, by reducing the cost and effort of examining a broader set of potential variables, automated content analysis (i.e., computer-assisted methods for extracting features from unstructured data) can uncover new insights and help develop new theories. We describe how these techniques can be applied to various research questions and outline methods and criteria that can be used to gain a wider perspective. In sum, automated content analysis is a powerful tool for identifying new and important phenomena, building (and sharpening) theory, and increasing impact.
Over the past decade, concerns about p-hacking and data fabrication have led to an overhaul of how data are collected and analyzed. Rather than simply reporting results, researchers are now encouraged to preregister analyses, demonstrate replicability, and consider the file drawer.
Despite much having changed about the back end of the research process (i.e., how researchers collect, analyze, and report data), though, the front end (i.e., hypothesis generation) has remained somewhat untouched. Initial research ideas are often the result of intuition, serendipitous observations, or reading prior work. This is followed by a more systematic literature review, and if an idea still seems novel enough, initial empirical work (e.g., conducting an experiment). If results conform to expectations, the investigation continues; if not, it is eventually abandoned (or null results are published). 1
Although this (often deductive) approach is useful in some ways, it can be quite narrow. Each source of hypothesis generation (e.g., intuition or literature review) is a biased convenience sample. How likely is it that whatever hypotheses researchers happen to come up with are the most interesting, novel, or important to examine? Or best explain important phenomena?
This article highlights an alternate approach. Building on the rise of automated content analysis (i.e., computer-assisted methods for extracting features from text, image, audio, and video data), we suggest it can be useful to cast a wider net during hypothesis generation. Rather than simply identifying a single abstract conceptual relationship, researchers are increasingly interested in explaining important, complex phenomena (e.g., the diffusion of misinformation or widening political divide). Consequently, rather than focusing on what has already been done, or what happens to come to mind, we need better ways to generate, sharpen, and develop novel theory (rather than simply extending existing ones). We need to look at multiple variables in parallel and optimize rather than satisfice.
To speak to these challenges, we review research from across the social sciences to showcase how automated content analysis can help cast a wider net. Specifically by facilitating exploration (i.e., making it cheaper and easier to examine multiple variables at once) and identifying relationships that one might not have been able to generate independently. Along the way, we showcase how this approach can increase contribution, rule out alternative explanations, and help identify larger and more consequential effects.
Casting a Wider Net
Psychological research often sets out to test a particular theory. One thinks a particular independent variable might influence a certain dependent variable (or that a particular process might explain some part of an independent variable’s influence on a dependent variable), so they design an experiment to test it.
This is one source of constraint in the variable-selection process (i.e., often based on theories the researcher has worked on in the past), but the nature of experiments themselves can also be restricting. Although experiments are great for testing the causal impact of one variable on another, it is often difficult (and costly) to manipulate many factors at once. This tends to reduce the set of independent variables examined. Further, although it is possible to measure multiple process (or even dependent) variables, adding more measures often increases participant fatigue and reduces the accuracy of responses (e.g., Li et al., 2022).
Automated content analysis can help cast a wider net. Specifically, it facilitates exploration by reducing the difficulty and cost of generating and performing preliminary analysis on many variables at once. Consequently, rather than focusing on a binary question (e.g., whether or not X impacts Y), it allows researchers to ask more open-ended questions (e.g., what makes content viral or how language and paralanguage shape social interactions).
To explore such (often inductive) questions, one might start by collecting secondary data. This might include information on how many times content was shared online or the content of social interactions and a relevant outcome measure.
Next, relevant features or variables can be extracted (for example approaches, see Table 1). Language features have received the most attention. Dictionaries (e.g., Berger, Sherman, et al., 2020; Boyd et al., 2022; Rocklage et al., 2018) can extract features such as pronoun use, emotionality, or linguistic concreteness, and more complex techniques such as topic modeling (Blei et al., 2003) and embeddings (Mikolov et al., 2013) can extract key themes or the semantic progression of discourse (Toubia et al., 2021).
Some Techniques to Cast a Wider Net
Although most psychologists are at least somewhat familiar with automated textual analysis (i.e., computer-assisted methods for extracting features from language; for reviews, see Berger & Packard, 2022; Berger, Humphreys, et al., 2020; Boyd & Markowitz, 2024), similar approaches can be used to extract audio, image, or video features. Praat (Boersma, 2001) takes audio files and measures things such as pitch, tone, and intensity, and Google’s Cloud Vision API uses deep learning models and manual coding to extract information from images (for other approaches, see Dzyabura et al., 2021), and features of movement (e.g., velocity or asymmetry; Bravin et al., 2025) can be extracted from videos.
Once extracted, these features can be used in various different ways. They can be used as independent variables (e.g., to see which, if any, are linked to a dependent variable), dependent variables, controls (i.e., to rule out spurious correlations or test alternative explanations), or potential mediators or moderators (i.e., to explore potential underlying processes). To understand what makes content viral, for example, Berger and Milkman (2012) used automated textual analysis to measure various potential independent variables (i.e., different emotions evoked by news articles). Similarly, to understand what drives satisfaction with social interactions, Van Zant et al. (2024) used text and audio analytics to measure potential underlying processes (e.g., assents). 2 And rather than collecting traditional Likert-scale dependent measures in an experiment, researchers can ask participants to describe how they think about something, or what they associate it with, and use the words provided to measure attitudes or other potential outcomes.
Overall, automated content analysis helps cast a wider net. Reducing the cost of examining multiple variables at once facilitates exploration. This can increase the likelihood of identifying novel relationships that may not have been predicted a priori and allow researchers to examine which factors have the largest relationship with a key outcome, even controlling for others.
Algorithmic Discovery
Although automated content analysis makes it easier to cast a wider net, it is still somewhat constrained. After all, one still decides which software package to use to extract features, so the variables may still be restricted by what measures are included in available packages or the time required to update code and include additional features.
Machine learning (i.e., a field of AI concerned with statistical algorithms that can learn from data) can facilitate even more unstructured exploration. Rather than relying on existing features (e.g., concrete language), by representing text as points in a multidimensional space (e.g., Mikolov et al., 2013) or breaking images down into pixels, researchers can discover new features and thus novel relationships that may not have been previously theorized. Researchers can then interrogate these features and relationships to understand the underlying psychological processes driving them.
Ludwig and Mullainathan (2024), for example, used this approach to explore novel drivers of judicial decision-making. Using defendants’ mug shots, and extracting all of the face pixels, explained a good deal of variation in judges’ decisions about whom to jail (even controlling for other information such as the crime and demographics). But this does not explain what, in particular, about the defendants’ appearance influenced judges’ decisions, or why. So, to gain deeper insight, the authors generated pairs of mug shots that were similar on all other dimensions (i.e., age, gender, and race) but for which the model predicted different outcomes (i.e., judges’ decisions) and asked humans what differentiated them. By exploring these comments, and considering their relation to existing theory, two key features emerged: how “well groomed” (e.g., tidy vs. unkempt) the defendants looked and how “heavy-faced” (e.g., wide or round) they were. A subsequent experiment confirmed these variables’ causal impact, and future work could explore the potential psychological processes (e.g., stereotyping or theories of self-control) behind why these novel dimensions played a role.
Similar approaches can be applied more broadly. Although prior work has used existing textual features to understand what drives word of mouth, for example, a more unstructured approach can provide additional insight. Dore and Berger (2024) embedded 3.5 million social media posts in a multidimensional space, shrunk its dimensionality, and then used the scores on the key dimensions to predict shares. Three latent features explained much of the variation, and by exploring correlations with known linguistic features, the authors found that negative past events, second-person promotional appeals, and sensory experiences seemed to be linked to increased sharing. They then used these results to begin to develop theories for why these relationships occurred and what might be driving them.
Although these two examples used different content (i.e., images and text) to address different questions, their underlying approaches were similar. By combining more open-ended feature extraction with machine learning, the resulting models were interrogated to explore which features were driving the outcome. Once important relationships were identified, researchers then examined how existing theories could explain them or whether new theories were necessary. Follow-up experiments could then test those theories in more detail and shed light on underlying psychological processes.
Prioritization
Because casting a wider net often identifies multiple factors that are linked to an important outcome, researchers must decide which to prioritize for further investigation (e.g., in confirmatory experiments). One simple (albeit not always the best) criterion is effect size, or the amount of variance explained (for related discussion, see Yarkoni & Westfall, 2017). Which independent variable, for example, has the largest effect on the dependent variable? In secondary data, this should help identify variables that are more likely to drive actual behavior (as long as the context explored is similar to the one in which the findings are being applied). 3
Novelty (i.e., whether something has been identified previously) or counterintuitiveness may also be worth considering. When analyzing what makes content viral, for example, Berger and Milkman (2012) found that various features (e.g., positivity and usefulness) played a role but focused on features that had not been explored previously (i.e., specific emotions such as anger and sadness) for subsequent follow-up experiments (for a similar approach, see Sheetal et al., 2020).
It may also be worth considering theoretical relevance (i.e., prioritizing variables that speak to prevailing theory) or regularity (i.e., consistency across a variety of contexts). Even if particular relationships are not the most novel or counterintuitive, to the degree that they expand the scope of existing theory or seem to be generalizability, they may be worth examining further.
Helping to Build and Sharpen Theory
Although automated content analysis is well suited for identifying novel relationships, it may not always shed light on causal mechanisms. Finding two variables are related, for example, does not always specify why. Consequently, as discussed, this approach can be part of an iterative process in which initial findings can be used to design subsequent experiments. Berger and Milkman (2012) identified a relationship in the field (i.e., articles that evoked more of certain emotions were more likely to be highly shared), for example, and then conducted follow-up experiments manipulating those specific emotions and testing potential underlying processes (i.e., arousal). By manipulating key features of interest and measuring potential underlying processes, such experiments can both underscore the relationship’s causal nature and identify the mechanism behind it.
Along these lines, although one could wonder whether exploration is somehow antitheory, this is not the case. Although automated content analysis can certainly be used in prediction (Yarkoni & Westfall, 2017), by making it easier to generate and analyze many variables at once, it is also a valuable tool for understanding. By allowing researchers to more easily identify previously unidentified relationships, or mechanisms, these approaches are ideal for helping to build and refine theory. By identifying novel relationships, for example, or factors that predict important outcomes, these tools can encourage researchers to develop new theories that explain those relationships.
Theory can also help researchers select and interpret variables. When deciding which variables to include in initial analysis, for example, existing theory can be helpful in determining what to focus on. Similarly, in attempting to understand why an observed relationship might occur, existing theory can be vital in the decoding process. It can shed light on the potential underlying process behind such a relationship or suggest more accurate and precise variables that might better capture the true relationship that is occurring.
Additional Points
Although simultaneously exploring many independent variables may generate questions about multiple-hypothesis testing, there are ways to address this. Corrections (e.g., Benjamini & Hochberg, 1995; see Batista & Ross, 2024) and cross-validation or out-of-sample testing (i.e., testing the model on a subset of data that were not used in estimating it) can mitigate such concerns (and address questions about overfitting the training data).
One might also wonder about causality or spurious correlations that do not generalize to data sets collected under slightly different conditions. Follow-up experiments that directly manipulate identified independent variables can be useful here.
Other methods beyond automated content analysis can also help cast a wider net. Systematically surveying the literature, for example, or conducting pilot studies can sometimes broaden the set of variables considered (although may still tend to constrain examination to the previously discussed factors). Qualitative research (e.g., interviews) or existing surveys that include multiple measures (e.g., Sheetal et al., 2020) may prove more useful. Optimally designing experiments (e.g., Kim et al., 2014) to maximize diagnostic relevance relative to existing theories can also be valuable.
Finally, note that automated approaches involving machine learning can be susceptible to social or cultural biases (i.e., reflect the data on which they were originally trained). Large language models are often trained on a particular corpus, for example, and thus may replicate, or even exacerbate, existing racial or gender biases that exist in that culture. This can lead to misleading conclusions if not properly accounted for.
Conclusion
Although behavioral science has made great strides in how researchers collect, analyze, and report data, there has been less attention to improving hypothesis generation and, by extension, theory development. Rather than relying on intuition, serendipitous observations, or a selective reading of the literature, casting a wider net may be beneficial. Automated content analysis can help facilitate a more systematic exploratory process that emphasizes variable discovery and prioritization. Doing so should help identify novel and more important drivers of real-world phenomena (Markowitz et al., 2024) and increase psychology’s impact as a field.
Recommended Reading
Batista, R. M., & Ross, J. (2024). (See References). Uses large language models, machine learning, and experiments to generate and test interpretable hypotheses from text.
Berger, J., & Milkman, K. L. (2012). (See References). Uses automated content analysis to identify a novel relationship in field data, to test causality, and to pinpoint underlying processes.
Berger, J., & Packard, G. (2022). (See References). Reviews various approaches to natural language processing and how they can be used to provide insight into psychology and culture.
Ludwig, J., & Mullainathan, S. (2024). (See References). Uses algorithmic discovery to identify novel drivers of an empirical phenomenon.
