Sage Journals: Discover world-class research

Abstract

Psychology has made great strides in how researchers collect, analyze, and report data, but there has been less attention to improving hypothesis generation. Some researchers still rely on intuition, serendipitous observations, or a limited reading of the literature to come up with a single idea about a relationship between constructs. Although this approach has led to valuable insights, it can constrain thinking and often fails to generate a full picture of what is going on. New approaches, however, allow researchers to cast a wider net. Specifically, by reducing the cost and effort of examining a broader set of potential variables, automated content analysis (i.e., computer-assisted methods for extracting features from unstructured data) can uncover new insights and help develop new theories. We describe how these techniques can be applied to various research questions and outline methods and criteria that can be used to gain a wider perspective. In sum, automated content analysis is a powerful tool for identifying new and important phenomena, building (and sharpening) theory, and increasing impact.

Keywords

theory development automated content analysis exploratory research

Over the past decade, concerns about p-hacking and data fabrication have led to an overhaul of how data are collected and analyzed. Rather than simply reporting results, researchers are now encouraged to preregister analyses, demonstrate replicability, and consider the file drawer.

Despite much having changed about the back end of the research process (i.e., how researchers collect, analyze, and report data), though, the front end (i.e., hypothesis generation) has remained somewhat untouched. Initial research ideas are often the result of intuition, serendipitous observations, or reading prior work. This is followed by a more systematic literature review, and if an idea still seems novel enough, initial empirical work (e.g., conducting an experiment). If results conform to expectations, the investigation continues; if not, it is eventually abandoned (or null results are published).¹

Although this (often deductive) approach is useful in some ways, it can be quite narrow. Each source of hypothesis generation (e.g., intuition or literature review) is a biased convenience sample. How likely is it that whatever hypotheses researchers happen to come up with are the most interesting, novel, or important to examine? Or best explain important phenomena?

This article highlights an alternate approach. Building on the rise of automated content analysis (i.e., computer-assisted methods for extracting features from text, image, audio, and video data), we suggest it can be useful to cast a wider net during hypothesis generation. Rather than simply identifying a single abstract conceptual relationship, researchers are increasingly interested in explaining important, complex phenomena (e.g., the diffusion of misinformation or widening political divide). Consequently, rather than focusing on what has already been done, or what happens to come to mind, we need better ways to generate, sharpen, and develop novel theory (rather than simply extending existing ones). We need to look at multiple variables in parallel and optimize rather than satisfice.

To speak to these challenges, we review research from across the social sciences to showcase how automated content analysis can help cast a wider net. Specifically by facilitating exploration (i.e., making it cheaper and easier to examine multiple variables at once) and identifying relationships that one might not have been able to generate independently. Along the way, we showcase how this approach can increase contribution, rule out alternative explanations, and help identify larger and more consequential effects.

Casting a Wider Net

Psychological research often sets out to test a particular theory. One thinks a particular independent variable might influence a certain dependent variable (or that a particular process might explain some part of an independent variable’s influence on a dependent variable), so they design an experiment to test it.

This is one source of constraint in the variable-selection process (i.e., often based on theories the researcher has worked on in the past), but the nature of experiments themselves can also be restricting. Although experiments are great for testing the causal impact of one variable on another, it is often difficult (and costly) to manipulate many factors at once. This tends to reduce the set of independent variables examined. Further, although it is possible to measure multiple process (or even dependent) variables, adding more measures often increases participant fatigue and reduces the accuracy of responses (e.g., Li et al., 2022).

Automated content analysis can help cast a wider net. Specifically, it facilitates exploration by reducing the difficulty and cost of generating and performing preliminary analysis on many variables at once. Consequently, rather than focusing on a binary question (e.g., whether or not X impacts Y), it allows researchers to ask more open-ended questions (e.g., what makes content viral or how language and paralanguage shape social interactions).

To explore such (often inductive) questions, one might start by collecting secondary data. This might include information on how many times content was shared online or the content of social interactions and a relevant outcome measure.

Next, relevant features or variables can be extracted (for example approaches, see Table 1). Language features have received the most attention. Dictionaries (e.g., Berger, Sherman, et al., 2020; Boyd et al., 2022; Rocklage et al., 2018) can extract features such as pronoun use, emotionality, or linguistic concreteness, and more complex techniques such as topic modeling (Blei et al., 2003) and embeddings (Mikolov et al., 2013) can extract key themes or the semantic progression of discourse (Toubia et al., 2021).

Table 1.

Some Techniques to Cast a Wider Net

Technique	Function	Examples
Automated text analysis	Takes language (e.g., writing or transcribed audio) and extracts key features (e.g., pronouns, level of emotionality, or certainty)	Dictionaries, topic modeling, embeddings
Automated audio analysis	Takes audio (e.g., recordings of people speaking) and extracts key features (e.g., pitch, tone, pause length, or articulation rate)	Praat, librosa Python package, OpenSMILE
Automated image analysis	Takes images (e.g., photos or still images from videos) and extracts key features (e.g., objects that appear, facial expressions, brightness)	Google’s Cloud Vision API, Amazon Rekognition, OpenCV, Face⁺⁺
Automated video analysis	Takes video and extracts key features (e.g., language, audio, and images that appear) and dynamics (e.g., body movement)	Embeddings, OpenPose, MediaPipe

Although most psychologists are at least somewhat familiar with automated textual analysis (i.e., computer-assisted methods for extracting features from language; for reviews, see Berger & Packard, 2022; Berger, Humphreys, et al., 2020; Boyd & Markowitz, 2024), similar approaches can be used to extract audio, image, or video features. Praat (Boersma, 2001) takes audio files and measures things such as pitch, tone, and intensity, and Google’s Cloud Vision API uses deep learning models and manual coding to extract information from images (for other approaches, see Dzyabura et al., 2021), and features of movement (e.g., velocity or asymmetry; Bravin et al., 2025) can be extracted from videos.

Once extracted, these features can be used in various different ways. They can be used as independent variables (e.g., to see which, if any, are linked to a dependent variable), dependent variables, controls (i.e., to rule out spurious correlations or test alternative explanations), or potential mediators or moderators (i.e., to explore potential underlying processes). To understand what makes content viral, for example, Berger and Milkman (2012) used automated textual analysis to measure various potential independent variables (i.e., different emotions evoked by news articles). Similarly, to understand what drives satisfaction with social interactions, Van Zant et al. (2024) used text and audio analytics to measure potential underlying processes (e.g., assents).² And rather than collecting traditional Likert-scale dependent measures in an experiment, researchers can ask participants to describe how they think about something, or what they associate it with, and use the words provided to measure attitudes or other potential outcomes.

Overall, automated content analysis helps cast a wider net. Reducing the cost of examining multiple variables at once facilitates exploration. This can increase the likelihood of identifying novel relationships that may not have been predicted a priori and allow researchers to examine which factors have the largest relationship with a key outcome, even controlling for others.

Algorithmic Discovery

Although automated content analysis makes it easier to cast a wider net, it is still somewhat constrained. After all, one still decides which software package to use to extract features, so the variables may still be restricted by what measures are included in available packages or the time required to update code and include additional features.

Machine learning (i.e., a field of AI concerned with statistical algorithms that can learn from data) can facilitate even more unstructured exploration. Rather than relying on existing features (e.g., concrete language), by representing text as points in a multidimensional space (e.g., Mikolov et al., 2013) or breaking images down into pixels, researchers can discover new features and thus novel relationships that may not have been previously theorized. Researchers can then interrogate these features and relationships to understand the underlying psychological processes driving them.

Ludwig and Mullainathan (2024), for example, used this approach to explore novel drivers of judicial decision-making. Using defendants’ mug shots, and extracting all of the face pixels, explained a good deal of variation in judges’ decisions about whom to jail (even controlling for other information such as the crime and demographics). But this does not explain what, in particular, about the defendants’ appearance influenced judges’ decisions, or why. So, to gain deeper insight, the authors generated pairs of mug shots that were similar on all other dimensions (i.e., age, gender, and race) but for which the model predicted different outcomes (i.e., judges’ decisions) and asked humans what differentiated them. By exploring these comments, and considering their relation to existing theory, two key features emerged: how “well groomed” (e.g., tidy vs. unkempt) the defendants looked and how “heavy-faced” (e.g., wide or round) they were. A subsequent experiment confirmed these variables’ causal impact, and future work could explore the potential psychological processes (e.g., stereotyping or theories of self-control) behind why these novel dimensions played a role.

Similar approaches can be applied more broadly. Although prior work has used existing textual features to understand what drives word of mouth, for example, a more unstructured approach can provide additional insight. Dore and Berger (2024) embedded 3.5 million social media posts in a multidimensional space, shrunk its dimensionality, and then used the scores on the key dimensions to predict shares. Three latent features explained much of the variation, and by exploring correlations with known linguistic features, the authors found that negative past events, second-person promotional appeals, and sensory experiences seemed to be linked to increased sharing. They then used these results to begin to develop theories for why these relationships occurred and what might be driving them.

Although these two examples used different content (i.e., images and text) to address different questions, their underlying approaches were similar. By combining more open-ended feature extraction with machine learning, the resulting models were interrogated to explore which features were driving the outcome. Once important relationships were identified, researchers then examined how existing theories could explain them or whether new theories were necessary. Follow-up experiments could then test those theories in more detail and shed light on underlying psychological processes.

Prioritization

Because casting a wider net often identifies multiple factors that are linked to an important outcome, researchers must decide which to prioritize for further investigation (e.g., in confirmatory experiments). One simple (albeit not always the best) criterion is effect size, or the amount of variance explained (for related discussion, see Yarkoni & Westfall, 2017). Which independent variable, for example, has the largest effect on the dependent variable? In secondary data, this should help identify variables that are more likely to drive actual behavior (as long as the context explored is similar to the one in which the findings are being applied).³

Novelty (i.e., whether something has been identified previously) or counterintuitiveness may also be worth considering. When analyzing what makes content viral, for example, Berger and Milkman (2012) found that various features (e.g., positivity and usefulness) played a role but focused on features that had not been explored previously (i.e., specific emotions such as anger and sadness) for subsequent follow-up experiments (for a similar approach, see Sheetal et al., 2020).

It may also be worth considering theoretical relevance (i.e., prioritizing variables that speak to prevailing theory) or regularity (i.e., consistency across a variety of contexts). Even if particular relationships are not the most novel or counterintuitive, to the degree that they expand the scope of existing theory or seem to be generalizability, they may be worth examining further.

Helping to Build and Sharpen Theory

Although automated content analysis is well suited for identifying novel relationships, it may not always shed light on causal mechanisms. Finding two variables are related, for example, does not always specify why. Consequently, as discussed, this approach can be part of an iterative process in which initial findings can be used to design subsequent experiments. Berger and Milkman (2012) identified a relationship in the field (i.e., articles that evoked more of certain emotions were more likely to be highly shared), for example, and then conducted follow-up experiments manipulating those specific emotions and testing potential underlying processes (i.e., arousal). By manipulating key features of interest and measuring potential underlying processes, such experiments can both underscore the relationship’s causal nature and identify the mechanism behind it.

Along these lines, although one could wonder whether exploration is somehow antitheory, this is not the case. Although automated content analysis can certainly be used in prediction (Yarkoni & Westfall, 2017), by making it easier to generate and analyze many variables at once, it is also a valuable tool for understanding. By allowing researchers to more easily identify previously unidentified relationships, or mechanisms, these approaches are ideal for helping to build and refine theory. By identifying novel relationships, for example, or factors that predict important outcomes, these tools can encourage researchers to develop new theories that explain those relationships.

Theory can also help researchers select and interpret variables. When deciding which variables to include in initial analysis, for example, existing theory can be helpful in determining what to focus on. Similarly, in attempting to understand why an observed relationship might occur, existing theory can be vital in the decoding process. It can shed light on the potential underlying process behind such a relationship or suggest more accurate and precise variables that might better capture the true relationship that is occurring.

Additional Points

Although simultaneously exploring many independent variables may generate questions about multiple-hypothesis testing, there are ways to address this. Corrections (e.g., Benjamini & Hochberg, 1995; see Batista & Ross, 2024) and cross-validation or out-of-sample testing (i.e., testing the model on a subset of data that were not used in estimating it) can mitigate such concerns (and address questions about overfitting the training data).

One might also wonder about causality or spurious correlations that do not generalize to data sets collected under slightly different conditions. Follow-up experiments that directly manipulate identified independent variables can be useful here.

Other methods beyond automated content analysis can also help cast a wider net. Systematically surveying the literature, for example, or conducting pilot studies can sometimes broaden the set of variables considered (although may still tend to constrain examination to the previously discussed factors). Qualitative research (e.g., interviews) or existing surveys that include multiple measures (e.g., Sheetal et al., 2020) may prove more useful. Optimally designing experiments (e.g., Kim et al., 2014) to maximize diagnostic relevance relative to existing theories can also be valuable.

Finally, note that automated approaches involving machine learning can be susceptible to social or cultural biases (i.e., reflect the data on which they were originally trained). Large language models are often trained on a particular corpus, for example, and thus may replicate, or even exacerbate, existing racial or gender biases that exist in that culture. This can lead to misleading conclusions if not properly accounted for.

Conclusion

Although behavioral science has made great strides in how researchers collect, analyze, and report data, there has been less attention to improving hypothesis generation and, by extension, theory development. Rather than relying on intuition, serendipitous observations, or a selective reading of the literature, casting a wider net may be beneficial. Automated content analysis can help facilitate a more systematic exploratory process that emphasizes variable discovery and prioritization. Doing so should help identify novel and more important drivers of real-world phenomena (Markowitz et al., 2024) and increase psychology’s impact as a field.

Footnotes

Transparency

Action Editor: Robert L. Goldstone

Editor: Robert L. Goldstone

ORCID iD

Jonah Berger

Notes

References

Batista

R. M.

Ross

(2024). Words that work: Using language to generate hypotheses. SSRN. https://doi.org/10.2139/ssrn.4926398

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1), 289–300.

Berger

Humphreys

Ludwig

Moe

W. W.

Netzer

Schweidel

D. A.

(2020). Uniting the tribes: Using text for marketing insight. Journal of Marketing, 84(1), 1–25.

Berger

Sherman

Ungar

(2020). TextAnalyzer [Computer software]. https://textanalyzer.org

Berger

Milkman

K. L.

(2012). What makes online content viral? Journal of Marketing Research, 49(2), 192–205.

Berger

Packard

(2022). Using natural language processing to understand people and culture. American Psychologist, 77(4), 525–537.

Blei

D. M.

A. Y.

Jordan

M. I.

(2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5), 993–1022.

Boersma

(2001). Praat, a system for doing phonetics by computer. Glot International, 5(9), 341–345.

Boyd

R. L.

Ashokkumar

Seraj

Pennebaker

J. W.

(2022). The development and psychometric properties of LIWC-22. University of Texas at Austin.

10.

Boyd

R. L.

Markowitz

D. M.

(2024). Verbal behavior and the future of social science. American Psychologist. Advance online publication. https://doi.org/10.1037/amp0001319

11.

Bravin

Clegg

Hofstetter

Pouly

Berger

J. A.

(2025). How to follow social media trends? An empirical investigation using TikTok short video data. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5117564

12.

Dore

Berger

(2024). A linguistic signature of sharing [Manuscript in preparation]. The Wharton School.

13.

Dzyabura

El Kihal

Peres

(2021). Image analytics in marketing. In Homburg

Klarmann

Vomberg

(Eds.), Handbook of market research (pp. 665–692). Springer.

14.

Dzyabura

Peres

(2021). Visual elicitation of brand perception. Journal of Marketing, 85(4), 44–66.

15.

Kim

Pitt

M. A.

Z. L.

Steyvers

Myung

J. I.

(2014). A hierarchical adaptive approach to optimal experimental design. Neural Computation, 26(11), 2465–2492.

16.

Krefeld-Schwalb

Wall

Johnson

E. J.

Toubia

Bartels

(2022). The more you ask, the less you get: When additional questions hurt external validity. Journal of Marketing Research, 59(5), 963–982. https://doi.org/10.1177/00222437211073581

17.

Ludwig

Mullainathan

(2024). Machine learning as a tool for hypothesis generation. The Quarterly Journal of Economics, 139(2), 751–827.

18.

Markowitz

D. M.

Boyd

R. L.

Blackburn

(2024). From silicon to solutions: AI’s impending impact on research and discovery. Frontiers in Social Psychology, 2. https://doi.org/10.3389/frsps.2024.1392128

19.

Mikolov

Sutskever

Chen

Corrado

G. S.

Dean

(2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.

20.

Rocklage

M. D.

Rucker

D. D.

Nordgren

L. F.

(2018). The Evaluative Lexicon 2.0: The measurement of emotionality, extremity, and valence in language. Behavior Research Methods, 50(4), 1327–1344.

21.

Sheetal

Feng

Savani

(2020). Using machine learning to generate novel hypotheses: Increasing optimism about COVID-19 makes people less willing to justify unethical behaviors. Psychological Science, 31(10), 1222–1235.

22.

Toubia

Berger

Eliashberg

(2021). How quantifying the shape of stories predicts their success. Proceedings of the National Academy of Sciences, 118(26), Article e2011695118.

23.

Van Zant

Berger

Packard

Wang

. (2024). How pausing shapes person perception [Manuscript in preparation]. Rutgers University.

24.

Yarkoni

Westfall

(2017). Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6), 1100–1122.

Casting a Wider Net: Using Automated Content Analysis to Discover New Ideas

Abstract

Keywords

Casting a Wider Net

Algorithmic Discovery

Prioritization

Helping to Build and Sharpen Theory

Additional Points

Conclusion

Recommended Reading

Footnotes

Transparency

ORCID iD

Notes

References