Abstract

These are three recent examples, as vignettes, of academic entrepreneurs from within the academic community promoting their artificial intelligence (AI) tools for management and organizational scholarship and for theory building (with or without empirical data) in particular. Each of these academics enthusiastically conveys that it is astounding to learn what AI can do if we were to add it to our methodologic repertoire and use it to optimize our scholarly efforts of theory building and write (read: “generate” or “produce”) many more papers each year. And to be fair, I am astounded—astounded at what I see as a shallow, backward view of theory building underneath these efforts and an ignorance and recklessness that will not only erode the distinct value of theory but also will contribute to a new and improved credibility crisis for our scholarly discipline that will be hard to bounce back from.
At their core, proposals for AI-augmented theory building are based on a presumed functional equivalence between disembodied algorithm-based computational machines and embodied brain-based human reasoning and intelligence. This simile is the reason why computer scientists such as Google engineer Agüera y Arcas (2022: 194) boldly claim that “large language models illustrate for the first time the way language understanding and intelligence can be dissociated from all the embodied and emotional characteristics we share with each other and with many other animals.” In the context of theory building, my focus here, this functional equivalence assumes that intelligent outputs, such as a text framing the theoretical motivation or, say, suggestions for construct labels and definitions, can be as effectively generated by AI and, with only one or a few prompts, be done way more quickly and efficiently. Like humans, after all, AI tools process input data (whose initial probability distribution may be unknown) through neural networks and use algorithms (embedded and learned mathematical instructions) to assign weights to any connections made and recalculate probabilities to mathematically settle on the “universally approximate” output that it generates. This calculus analogue, together with its training and scale, is what makes AI in the words of Google and OpenAI engineers Peter Norvig and Andrej Karpathy “unreasonably effective” and a superb way to “augment” human efforts.
When, guided by these presumptions, we slot AI agents into our theory building, we follow a process of what the philosopher Cartwright (1999) playfully dubbed a ‘vending machine view’ of theory building. As an author, “you feed it input in certain prescribed forms for the desired output; it gurgitates for a while; then it drops out the sought-for representation, plonk, on the tray, fully formed, as Athena from the brain of Zeus” (Cartwright, 1999, p. 247). Cartwright had offered the metaphor to critique the formal, mathematical calculus of how logical empiricism had presumed that theory was generated (and convincingly refuted since) but which, ironically, seems to have returned with AI-augmented theory building that likewise places (blind) faith in the calculus analogy and uses a (vending) machine as a substitute, either fully or in part, for the breadth, complexities, and nuances of human inferential reasoning that otherwise constitutes theory building.
Even with a “human in the loop,” this mechanical use of AI tools will fundamentally shift and reconfigure how we understand theory building by aligning it with the technology’s code, algorithms, and generated simulations. For example, when we use the “agentic AI theory-building tool” (vignette 3), we come to take a syntactically generated construct label and definition as a proxy for what otherwise would be much more thorough and deep human inferential processes of construct development and validation in theoretical work (see Cornelissen et al., 2026). Similarly, use of the AI-augmented qualitative AI tool (vignette 2) leads us to trade situated, reflexive processes of human inferential-logical reasoning for a statistical learning process that instead of induction or abduction is better described as “transductive inference” (Vapnik, 1999, p. 293): Its learned algorithms statistically recognize a class from examples and generate further exemplars (categories) and connections based on knowledge of the class, creating a standalone “indexical pattern that generalizes to instances of the function” (Weatherby & Justie, 2022, p. 395).
These tools, however, not only will reconfigure theory building, but their use also will create a new and improved credibility crisis with many more simulated theoretical arguments, frameworks, and explanations that are false positives, industrial-scale data mining, and HARKing (hypothesizing after the results are known) and with a construct proliferation problem that, fueled by AI, appears to have no bounds. Consider, for example, the AI-powered management research feedback tool (vignette 1). Although branded as an evaluation tool, an author can easily leverage the Claude-based environment to machine generate complete papers in line with the canvas that it has been trained on (as demonstrated by Novy-Marx & Velikov, 2026). The simulated theoretical motivation, constructs, relationships, and mechanisms that are machine generated in this way will, after a few prompting cycles, create a line of argument that contracts the different parts of the canvas into a compelling “story” that seemingly has enough fidelity with the data. Crucially, however, this machine-generated strategy of matching (somewhat) unmotivated data to a theoretically motivated set of constructs, hypothesized relationships, and mechanisms harbors the real risk of false positives. Claude essentially paraphrases the data to a fitting “just so” story that can readily explain in-sample significant differences or patterns based on a matching theoretical frame while effectively ignoring preanalysis probabilities, how the same data may have as much power in predicting out-of-sample differences and may be equally, or better, explained in alternative ways.
As a stochastic technology, the same prompts in the Claude-based tool also can machine generate different paper versions with different theoretical storylines coherently fitting the same data. Using Claude, Novy-Marx and Velikov (2026, p. 7) machine generated 380 full-paper versions (!) in this way, whereby the “theoretical frameworks are automatically generated” while ensuring “that all empirical analyses and statistical validations are conducted using rigorous methods developed in the academic literature, ensuring the reliability (if not the interpretation) of the underlying findings.” Based on this example, it is not difficult to see how tools such as the management research feedback tool not only will produce many false positives but also will foster data-mining exercises at scale, with Claude (or another LLM) being used to machine generate plausible ex post theoretical stories and explanations to fit observed empirical patterns (HARKing).
Before generative AI emerged, management and organizational research already had a construct proliferation problem, with thousands of new constructs being introduced each year and artificially labeled as distinct. A key capability of AI tools, as stochastic parrots, is that they can constantly “paraphrase” and autogenerate new text such that, with prompting, such tools can continuously generate new suggestive labels, definitions, and word-embedded representations that fit the data well enough (whether the input data are, say, prior theoretical literature or empirical data). Motivated by the premium on theoretical contributions, an author may prompt an AI tool (which itself has no grounding in the world or semantic core of its own) to keep paraphrasing until a seemingly novel construct is generated in this way. If done at scale, it amplifies an already big problem into one of massive proportions, creating a hall of mirrors that will be hard to navigate by anyone in the field.
Faced with this looming credibility crisis, we not only urgently need “enhanced validation systems” (Novy-Marx & Velikov, 2026, pp. 8 and 24), such as for construct validation (Cornelissen et al., 2026), but we also must draw a hard line. Instead of letting our theory building be hacked by AI, we need to be more discerning regarding what theory building is, cherish and protect what strong theorizing consists of, and maintain high standards for how our theories (including constructs and mechanisms) can be independently validated. We should outright reject the claim that theory building is nothing more and nothing less than machine generating some supporting arguments or stand-alone stories (as a necessary but redundant inference to theory) that, based on statistical algorithms, appear to sufficiently fit the data. Strong theorizing, rather, involves probing any theoretical inferences directly and assessing the extent to which, based on our ongoing inquiries and interventions in the real world, evidential support accrues to a particular inference in how it explains, in a particularly satisfying way, data about a real-world phenomenon.
The reason for standing our ground is not only the scale of things with AI (although think for a moment of the 380 papers that Novy-Marx and Velikov (2026) machine generated in a day) and the massive credibility crisis it will cause. Something even more fundamental, existential even, for management and organizational research is at stake. If we take the bait and start using these AI tools, we are not actually “optimizing” our current ways of theory building. We are conned into believing that this is the case. Instead, we are drawn into an analogous knowledge-production system that redefines the nature of theory building in its own computational terms, commodifying in the process what we understand theory to be as well as how we build or “produce” it. Its inevitable outcome will be a sea of sameness in theoretical contributions, with theory, reduced to linguistic tokens, being abundant and superfluous and no longer a distinct and differentiating element by which to judge a paper’s quality or contributions. In fact, the differentiating value of theory and of theorizing as a distinct practice or process will have been extracted from it to make it amenable to a computational form of social science and befitting a system of mass knowledge production. This commodification of theory, eroding its intrinsic value, is the ultimate price that we will pay when we buy into the presumptions of AI-augmented theory building and one that, now that we still have a chance, we should avoid at all costs.
