Abstract
Clinicians face an ever-increasing volume of medical literature and are turning to large language models and “deep research” systems to retrieve, organize, and synthesize biomedical evidence. In our use of these tools, we have found them useful in producing coherent and comprehensive summaries and proposing testable hypotheses. However, the outputs of these models are prone to flattening evidentiary hierarchies, overgeneralizing across heterogeneous populations and comparators, and occasionally propagating hallucinated citations. These failure modes risk automation bias and erosion of transparency if introduced into clinical pathways without guardrails. Here, we propose a “clinician-in-the-loop” in which clinicians remain the gatekeepers for artificial intelligence-assisted synthesis. We outline three core duties for clinicians: (1) evidence weighting that privileges randomized trials, high-quality meta-analyses, and absolute risk communication; (2) contextual integration across pathophysiology, existing evidence, and patient populations; and (3) provenance and bias auditing through source verification, uncertainty reporting, and counter-summaries. We moreover explore how healthcare institutions, medical educators, policymakers, and publishers of medical literature can promote literacy and transparency regarding the use of “deep research” tools, including the implementation of reporting standards, provenance disclosures, and equity surveillance.
As physicians-in-training, we are acutely aware of the explosive growth of medical information. Medical knowledge is forecasted to double approximately every 3 months, 1 PubMed alone added nearly 1.6 million new citations in 2023, 2 and a physician would need an unrealistic 20+ hours per day just to skim all relevant publications to stay current. 3 Faced with this overwhelming deluge, we find ourselves increasingly turning to large language models (LLMs)—particularly the latest large reasoning models (LRMs). These advanced LLMs go beyond mere summarization; they undertake structured, multi-step reasoning tasks to help us organize and synthesize complex medical literature. We are the first generation of clinicians to come of age alongside these technologies, and their so-called “deep research” capabilities offer a tantalizing vision: instant access to organized, synthesized medical knowledge.
Nonetheless, our enthusiasm recently met a sobering test during a literature review of the latest glaucoma therapies (a topic aligned with our specialization as ophthalmology residents). We tasked an LRM to summarize the newest clinical trials and came to an unsettling realization: the generated summary was fluent, authoritative, yet oddly hollow. It effortlessly aggregated relevant studies but failed to highlight meaningful patterns or provide a deeper analytical context. It was a well-organized collage, not a thoughtful synthesis.
Our experience illustrates a core truth about the current state of artificial intelligence (AI)-driven “deep research.” Tools such as ChatGPT and specialized platforms such as Elicit can analyze, summarize, and even draft manuscripts from vast troves of literature. Yet they remain incapable of genuine interpretative insights or conceptual breakthroughs. Still, with over a tenth of physicians now regularly using LLMs to generate summaries of medical research and standards of care, 4 it is paramount that physicians grasp the utility and limitations of “deep research.”
What is deep research?
Deep research refers to LRM-enabled workflows that extend beyond the single-pass text generation typical of earlier LLMs. Traditional LLMs produced responses in a single pass, relying on patterns learned during training rather than engaging with new information. While they can generate fluent text, they cannot consult new sources or verify their own reasoning. In contrast, deep research systems are designed to retrieve and critically reason over the most current biomedical literature and clinical guidelines. Two features are central to their approach: (1) retrieval-augmented generation to ground outputs in published sources along with provenance tracking with appropriate citations, and (2) chain-of-thought reasoning to break down complex questions into smaller, sequential steps. 5 The latter means that these models, through iterative processes, can check the coherence of their intermediate reasoning steps and revise outputs if inconsistencies are detected. In short, the resulting outputs are evidence-linked syntheses that double-check and self-correct, thereby enhancing transparency, reproducibility, and clinical applicability.
Opportunities for clinicians
The practical benefits of deep research are evident. In minutes, these systems can ostensibly produce comprehensive summaries and literature overviews that once demanded weeks of human effort. Elicit, for example, is a deep research tool that can create lists of citations based on a research query and allows for easy searching through these articles. 6 By highlighting key findings from a diverse range of papers, LRMs become a springboard for creative thinking and hypothesis generation. In institutions without digital libraries or subscription databases, the technology can potentially level the playing field: instead of struggling to access paywalled articles, deep research tools can retrieve and distill the essential evidence. Many LLMs can also generate tables and graphs and perform basic statistical analyses, further extending their utility in clinical and research settings.
Fluent answers missing insight
On closer inspection, the apparent speed and well-organized nature of deep research-generated content belie critical shortcomings. First, deep research often flattens the hierarchy of evidence. Through the filter of LRMs, case series, randomized controlled trials, observational studies, editorials from subject matter experts, and meta-analyses from small and large journals alike all emerge as equally compelling findings, stitched together with limited consideration of differences in sample size, study design rigor, and bias risk.
Studies have illustrated the limitations of LLMs in summarizing medical evidence: human reviewers identify critical omissions and misinterpretations in system-generated summaries across multiple clinical domains. 7 Beyond medicine, controlled tests of scientific summarization demonstrate a systematic tendency to over-generalize conclusions relative to source texts, reinforcing the risk that fluent language can mask distorted takeaways. 8
Moreover, the snippets recombined from existing literature by LRMs are based purely on statistical co-occurrence, without any internal grasp of pathophysiology or clinical nuance. They cannot distinguish between breakthroughs and incremental progress, nor can they determine when a hypothesis is outdated or contradicted. They do not challenge prevailing assumptions but merely rephrase what already exists. In short, these systems simulate understanding without performing conceptual synthesis.
The fundamental limitation was rigorously quantified in Apple's recent study, “The Illusion of Thinking,” which examined leading LRMs across carefully controlled puzzle environments. 9 The researchers identified a counterintuitive phenomenon termed a “scaling limit”: as task complexity increases, models initially devote more reasoning effort, but soon reach a threshold beyond which their effort and accuracy dramatically decline despite having ample remaining computing resources. Notably, these failures occurred even with straightforward logic puzzles, which lack the inherent ambiguity, complexity, and contextual noise found in clinical medicine. It is not hard to imagine how these same models might falter when faced with the considerably messier problems of clinical decision-making and medical research. In other words, as problems become increasingly complex, these models demonstrate behaviors fundamentally misaligned with human cognition, scientific inquiry, and patient-centered medical care.
Then, even with the most advanced LRMs, there are hallucinations, confidently fabricated PubMed identifiers with nonexistent DOIs, invented trial results, and even spurious “expert quotations” that cannot be found when the cited sources are scrutinized. One of the most frustrating aspects of our use of deep research is chasing down AI-generated citations of empty archives and dead ends, convincingly embedded within otherwise accurate information. This is not merely anecdotal—in medical contexts, published studies have documented high rates of fabricated or inaccurate references in chatbot-generated outputs. 10
Bias in AI outputs presents another challenge. Trained on a skewed archive of published work, these models perpetuate the same blind spots that are in the medical work they ingest: underrepresented patient groups remain less visible, minority perspectives go unheard, and dominant narratives grow more entrenched. 11
The clinician as gatekeeper
What our experience tells us is that deep research, at least in its current form, does not replace the clinician's judgment. The real promise lies in a “clinician-in-the-loop” approach: AI may perform a first pass at gathering relevant evidence and organizing raw material, but human experts must interpret the findings and decide which leads warrant further exploration.
Being a “clinician-in-the-loop” entails three specific responsibilities: (1) Evidence weighting: explicitly privileging stronger study designs (e.g. randomized trials and high-quality meta-analyses) over other sources; (2) Contextual integration: aligning aggregated claims with pathophysiology, population differences, practice standards, and trends over time; and (3) Bias and safety checks: screening for algorithmic bias and hallucinations as reflected in historical data.11–13 These duties may be familiar to clinicians, but AI's surface fluency means we must perform them more deliberately and transparently.
To help with this, today's clinician must also gain a high-level familiarity with the language of algorithms. It starts with a curiosity about how these tools learn: what datasets shape their outputs, and how their model architecture makes them confident yet prone to inventing facts. We can then subject AI-generated summaries to the same rigorous critique we apply to any peer-reviewed paper, asking whether the evidence comes from a randomized trial or a case series, the existence of conflicts of interest, and whether key subgroups are represented.
And suppose we brought AI into our existing culture of inquiry: what if, at the next journal club, alongside a presentation of the latest article from the American Journal of Ophthalmology, we also said, “Here's what my deep research model found—let's critique it together.” By sharpening our clinical judgment against the generated information, we won’t just keep pace with the growing flood of knowledge; we’ll shape it into meaningful insights.
The mandate as gatekeepers extends beyond the individual clinician. Medical educators should incorporate critical appraisal of AI outputs into their curricula. Journals and funding bodies should demand complete transparency whenever manuscripts lean on AI, whether it helped mine the literature or draft prose, and hold authors accountable for any undisclosed reliance. Peer reviewers should demand clear source citations and verify dubious claims. At the health system level, leadership should vet a “toolkit” of in-depth research services and establish protocols for monitoring AI-generated evidence.
Evidence from the literature shows that a thorough understanding of how AI reasons is essential to sound clinical decision-making. A randomized controlled trial found that providing hospital-based clinicians with systematically biased AI-diagnostic tools significantly decreased the clinicians’ diagnostic performance. 14 Systematic reviews have demonstrated how clinicians can become overconfident in machine outputs and less vigilant in verifying the validity of outputs.12,13 What this suggests is that AI may contribute to de-skilling when clinicians stop practicing core appraisal skills. Table 1 provides a checklist for the responsible use of AI-assisted research.
Gatekeeper checklist for artificial intelligence (AI)-assisted research.
Two other aspects of being gatekeepers should be mentioned: patient trust and infrastructure equity. Patients’ perceptions of AI-influenced recommendations are heterogeneous, with common concerns over accuracy and accountability, and a general preference for clinician supervision.15,16 A pragmatic norm would be to disclose AI involvement in plain language and to confirm understanding with a brief teach-back. A study shows that transparency in the use of AI tools can improve patient trust without eliminating appropriate reliance. 17 Without attention to access, literacy, and representative data, AI tools can reinforce existing inequities. 11 Health systems and training programs should allocate resources for equitable access to vetted tools, offer faculty development in AI literacy, and audit for disparate impact. 18
Deep research is a powerful and exciting tool, but it is far from being an arbiter of truth. As we navigate the influx of medical literature and the emergence of LRMs, our human clinical judgement, paradoxically, becomes all the more crucial. We are aware of the accelerating evolution in the capabilities of LLMs, from simple text generation to multi-step analysis within three years—these advancements suggest that AI-driven reasoning itself will likely become significantly more sophisticated in the near future. Still, by rigorously scrutinizing these emerging tools, we not only preserve the compassionate, critical inquiry at the heart of our profession, but also translate the promise of deep research into better care for our patients.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
AI use statement
AI tools (ChatGPT and OpenAI) were used to improve the grammar and phrasing only. The content, references, analysis, and conclusions were entirely written and verified by the authors. The authors take full responsibility for the integrity and accuracy of the work. No text or citations were accepted without manual verification.
