Sage Journals: Discover world-class research

Abstract

Clinicians face an ever-increasing volume of medical literature and are turning to large language models and “deep research” systems to retrieve, organize, and synthesize biomedical evidence. In our use of these tools, we have found them useful in producing coherent and comprehensive summaries and proposing testable hypotheses. However, the outputs of these models are prone to flattening evidentiary hierarchies, overgeneralizing across heterogeneous populations and comparators, and occasionally propagating hallucinated citations. These failure modes risk automation bias and erosion of transparency if introduced into clinical pathways without guardrails. Here, we propose a “clinician-in-the-loop” in which clinicians remain the gatekeepers for artificial intelligence-assisted synthesis. We outline three core duties for clinicians: (1) evidence weighting that privileges randomized trials, high-quality meta-analyses, and absolute risk communication; (2) contextual integration across pathophysiology, existing evidence, and patient populations; and (3) provenance and bias auditing through source verification, uncertainty reporting, and counter-summaries. We moreover explore how healthcare institutions, medical educators, policymakers, and publishers of medical literature can promote literacy and transparency regarding the use of “deep research” tools, including the implementation of reporting standards, provenance disclosures, and equity surveillance.

Keywords

Deep research artificial intelligence large language models evidence synthesis automation bias

As physicians-in-training, we are acutely aware of the explosive growth of medical information. Medical knowledge is forecasted to double approximately every 3 months,¹ PubMed alone added nearly 1.6 million new citations in 2023,² and a physician would need an unrealistic 20+ hours per day just to skim all relevant publications to stay current.³ Faced with this overwhelming deluge, we find ourselves increasingly turning to large language models (LLMs)—particularly the latest large reasoning models (LRMs). These advanced LLMs go beyond mere summarization; they undertake structured, multi-step reasoning tasks to help us organize and synthesize complex medical literature. We are the first generation of clinicians to come of age alongside these technologies, and their so-called “deep research” capabilities offer a tantalizing vision: instant access to organized, synthesized medical knowledge.

Nonetheless, our enthusiasm recently met a sobering test during a literature review of the latest glaucoma therapies (a topic aligned with our specialization as ophthalmology residents). We tasked an LRM to summarize the newest clinical trials and came to an unsettling realization: the generated summary was fluent, authoritative, yet oddly hollow. It effortlessly aggregated relevant studies but failed to highlight meaningful patterns or provide a deeper analytical context. It was a well-organized collage, not a thoughtful synthesis.

Our experience illustrates a core truth about the current state of artificial intelligence (AI)-driven “deep research.” Tools such as ChatGPT and specialized platforms such as Elicit can analyze, summarize, and even draft manuscripts from vast troves of literature. Yet they remain incapable of genuine interpretative insights or conceptual breakthroughs. Still, with over a tenth of physicians now regularly using LLMs to generate summaries of medical research and standards of care,⁴ it is paramount that physicians grasp the utility and limitations of “deep research.”

What is deep research?

Deep research refers to LRM-enabled workflows that extend beyond the single-pass text generation typical of earlier LLMs. Traditional LLMs produced responses in a single pass, relying on patterns learned during training rather than engaging with new information. While they can generate fluent text, they cannot consult new sources or verify their own reasoning. In contrast, deep research systems are designed to retrieve and critically reason over the most current biomedical literature and clinical guidelines. Two features are central to their approach: (1) retrieval-augmented generation to ground outputs in published sources along with provenance tracking with appropriate citations, and (2) chain-of-thought reasoning to break down complex questions into smaller, sequential steps.⁵ The latter means that these models, through iterative processes, can check the coherence of their intermediate reasoning steps and revise outputs if inconsistencies are detected. In short, the resulting outputs are evidence-linked syntheses that double-check and self-correct, thereby enhancing transparency, reproducibility, and clinical applicability.

Opportunities for clinicians

The practical benefits of deep research are evident. In minutes, these systems can ostensibly produce comprehensive summaries and literature overviews that once demanded weeks of human effort. Elicit, for example, is a deep research tool that can create lists of citations based on a research query and allows for easy searching through these articles.⁶ By highlighting key findings from a diverse range of papers, LRMs become a springboard for creative thinking and hypothesis generation. In institutions without digital libraries or subscription databases, the technology can potentially level the playing field: instead of struggling to access paywalled articles, deep research tools can retrieve and distill the essential evidence. Many LLMs can also generate tables and graphs and perform basic statistical analyses, further extending their utility in clinical and research settings.

Fluent answers missing insight

On closer inspection, the apparent speed and well-organized nature of deep research-generated content belie critical shortcomings. First, deep research often flattens the hierarchy of evidence. Through the filter of LRMs, case series, randomized controlled trials, observational studies, editorials from subject matter experts, and meta-analyses from small and large journals alike all emerge as equally compelling findings, stitched together with limited consideration of differences in sample size, study design rigor, and bias risk.

Studies have illustrated the limitations of LLMs in summarizing medical evidence: human reviewers identify critical omissions and misinterpretations in system-generated summaries across multiple clinical domains.⁷ Beyond medicine, controlled tests of scientific summarization demonstrate a systematic tendency to over-generalize conclusions relative to source texts, reinforcing the risk that fluent language can mask distorted takeaways.⁸

Moreover, the snippets recombined from existing literature by LRMs are based purely on statistical co-occurrence, without any internal grasp of pathophysiology or clinical nuance. They cannot distinguish between breakthroughs and incremental progress, nor can they determine when a hypothesis is outdated or contradicted. They do not challenge prevailing assumptions but merely rephrase what already exists. In short, these systems simulate understanding without performing conceptual synthesis.

The fundamental limitation was rigorously quantified in Apple's recent study, “The Illusion of Thinking,” which examined leading LRMs across carefully controlled puzzle environments.⁹ The researchers identified a counterintuitive phenomenon termed a “scaling limit”: as task complexity increases, models initially devote more reasoning effort, but soon reach a threshold beyond which their effort and accuracy dramatically decline despite having ample remaining computing resources. Notably, these failures occurred even with straightforward logic puzzles, which lack the inherent ambiguity, complexity, and contextual noise found in clinical medicine. It is not hard to imagine how these same models might falter when faced with the considerably messier problems of clinical decision-making and medical research. In other words, as problems become increasingly complex, these models demonstrate behaviors fundamentally misaligned with human cognition, scientific inquiry, and patient-centered medical care.

Then, even with the most advanced LRMs, there are hallucinations, confidently fabricated PubMed identifiers with nonexistent DOIs, invented trial results, and even spurious “expert quotations” that cannot be found when the cited sources are scrutinized. One of the most frustrating aspects of our use of deep research is chasing down AI-generated citations of empty archives and dead ends, convincingly embedded within otherwise accurate information. This is not merely anecdotal—in medical contexts, published studies have documented high rates of fabricated or inaccurate references in chatbot-generated outputs.¹⁰

Bias in AI outputs presents another challenge. Trained on a skewed archive of published work, these models perpetuate the same blind spots that are in the medical work they ingest: underrepresented patient groups remain less visible, minority perspectives go unheard, and dominant narratives grow more entrenched.¹¹

The clinician as gatekeeper

What our experience tells us is that deep research, at least in its current form, does not replace the clinician's judgment. The real promise lies in a “clinician-in-the-loop” approach: AI may perform a first pass at gathering relevant evidence and organizing raw material, but human experts must interpret the findings and decide which leads warrant further exploration.

Being a “clinician-in-the-loop” entails three specific responsibilities: (1) Evidence weighting: explicitly privileging stronger study designs (e.g. randomized trials and high-quality meta-analyses) over other sources; (2) Contextual integration: aligning aggregated claims with pathophysiology, population differences, practice standards, and trends over time; and (3) Bias and safety checks: screening for algorithmic bias and hallucinations as reflected in historical data.^11–13 These duties may be familiar to clinicians, but AI's surface fluency means we must perform them more deliberately and transparently.

To help with this, today's clinician must also gain a high-level familiarity with the language of algorithms. It starts with a curiosity about how these tools learn: what datasets shape their outputs, and how their model architecture makes them confident yet prone to inventing facts. We can then subject AI-generated summaries to the same rigorous critique we apply to any peer-reviewed paper, asking whether the evidence comes from a randomized trial or a case series, the existence of conflicts of interest, and whether key subgroups are represented.

And suppose we brought AI into our existing culture of inquiry: what if, at the next journal club, alongside a presentation of the latest article from the American Journal of Ophthalmology, we also said, “Here's what my deep research model found—let's critique it together.” By sharpening our clinical judgment against the generated information, we won’t just keep pace with the growing flood of knowledge; we’ll shape it into meaningful insights.

The mandate as gatekeepers extends beyond the individual clinician. Medical educators should incorporate critical appraisal of AI outputs into their curricula. Journals and funding bodies should demand complete transparency whenever manuscripts lean on AI, whether it helped mine the literature or draft prose, and hold authors accountable for any undisclosed reliance. Peer reviewers should demand clear source citations and verify dubious claims. At the health system level, leadership should vet a “toolkit” of in-depth research services and establish protocols for monitoring AI-generated evidence.

Evidence from the literature shows that a thorough understanding of how AI reasons is essential to sound clinical decision-making. A randomized controlled trial found that providing hospital-based clinicians with systematically biased AI-diagnostic tools significantly decreased the clinicians’ diagnostic performance.¹⁴ Systematic reviews have demonstrated how clinicians can become overconfident in machine outputs and less vigilant in verifying the validity of outputs.^12,13 What this suggests is that AI may contribute to de-skilling when clinicians stop practicing core appraisal skills. Table 1 provides a checklist for the responsible use of AI-assisted research.

Table 1.
Gatekeeper checklist for artificial intelligence (AI)-assisted research.

1. Provenance: Can each key claim be traced to a peer-reviewed source? Routinely spot-check citations against full texts.

2. Design weighting: Does the output distinguish randomized controlled trials, meta-analyses, observational studies, and expert opinion—and weight them accordingly?

3. Numeracy: In cited studies, do effect sizes, confidence intervals, and absolute risks support the claim generated by the large language model? Flag “confidence without numbers.”

4. Comparators and context: Does the AI-generated output respect clinical indications, study populations, and study variables in its synthesis?

5. Contradictions: Are discordant trials acknowledged? If not, prompt for conflicting evidence and reassess.

6. Bias and safety checks: Guard against automation bias; request a counter-summary (“argue the opposite”) and compare claims with sources.

7. Equity: Ask explicitly about underrepresented subgroups and calibration across demographics; note data gaps.

8. Actionability: Translate outputs into clinically verifiable next steps.

9. Disclosure and accountability: If AI-shaped your synthesis, disclose this to colleagues or patients and retain human responsibility for conclusions.

Two other aspects of being gatekeepers should be mentioned: patient trust and infrastructure equity. Patients’ perceptions of AI-influenced recommendations are heterogeneous, with common concerns over accuracy and accountability, and a general preference for clinician supervision.^15,16 A pragmatic norm would be to disclose AI involvement in plain language and to confirm understanding with a brief teach-back. A study shows that transparency in the use of AI tools can improve patient trust without eliminating appropriate reliance.¹⁷ Without attention to access, literacy, and representative data, AI tools can reinforce existing inequities.¹¹ Health systems and training programs should allocate resources for equitable access to vetted tools, offer faculty development in AI literacy, and audit for disparate impact.¹⁸

Deep research is a powerful and exciting tool, but it is far from being an arbiter of truth. As we navigate the influx of medical literature and the emergence of LRMs, our human clinical judgement, paradoxically, becomes all the more crucial. We are aware of the accelerating evolution in the capabilities of LLMs, from simple text generation to multi-step analysis within three years—these advancements suggest that AI-driven reasoning itself will likely become significantly more sophisticated in the near future. Still, by rigorously scrutinizing these emerging tools, we not only preserve the compassionate, critical inquiry at the heart of our profession, but also translate the promise of deep research into better care for our patients.

Footnotes

ORCID iD

Henry Bair

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

AI use statement

AI tools (ChatGPT and OpenAI) were used to improve the grammar and phrasing only. The content, references, analysis, and conclusions were entirely written and verified by the authors. The authors take full responsibility for the integrity and accuracy of the work. No text or citations were accepted without manual verification.

References

Densen

. Challenges and opportunities facing medical education. Trans Am Clin Climatol Assoc 2020; 122: 48.

MEDLINE PubMed Production Statistics, https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html?utm_source=chatgpt.com (2023, accessed 11 June 2025).

Porter

Boyd

Skandari

, et al. Revisiting the time needed to provide adult primary care. J Gen Intern Med 2022; 38: 147–155.

Henry

. 2 in 3 physicians are using health AI—up 78% from 2023. American Medical Association, https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023?utm_source=chatgpt.com (2025, accessed 11 June 2025).

Wei

Wang

Schuurmans

, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022; 35: 24824–24837.

Elicit Help Center. Elicit.com, https://support.elicit.com/en/categories/146369 (2025).

Tang

Sun

Idnay

, et al. Evaluating large language models on medical evidence summarization. npj Digit Med 2023; 6: 58.

Peters

Chin-Yee

. Generalization bias in large language model summarization of scientific research. R Soc Open Sci 2025; 12: 241776.

Apple Machine Learning Research. The illusion of thinking. Apple.com, https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf (2025)

10.

Chen

. Accuracy of chatbots in citing journal articles. JAMA Netw Open 2023; 6: e2327647.

11.

Obermeyer

Powers

Vogeli

, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366: 447–453.

12.

Lyell

Coiera

. Automation bias and verification complexity: A systematic review. J Am Med Inf Assoc 2017; 24: 423–431.

13.

Goddard

Roudsari

Wyatt

. Automation bias: A systematic review of frequency, effect mediators, and mitigators. J Am Med Inf Assoc 2012; 19: 121–127.

14.

Jabbour

Fouhey

Shepard

, et al. Measuring the impact of AI in the diagnosis of hospitalized patients. JAMA 2023; 330: 2275.

15.

Esmaeilzadeh

Mirzaei

Dharanikota

. Patients’ perceptions toward human–artificial intelligence interaction in health care: Experimental study. J Med Internet Res 2021; 23: e25856.

16.

Moy

Irannejad

Manning

, et al. Patient perspectives on the use of artificial intelligence in health care: A scoping review. J Patient Cent Res Rev 2024; 11: 51–62.

17.

Sakamoto

Harada

Shimizu

. Facilitating trust calibration in artificial-intelligence-driven differential diagnoses list for physicians’ diagnostic accuracy: A quasi-experimental study. JMIR Form Res 2024; 8: e58666.

18.

Cross

Choma

Onofrey

. Bias in medical AI: Implications for clinical decision-making. PLoS Digit Health 2024; 3: e0000651.

The clinician as gatekeeper in the age of “deep research”

Abstract

Keywords

What is deep research?

Opportunities for clinicians

Fluent answers missing insight

The clinician as gatekeeper