Abstract
Background:
Patient–provider interactions could inform care quality and communication but are rarely leveraged because collecting and analyzing them is both time-consuming and methodologically complex. The growing availability of large language models (LLMs) makes these analyses more feasible, though their accuracy remains uncertain.
Objectives:
Assess an LLM’s ability to analyze patient–provider interactions.
Design:
Compare a human’s and an LLM’s codings of clinical encounter transcripts.
Setting/Subjects:
Two hundred and thirty-six potential symptom discussions from transcripts of clinical encounters with 92 patients living with cancer in the mid-Atlantic United States. Transcripts were analyzed by GPT4DFCI in our hospital’s Health Insurance Portability and Accountability Act compliant infrastructure instance of GPT-4 (OpenAI).
Measurements:
Human and an LLM-coded transcripts to determine whether a patient’s reported symptom(s) were discussed, who initiated the discussion, and any resulting recommendation. We calculated Cohen’s κ to assess interrater agreement between the LLM and human and qualitatively classified disagreements about recommendations.
Results:
Interrater reliability indicated “strong” and “moderate” agreement levels across measures: Agreement was strongest for whether the symptom was discussed (k = 0.89), followed by who initiated the discussion (k = 0.82), and the recommendation provided (k = 0.78). The human and LLM disagreed on the presence and/or content of the recommendation in 16% of potential discussions, which we categorized into nine types of disagreements.
Conclusions:
Our results suggest that LLMs’ abilities to analyze clinical encounters are equivalent to humans. Thus, using LLMs as a research tool may make it more feasible to analyze patient–provider interactions, which could have broader implications for assessing and improving care quality, care inequities, and provider communication.
Get full access to this article
View all access options for this article.
