Sage Journals: Discover world-class research

Abstract

Oral and Maxillofacial Surgery (OMFS) is a surgical spatiality that serves as a bridge between medicine and dentistry, focusing on the diagnosis and treatment of diseases affecting the mouth, jaw, face, and neck. Large Language Models (LLMs), which first appeared in 2019, are trained in extensive text collections and can process languages with high quality. Although OMFS is a hands-on surgical specialty, LLMs have been increasingly used for patient education, research, and training purposes. This study aimed to explore the capabilities of LLMs in the field of OMFS by investigating the most recent literature. Seven peer-reviewed online repositories including PubMed, Scopus, association for computing machinery (ACM), IEEE, Embase, cumulative index to nursing and allied health literature (CINAHL), and Google Scholar, are selected to download relevant articles. Adhering to the PRISMA-ScR guidelines, we conducted a systematic search across these libraries to select articles that incorporated LLMs into OMFS. The forward and backward reference lists of the included articles were checked to retrieve missing articles. After the final screening process a total of 20 studies are selected for this review process. The selected studies demonstrated the applications of LLMs in OMFS, such as patient education, clinical decision support, and procedural guidance for specific procedures. The study results showed variability in LLM response accuracy and lower accuracy in citation generation, whereas open-ended questions achieved higher accuracy rates. Advanced versions of LLMs, such as ChatGPT4, have shown improved accuracy, and reliability compared with older GPT versions. While some studies reported that LLM responses lacked complete details and exhibited only moderate accuracy. This variability in performance emphasizes the need for the continuous refinement of LLMs and highlights the importance of human oversight in clinical applications. However, there is a need for further refinement, extensive research, and verification by experts.

Keywords

large language model LLM ChatGPT bard maxillofacial surgery oral surgery head and neck surgery

Introduction

Oral and Maxillofacial Surgery is a surgical spatiality that serves as a bridge between medicine and dentistry, focusing on the diagnosis and treatment of diseases affecting the mouth, jaw, face, and neck. The scope of the field is broad and includes the diagnosis and management of facial injuries, head and neck cancers, salivary gland diseases, facial disproportion, facial pain, impacted teeth, cysts, and tumors of the jaws, as well as various issues affecting the oral mucosa, such as mouth ulcers and infections.¹ Specialists in this field are unique in that their training often requires dual degrees in medicine and dentistry, followed by specialty residency training, and is recognized worldwide.

The field of Oral and Maxillofacial Surgery (OMFS) includes various subspecialties, such as head and neck oncology, dentoalveolar surgery, orthognathic surgery, cleft lip and palate, craniofacial surgery, facial esthetic surgery, and craniofacial trauma.^1,2 The International Association of Oral and Maxillofacial Surgeons (IAOMS) has acknowledged the transformative potential of Information Technology in global oral and maxillofacial training. This led to the formation of a committee and initiatives aimed at leveraging IT to disseminate education worldwide.³

The advent of large language models (LLMs), which first appeared in 2019, represents a significant technological development. These models trained in extensive text collections, can process, and generate language with exceptional quality.⁴ Among the most well-known LLMs, ChatGPT, developed by OpenAI in San Francisco, CA, USA, was launched in November 2022. It quickly gained recognition, amassing 1 million users within 5 days of its release. Accessible via web browsers or mobile apps, ChatGPT facilitates queries, communication, and word-based tasks. Although OMFS is a hands-on surgical specialty with practitioners often engaged in clinical or surgical settings, LLMs have found utility in areas such as diagnosis and education for both patients and dental students.^5,6 Cufuna et al.,⁷ explored the integration of Augmented Reality (AR) and LLMs to enhance future teachers’ digital competencies, revealing that these technologies boost student engagement, problem-solving skills, and interactive learning. The findings highlighted the potential of AR and LLMs to transform education by fostering dynamic participatory teaching methodologies that prepare educators for real-world challenges. Similarly, Askarbekuly and Aničić⁸ automated outcome-based assessment in informal e-learning using ChatGPT, addressing the challenge of evaluating learning trajectories. A case study and two evaluation stages showed that instructor oversight, a high-quality knowledge base, and well-crafted prompts are key to ensuring assessment quality. Distance simulation is transforming surgical education by providing scalable, high-quality training through effective hardware, validated programs, and timely feedback. With AI-enhanced assessment tools and remote feedback, it optimizes learning, mentorship, and faculty efficiency, making surgical training more accessible and sustainable.⁹

While there have been several narrative reviews on the use of Artificial Intelligence in OMFS, there is a notable lack of LLMs in this area, which.^4,5 aims to fill this gap by exploring the most recent applications of LLMs in OMFS and describing an overview of language model input and output. The research questions for this review are as follows.

RQ1: In which OMFS subspecialties are LLMs utilized? This research question focuses on identifying the subspecialties investigated using LLMs. These subspecialties include orthognathic surgery, facial esthetic surgery, head and neck oncology, salivary gland disease, osteonecrosis of the jaw, oral cancer, and transoral robotic surgery.

RQ2: Why are LLMs used in OMFS, and in what capacities such as patient education, diagnosis, or text generation? This question primarily aimed to identify the diverse domains of OMFS explored using LLMs and explainable AI-based systems.

RQ3: Are LLM responses compared to those from experts? This question explores the literature to investigate where the researchers have validated the results of their corresponding LLM models from a medical practitioner, and if validated, what are the responses of healthcare professionals regarding their LLM models.

RQ4: What are the different scoring methods for LLM responses? This research question aimed to explore the different techniques and scoring methodologies used to evaluate the capabilities of multiple LLM models proposed for different research tasks.

These research questions will help us identify the subspecialties of OMFS that are most amenable to digital innovation and highlight the fields where further development might be needed. They also help to explore the various functionalities that LLMs play in OMFS, thus helping us to assess the impact on clinical outcomes and patient care. Comparing the LLM response with that of experts helps us to evaluate their reliability and accuracy, which are important for the use of LLMs in clinical settings. Understanding the methods used to evaluate the quality of responses from LLMs is essential for ensuring consistency and reliability in their application.

This scoping review provides an understanding of the role of LLMs in OMFS, aiding researchers and practitioners in developing new models or chatbots that can enhance patient and resident education. The remainder of the paper is organized as follows: Section 2 discusses the study protocol and methodology, Section 3 outlines the research findings, Section 4 discusses these findings, Section 5 examines the strengths and limitations of this review, and Section 6 concludes the paper.

Methods

The PRISMA extension for scoping reviews (PRISMA-Scr) guidelines were followed for this scoping review, and Ref.⁷ The search process and study selection are described in detail below.

Search process

A systematic literature search was performed on February 2, 2024, across seven electronic databases: PubMed, Scopus, ACM, IEEE, Embase, CINAHL, and Google Scholar. The search was focused on articles published between November 2022 and January 2024. This duration was selected because the innovative ChatGPT was launched on November 30, 2022. Google Scholar retrieved several relevant and irrelevant studies. Therefore, the first 100 studies were considered to limit the search results and focus on the study objectives. The reference lists of the articles selected for inclusion were carefully examined to identify additional relevant studies. The search keywords were as follows.

Intervention—related search: (Large Language Model) OR (ChatGPT) OR (Bard) OR (llama) OR (GPT) OR (Dalle2)

Disease—related search strings: (Maxillofacial Surgery) OR (Oral Surgery) OR (Dentoalveolar Surgery) OR (Craniofacial surgery) OR (Orthognathic Surgery) OR (Head and Neck Surgery)

The search strategy is further detailed in the Supplemental file Search Results.pdf.

Inclusion and exclusion criteria

The studies included in this scoping review focused on populations who underwent OMFS procedures, with no restrictions on age, sex, or ethnicity. The interventions considered LLM used in the OMFS field encompass applications in treatment, diagnosis planning, post-operative care, student education, research, and other relevant areas. Only studies published in English between 2022 and 2024 were included. The types of publications included were peer-reviewed articles, theses, dissertations, conference proceedings, and preprints. Reviews, conference abstracts, proposals, editorials, and commentaries were excluded. No constraints were placed on the publication country, comparators, or outcomes of the LLM models.

Study selection

The articles retrieved from the search were uploaded to the Rayyan Intelligent Review Application developed by Rayyan Systems Inc.¹¹ This application facilitates efficient collaboration among researchers and expedites the review process. Reviewers can conduct individual or collaborative reviews independently, making decisions regarding the inclusion or exclusion of articles.¹¹ Duplicates were identified and removed from the list, and the remaining studies were evaluated based on their titles and abstracts. Two reviewers (SM, MR, and SK) independently assessed the eligibility of each article. Any discrepancies were resolved through mutual consultation and discussion between reviewers.

Data extraction

A data extraction sheet was created using Microsoft Excel, and relevant information was extracted from the final articles included. The following variables were included in the extraction process: the first author’s name, year of publication, month of publication, type of publication, venue (conference or journal name), country, study design, setting, aim, duration or date of study, LLM model, type of disease or subspecialty, subcategory of disease, reasoning mechanism, LLM application, comparison, input to LLM, source of questions, output from LLM, input type, number of questions/input, number of answers/output, fine-tuned or not, number of reviewers, scoring of LLM answers, data analysis tools, inter-rater reliability, statistical analysis, performance values, completeness, accuracy or references, reported outcomes, identified gaps, limitations, future recommendations, and additional comments. A detailed description of the extraction information is provided in the Supplemental file Data Extraction sheet.xlsx. The data extraction process was conducted by the authors (SM and MR), and the extracted data were subsequently reviewed and verified by other authors (SK and ZS).

Data synthesis

The collected data were analyzed and presented using narrative synthesis. The included studies and results are summarized and detailed in the Supplemental file Data Extraction sheet.xlsx.

Results

Search results

In the initial search across the seven databases, 405 articles were retrieved: PubMed (183), Scopus (59), ACM (21), IEEE (9), Embase (29), CINAHL (4), and Google Scholar (100). After removing 71 duplicate articles, the remaining 334 were screened based on their titles and abstracts. Subsequently, 312 articles were excluded: 173 due to different outcomes, 87 due to different populations, and 52 due to not meeting the inclusion/exclusion criteria regarding publication type. Following this screening process, the remaining 22 studies were sought for retrieval, out of which the full-text PDFs of two studies could not be obtained. Finally, 20 studies were assessed for eligibility, and all 20 studies were included in our final review, as they aligned with our inclusion and exclusion criteria, as shown in Figure 1.

Figure 1.

Research protocol followed to execute this scoping review work.

Demographics of included studies

Detailed demographics of the included studies are presented in Table 1. All 20 articles were published in journals.^6,9–30 Eighty-five percentage of the articles were published by 2023 (n = 17).^{9–20,22–28,30} Three articles were published in 2024,^6,19,27 indicating a growing interest in research in this field. The included studies were published in nine countries, with the USA having the highest number (n = 5, 25%),^{6,15,17,19,24} followed by Turkey (n = 4, 20%).^16,20,27,30 There were three studies were published in Italy^13,14,28 and two each in Australia^22,23 and Spain ^25,26; one study each was published in Brazil,²⁹ France,¹⁸ Germany,²¹ and Taiwan.¹²

Table 1.

Demographics of included studies.

Year	N (%)	References
2023	17	Wu and Dang,¹² Vaira et al.,¹³ Frosolini et al.,¹⁴ Lechien et al.,¹⁵Wei et al.,¹⁶ Alten et al.,¹⁷ Lebhar et al.,¹⁸ Lechien et al.,¹⁹Lee et al.,²⁰ Russe et al.,²² Seth et al.,²³ Xie et al.,²⁴Dang and Hanba,²⁵ Suárez et al.,²⁶ Chiesa-Estomba et al.,²⁷Yousefi-Koma and Akbarzadeh-Baghban,²⁸ Aguiar de Sousa et al.³⁰
2024	3	Chaker et al.,⁶ Yurdakurban et al.,²¹ Saibene et al.²⁹
Country	N (%)	References
Australia	2	Russe et al.,²² Seth et al.²³
Brazil	1	Saibene et al.²⁹
France	1	Lebhar et al.¹⁸
Germany	1	Yurdakurban et al.²¹
Italy	3	Vaira et al.,¹³ Frosolini et al.,¹⁴ Yousefi-Koma and Akbarzadeh-Baghban²⁸
Spain	2	Dang and Hanba,²⁵ Suárez et al.²⁶
Taiwan	1	Wu and Dang¹²
Turkey	4	Wei et al.,¹⁶ Lee et al.,²⁰ Chiesa-Estomba et al.,²⁷ Aguiar de Sousa et al.³⁰
USA	5	Chaker et al.,⁶ Lechien et al.,¹⁵ Alten et al.,¹⁷ Lechien et al.,¹⁹ Xie et al.²⁴
Publication type	N (%)	References
Journal Article	20	Chaker et al.,⁶ Wu and Dang,¹² Vaira et al.,¹³ Frosolini et al.,¹⁴Lechien et al.,¹⁵ Wei et al.,¹⁶ Alten et al.,¹⁷ Lebhar et al.,¹⁸ Lechien et al.,¹⁹Lee et al.,²⁰ Yurdakurban et al.,²¹ Russe et al.,²² Seth et al.,²³ Xie et al.,²⁴Dang and Hanba,²⁵ Suárez et al.,²⁶ Chiesa-Estomba et al.,²⁷ Yousefi-Komaand Akbarzadeh-Baghban,²⁸ Saibene et al.,²⁹ Aguiar de Sousa et al.³⁰

Subspecialty of OMFS

It has been found that 7(30%) studies were in the field of head and neck oncology^{12–15,18,19,24} while the fields of orthognathic surgery and facial esthetic surgery each had 2 (10%) studies,^16,20,22,23 respectively. Thirty percent (n = 5) of the studies were in the field of dentoalveolar surgery,^{21,25,27–30} and only 1 (5%) study was in the field of salivary gland disease.²⁶ The distribution of these studies is illustrated in the pie chart shown in Figure 2. The various diseases, procedures, and surgical topics discussed by the authors include osteonecrosis of the jaw, oral cancer, transoral robotic surgery,¹² salivary gland pathology,^13,26 oral oncology and reconstructive surgery,¹³ maxillofacial and oral surgery,^14,27 facial trauma,¹⁴ head and neck cancer prognosis, ICD-10 codes,¹⁵ orthognathic surgery consultation,^16,20,30 steps of Fisher Cleft Lip repair,¹⁷ Extractions,²⁹ German S2 cone beam dental imaging guidelines,¹⁹ odontogenic sinusitis etiologies,²⁸ dental implants,^27,30 temporomandibular joint diseases,³⁰ and total laryngectomy, parotidectomy, neck dissection, glossectomy, and free tissue transfer for the head and neck,¹⁹ as shown in Table 2.

Figure 2.

Evolution of OMFS subspecialty.

Table 2.

Diseases and topics discussed in the final pool of relevant articles.

S.no.	Disease/topic	References
1	Osteoradionecrosis of jaws, oral cancer, transoral robotic surgery	Wu and Dang¹²
2	Salivary gland pathology	Vaira et al.,¹³ Suárez et al.²⁶
3	Maxillofacial and oral surgery	Frosolini et al.,¹⁴ Chiesa-Estomba et al.²⁷
4	Facial trauma	Frosolini et al.¹⁴
5	Head and neck cancer prognosis, ICD-10 codes	Lechien et al.¹⁵
6	Orthognathic surgery consultations	Wei et al.,¹⁶ Lee et al.,²⁰ Aguiar de Sousa et al.³⁰
7	Surgical steps of Fisher cleft repair	Alten et al.¹⁷
8	Oral oncology and reconstructive surgery	Vaira et al.¹³
9	Extractions	Saibene et al.²⁹
10	Dental implants	Chiesa-Estomba et al.,²⁷ Aguiar de Sousa et al.³⁰
11	German S2 cone-beam CT dental imaging guideline	Yurdakurban et al.²¹
12	Odontogenic sinusitis etiologies	Yousefi-Koma and Akbarzadeh-Baghban²⁸
13	Temporomandibular joint diseases	Aguiar de Sousa et al.³⁰
14	Total laryngectomy, parotidectomy, neck dissection,glossectomy, free tissue transfers for the head and neck	Lechien et al.¹⁹

LLM models reported with applications

Seven studies (35%) utilized the latest version, ChatGPT4,^{13,14,16,18,21,25,28} whereas five studies (25%) employed ChatGPT 3.5.^{14,21,24,26,28} Studies by^{6,12,15,17,19,20,23,29,30} did not specify the version of the ChatGPT used. Additionally, two articles used Microsoft Bing AI and Google Bard, as^22,27 shown in Table 3.

Table 3.

LLM models suggested in the literature for the OMFS.

LLM model	N	References
ChatGPT	9	Chaker et al.,⁶ Wu and Dang,¹² Lechien et al.,¹⁵ Alten et al.,¹⁷ Lechien et al.,¹⁹ Lee et al.,²⁰Seth et al.,²³ Saibene et al.,²⁹ Aguiar de Sousa et al.³⁰
ChatGPT 3.5	5	Frosolini et al.,¹⁴ Yurdakurban et al.,²¹ Xie et al.,²⁴ Suárez et al.,²⁶ Yousefi-Koma andAkbarzadeh-Baghban²⁸
ChatGPT 4	7	Vaira et al.,¹³ Frosolini et al.,¹⁴ Wei et al.,¹⁶ Lebhar et al.,¹⁸ Yurdakurban et al.,²¹ Dang and Hanba,²⁵Yousefi-Koma and Akbarzadeh-Baghban²⁸
Google Bard	2	Russe et al.,²² Chiesa-Estomba et al.²⁷
Microsoft Bing AI	2	Russe et al.,²² Chiesa-Estomba et al.²⁷

Six studies (30%) utilized LLM for text generation,^{6,12,14,19,22,24} whereas 17 of the included studies used it for answering^{6,13,15–23,25–30} questions, as shown in Table 4.

Table 4.

Use of LLMs for different purposes in the finalized articles.

LLM used	N	References
Text generation	6	Chaker et al.,⁶ Wu and Dang,¹² Frosolini et al.,¹⁴ Lechien et al.,¹⁹ Russe et al.,²² Xie et al.²⁴
Question answering	17	Chaker et al.,⁶ Vaira et al.,¹³ Lechien et al.,¹⁵ Wei et al.,¹⁶ Alten et al.,¹⁷ Lebhar et al.,¹⁸ Lechien et al.,¹⁹Lee et al.,²⁰ Yurdakurban et al.,²¹ Russe et al.,²² Seth et al.,²³ Dang and Hanba,²⁵ Suárez et al.,²⁶Chiesa-Estomba et al.,²⁷ Yousefi-Koma and Akbarzadeh-Baghban,²⁸ Saibene et al.,²⁹Aguiar de Sousa et al.³⁰

In the selected studies, LLM was utilized for the following purposes: citation generation,¹⁰ reference generation,¹⁴ clinical decision-making,^13,25,26,28 patient education and counseling,^{6,15,16,19,20,22,23,27,29,30} medical record analysis,¹⁶ dentist’s education and training,^17,30 supportive tool,¹⁶ virtual assistant for general dentist,²⁵ pre-surgical planning,²² and aid publication writing, and to create a framework for assessing publication.²⁴ These applications demonstrate the versatile use of LLMs across various aspects of patient care and medical education as shown in Figure 3.

Figure 3.

LLMs application in OMFS.

Sub-section head style

The finalized 20 studies posed text-based questions to LLM. The input to the LLM varied from keywords,¹² detailed clinical questions, and clinical scenarios^{13,14,17,18,21,22,25–28} to patient’s questions–preoperative, consultation, and postoperative questions–and^{6,15,16,19,20,23,29,30} the number of questions asked to the LLM ranged from 1 question¹⁷ to 159 questions.¹³

The sources of questions asked to the LLM varied: they were either commonly searched or frequently asked questions,^{6,12,15,20,29,30} set by researchers,^{13,14,17,22,26,28} derived from medical records, or examination results,¹⁸ obtained from platforms such as Quora,²⁷ based on guidelines from the American Society of Plastic Surgeons website,^16,23 sourced from the Spanish Society of Oral Surgery,²⁵ derived from the German S2 guideline for CBCT,²¹ and from publications from the AHNS Head Neck Fellowship Curriculum.²⁴

The output from the LLM varied across studies and included references and citations,^12,14 surgical steps,¹⁷ treatment options,²⁶ procedure information, complications,^20,23,29 post-extraction symptoms,²⁷ and publication scoring systems.²⁴ The number of responses from the LLM ranged from 1 ¹⁷ to 900.²⁵ Prompt engineering of LLM responses was conducted in two studies,^17,25 whereas the zero-shot learning approach was employed in the study.²¹ The LLM answers were compared to answers by experts in seven studies,^{6,13,17,18,21,26,29} compared to answers from Google in two studies,^15,19 and MedSearch and Open Evidence in the study.²⁰

Performance metrics used

To score LLM answers, Likert scale-based evaluations were used in eight studies.^{13,15,17,21,25–28} True/false categorization was employed to assess content correctness,¹⁴ while readability was evaluated using the Flesch Reading Ease Test and Flesch-Kincaid Grade Leve l.¹⁵ Additional assessments included the Gunning-Fog Index,¹⁹ SMOG index for readability, EQIP tool, and reliability scoring system.²⁰ There was one study with scoring criteria set by the reviewers²³ and one study with an agreement scoring of 1–5²² as shown in Table 5.

Table 5.

Scoring of LLM response.

Scoring system	No of studies
Likert Scale	8
True/false categorization	1
Flesch reading ease test and the Flesch-Kincaidgrade level (for readability)	1
Gunning-Fog index	1
SMOG index for readability	1
EQIP tool	1
Reliability scoring system	1
Agreement 1–5	1
Criteria set by reviewers	1

The number of reviewers who assessed the answers ranged from 2 to 35. Inter-rater reliability was high, with a Cohen’s kappa of 0.878,¹² demonstrating almost perfect agreement among raters. General agreement rates were notable at 90%¹⁵ and 89.7%, respectively.²⁵

The accuracy of the LLM responses varied across studies. The research article¹² found that only 10% of the responses for citation generation contained all correct details, while¹³ higher accuracy rates were observed in open-ended clinical questions (87.2%) than in closed ones (84.7%). Bibliographic references were poorly executed, with 46.4% being nonexistent.¹³ The study article¹⁴ reported that ChatGPT version 4 outperformed version 3.5 in providing true answers (74.2% vs 16.6%). The research work¹⁵ indicated a preference for Google over ChatGPT, with Google demonstrating a higher-quality score. The accuracy rates were 71.7% in this study²⁵ and 69%.⁶ The research study²¹ found that ChatGPT 4 showed 100% correct recommendations, whereas ChatGPT 3.5 showed only 92.5% correct recommendations. The quality of the explanation of the answers for version 4 was superior to version 3.5 (87.5% and 57.5%, respectively).

Discussions

Principle findings

This scoping review identified the diverse applications of LLMs in OMFS. The evidence collected demonstrates that LLMs, particularly advanced models such as ChatGPT4, are increasingly being used in OMFS with satisfactory results, particularly in patient education and clinical decision-making, and as a supportive tool for practitioners and students. The wide geographic distribution of the studies, with the USA and Turkey contributing the largest number of publications, and most studies published in 2023, indicate the increasing interest and global relevance of the research topic. The focus on diverse subspecialties within OMFS, notably Head & Neck Oncology and Dentoalveolar Surgery, reflects the broad applicability of our research findings.

Capabilities of LLM

LLMs have been effective in synthesizing information and generating accurate responses, making them valuable in educational settings for both patients and health care professionals.⁸ Distance simulation is transforming surgical education by providing scalable, high-quality training through effective hardware, validated programs, and timely feedback.⁹ Their applications range from generating citations and supporting clinical decision-making to enhancing patient education through accessible explanations of medical conditions and treatments related to OMFS, such as cleft lip repair. The use of LLMs for unique purposes, such as medical record analysis and presurgical planning, indicates the growing confidence in LLM’s precision, and reliability by the surgical community.²⁵ The varied input to the LLMs, from keywords to detailed clinical scenarios, shows the ability of LLMs to handle diverse medical data. This also poses challenges in standardizing responses and ensuring consistency among studies. Comparison with answers from experts, Google, Med Search, and Open Evidence is an effective method to validate the effectiveness of the LLM response.

Accuracy and reliability

Despite its high utility, the overall accuracy of LLM responses, especially in complex clinical scenarios, poses some concerns, particularly in head and neck surgery, and other medical imaging domain.³⁰ While some studies have reported high accuracy rates for open-ended questions, the accuracy of bibliographic references was notably poor, which could lead to the dissemination of incorrect information if not properly verified. ChatGPT 4 outperformed version 3.5, by providing more accurate true answers and correct recommendations, and the quality of explanation was also superior, indicating improved natural language processing abilities. There are significant concerns regarding the “academic hallucination” observed in references related to head and neck oncology, where ChatGPT often produced erroneous or nonexistent references, albeit at a decreased rate with newer versions such as ChatGPT4. The limited use of other LLM tools, such as Google Bard, Gemini, Claud AI, and Microsoft Bing AI, indicates their low popularity in the OMFS community. Some authors noted that LLM responses lacked complete details and exhibited only moderate accuracy. This variability in performance emphasizes the need for the continuous refinement of LLMs and highlights the importance of human oversight in clinical applications.

Challenges in using LLM in OMFS

During our review analysis, we identified several challenges in the application of LLM in OMFS and academia.

LLMs such as ChatGPT may invent data when responding to niche topic inquiries in which verified information is scarce. This tendency can lead to the dissemination of unverified or fabricated information, which is particularly problematic in surgical and academic contexts, where accuracy is crucial.

The processing and storage of sensitive health information by LLMs pose significant privacy risks. Ensuring the security of patient data and compliance with medical confidentiality laws are essential.

There needs to be clear communication about the medicolegal aspects of using LLMs, such as ChatGPT. Patients and health care providers must understand the legal implications of relying on AI-generated advice.

Patients should be explicitly informed that they are interacting with an LLM tool and not with a human, which is crucial for maintaining trust and managing expectations.

LLMs current inability to reliably cite sources and provide evidence-based responses limits its academic feasibility.

Implications for clinical practices

The integration of LLMs into clinical settings could potentially streamline workflow and enhance the efficiency of patient care. For example, their use in patient education and counseling, as demonstrated in salivary gland clinics for Sia endoscopy treatment, suggests that LLMs can effectively communicate complex medical information. Integrating AI-based chatbots into dental imaging workflows can potentially enhance the standardization and quality of medical imaging practices. However, the reliance on LLMs for critical tasks, such as clinical decision-making and pre-surgical planning, should be approached with caution because of the variability in accuracy and the critical nature of these tasks.

Gaps and future recommendations

There is a clear need for further studies that compare the responses of LLMs, such as ChatGPT, against those from experts from the head, and neck surgical community to validate the accuracy and applicability of LLM-generated information in clinical settings. Conducting longitudinal studies to understand the impact of LLM use over time and employing large database studies specific to OMFS to enhance the comprehensiveness and reliability of LLM tools. Rigorous testing protocols and replicating studies were implemented to ensure the consistency, reliability, and generalizability of LLM applications for both patients and dentists.

No intelligent framework has been identified as implemented on mobile devices in the selected finalized studies. The high computational demands of LLMs and the substantial memory needed for OMFS imaging data may explain the limited adaptation of these applications for mobile use. Future advancements are expected to enable the deployment of these techniques on mobile platforms, integrating them with servers to facilitate diagnoses at the patient’s location.³¹ This could alleviate pressure on healthcare facilities while supporting medical professionals in delivering home-based care and prescribing appropriate treatments.

There is a notable absence of a standardized scoring system for evaluating LLM responses and assessing their accuracy, highlighting the need for a standardized approach in LLM research. Additionally, there is a lack of specialized LLMs tailored specifically for physicians and other professionals in the medical field.

Strength and limitations

The strengths and limitations of this review are discussed in the following section.

This review explores the diverse applications of LLM in OMFS, providing a broad understanding of its uses and benefits in this specialized surgical subspecialty. This highlights the versatility of LLM tools for improving healthcare delivery, training, and education. There was a clear objective in conducting the scoping review and the research questions were highlighted. Furthermore, the study effectively identifies gaps in current research and applications, offering valuable insights for future researchers to address these shortcomings.

Although every effort has been made to ensure the validity of the review, there are some limitations to consider. First, this review included only seven databases, which may have resulted in the omission of relevant studies from other databases. There are a limited number of studies on other large-language models. In addition, the decision to include only papers in English may have led to the exclusion of relevant articles published in other languages.

Conclusion

This review aimed to explore the role of large language models in oral and maxillofacial surgery. We analyzed 20 articles published between 2023 and 2024, with the majority originating in the USA. These findings highlight the potential of LLMs to enhance various aspects of OMFS practice by offering timely, accessible, and accurate information that can aid in clinical decision-making and patient care. However, it is essential to acknowledge the limitations and challenges associated with its use like advanced model of ChatGPT like GPT4.0 outperformed its early version (GPG 3.5 and others). Additionally, without its limited integration some LLM models such as Google Bard, Gemini, Claud AI, and Microsoft Bing AI are less popular in OMFS community. While some studies reported that LLM responses lacked complete details and exhibited only moderate accuracy. This variability in performance emphasizes the need for the continuous refinement of LLMs and highlights the importance of human oversight in clinical applications. Future research should address the identified gaps by developing specialized LLM applications tailored to OMFS, improving content generation accuracy, establishing a standard scoring system for LLM responses, and ensuring the ethical use of Large Language Models.

Supplemental Material

sj-zip-1-mac-10.1177_00202940251344491 – Supplemental material for Exploring the capabilities of large language models in oral and maxillofacial surgery

Supplemental material, sj-zip-1-mac-10.1177_00202940251344491 for Exploring the capabilities of large language models in oral and maxillofacial surgery by Sulaiman Khan, Shahira Padinharepattel Mohamed, Md. Rafiul Biswas and Zubair Shah in Measurement and Control

Footnotes

Acknowledgements

Open Access funding provided by the Qatar National Library.

ORCID iD

Sulaiman Khan

Author contributions

S.K., M.R.B. and Z.S. developed the concept and crafted a research question aimed at retrieving articles from online repositories. S.K. and S.P.M. orchestrated the study’s design and spearheaded the analysis, with assistance from M.R.B., Z.S. The S.K. and S.P.M authored and refined the original manuscript. All authors participated in the analysis, contributed intellectually, made critical revisions to the paper drafts, and endorsed the final version.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Open Access funding provided by the Qatar National Library.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The data used and/or analyzed during the current study is available from the corresponding author on reasonable request.

Supplemental material

Supplemental material for this article is available online.

References

Oral and Maxillofacial Surgery. Royal College of Surgeons England. https://www.rcseng.ac.uk/careers-in-surgery/trainees/foundation-and-core-trainees/surgical-specialties/oral-and-maxillofacial-surgery/ (accessed 16 April 2024).

OMFS, West Midlands Deanery, NHS. https://www.westmidlandsdeanery.nhs.uk/Portals/0/Surgery/OMFS/OMFS%20Overview.pdf (accessed 16 April 2024).

Nayak

Oral and maxillofacial surgery: it’s future as a specialty. J Oral Maxillofac Surg 2012; 10: 281–282.

Puladi

Gsaxner

Kleesiek

, et al. The impact and opportunities of large language models like ChatGPT in oral and maxillofacial surgery: a narrative review. Int J Oral Maxillofac Surg 2023; 53(1): 78–88.

Miragall

Knoedler

Kauke-Navarro

, et al. Face the future—artificial intelligence in oral and maxillofacial surgery. J Clin Med 2023; 12(21): 6843.

Chaker

Hung

Saad

, et al. Easing the burden on caregivers- applications of artificial intelligence for physicians and caregivers of children with cleft lip and palate. Cleft Palate Craniofac J 2024; 62: 574–587.

Cufuna

DSA

Duart

Rangel-de Lazaro

. Augmented reality in Higher Education: Interactions in LLM-Based teaching and Learning. In: Guralnick

Auer

Poce

(eds) Creative Approaches to technology-enhanced learning for the workplace and higher education. TLIC 2024. Lecture Notes in networks and Systems. Springer, 2024, Vol. 1150, pp. 105–114.

Askarbekuly

Aničić

LLM examiner: automating assessment in informal self-directed e-learning using ChatGPT. Knowl Inf Syst 2024; 66: 6133–6150.

Jarry

Varas Cohen

Distance simulation in surgical education. Surgery 2025; 180: 109097.

10.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021; 372: n71.

11.

Ouzzani

Hammady

Fedorowicz

, et al. Rayyan—a web and mobile app for systematic reviews. Syst Rev 2016; 5(1): 210.

12.

Dang

RR.

ChatGPT in head and neck scientific writing: a precautionary anecdote. Am J Otolaryngol 2023; 44(6): 103980.

13.

Vaira

Lechien

Abbate

, et al. Accuracy of ChatGPT-generated information on head and neck and oromaxillofacial surgery: a multicenter collaborative analysis. Otolaryngol Head Neck Surg 2023; 170: 1492–1503.

14.

Frosolini

Franz

Benedetti

, et al. Assessing the accuracy of ChatGPT references in head and neck and ENT disciplines. Eur Arch Otorhinolaryngol 2023; 280(11): 5129–5133.

15.

Lechien

Briganti

Vaira

LA.

Accuracy of chatgpt-3.5 and-4 in providing scientific references in otolaryngology–head and neck surgery. Eur Arch Otorhinolaryngol 2024; 281(4): 2159–2165.

16.

Wei

Fritz

Rajasekaran

Answering head and neck cancer questions: an assessment of ChatGPT responses. Am J Otolaryngol 2023; 45(1): 104085.

17.

Alten

Gündeş

Tuncer

, et al. Integrating artificial intelligence in orthognathic surgery: a case study of ChatGPT’s role in enhancing physician-patient consultations for dentofacial deformities. J Plast Reconstr Aesthet Surg 2023; 87: 405–407.

18.

Lebhar

Velazquez

Goza

, et al. Dr. ChatGPT: utilizing artificial intelligence in surgical education. Cleft Palate Craniofac J 2023; 61: 2067–2073.

19.

Lechien

Chiesa-Estomba

Baudouin

, et al. Accuracy of ChatGPT in head and neck oncological board decisions: preliminary findings. Eur Arch Otorhinolaryngol 2023; 281(4): 2105–2114.

20.

Lee

Hamill

Shnayder

, et al. Exploring the role of artificial intelligence chatbots in preoperative counseling for head and neck cancer surgery. Laryngoscope 2023; 134: 2757–2761.

21.

Yurdakurban

Topsakal

Duran

GS.

A comparative analysis of AI-based chatbots: assessing data quality in orthognathic surgery related patient information. J Stomatol Oral Maxillofac Surg 2023; 125(5): 101757.

22.

Russe

Rau

Ermer

, et al. A content-aware chatbot based on GPT 4 provides trustworthy recommendations for cone-beam CT guidelines in dental imaging. Dentomaxillofac Radiol 2024; 53(2): 109–114.

23.

Seth

Lim

Xie

, et al. Comparing the efficacy of large language models ChatGPT, BARD, and Bing AI in providing information on Rhinoplasty: an observational study. Aesthetic Surg J Open Forum 2023; 5: 1–9. DOI: 10.1093/asjof/ojad084

24.

Xie

Seth

Hunter-Smith

, et al. Aesthetic surgery advice and counseling from artificial intelligence: a rhinoplasty consultation with ChatGPT. Aesthetic Plast Surg 2023; 47(5): 1985–1993.

25.

Dang

Hanba

A large language model’s assessment of methodology reporting in head and neck surgery. Am J Otolaryngol 2023; 45(2): 104145.

26.

Suárez

Jiménez

Llorente de

Pedro M

, et al. Beyond the Scalpel: assessing ChatGPT’s potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2023; 24: 46–52.

27.

Chiesa-Estomba

Lechien

Vaira

, et al. Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. Eur Arch Otorhinolaryngol 2024; 281(4): 2081–2023.

28.

Yousefi-Koma

Akbarzadeh-Baghban

Can natural language processing serve as a consultant in oral surgery?

Journal of Stomatology, Oral and Maxillofacial Surgery 2023; 125(3): 101963.

29.

Saibene

Allevi

Calvo-Henriquez

, et al. Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation. Eur Arch Otorhinolaryngol 2024; 281(4): 1835–1841.

30.

Aguiar de

Sousa R

Costa

Almeida Figueiredo

, et al. Is ChatGPT a reliable source of scientific information regarding third-molar surgery? J Am Dent Assoc 2024; 155(3): 227–232.e6.

31.

Khan

Ali

Shah

Identifying the role of vision transformer for skin cancer—a scoping review. Front Artif Intell 2023; 6: 1202990.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.05 MB

0.00 MB