Sage Journals: Discover world-class research

Abstract

Purpose

Chat Generative Pre-trained Transformer (ChatGPT) is now utilized in various fields of healthcare in order to obtain answers to questions related to healthcare-related problems and to evaluate available information. Primary hyperparathyroidism is a common endocrine disorder. We aimed to evaluate the accuracy and quality of ChatGPT's responses to questions specific to hyperparathyroidism cases discussed at multidisciplinary endocrinology meetings.

Methods

ChatGPT-4 was asked to respond to 10 hyperparathyroidism cases evaluated at multidisciplinary endocrinology meetings. The accuracy, completeness, and quality of the responses were scored independently by two endocrinologists. Accuracy and completeness were evaluated on the Likert scale, and quality was evaluated on the global quality scale (GQS).

Results

No misleading information was detected in the responses. In terms of diagnosis, the mean accuracy scores (ranging from 1 to 5) were 4.9 ± 0.1 and the mean completeness scores (ranging from 1 to 3) were 3.0. In the responses given in terms of further examination, the mean accuracy and completeness scores were 4.8 ± 0.13 and 2.6 ± 0.16, respectively. The mean accuracy and completeness scores for treatment recommendations were 4.9 ± 0.1 and 2.4 ± 0.16, respectively. The GQS evaluation result was 80% high quality and 20% medium quality.

Conclusion

In this study, the accuracy and quality rates of ChatGPT-4 were generally high in responding to questions as to hyperparathyroidism patients. It can be concluded that artificial intelligence may serve as a valuable tool in healthcare. However, the limitations and risks of ChatGPT should also be evaluated.

Keywords

ChatGPT accuracy completeness quality hyperparathyroidism

Introduction

One of the most significant recent developments in artificial intelligence (AI) is considerable advances in the large language model technology. Chatbots are computer programs that simulate human communication using AI and a natural language processing (NLP) system. ChatGPT (Chat Generative Pre-trained Transformer) is an NLP model developed by OpenAI in November 2022.¹ In recent years, the ChatGPT chatbot has been used to provide answers to questions and evaluate information in various fields, including medicine, and the results have been outstanding.^2,3

ChatGPT has been used in various areas of medicine. Stokel-Walker and van Noorden found that ChatGPT produced accurate answers to some open-ended medical questions that were close to what a doctor would do.³ ChatGPT has also passed standardized exams in a wide range of professional medical fields in numerous countries, such as the United States Medical Licensing Examination^® (USMLE).^4,5 ChatGPT has also been successful in breast tumor councils’ individualized treatment recommendations and infectious disease consultation evaluations.^6,7 ChatGPT has also been evaluated in many studies in the field of endocrinology. It has been shown that it is partially successful in the evaluation and management of obesity in type 2 diabetes mellitus and in the approach to prolactinoma, and that it is successful in the management of thyroid nodules, but it needs more up-to-date information on diabetes technology recommendations.^8–11

Primary hyperparathyroidism (PHPT) is a common endocrine disorder characterized by excessive secretion of parathormone (PTH) from the parathyroid glands. There are a number of different clinical approaches to this disorder. Multidisciplinary team meetings are usually organized for the effective management of patients with parathyroid pathology.¹²

The number of studies evaluating ChatGPT regarding endocrinological diseases is limited and there is no information yet on the approach to hyperparathyroidism patients. Therefore, in this study, we aimed to evaluate how accurately and comprehensively ChatGPT answers open-ended questions regarding hyperparathyroidism cases discussed in multidisciplinary endocrinology meetings and the quality of its answers.

Methods

Study population

In the present study, we evaluated a single disease group to create homogeneity and obtain clearer data about the specific disease. We asked ChatGPT-4 to comment on hyperparathyroidism patients evaluated at our multidisciplinary endocrine meetings as a case in 2023 (within the last year). Then we analyzed the answers given by ChatGPT-4. Our center is a tertiary education and research hospital and patients with hyperparathyroidism are evaluated at multidisciplinary endocrine meetings. In these meetings, the participants discuss diagnoses, differential diagnoses, further investigation if necessary, and treatment options for each patient. Treatment methods may differ as follow up, medical treatment, and surgery.

Model input

The medical data of the patients, which were discussed at the meeting, were input into ChatGPT. We performed the model testing using ChatGPT-4 because this is the latest version. A model entry was designed for each patient, consisting of first clinical and laboratory data and then open-ended questions investigating diagnosis, examination, and treatment. In the patient information section, the data that may affect the decision of diagnosis and treatment were shared. These include serum albumin-corrected calcium, serum phosphorus value, PTH, 25 OH vitamin D level, 24-hour urine calcium, estimated glomerular filtration rate, the result of bone mineral densitometry, and the results of imaging methods. Afterwards, ChatGPT was asked about the diagnosis, the need for additional examinations and treatment recommendations for the case whose data was entered. Table 1 presents a sample question and the answer generated by ChatGPT-4. All the questions were edited in terms of grammar by two authors for clarity before being entered to ChatGPT-4. To account for response-to-response variation, each prompt was tested in duplicate on different days.

Table 1.

Sample questions and ChatGPT-generated answers.

Question: A 27-year-old male patient was evaluated in the endocrinology outpatient clinic. In the laboratory tests of the patient; serum albumin corrected calcium 13.7 mg/dl (reference 8.5–10.4), serum phosphorus value 2.4 mg/dl (reference 2.6–4.5), PTH: 236 ng/L (reference 15–65), 25 OH vitamin D level 24 ng/ml (reference 30–50), 24-hour urinary calcium 560 mg/day (reference 100–300 mg/day). No kidney stones were detected in urinary ultrasonography. Bone mineral densitometry was consistent with osteopenia. On neck ultrasonography, a lesion compatible with left lower parathyroid adenoma was observed. In 99mTc-MIBI parathyroid scintigraphy, uptake compatible with right and left lower parathyroid adenomas was observed. What is the diagnosis of this patient? Does this patient need further examination and evaluation? What additional investigations can be done? What is the most appropriate treatment for this patient?

ChatGPT-4 response: The clinical and laboratory findings of this 27-year-old male patient show significantly elevated serum calcium, PTH, and 24-hour urinary calcium, along with a low phosphorus value and suboptimal vitamin D level. The imaging studies identified lesions in both left and right lower parathyroid glands, consistent with adenomas. These findings are highly suggestive of PHPT, and the presence of adenomas in multiple glands may indicate a less common scenario such as parathyroid hyperplasia or multiple adenomas.

Diagnosis:

PHPT, possibly due to multiple adenomas or hyperplasia.

Further examination and evaluation:

Given the age of the patient and the involvement of multiple glands, further investigation for a possible familial or genetic syndrome, such as MEN1, should be considered.

Additional investigations:

Genetic testing: For MEN1 or other genetic syndromes related to hyperparathyroidism.

Sestamibi-SPECT scan: To confirm and better localize the adenomas.

Full body imaging (like CT or MRI): To rule out other tumors if a genetic syndrome like MEN1 is suspected.

Serum pancreatic and pituitary hormones assessment: If MEN1 is suspected, as it often involves tumors in the pituitary and pancreas.

Kidney function tests: To rule out any subtle kidney dysfunction.

Detailed family history: To assess the possibility of familial hyperparathyroidism.

Treatment:

Surgical removal: Given the significant hypercalcemia, osteopenia, and high urinary calcium, surgical removal of the affected parathyroid glands is the treatment of choice. In cases of hyperplasia or multiple adenomas, a subtotal or total parathyroidectomy with autotransplantation may be required.

Genetic counseling: If a genetic syndrome is diagnosed, genetic counseling, and screening for other associated tumors will be important.

Vitamin D supplementation: Given the low levels of vitamin D, supplementation might be necessary.

Preoperative management: Careful preoperative management with hydration and medications to lower the calcium levels, like bisphosphonates, might be needed given the significant hypercalcemia.

Postoperative follow up: Monitoring for hypocalcemia postoperatively and long-term follow up with an endocrinologist is essential.

Management should be coordinated with a multidisciplinary team including an endocrinologist, experienced parathyroid surgeon, radiologist, and possibly a genetic counselor, especially if a hereditary condition is suspected. The treatment plan should be tailored based on the patient's individual condition, underlying cause, and preferences.

ChatGPT: Chat Generative Pre-trained Transformer; CT: computed tomogrpahy; MEN1: Multiple Endocrine Neoplasia type 1; MRI: magnetic resonance imaging; PHPT: primary hyperparathyroidism; PTH: parathormone.

Evaluation of accuracy, completeness, and quality

The responses were scored independently by two endocrinologists, and differences between the two evaluators were independently reviewed and resolved by a blinded endocrinologist with at least 20 years of experience in endocrinology for the final decision. The endocrinologists evaluated each chatbot response in three distinct categories as diagnosis and differential diagnosis, further investigation, and treatment recommendation. The evaluation was performed with regard to accuracy, completeness, and quality.

A five-point Likert scale was used for the accuracy scale (1—totally wrong, 2—mostly wrong, 3—right and wrong approximately equal, 4—mostly right, and 5—totally right). A three-point Likert scale was used as a completeness scale (1—incomplete, 2—partially incomplete, and 3—covers all aspects).⁵

The previously used global quality scale (GQS) was applied to assess the quality of ChatGPT responses. Accordingly, 1 point indicates poor quality, and 5 points indicate excellent quality (Table 2). This scale is also used for quality classification. According to this classification, 1 and 2 points indicate low quality, 3 points moderate quality, and 4 and 5 points high quality.¹³

Table 2.

Contents of GQS.

GQS	Score
Poor quality, poor flow of the site, most information missing, not at all useful for patients	1
Generally poor quality and poor flow, some information listed but many important topics missing, of very limited use to patients	2
Moderate quality, suboptimal flow, some important information is adequately discussed but others poorly discussed, somewhat useful for patients	3
Good quality and generally good flow, most of the relevant information is listed, but some topics not covered, useful for patients	4
Excellent quality and excellent flow, very useful for patients	5

GQS: Global quality score.

Ethical approval

The study was approved by the Ankara Training and Research Hospital Ethics Committee (date and number: February 7, 2024/E24-3). It was conducted in accordance with the Declaration of Helsinki. The requirement for informed consent was waived due to the retrospective nature of this study.

Statistical analysis

All statistical analyses were performed using SPSS version 27 (IBM SPSS Statistics for Windows, Armonk, NY: IBM Corp). The agreement between the two authors who independently evaluated the ChatGPT responses was tested using Weighted Cohen's Kappa coefficient. The Shapiro-Wilk test was utilized to check the normality of the distribution. The Kruskal-Wallis and the Mann-Whitney U tests were implemented for the nonparametric variables of the relationship between the data.

Result

In total, 10 cases evaluated in endocrinology for hypercalcemia and hyperparathyroidism were input into ChatGPT-4. All of the queries involved questions about diagnosis, further investigation, and treatment options. No misleading information was detected in any of the ChatGPT-4 responses. All the questions and answers are presented in the Supplemental materials.

ChatGPT-4's responses to the questions were generally consistent between the first and second sessions. All of the responses emphasized the need for individualization of the treatment and the need to evaluate the cases by multidisciplinary health professionals who are experts in the field.

Evaluation of accuracy and completeness

The Weighted Cohen Kappa coefficient between the two authors was 0.714 and 0.760 for accuracy and 0.780 and 0.754 for completeness in the first and second sessions, respectively. Table 3 presents accuracy and incompleteness scores for each question in terms of diagnosis, further examination and evaluation, and treatment.

Table 3.

Grade of responses by ChatGPT in accuracy and completeness.

Question number	Diagnosis and differential diagnosis		Additional further investigation		Treatment recommendation
Question number	Accuracy	Completeness	Accuracy	Completeness	Accuracy	Completeness
1.	5	3	5	3	5	3
2.	5	3	4	2	5	2
3.	5	3	5	3	5	2
4.	5	3	5	3	5	2
5.	5	3	5	2	5	2
6.	5	3	5	2	5	2
7.	5	3	5	3	5	3
8.	5	3	5	3	4	3
9.	4	3	4	2	5	2
10.	5	3	5	3	5	3

ChatGPT: Chat Generative Pre-trained Transformer.

With respect to diagnosis and differential diagnosis, in the accuracy assessments, 90% received 5 full points, and in the completeness assessment, full points were obtained in 100% of the cases. In the answers given in terms of further examination, 80% received a full score of 5 on the accuracy assessment, and 60% scored a full 3 points in the completeness assessment. When the answers given in terms of treatment recommendations were evaluated, the accuracy rate was 90% full. When evaluated in terms of completeness, 40% received full points and 60% received 2 points (partially incomplete).

For questions about diagnosis and differential diagnosis, the mean accuracy scores (1–5) were 4.9 ± 0.1 and the mean completeness scores (1–3) were 3.0. In the answers given in terms of further examination, the mean accuracy and completeness scores were 4.8 ± 0.13 and 2.6 ± 0.16, respectively. In the answers given in terms of treatment recommendations, the mean accuracy and completeness scores were 4.9 ± 0.1 and 2.4 ± 0.16, respectively (Figure 1).

Figure 1.

Graphical distribution model of grades assigned by the ChatGPT language model to answers to questions for accuracy and completeness in three different groups. ChatGPT: Chat Generative Pre-trained Transformer.

When the accuracy rates of the data were compared between the groups (in terms of diagnosis and differential diagnosis, further investigation, and treatment recommendation), no statistically significant difference was observed (p = 0.342). When the data completeness rates were compared between the groups (in terms of diagnosis and differential diagnosis, further investigation, and treatment recommendation), a statistically significant difference was observed (p = 0.017). When subgroup analyses were evaluated, a significant difference was observed between diagnosis and further investigation and between diagnosis and treatment recommendation, but no significant difference was observed between further investigation and treatment recommendation (p = 0.029, p = 0.004 and p = 0.384, respectively). It was observed that more incomplete recommendations were made in terms of treatment recommendations.

In the diagnosis and differential diagnosis section, familial hypocalciuric hypercalcemia (FHH) was stated in the differential diagnosis in only one case, even though the calcium value in the 24-hour urine was high, but this was not noted in the second session of the same case. Points were deducted for accuracy here. In the further examination section, in some cases of PHPT that could not be localized, 18F-fluorocholine (FCH) positron emission tomography/computed tomography (PET/CT) was not presented as a recommendation and points were deducted for completeness. In the treatment recommendations section, cinecalcet was not mentioned in some necessary cases, but cinecalcet was recommended in the second session of these cases. In some necessary cases, the neck exploration option and radiofrequency ablation option were not offered. Points were deducted for completeness here. Other than that, ChatGPT was quite successful. It was successful in detecting FHH, lithium use, and multiple endocrine neoplasia (MEN) syndrome when necessary in the differential diagnosis. ChatGPT was successful in its recommendations to assess additional comorbidities and evaluate other components when considering MEN syndrome. It was also very successful in detecting which patients had a surgical indication and some did not.

Evaluation of quality

The Weighted Cohen Kappa coefficient between the two authors was 0.730 and 0.737 for quality in the first and second sessions, respectively. Table 4 shows the score distribution of ChatGPT-4 responses to each question according to the GQS quality classification. The result of the evaluation was 80% high quality, and 20% moderate quality (Figure 2).

Figure 2.

GQS quality classification of responses by the ChatGPT language model to questions. ChatGPT: Chat Generative Pre-trained Transformer; GQS: global quality score.

Table 4.

Grade of responses by ChatGPT inGQS.

Question number	1	2	3	4	5	6	7	8	9	10
GQS	5	4	4	4	4	4	5	3	3	5

ChatGPT: Chat Generative Pre-trained Transformer; GQS: global quality score.

Discussion

In this study, we evaluated the accuracy, comprehensiveness, and quality of the ChatGPT-4 language model in answering open-ended questions about diagnosis, further examination, and treatment by providing clinical information of hyperparathyroidism patients evaluated at endocrinology meetings. ChatGPT's responses, which were evaluated by endocrinologists, generated generally accurate and comprehensive information for diagnosis, further investigation, and treatment recommendations. The study results revealed that ChatGPT-4 responses were also generally of high quality.

Hyperparathyroidism is a disease that is commonly encountered in endocrinology outpatient clinics and can sometimes be confusing, requiring patient-based decisions in diagnosis and treatment. Therefore, it was planned to evaluate AI as an auxiliary tool for multidisciplinary councils. AI can be used as a supporting tool to increase efficiency in many areas of medicine, especially in radiology and pathology.^14,15 The cases in our study were selected from cases frequently encountered in endocrine polyclinics, evaluated in real life in our own clinic, and evaluated in our own multidisciplinary meetings. International guidelines on this subject were taken into account when evaluating the accuracy and completeness of the questions [Fifth International Workshop on Primary Hyperparathyroidism: Summary statement and guidelines for the evaluation and management of primary hyperparathyroidism (2022), European Society of Endocrinology (ESE): European expert consensus on practical management of specific aspects of parathyroid disorders in adults and in pregnancy – Recommendations of the ESE Educational Program of Parathyroid Disorders (PARAT 2021), European Association of Nuclear Medicine (EANM): Practice guidelines for parathyroid imaging (2021)].

PHPT is a clinical condition characterized by elevated PTH and hypercalcemia-specific clinical symptoms and signs. Although mostly sporadic, sometimes familial cases may be included in the component of multiple endocrine neoplasia (MEN 1, MEN 2A).¹⁶ Genetic tests are recommended in young patients (<30 years), those with a family history of PHPT, and those with multigland disease.¹⁷ ChatGPT-4 was able to recommend genetic tests and examination for MEN in cases under the age of 30 and in cases with multigland involvement.

Familial hypocalciuric hypercalcemia and lithium treatment may also cause elevated serum calcium and PTH values, and therefore they should be evaluated in the differential diagnosis of PHPT. FHH is a hereditary disease caused by the mutation of calcium sensitive genes. The presence of hypocalciuria (excretion less than 50 mg/day) is essential for diagnosis.¹⁸ Lithium therapy also leads to an acquired calcium receptor insensitivity.¹⁹ It was observed that ChatGPT-4 could make a differential diagnosis in hypocalciuria patients and patients using lithium. ChatGPT-4 also suggested genetic examinations as further examination in cases where FHH is suspected. It was also observed that ChatGPT-4 could make the differential diagnosis of primary and tertiary hyperparathyroidism.

After diagnosis, localization should be performed. Imaging methods include ultrasonography, Tc99M-sestamibi imaging, magnetic resonance imaging, or preferably 4-dimensional computed tomography (4D CT). ChatGPT-4 was able to request these examinations including 4D CT. FCH PET/CT is also a novel imaging method for parathyroid pathologies. FCH PET/CT has been applied in some centers especially in Europe since 2017.²⁰ In our cases, FCH PET/CT was never mentioned among advanced examination and imaging methods in ChatGPT-4 responses. Presumably, ChatGPT-4 did not include FCH PET/CT in its responses as this method has not yet been included in hyperparathyroidism management guidelines and it is not yet widespread in the United States.^21,22 Selective venous sampling is an invasive method for the localization of parathyroid.²³ This technique was also included in the recommendations put forward by ChatGPT-4. Bilateral exploration and intraoperative parathyroid hormone monitoring may be recommended by an experienced parathyroid surgeon for patients with primary hyperparathyroidism who are not localized but meet the operative criteria.²⁴ However, this surgical technique was not recommended by ChatGPT-4 in case questions that could not be localized.

Surgical treatment is recommended for symptomatic and asymptomatic individuals who meet the operating criteria of the Fifth International Workshop on Asymptomatic Primary Hyperparathyroidism guideline.²¹ For some patients who meet surgical criteria but cannot or will not undergo surgery, medical treatments such as bisphosphonates or cinacalcet may be preferred.^16,25 ChatCPT-4 was able to give correct answers in determining the operation indications. Also it was observed that ChatCPT-4 generally produced correct answers in recommending medical treatment to elderly and inoperable patients. However, in some cases, it was incomplete in recommending cinecalcet treatment and similarly in providing the radiofrequency ablation option. Additionally, it failed to offer the option of neck exploration in a patient whose parathyroid adenum could not be localized but who had an indication for surgery.

ChatGPT has also been used in several other professional healthcare fields before. Several studies evaluated ChatGPT's responses to questions on multiple choice medical exams, such as the USMLE and the Medical College Admission Test (MCAT). According to these studies, ChatGPT displayed a successful performance.^4,26 In the literature, there is a study on the use of ChatGPT in situations that require a multidisciplinary approach, similar to our study. In this particular study, the recommendations of ChatGPT for primary breast cancer cases were compared with those of a multidisciplinary tumor board. The study results showed an agreement at a rate of 64.2%.⁶ The relatively low rate of compliance in this study may have resulted from the fact that the treatment options in the oncology department change rather rapidly and the database of ChatGPT is restricted to information it learnt before 2021. The researchers noted that in this study, ChatGPT provided general responses but not individual specific recommendations. Unlike this one, our study showed that ChatGPT offered individual diagnosis, further examination, and treatment options to the cases under investigation. The available data on primary hyperparathyroidism approach is more stable in the literature, which may be the reason for this difference.

Another study analyzed ChatGPT's performance in Korean general surgery board exams. In this study, 280 questions were evaluated, and ChatGPT-4 had an accuracy rate of 76.4% in the specific exam. ChatGPT was found to be highly successful in understanding complex surgical clinical information.²⁷

Another study in the literature evaluated the accuracy and completeness of ChatGPT for medical questions. In this study, 284 medical questions were entered into ChatGPT. Accuracy was evaluated using a 6-point Likert scale, with a median accuracy score of 5.5 and an average accuracy score of 4.8. The completeness score was also evaluated using a three-point Likert scale. The data analysis revealed an average completeness score of 2.5. In conclusion, it was stated that ChatGPT produces largely accurate information for various medical queries.⁵

However, in another study investigating the performances of ChatGPT-3 and ChatGPT-4 in gastroenterology self-assessment tests, the researchers reported that both models failed the test, and therefore they did not recommend their current use in gastroenterology medical education.²⁸

Our study had certain limitations. First, a small sample size may reduce the robustness of the findings, limiting the power of the tests to detect significant differences. Secondly, the responses were evaluated solely in English, so we cannot generalize the results to all languages. Although ChatGPT-4 is available in numerous other languages, studies conducted in English are in great majority within the literature. Moreover, ChatGPT does not produce standard answers to the questions asked; on the contrary, it may give different answers in different sessions. Although there is often consistency between the answers, this cannot be generalized to all situations. Furthermore, ChatGPT can only access data that predate 2021. It is not actively connected to the Internet and does not contain up-to-date data. Finally, no standard method exists for evaluating ChatGPT-4 responses, which may cause differences in assessments. And also, evaluation of responses reveals possible bias because they are not measured objectively.

Conclusion

In conclusion, this study demonstrated the potential of AI-based ChatGPT-4 to provide answers to open-ended questions about data of real hyperparathyroidism patients. The accuracy and quality rates of the answers were high. ChatGPT can be used as a valuable resource to enhance the efficiency of healthcare and patient-based individual decision making. However, the limitations and risks of ChatGPT as well as the potential benefits should be known, and clinicians and patients should be informed about them. The reliability of and ethical debates about ChatGPT in the medical field should be further clarified. It is not appropriate to use ChatGPT alone in patient management; diagnosis and treatment must be done by a physician. AI and ChatGPT can never replace medical professionals and doctors alone, but can only be a helpful tool in healthcare. Further studies are needed to more clearly evaluate the performance of chatGPT in the medical field.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076241278692 - Supplemental material for Evaluation of the accuracy and quality of ChatGPT-4 responses for hyperparathyroidism patients discussed at multidisciplinary endocrinology meetings

Supplemental material, sj-docx-1-dhj-10.1177_20552076241278692 for Evaluation of the accuracy and quality of ChatGPT-4 responses for hyperparathyroidism patients discussed at multidisciplinary endocrinology meetings by Işılay Taşkaldıran, Çağatay Emir Önder, Püren Gökbulut, Gönül Koç and Şerife Mehlika Kuşkonmaz in DIGITAL HEALTH

Footnotes

Contributorship

All authors contributed to the understanding and design of the study. Material preparation, data collection, and analysis were carried out with IT, CEO, PG, GK, and SMK. The first draft of the article was written by IT, and all authors commented on previous versions of the article. All authors have read and approved the final article.

Data availability statement

The data that support the findings of this study are available on request from the corresponding author.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

The study was approved by the Ankara Training and Research Hospital Ethics Committee (date and number: February 7, 2024 / E24-3). It was conducted in accordance with the Declaration of Helsinki. The requirement for informed consent was waived due to the retrospective nature of this study.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Guarantor

ORCID iD

Işılay Taşkaldıran

Supplemental material

Supplemental material for this article is available online.

References

OpenAI. Introducing ChatGPT [Internet]. OpenAI; c2015–2023 [cited 2023 Feb 10]. Available from: https://openai.com/blog/chatgpt.

Sallam

. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel, Switzerland) 2023; 11, 87.

Stokel-Walker

Van Noorden

. What ChatGPT and generative AI mean for science. Nature 2023; 614: 214–216.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198.

Johnson

Goodman

Patrinely

, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the chat-GPT model. Res Sq [Preprint] 2023: rs.3.rs-2566942. doi:https://doi.org/10.21203/rs.3.rs-2566942/v1.

Lukac

Dayan

Fink

, et al. Evaluating Chat GPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch Gynecol Obstet 2023; 308: 1831–1844. doi:https://doi.org/10.1007/s00404-023-07130-5.

Howard

Hope

Gerada

. ChatGPT and antimicrobial advice: the end of the consulting infection doctor? Lancet Infect Dis 2023 Apr; 23: 405–406. Epub 2023 Feb 20. PMID: 36822213.

Barlas

Altinova

Akturk

, et al. Response to comment on “credibility of chat-GPT in the assessment of obesity in type 2 diabetes according to the guidelines”. Int J Obes (Lond) 2024; 48: 1042.

Şenoymak

Erbatur

Şenoymak

, et al. The role of artificial intelligence in endocrine management: assessing ChatGPT’s responses to prolactinoma queries. J Pers Med 2024; 14: 330.

10.

Köroğlu

Fakı

Beştepe

, et al. A novel approach: evaluating ChatGPT’s utility for the management of thyroid nodules. Cureus 2023 Oct 24; 15: e47576.

11.

Meo

Al-Khlaiwi

AbuKhalaf

, et al. The scientific knowledge of bard and ChatGPT in endocrinology, diabetes, and diabetes technology: multiple-choice questions examination-based performance. J Diabetes Sci Technol 2023 Oct 5: 19322968231203987. doi: https://doi.org/10.1177/19322968231203987.

12.

Jodkowska

Tupikowski

Szymczak

, et al. Interdisciplinary aspects of primary hyperparathyroidism: symptomatology in a series of 100 cases. Adv Clin Exp Med 2016; 25: 285–293.

13.

Onder

Zengin

. Quality of English-language videos available on YouTube as a source of information on osteoporosis. Arch Osteoporos 2022; 17: 19. PMID: 35059873; PMCID: PMC8776324.

14.

McKinney

Sieniek

Godbole

, et al. International evaluation of an AI system for breast cancer screening. Nature 2020; 577: 89–94.

15.

Tran

Jerzak

, et al. Personalized breast cancer treatments using artificial intelligence in radiomics and pathomics. J Med Imaging Radiat Sci 2019; 50: S32–S41.

16.

Bilezikian

Khan

Silverberg

, et al. Evaluation and management of primary hyperparathyroidism: summary statement and guidelines from the fifth international workshop. J Bone Miner Res 2022; 37: 2293.

17.

Eastell

Brandi

Costa

, et al. Diagnosis of asymptomatic primary hyperparathyroidism: proceedings of the Fourth International Workshop. J Clin Endocrinol Metab 2014; 99: 3570.

18.

Moore

Berber

Jin

, et al. Calcium-creatinine clearance ratio is not helpful in differentiating primary hyperparathyroidism from familial hypercalcemic hypocalciuria: a study of 1000 patients. Endocr Pract 2018. doi:https://doi.org/10.4158/EP-2018-0350

19.

Meehan

Udumyan

Kardell

, et al. Lithium-associated hypercalcemia: pathophysiology, prevalence, management. World J Surg 2018; 42: 415.

20.

Talbot

Périé

Tassart

, et al. 18F-fluorocholine PET/CT detects parathyroid gland hyperplasia as well as adenoma: 401 PET/CTs in one center. Q J Nucl Med Mol Imaging 2023 Jun; 67: 96–113.

21.

Bilezikian

Khan

Silverberg

, et al. International workshop on primary hyperparathyroidism. Evaluation and management of primary hyperparathyroidism: summary statement and guidelines from the Fifth International Workshop. J Bone Miner Res 2022 Nov; 37: 2293–2314.

22.

Bilezikian

Silverberg

Bandeira

, et al. Management of primary hyperparathyroidism. J Bone Miner Res 2022; 37: 2391–2403.

23.

Lebastchi

Aruny

Donovan

, et al. Real-time super selective venous sampling in remedial parathyroid surgery. J Am Coll Surg 2015; 220: 994.

24.

Chan

Ruan

Gawande

, et al. Surgery for hyperparathyroidism in image-negative patients. Arch Surg 2008; 143: 335.

25.

Marcocci

Bollerslev

Khan

, et al. Medical management of primary hyperparathyroidism: proceedings of the Fourth International Workshop on the management of asymptomatic primary hyperparathyroidism. J Clin Endocrinol Metab 2014; 99: 3607.

26.

Bommineni

Bhagwagar

Balcarcel

, et al. Performance of ChatGPT on the MCAT: The road to personalized and equitable premedical learning. medRxiv 2023. Doi:https://doi.org/10.1101/2023.03.05.23286533.

27.

Choi

Lee

. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res 2023; 104: 269–273..

28.

Suchman

Garg

Trindade

. Chat generative pretrained transformer fails the multiple-choice American college of gastroenterology self-assessment test. Am J Gastroenterol 2023; 118: 2280–2282. doi:https://doi.org/10.14309/ajg.0000000000002320.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.05 MB