Abstract
Introduction
Acute appendicitis is a common cause of acute abdomen in secondary care. Despite advancements in diagnostics, misdiagnosis and negative appendectomies remain significant. Artificial Intelligence (AI), particularly machine learning (ML) and deep learning, shows promise in improving diagnostic accuracy.
Materials and Methods
A literature review using PubMed and Cochrane databases included studies on AI’s role in diagnosing and prognosing appendicitis. Studies relying solely on clinical or radiology reports were excluded.
Results
AI models, particularly random forest (RF), logistic regression (LR), and neural networks (NN), demonstrated high diagnostic accuracy, with RF outperforming others. Machine learning methods like SVM and XGBoost (XGB) were effective in predicting appendicitis prognosis, especially in distinguishing complicated cases. AI models outperformed traditional diagnostic scores, such as the Alvarado score.
Conclusion
AI has significant potential to enhance the diagnosis and prognosis of acute appendicitis, but challenges in data requirements and standardisation must be addressed for widespread clinical use.
Introduction
Acute appendicitis is a prevalent cause of acute abdominal pain affecting both paediatric and adult populations, with an annual incidence rate of 5.7-50 patients per 100 000 individuals in developed countries. 1 The pathophysiology of acute appendicitis arises from the obstruction of the appendiceal orifice, which can be attributed to various factors such as infections, fecaliths, lymphoid hyperplasia, or neoplasms. 2 This condition can be categorised as simple or complex. Simple appendicitis denotes non-perforated or non-gangrenous appendicitis, with a mortality risk of 0.1% for non-gangrenous cases and 0.6% for gangrenous appendicitis. Conversely, complex appendicitis may rapidly progress to abscess or perforation, carrying an elevated mortality rate of approximately 5%. 1
The traditional method of diagnosing appendicitis involves clinical assessment and correlating laboratory inflammatory markers. However, to enhance diagnostic accuracy, ultrasonography and computed tomography (CT) scanning are recommended.2,3 Nonetheless, due to the potential harm from ionising radiation used in CT scanning, it is often avoided in pregnant women and children. 4 For these groups, ultrasonography is frequently used as an alternative. However, limitations of ultrasonography include operator dependency and patient factors. Even though the introduction of systematic preoperative imaging has reduced negative appendectomy rates in recent times, this number remains high, reaching rates of 32.8% in some studies. 3 It is important to note, nonetheless, that although one study reported a 32.8% negative appendectomy rate, 3 more recent analyses suggest rates have fallen below 10% in most high-resource settings, particularly with the advent of preoperative imaging and scoring systems. 1 Additionally, 4.3% and 6.3% of children who had undergone appendectomies in the United States and Canada respectively, were found to have had unnecessary appendectomies despite a clinical and/or radiological diagnosis. 5
Current research increasingly emphasises the application of Artificial Intelligence (AI) to support clinical decision-making and reduce the occurrence of missed and incorrect diagnoses of acute appendicitis.6,7 Machine learning (ML), a specific type of AI, entails training computers to learn from datasets using algorithms to make predictions or decisions. ML algorithms such as random forests (RF) and support vector machines (SVM) are utilised for classification tasks. 8 In the case of acute appendicitis, these classification tasks can be employed to predict a diagnosis based on laboratory inflammatory marker values, patient signs and symptoms, and imaging findings. Deep learning, a subset of ML, uses multiple neural networks like convolutional neural networks (CNNs) to recognise patterns and analyse data or images, which can be valuable for interpreting CT scans in appendicitis. 9 These forms of AI, among others, have garnered increased interest and usage in acute appendicitis diagnosis and subsequent management. This literature review will examine the effectiveness of AI in assisting clinical decision-making for this condition.
Materials and Methods
Although this is a narrative review, a systematic process was followed to identify relevant studies. A comprehensive literature search was conducted to identify relevant studies evaluating the application of artificial intelligence (AI), including machine learning (ML) and deep learning (DL), in the diagnosis and prognosis of acute appendicitis. Searches were performed using several major electronic databases including PubMed, Ovid/MEDLINE, Google Scholar and the Cochrane Library, covering the period from January 1, 1990, to December 31, 2024. The search strategy employed a wide range of terms to capture relevant studies, including but not limited to: “acute appendicitis”, “appendicitis diagnostics”, “appendicitis imaging”, “artificial intelligence in appendicitis”, “machine learning in appendicitis”, “deep learning for appendicitis diagnosis”, “AI in diagnosis of appendicitis”, “predictive models in appendicitis”, “clinical decision support for appendicitis”, and “prognosis of appendicitis using AI”. The search focused on identifying studies that explore the role of AI, including machine learning and deep learning algorithms, in the diagnosis and prognosis of acute appendicitis.
The following Boolean search strategy was used: (“acute appendicitis” OR “appendicitis”) AND (“diagnosis” OR “prognosis”) AND (“artificial intelligence” OR “machine learning” OR “deep learning” OR “neural networks” OR “natural language processing”).
Medical Subject Headings (MeSH) were applied where appropriate. The search was limited to studies published in English and conducted on human subjects. Manual searches of reference lists from included studies and relevant reviews were also performed to ensure completeness.
Studies were eligible for inclusion if they investigated the use of AI-based tools (including ML, DL, NLP, or ensemble methods) in the diagnosis or prognosis of acute appendicitis, and reported quantitative performance metrics. The type of AI model used (eg, logistic regression, random forest, support vector machine, convolutional neural network) was noted for each study.
QUADAS-2 Risk of Bias Assessment for Included Diagnostic AI Studies in Appendicitis. Evaluated Across Four Domains With Overall Bias Summarised
Given the heterogeneity of AI models, input features, and outcome measures across studies, statistical analysis was not conducted. Instead, findings were synthesized narratively with thematic emphasis on: AI model performance (diagnosis and prognosis separately); type and quality of input variables; validation approaches; study population (adult vs paediatric); and implementation potential and limitations.
Study Quality and Validation Methods
Many included studies were conducted at single centres with small sample sizes, often under 500 participants. This raises concerns about overfitting and limited generalisability. For example, Aydin et al. trained models on 300 paediatric cases, limiting their external validity.
Only 4 studies used external validation datasets (eg, Kim et al., Reismann et al.), while most applied internal validation or cross-validation, increasing the risk of optimism bias. Additionally, paediatric studies (eg, Reismann et al., Aydin et al.) showed significantly different input features and model performance compared to adult cohorts (eg, Hsieh et al., Shahmoradi et al.), suggesting that AI models may not generalise well across age groups without retraining.
Understanding of AI
AI encompasses the capability of computers to replicate human brain processing and functionality. AI in healthcare includes distinct subfields such as machine learning (ML), which learns patterns from numerical or image data, and natural language processing (NLP), which focuses on interpreting unstructured clinical text. While both contribute to AI decision-support systems, they operate using different data modalities and architectures.
ML, widely utilised in medical research, enables a system to improve its performance and acquire knowledge from experience by exposure to “training data” or a running algorithm. In appendicitis, ML can be applied in various ways, such as developing a diagnostic system capable of analysing diverse inputs like clinical features and laboratory results or determining the likelihood of appendicitis. Moreover, ML can be utilised to interpret medical imaging, such as ultrasound scans, to identify abnormalities, and it can even predict patient outcomes, such as post-appendectomy recovery time.10-12
The following passage offers an overview of various subsets of ML relevant to this review, organised into four main sections.
13
1. Statistical and machine learning classifiers are models that categorise inputs based on known training data. Examples include logistic regression (LR), Naive Bayes, random forest (RF), support vector machine (SVM), k-nearest neighbours (KNN), and decision trees (DT). 2. Neural networks (NN) are machine learning systems inspired by the biological nervous system. They utilise interconnections to enhance processing speed and accuracy, leading to improved outcomes. 3. Ensemble learning machine models use multiple learning algorithms to achieve superior predictive performance. Examples include RF, gradient boosting (GB), extreme gradient boosting (XGB), and CatBoost. 4. Natural language processing (NLP) involves interpreting and manipulating human-generated spoken or written information, focusing on knowledge rather than just data. NLP is particularly valuable in extracting clinical information from patient records and evaluating patients’ status and outcomes.
Role of AI in Diagnosis of Appendicitis
Summarising Findings of Issaiy et al. 10 With Accuracy and Area Under Curve (AUC) Found for Each AI Model
Comprehensive Diagnostic AI Performance Metrics Summary
A recent extensive analysis conducted by Lam et al. 7 delved into the potential of AI in diagnosing paediatric conditions. The analysis encompassed 10 studies, with a specific focus on AI’s application in diagnosing appendicitis, and 7 of the studies addressed this area specifically. The RF algorithm was employed in 4 of the studies, while NN, DT and LR were each used in 2 studies. Some studies incorporated multiple AI methods. From the 7 studies, DT, LR, and RF were identified as the most suitable models in individual analyses, with respective AUC values of 0.93, 0.84, and 0.91. RF emerged as the most effective model in 2 studies, exhibiting AUC values ranging from 0.86 to 0.96. However, due to the small sample size, Lam et al. 7 were unable to definitively establish the superiority of any specific AI model. Both systematic reviews concurred on the diagnostic superiority of AI over any current appendicitis score, with several studies, including Alvarado and Adult Appendicitis Score as the main comparisons.7,10 AI models outperformed these scores, with studies such as Park et al. 16 and Hsieh et al. 17 obtaining P values of P < .001 and P < .003 respectively.
In a study conducted by Aydin et al, 18 six distinct AI models were trained to identify appendicitis in paediatric patients. The researchers utilised the same training set for all six models, incorporating SVM, generalised linear models, RF, DT, k-nearest neighbours, and Naive Bayes. RF consistently yielded the highest performance across all metrics: an AUC of 99.67%, accuracy of 97.45%, sensitivity of 97.79%, and specificity of 97.21%. Despite RF’s superior accuracy, the researchers opted for DT over RF due to its enhanced interpretability, enabling them to comprehend the correlation between blood variables and the ailment. This comprehension is anticipated to facilitate the development of a decision support system in the future. Even though NN appears to be the AI model most capable of creating an accurate algorithm to diagnose appendicitis, they are limited by the vast amount of training data required, as seen in Issaiy et al. 10 RF is regarded as a better option due to its versatility in processing all the variables present in appendicitis, whether numeric or not, and due to it being less influenced by overfitting than NN. LR and DT, however, are preferable to the rest if result interpretation is necessary, because of their clear decision-making process.
Input Variables in Diagnosis
The selection of input significantly impacts the overall performance of an AI model. When designing AI models, it’s essential to consider both the quantity and quality of the input data. Having too few input features can lead to suboptimal performance, while including too many inputs can result in overfitting the training dataset. 12 AI models use input data to create parameters and algorithms for accurate diagnosis by giving weight to each parameter in the most favourable way. In the case of appendicitis diagnosis, older studies primarily focused on demographics and clinical variables, while newer studies included a higher percentage of laboratory and imaging-related inputs.16,19-21 The shift in input features away from clinical parameters is an attempt to reduce bias and move towards more standardised and less biased numerical parameters. Some studies even omitted common clinical appendicitis signs to reduce noise and subjectivity, allowing the AI to reach conclusions more efficiently.
The selection of input features for an AI model typically relies on the expertise of professionals or existing research findings. However, Xia et al. 20 deviated from this norm by leveraging RF to identify inputs based on the mean decrease in accuracy. These selected inputs were subsequently integrated into their primary diagnostic AI model (SVM) to pinpoint crucial features and eliminate superfluous noise. The amalgamation of manual and AI-driven techniques for input selection may represent the most optimal approach for discerning the most impactful inputs.
To enhance their diagnostic capacity further, Kim et al. 11 developed an AI model specifically designed to interpret appendiceal features from ultrasound images, significantly enhancing diagnostic accuracy. Reismann et al. 22 further demonstrated the power of AI in image interpretation by training a model using lab results, which achieved an AUC of 0.8. When appendiceal diameter data from ultrasound scans were incorporated, the AUC increased to 0.9. Rajpurkar et al. 23 showed that AI could interpret CT images with an AUC of 0.81 for diagnosing appendicitis. They also found that pretraining the AI on video data, rather than static images, improved diagnostic accuracy through data augmentation.
Overall, AI’s ability to interpret imaging, such as ultrasound and CT scans, has proven superior to human interpretation in identifying appendicitis, leading to more accurate diagnoses.
AI models that do not rely on imaging may still be valuable in settings where imaging is not readily available. Although developing AIs focusing on clinical signs over numerical data may pose challenges due to sample size limitations, they could still be beneficial in supporting primary care providers with limited access to costly laboratory tests.24-26 For instance, Shikha et al.’s 27 AI assisted trainee doctors, enhanced their diagnostic success rate for identifying appendicitis from 70% to 97%. The AI leveraged the trainee doctors’ clinical findings in conjunction with white cell count including neutrophils to yield more accurate predictions, effectively transferring accumulated expertise from experts to trainees through their algorithm.
Role of AI in Prognosis of Appendicitis
The prognosis for appendicitis often involves categorising it as either complicated or uncomplicated. 28 Prompt identification of complicated appendicitis is important as these cases may require surgical intervention or interventional radiology procedures. In contrast, uncomplicated appendicitis may be considered for non-operative management in select cases.
While studies examining AI’s diagnostic capabilities are plentiful in the literature, research on AI’s prognostic abilities is less common. In contrast to appendicitis diagnosis, the most frequently used AI models in appendicitis prognosis were different. According to two separate systematic reviews by Bhandarkar et al. 12 and Issay et al. 10 analysing a total of 16 papers on the role of AI in appendicitis prognosis, ML models were exclusively utilised over NLP. Among the ML models, ensemble models were the most prevalent, encompassing 15 out of the 35 AI models used in the 16 studies. This is in contrast to ML classifiers, which were the second most popular, with 10 models, closely followed by statistical classifiers, accounting for 9. NNs only represented 6 AI models, with other models making up the remainder. The four most commonly used AIs within the two systematic reviews were SVM and LR each with 6 instances, and XGB/Gradient Boosting (GB) and RFs with 4 and 3 instances, respectively.
SVM are a machine learning classifier known for their exceptional ability to establish distinct boundaries between different classes, leading to improved generalisation to new data. They are frequently employed in appendicitis prediction due to their proficiency in binary classification, allowing them to identify the hyperplane that separates complex and uncomplicated cases. Nevertheless, SVMs exhibit higher sensitivity to noise when compared to other AI models like ML ensembles. 29
XGBoost, or XGB, is a machine learning ensemble that operates similarly to Random Forests by generating multiple DTs. However, unlike RF, XGB constructs its DTs sequentially, with each tree correcting the collective prediction errors of the existing trees through gradient boosting. Although XGBoost is recognised for its exceptional performance and efficiency, it demands thorough tuning and presents challenges in interpretation due to its operational complexity. Nevertheless, its iterative predictive enhancement makes it well-suited for capturing dynamic changes in a patient’s condition.30,31
Studies Utilising the 4 AI Models: RF, XGB/GB, SVM & LR in Appendicitis Prognosis
Comparing the accuracy of AI models across different studies proved to be challenging because of the use of various performance measures, including AUC and accuracy, with some studies not utilising either. Generally, diagnostic AIs demonstrated higher accuracy in comparison to prognostic AIs. For example, Bhandarkar et al.'s 12 systematic review revealed an average AUC of 0.825 for diagnostic AIs, compared to 0.774 for prognostic AIs. However, some studies reported exceptionally high AUC values, such as 0.97 in Abkulut et al, 32 highlighting the potential of AI in prognosis. The remarkable accuracy in prognostic capabilities offers numerous advantages for medical practitioners, including support in decision-making, implementation of action plans, and assessment of treatment response.
In a specific study, Li et al. 33 identified LR as the preferred AI for distinguishing complicated and uncomplicated appendicitis cases in pregnant women. This underscores the importance of carefully selecting AI models tailored to distinct healthcare settings.
Management of complicated appendicitis often includes drainage and delayed surgery, particularly in patients with localised abscesses or poor surgical fitness. Conversely, uncomplicated appendicitis may be treated conservatively with antibiotics in selected cases. Thus, accurate classification by AI models may inform both escalation and de-escalation of care.
Numerous studies have examined postoperative complications. For instance, Eickhoff et al 34 conducted a study that developed a RF model to forecast postoperative outcomes following perforated appendicitis. The RF model effectively predicted the necessity for intensive care with 77% accuracy, an ICU stay exceeding 24 h with 88% accuracy, complications assessed with Clavien-Dindo scores >3 with 68% accuracy, reoperation subsequent to initial appendectomy with 74% accuracy, the requirement for oral antibiotic therapy after discharge with 79% accuracy, as well as hospital stay duration and the occurrence of surgical site infections. Their AI showcased the extensive prognostic capabilities of AI, aiding in predicting whether further prophylactic treatment may be necessary. Similarly, Alramadhan et al. 34 conducted a comparable study using ANN to forecast the risk of intra-abdominal abscess post-appendectomy with an accuracy of 89.4%.
The prognostic input characteristics were similar to those of the diagnostic AIs, but they relied more on laboratory results and had a larger number of inputs. AIs examining postoperative outcomes included unique inputs, as seen in Eickhoff et al. 34 Their model utilised various parameters such as ASA score, type of surgery and incisions, comorbidities, blood test results, and patient demographics to predict patient outcomes. 35
Direction of AI in Appendicitis
This narrative review raises the question of ‘Could AI also be utilised to aid in making surgical decisions?’ AI holds significant potential in aiding the clinician’s decision making, including selecting patients for operative vs conservative management. For instance, Monsell et al. 36 discovered that 77% of patients initially treated with antibiotics but later requiring appendectomies within 30 days exhibited acute clinical signs, underscoring the significance of further exploration into these cases using AI to potentially recognise patterns and develop an algorithm to determine whether non-surgical treatment outweighs the risks of surgery on an individual basis. Although the complexity of surgical decisions poses a challenge, the rapid advancement of technology and improvements in AI may make this a feasible prospect.
Studies Encompassing the 4 AI Models: NN, RF, LR & DT in Appendicitis Diagnosis
Footnotes
Limitations and Ethical Considerations
AI models risk replicating bias if trained on non-representative datasets (eg, overrepresentation of adult males or urban hospitals). Without sufficient diversity, outputs may underperform in marginalised populations. Furthermore, many AI models—particularly neural networks—function as ‘black boxes,’ making it difficult to interpret their decisions. Explainable AI (XAI) techniques such as SHAP values or saliency maps are gaining traction but remain underused in appendicitis models.
However, the widespread adoption of AI in appendicitis care requires harmonisation of laboratory input ranges, imaging protocols, and data labelling across institutions. For example, the definition of leukocytosis or appendiceal diameter cut-offs may vary between centres.
Moreover, regulatory approval is a critical hurdle. In the United States, AI tools for diagnosis must be approved by the FDA as Software as a Medical Device (SaMD). In Europe, the Medical Device Regulation (MDR) governs similar approvals.
Economic considerations include the cost of model training, infrastructure needs (eg, cloud computing), and integration into EHR systems. Smaller hospitals may lack the resources to adopt or validate these tools without institutional or national funding support.
Data privacy remains a key concern. Robust de-identification, ethical approval, and GDPR/HIPAA compliance are prerequisites for AI development. Finally, the lack of external validation in most studies limits real-world deployment.
Conclusion
AI shows great promise for widespread implementation in healthcare, particularly in diagnosing and managing acute appendicitis. AI’s diagnostic abilities can reduce complications from late detection, optimise testing, and lower the incidence of diagnostic errors. Its use in medical imaging interpretation, such as ultrasound and CT scans, has proven superior to human interpretation, and AI models can assist healthcare providers in challenging cases where traditional methods fall short. Prognostically, AI can help identify complicated cases of appendicitis, enabling timely surgical intervention, improving postoperative outcomes, predicting complications, and helping tailor treatments to individual patient needs. While current AI models face challenges such as the requirement for large datasets and the need for standardised inputs, these hurdles are likely to be addressed as the technology continues to advance. Additionally, high-quality literature is essential to further develop our understanding of what systems are better suited for specific tasks. Despite the advancements in AI technologies, challenges remain, such as the need for large, high-quality datasets and standardised inputs.
Author Contributions
Authors Mohamad Bashir and Ali Murtada were involved in the conceptualization and supervision of the manuscript. Authors Samuel Ghattas, Samuel Rezk, Marco David Bokobza De la Rosa, Fatima Kayali, Albert Mensah, Matti Jubouri, Shuaiyb Majid and Hussam Khougali Mohamed were involved in the investigation, methodology, data curation and original writing of the manuscript. Authors Fatima Kayali and Matti Jubouri were involved in project administration. Authors Feroze Ahmed Mir, Ian Williams, Damian M. Bailey, Mohamad Bashir and Ali Murtada were involved in the review and editing of the final draft. All authors read and approved the final submitted version of this manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Author D.M.B. is supported by the Royal Society Wolfson Research Fellowship (#WM170007).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The evidence used to support this review is publicly available in electronic databases including PubMed, Ovid/Medline, Scopus and Google Scholar. No new/original data was generated for the purpose of this review.
