Abstract
Background:
Capsule endoscopy (CE) is a valuable tool for assessing inflammation in patients with Crohn’s disease (CD). The current standard for evaluating inflammation are validated scores (and clinical laboratory values) like Lewis score (LS), Capsule Endoscopy Crohn’s Disease Activity Index (CECDAI), and ELIAKIM. Recent advances in artificial intelligence (AI) have made it possible to automatically select the most relevant frames in CE.
Objectives:
In this proof-of-concept study, our objective was to develop an automated scoring system using CE images to objectively grade inflammation.
Design:
Pan-enteric CE videos (PillCam Crohn’s) performed in CD patients between 09/2020 and 01/2023 were retrospectively reviewed and LS, CECDAI, and ELIAKIM scores were calculated.
Methods:
We developed a convolutional neural network-based automated score consisting of the percentage of positive frames selected by the algorithm (for small bowel and colon separately). We correlated clinical data and the validated scores with the artificial intelligence-generated score (AIS).
Results:
A total of 61 patients were included. The median LS was 225 (0–6006), CECDAI was 6 (0–33), ELIAKIM was 4 (0–38), and SB_AIS was 0.5659 (0–29.45). We found a strong correlation between SB_AIS and LS, CECDAI, and ELIAKIM scores (Spearman’s r = 0.751, r = 0.707, r = 0.655, p = 0.001). We found a strong correlation between LS and ELIAKIM (r = 0.768, p = 0.001) and a very strong correlation between CECDAI and LS (r = 0.854, p = 0.001) and CECDAI and ELIAKIM scores (r = 0.827, p = 0.001).
Conclusion:
Our study showed that the AI-generated score had a strong correlation with validated scores indicating that it could serve as an objective and efficient method for evaluating inflammation in CD patients. As a preliminary study, our findings provide a promising basis for future refining of a CE score that may accurately correlate with prognostic factors and aid in the management and treatment of CD patients.
Plain language summary
This study introduces an innovative AI-based approach to evaluate Crohn’s Disease. The AI system automatically analyzes images from capsule endoscopy, focusing on finding ulcers and erosions to measure disease activity. The research reveals a robust correlation between the AI-generated score assessing inflammation in the small bowel and traditional clinical scores. This suggests that the AI solution could be a quicker and more consistent way to evaluate Crohn’s Disease, speeding up the evaluation process and reducing manual scoring variability. While promising, the study acknowledges limitations and emphasizes the need for further validation with larger groups of patients. Overall, it represents a crucial step toward integrating AI into gastroenterology, offering a glimpse into a future of more objective and personalized Crohn’s Disease evaluation.
Keywords
Introduction
Inflammatory bowel disease (IBD) is a chronic inflammatory disease of the gastrointestinal tract that affects millions of people worldwide. The assessment and management of IBD can be challenging, requiring a comprehensive evaluation of disease activity and response to treatment.
Capsule endoscopy (CE) is a minimally invasive procedure initially designed to assess the small bowel (SB), demonstrating a high diagnostic yield in detecting SB lesions.1,2
CE is typically safe and well tolerated, presenting few contraindications but a clinical evaluation of the risk of capsule retention is advisable. This is especially relevant for patients with established Crohn’s disease (ECD), where the likelihood of retention is increased, particularly when known stricturing phenotype or obstructive symptoms are evident. A meta-analysis evaluated the risk of capsule retention in patients with suspected Crohn’s disease (SCD) and ECD. The authors found a retention rate in the overall CD cohort of 3.3% (SCD = 2.3% and ECD = 4.6%). However, the concern for retention should not preclude CE utilization, as long as patients have undergone appropriate patency testing. 3
The role of CE in patients with SCD and ECD has expanded over the years. Its applications encompass the diagnosis of SB CD, evaluation of disease activity and therapy response, objective assessment of mucosal healing, and detection of postoperative recurrence. Indeed, interest in evaluating the entire SB and colon has grown in recent years. 4
The introduction of pan-enteric capsule endoscopy (PCE), enabling simultaneous enteric and colonic evaluation, has raised expectations that this modality may allow a comfortable and accurate evaluation of the gastrointestinal involvement in IBD within a single examination.5,6 Indeed, recent studies have emphasized the potential role of PCE in IBD patients, particularly its utility in monitoring patients with CD. 7
CD, with its discontinuous nature and varied disease location, benefits from a pan-enteric approach. PCE, for example, PillCam Crohn (PCC)’s®, offers a convenient method for simultaneously assessing SB and colonic lesions, allowing for the evaluation of disease severity, extent, and distribution. 8 Numerous studies have evaluated the application of VCE in patients with CD and have consistently demonstrated that PCE achieves a high diagnostic yield for lesions detection throughout the entire gastrointestinal tract.9–12
To standardize reporting of CE exams and enhance reproducibility and inter-observer agreement, scoring systems have been developed. 13 Lewis’s score (LS) assesses CD inflammation in the SB. It assesses ulcerations, erythema, and mucosal abnormalities, showing excellent inter-observer agreement and correlation with inflammatory markers, providing a standardized and objective evaluation of disease activity. 14 Similar to the Lewis score (LS), the Capsule Endoscopy Crohn’s Disease Activity Index (CECDAI) evaluates inflammation in the SB using three main parameters: inflammation, extent of disease, and strictures. The score is separately calculated for proximal and distal segments. In contrast to previously mentioned scores, the ELIAKIM score is a novel PCC’s capsule score that evaluates mucosal inflammation not only in the small intestine but also in the colon. The entire bowel is divided into five segments, with the small intestine divided into three tertiles and the colon into right and left segments. The score takes into account the most common and severe lesions, disease extent, and the presence of strictures, providing a comprehensive assessment of CD activity. 15
Traditional scoring systems, such as the LS, CECDAI, and ELIAKIM, have been used to evaluate disease activity in IBD. However, these scoring systems have limitations, including subjectivity and variability, and being a time-consuming task.
Over the past few years, significant efforts have been dedicated to developing and implementing artificial intelligence (AI) tools for automated image analysis in the field of gastroenterology. 16 Based on convolutional neural network (CNN) models, AI-powered systems have been developed to decrease reading times and improve lesion detection. 17 To this date, evidence regarding the application of deep learning modules to pan-endoscopy systems is still in its early stages and comes mainly from retrospective studies including a small number of patients and with limited datasets.18–22
Published as a proof-of-concept study by this group conducted by Ferreira et al., a deep learning model was developed for automatic detection of both SB and colonic ulcers and erosions using PCC’s® capsule images. This model achieved encouraging results for lesion detection with a sensitivity of 90% and a specificity of 96%. 22
The main objective of this article is to explore the application of AI in assessing disease activity in IBD, including the development of an automated AI score based on CE images selected by CNN and comparing it with laboratory data and existing scoring systems.
Materials and methods
Study design
We performed a retrospective analysis of clinical data of patients with CD undergoing PCC’s between September 2020 and January 2023 at a single center (São João University Hospital, Porto, Portugal). The reporting of this study conforms to the Strengthening the Reporting of Observational Studies in Epidemiology statement. 23 In all, 61 were enrolled patients in this study. All procedures were recorded as a video file. Each full-length video was reviewed, and images retrieved from these examinations were used. These images comprised still frames extracted by the decomposition of each video. The segmentation of each video into frames was performed using dedicated video software (VLC media player, Paris, France). Each exam was analyzed by a pre-developed and validated CNN for automatic detection of ulcers and erosions in PCC images. 22
A total of 1,659,175 frames of enteric and colonic mucosa were ultimately extracted. Each exam video was reviewed by three readers (PC, MM, and FM) and inflammation was scored in all studies using LS, CECDAI, and ELIAKIM. A final decision on each score attribution required the agreement of at least two of the three researchers.
Laboratory parameters of these patients were also collected, such as relative C-reactive protein (CRP) and fecal calprotectin (FCP).
CE procedure
PCE procedures were conducted using the PCC system (Medtronic, Dublin, Ireland). The images were reviewed using PillCam™ software version 9.0 (Medtronic). Each frame was processed to remove any information allowing patient identification (name, operating number, date of procedure). The bowel preparation protocol followed previously published guidelines. 24
Development of the CNN
A developed CNN for automatic detection of ulcers or erosions in the enteric or colonic mucosa published in previous works was used. 22 From the collected pool of images (n = 1,659,175), 55,317 displayed ulcers and erosions. The remaining (n = 1,603,858) showed normal SB or colonic mucosa. For each image, CNN estimated the probability for each category: ulcers or erosions versus normal mucosa. The category with the highest probability score was outputted as the CNN’s predicted classification.
Development of the automated AI score
The videos were subsequently analyzed by software and image processing to detect ulcers and erosions. Based on the total number of images selected that display ulcers and erosions (in SB and colon, distinctively) and the total number of frames per video, a score was calculated. This score was determined by comparing the count of images with findings to the overall number of frames. The score reflects a fraction, giving an idea of the proportion of affected images relative to the total number of images in the video. The score was calculated for the SB (AIS_Small Bowel), the colon (AIS_Colon), and for both (AIS_total). A potential challenge for the algorithm lies in the risk of counting the same ulcer or erosion multiple times, especially when the capsule crosses a segment repeatedly. To address this issue, cross-matching procedures were integrated into both the training and test datasets, effectively eliminating redundant images of identical ulcers. In addition, to reduce overfitting bias, patients were methodically divided between the training and validation datasets.
Model performance and statistical analysis
Statistical analysis was performed with SPSS Version 29.0 statistic software package, developed by IBM Corporation, Armonk, NY, USA. Data were expressed as means. Variables minimum and maximum were calculated. For correlation analysis, Spearman correlation was used to examine the relationship between the variables of interest. A value of p < 0.01 was considered statistically significant.
Results
PCC characteristics
In all, 61 patients CD patients underwent PCC examination. The capsule reached the cecum in all cases. The examination was considered completed (visualization of rectum or toilet) in 55/61 (90%). The bowel preparation was assessed in all cases and was considered satisfactory in the SB in all cases.
CNN characteristics
The CNN for automatic detection of ulcers or erosions used a total of 1,659,175 images, 617,743 of enteric mucosa, and 986,115 of colonic mucosa. It identified 20,787 ulcers or erosions of enteric mucosa and 34,530 of colonic mucosa. Figure 1 presents an example of the output of the CNN. Figure 2 presents the heatmaps produced by the neural network which are visual representations that highlight the important features in the input image, in this case, ulcers and erosions. These heatmaps help in interpreting CNN’s decision-making process by highlighting these regions.

Output obtained from the application of the CNN. A blue bar represents a correct prediction. The red bars represent an incorrect prediction. The category with the highest probability was outputted as CNN’s prediction.

Heatmaps produced by the CNN: highlighted regions of ulcers and erosions.
Capsule activity scores
The median LS was 225 (0–6060), CECDAI was 6 (0–33), ELIAKIM was 4 (0–38), and AIS_Small Bowel was 0.5659 (0–29.45). The median, minimum, and maximum values are shown in Table 1.
Descriptive statistics of variables.
CECDAI, Capsule Endoscopy Crohn’s Disease Activity Index; CRP, C-reactive protein; FCP, fecal calprotectin.
To compare the automated score generated by the CNN, we compared it with the laboratory data and with the already validated scores.
Since the variables were nonparametric in nature, we utilized Spearman correlation for the analysis. Regarding the interpretation, the Spearman coefficient varies between −1 and 1 and the closer ρ is to −1 or 1, the stronger the correlation (values above 0 are positive). The authors considered the following range of values: ρ 0–0.19 = very weak, 0.2–0.39 = weak, 0.4–0.59 = moderate, 0.6–0.79 = strong, and 0.8–0.99 very strong. 25 Table 2 shows the correlations between all the variables.
Spearman rank correlation coefficient between variables.
Correlation is significant at the 0.05 level (two-tailed).
Correlation is significant at the 0.01 level (two-tailed).
CECDAI, Capsule Endoscopy Crohn’s Disease Activity Index; FCP, fecal calprotectin; LS, Lewis score.
It is important to mention that due to the retrospective nature of the study, the timing of blood sample collection was not standardized for all patients. In fact, there was a considerable variation in the time interval between sample collection and CE. For this reason, the researchers established a reasonable minimum time frame of 3 months to ensure meaningful comparisons between these values and the findings in the PCC. This explains that the comparison between RCP and FCP was performed for only 36 and 35 patients, respectively.
We started to analyze the laboratory values like CRP and FCP. Regarding CRP, we found virtually no correlations with the other variables. In what concerns FCP, we found Spearman correlations close to 0.5 (p < 0.05) with the known scores (low to moderated association).
When analyzing the validated scores (LS, CECDAI, and ELIAKIM), we found that they correlate well. We found a very strong correlation between LS and CECDAI (Spearman’s r = 0.854, p < 0.001), CECDAI and ELIAKIM scores (Spearman’s r = 0.827, p < 0.001), and a strong correlation between LS and ELIAKIM (Spearman’s r = 0.768, p < 0.001).
We found a strong correlation between SB_AIS and LS, CECDAI, and ELIAKIM scores (Spearman’s r = 0.751, r = 0.707, r = 0.655, p = 0.001). We did not find any correlation between the generated score for the colon segment (AIS_Colon) and the other scores. For the score corresponding to both the SB segment and colon (AIS_total), we found statistically significant correlations, although not very strong, in particular, a positive but weak correlation with LS (Spearman’s r = 0.472, p < 0.001), CECDAI (Spearman’s r = 0.398, p = 0.001), and a moderated positive correlation with ELIAKIM (Spearman’s r = 0.517, p < 0.001).
Discussion
The idea of a comprehensive pan-enteric examination, particularly for evaluating conditions like CD, arose with the advent of colon CE. 26 Given that PCE enables the assessment of the entire gastrointestinal tract, the concept of a single, minimally invasive panendoscopy has become an enticing prospect. PCE is a convenient and accurate method for assessing both the SB and colon, reducing the need for separate evaluations.
As previously mentioned, there are several validated scores for evaluating disease activity in CE. The development of a new score is questioned when multiple validated scores already exist and is valid to think about the chance of developing an AI tool for swift computation of existing scores. Some are dedicated to assessing the mucosa of the small intestine (LS and CECDAI), while others provide a pan-enteric evaluation (ELIAKIM). However, the application of these scores involves certain challenges. In manual reading and scoring, we need to acknowledge subjectivity and variability among different observers. Furthermore, it is often a time-consuming task, especially the panendoscopy capsules (double headed cameras) in which a single exam can take anywhere from 30 to 90 min. As stated, the traditional scoring systems (LS and CECDAI) are dedicated to assessing only the small intestine mucosa, and this limited scope can make it challenging to obtain a comprehensive assessment of disease activity.
The establishment of scoring systems has indeed introduced standardized reporting, but we believe that there is space for improvement. The use of AI algorithms in assessing disease activity in IBD, in particular CD, has shown promise for accurate and less biased disease evaluation.27,28 The use of CNN in CE has shown that the implementation of these technologies may improve the accuracy, efficiency, and reproducibility of IBD evaluations.18–21
Based on a pre-developed and validated CNN, 22 we developed an automated score for disease activity assessment and compared it with clinical variables and scores. In this study, we compared that score with clinical laboratory markers and other established scores. While the other scores were based on villous appearance, edema, the presence of ulcers, strictures, stenosis, and the extent and severity of lesions, the score we created is based on the count of images with ulcers and erosions relative to the total number of frames in the video. This automated score provides an objective and standardized assessment of inflammation, reducing subjectivity and variability. In practice, the number of frames with lesions will, indirectly, reflect the extent and severity of the different lesions, as well as the presence of stenosis (more selected frames from delayed passage).
In this research, our primary method for estimating inflammatory burden centered on identifying ulcers and erosions, with a specific emphasis on quantifying the number of frames featuring those lesions. We acknowledge that this approach may lack complete accuracy, as other relevant endoscopic aspects, notably strictures, were not considered. However, the development of CNN for strictures identification presents complexity, given the impracticality of a frame-by-frame assessment. In future works, we will address this limitation by incorporating all relevant endoscopic features that could reflect inflammatory activity. Our goal is to develop an AI tool based on diverse endoscopic aspects across the entire gastrointestinal tract, aiming to automatically provide a comprehensive estimate of pan-enteric inflammatory activity. Although the application of additional landmarks poses technical challenges, this aspect is already undergoing development.
Another important point to emphasize is the reason behind the separate analysis of SB and colon. It is true that one of the objectives of this study is to perform a pan-endoscopic evaluation but the division into the SB and colon makes sense from a methodological perspective. In fact, considering the anatomical differences between the two segments, from the standpoint of neural network development, it is possible to create much more accurate and robust CNNs by conducting a segmented evaluation as described. However, given the high reading rate of each CNN and the fact that both networks are coupled in the algorithm, this division does not interfere with the overall accuracy. Therefore, this methodological division is part of a data science strategy that guarantees the best performance of the network.
The results of the study showed a strong correlation between the automated AI score in the SB segment and validated scoring systems. This strong correlation may indicate that the AI-based core provides results that are consistent and aligned with the assessments made by well-established and validated scores which can mean that this score might serve as an effective and efficient alternative for evaluating CD activity. In addition, the speed at which the neural network evaluates thousands of frames can significantly speed up the evaluation process, saving time. Furthermore, the use of AI-assisted technologies eliminates inter-observer variability, leading to increased consistency and reproducibility.
This study lays a promising foundation for future research. The researchers are enthusiastic about exploring ways to enhance the score’s performance for disease activity assessment, with the ultimate goal of creating a superior tool compared to existing scores. Further validation studies with larger and diverse patient cohorts are necessary to affirm this automated score’s reliability and generalizability.
Despite the potential value of these technologies, we must acknowledge some limitations. First, the retrospective nature of this study limits its generalization potential. Second, the nature of this score is mainly quantitative (fraction of images with ulcers and erosions in relation to the total number of frames) which can limit the ability to capture the full spectrum of disease activity. In addition, this is the first study with the use of the score to assess the disease activity and therefore has a limited number of patients. It is essential to validate the AI score in a clinical setting in a prospective manner to determine its correlation with clinical symptoms, response to treatment, and the need for escalation or de-escalation of therapy. This validation would provide valuable insights into the clinical utility and reliability of the score in guiding clinical decision-making and optimizing patient care.29–31
Despite the acknowledged limitations, this study suggests that the development of an automated score holds significant potential as an objective and efficient tool for assessing disease activity in patients with CD. In fact, the use of AI in PCE for assessing inflammation in IBD patients is a promising field that has the potential to revolutionize disease management. Ultimately, the goal of the group is to develop clinical scores and interfaces that aggregate all patient information, including clinical status, laboratory values, endoscopic exams (including CE), radiological exams, and others, to optimize disease treatment and improve quality of life. 27 This approach would provide a comprehensive and integrated view of the patient’s disease status, allowing for more informed clinical decision-making, toward personalized medicine.
Conclusion
In conclusion, our study suggests that the application of AI tools to assess IBD, in particular CD, can have potential benefits. Our group explored the development of an automated score using CNN for CE image analysis. Preliminary findings indicate a correlation with established clinical scores, suggesting the possibility of objective disease activity evaluation. However, it is important to acknowledge uncertainties, and further research with consideration of more endoscopic features and larger cohorts is needed to determine the real-world effectiveness of this AI-based approach. These emerging technologies might improve the evaluation and management of CD patients, but their practical implications require careful consideration and validation.
