Abstract
Objective:
While artificial intelligence (AI) is known to benefit radiotherapy (RT), its applications in carbon ion RT (CIRT) is yet to be surveyed. This scoping review aims to examine the current landscape of AI in CIRT and highlight future directions.
Materials and methods:
Following PRISMA guidelines, three databases (PubMed, Scopus, and Web of Science) were searched to find eligible English articles investigating AI in CIRT between January 2014 and June 2025. Two human reviewers and an AI tool independently screened articles. Agreement was quantified using Cohen’s kappa and McNemar’s test (α = 0.05).
Results:
Of the 413 unique articles screened, 16 were selected, and an additional three were included through citation search, totaling 19 articles. The screening agreement between human and AI reviewers was substantial (Cohen’s kappa = 0.75) with no statistical difference (p = 0.18). The AI contributions in CIRT were categorized into treatment planning, optimization, and verification; synthetic imaging; tumor control; and normal tissue complication prediction. All selected studies were limited by small datasets and lacked external validation.
Conclusions:
AI development in the CIRT domain is in early stages but holds promise for reducing costs and improving treatment, especially in adaptive settings. Future efforts should foster collaboration across CIRT facilities to support robust AI applications.
Keywords
Introduction
In recent years, an increasing interest in artificial intelligence (AI) approaches has been reported in healthcare. 1 This is largely driven by significant technological advancements over the past decade, including hardware innovations and enhanced accessibility of key hardware components required to fully leverage powerful AI algorithms, such as deep learning (DL). The contextual data digitalization in healthcare further facilitated the proliferation of AI applications. Radiotherapy (RT) was one of the most explored fields, likely due to the large availability of multimodal imaging datasets, especially after digitalization. Based on a recent review, 2 most of the AI efforts in photon RT are mainly focused on treatment plan optimization, image quality improvement, and treatment outcome prediction. A similar pattern was also observed in proton therapy (PT), 3 although with lower contributions due to the limited widespread use of this treatment modality. Notably, AI applications have been progressively developed and are starting to be integrated into routine clinical practice to increase RT efficiency. A key example is the automatic contouring of organs of interest, 4 which is now commercially available in most treatment planning systems (TPS) and is generally powered by AI.
Carbon ion radiotherapy (CIRT), an advanced form of RT, leverages carbon ions’ peculiarities to gain geometrical and radiobiological advantages in the treatment of deep-seated, unfavorably located, and radioresistant tumors. CIRT offers highly favorable treatment outcomes, albeit with substantial technological costs. Only a few CIRT facilities are operating worldwide, hence the availability of such a promising treatment is currently limited.
Substantial economic advantages in many healthcare fields were reported when integrating AI in key clinical 5 and healthcare administrative processes. 6 In this scenario, AI may represent a pivotal tool to mitigate CIRT treatment costs and increase its accessibility. Furthermore, to promote economically convenient resource allocation, AI tools may serve to effectively select patients who would likely gain the highest benefit from CIRT. 7 Beyond the economic aspects, CIRT could also gain significant advantages from AI-guided personalized treatments. 8 Finally, due to the peculiar characteristics of carbon ion interaction with biological tissues, CIRT is particularly sensitive to inter-fraction anatomical modifications, and adaptive treatments are warranted to maximize CIRT outcomes. For such adaptive settings, AI-based frameworks are particularly well-suited to address the stringent demands of both accuracy and computational efficiency.
This scoping review aims to systematically examine the current landscape and status of AI applications in CIRT, while critically identifying gaps and future directions.
Materials and methods
This scoping review was conducted following the PRISMA Extension for Scoping Reviews (PRISMA—ScR) 9 to describe the current status of AI applications in the landscape of CIRT. The PRISMA-ScR checklist was filled out and is provided as Online Supplementary Material (SM).
Eligibility criteria
The population, concept, and context framework was used to define the eligibility criteria. 10 No limitations regarding population were defined, and articles on AI (concept) applications in CIRT (context) were included. Specifically, scientific manuscripts exploring, developing, or taking advantage of any AI method in the context of CIRT were included in the literature analysis. Articles not written in English were excluded.
Information sources and search
Manuscripts published between January 2014 and June 2025 were retrieved from the following three databases: Web of Science, Scopus, and PubMed. The search string (see SM, S1) was constructed to include the following keywords within the manuscript’s title or abstract:
• carbon ion radiotherapy, carbon ion radiation therapy, carbon ion RT, CIRT, particle therapy, hadron therapy, heavy ion radiotherapy, heavy ion radiation therapy, heavy ion RT; AND
• artificial intelligence, AI, machine learning, deep learning, neural network, model, radiomics, dosiomics, omics, bioinformatics, sequencing, LLM, NLP, auto-segmentation, auto-contouring, GANN, CNN, large language model, natural language processing, synthetic, prediction, predictive.
Relevant scoping, narrative, or mapping reviews exploring AI in the field of particle RT were evaluated through citation search to account for potentially pertinent articles not captured by our search string.
The following metadata were collected from all sourced manuscripts: authors, title, date of publication, volume, issue, journal, publisher, DOI, abstract, URL, keywords, and location.
Selection of sources of evidence
Metadata from all the retrieved manuscripts were imported into Rayyan 11 for the screening process. Duplicate entries were automatically detected and were deleted manually after inspection. Two human reviewers and a fully open-source AI large language model (LLM) reviewer (DeepSeek-R1:32B) 12 independently screened the unique manuscripts based on the article title and abstract. The AI reviewer was prompted to provide a simple yes or no answer along with a concise one-sentence description for the decision (see SM, S2 for prompt details). Only human decisions were considered for source selection purposes, and disagreements among them were resolved by consensus.
Cohen’s kappa 13 was used to measure the screening agreement between consensual human and AI reviewer decisions. Metrics such as sensitivity, specificity, and precision were also measured 13 . Subsequently, McNemar’s test was applied to determine if there was a statistically significant difference in article selections between the human and AI reviewers. The significance level was set to 0.05. Finally, studies not identified with the search string, but found pertaining to the review scope through citation search, were also included in the analysis.
Data charting process and data items
The two human reviewers evaluated the selected manuscripts and classified them into different AI application categories within the CIRT domain. A spreadsheet was designed to collect the following data items from each included article: application category, title, authors, journal, year of publication, aim, dataset, dataset modality, site, AI methodology, AI performance metrics, and LLM summary. For tabulation, data were primarily extracted manually from the articles’ abstract, while the full text was consulted only when deriving the data items was not possible from the abstract alone. One human reviewer independently extracted the data, while the second human reviewer subsequently verified the entries. The LLM summary of the selected manuscripts was obtained by asking the AI reviewer to respond to a specific set of tailored questions and subsequently combine its responses to generate a short summary (see section SM, S3 for prompt details), using the full manuscript text (excluding the references sub-section to reduce the context length). The generated AI summaries were included after careful assessment of their correctness by one of the human reviewers, as LLM models are prone to hallucinations.
Synthesis of results
Word clouds were used to characterize the AI terminology and identify trends in the CIRT research domain. A comprehensive critical summary for each defined AI category was elaborated by one of the two human reviewers. All the exploratory trends were plotted using Seaborn-0.13.2 14 and wordcloud-1.9.4 15 packages, while the AI-based abstract screening and critical summary pipelines were implemented using Ollama-0.6.2 16 and LangChain-0.3.20, 17 all in Python-3.13.0. To support future reviews leveraging fully open-source locally run LLM for screening and critical review, the code was made open source and is available at: https://github.com/sithin-cnao/llm_review.git (accessed on 22 January 2026).
Results
A total of 689 records were retrieved from the three databases. After removing duplicates, 413 unique articles were screened. Of these, 16 were included in the study after consensus between the two human reviewers. The PRISMA flowchart 18 is reported in Figure 1. On the other hand, the AI reviewer selected 21 articles, 14 of which overlapped with the human reviewers’ selection. When quantitatively compared against the consensual human selections, the AI reviewer demonstrated 88% sensitivity and 98% specificity, but with a moderate precision of 67%. The agreement between the human reviewers and the AI reviewer was substantial, with a Cohen’s kappa of 0.75 and no statistically significant difference in agreement (p = 0.18).

PRISMA flowchart describing the source of evidence retrieval and selection process of the scientific literature on artificial intelligence applications in the context of carbon ion radiotherapy. The experimental LLM screening was excluded from the flowchart, as the final inclusion decisions relied exclusively on humans.
The AI methods associated with abstracts where human and AI reviewers’ decisions matched and mismatched were identified through word clouds, and are reported in Figure 2. The terms related to heuristic search (eg., A*), scheduling algorithms, and LASSO were found to be mainly related to the AI reviewer’s incorrect manuscript selection. This highlighted the LLM’s inclination to classify conventional statistical tools or heuristic AI optimization algorithms as AI applications.

Word cloud of AI methodologies extracted from the screened abstracts where human and LLM-reviewer agreed (black) and disagreed (red).
Figure 3 presents the AI methods explored in the selected articles according to their year of publication. Interestingly, generative AI methods were mostly represented in the period from 2022 onwards, while conventional machine learning (ML) approaches were more common in earlier years. Excluding AI-unrelated terms, the most common AI methods explored in the context of CIRT were: generative adversarial networks (GAN), radiomics, U-Net, and ML methods such as logistic regression, support vector machines (SVM), random forest, etc. Word clouds were used to detect such trends (see Figure 3b).

General trend of AI literature in CIRT. (a) Number of publications per year. (b) Word clouds of AI methods explored in the selected articles, grouped by year.
The CIRT AI applications investigated in the selected articles were classified into four categories: (i) Treatment planning, optimization, and verification (# articles = five, publication years = 2020-2025), (ii) Synthetic imaging (five, 2022-2025), (iii) Tumor control prediction (five, 2019-2024) and finally (iv) Normal tissue complication prediction (four, 2018-2025). Table 1 presents the manually extracted data items alongside the human-revised LLM-generated summary, while the LLM’s response to critical review questions for the selected papers is provided in the Online Supplementary Material. A comprehensive overview of each category is presented in the subsequent paragraphs.
Manually extracted data items and LLM-generated short summaries of 19 selected articles.
Treatment planning, optimization, and verification
Five studies were published between 2020 and 2025, encompassing AI applications related to treatment planning or dose distribution prediction in CIRT. The selected papers showed heterogeneity not only in AI methodologies but also in their application scopes and have not undergone external validation. Nonetheless, they mostly aimed to overcome the resource and time consumption limitations of Monte Carlo (MC) simulation approaches toward adaptive CIRT, predominantly using DL based on U-Net 4 architecture.
In the context of particle therapy range verification, in 2020, Yabe et al. 19 successfully developed a U-Net model to predict accurate and reliable dose distributions almost instantaneously (<1s) from measured luminescence water images during proton or carbon ion treatment. This was an extension to their previous work on protons. 20 Similarly, Yamaguchi et al. 21 proposed a U-Net model to predict the dose images with low error margins from secondary bremsstrahlung electron (SEB) X-ray images. On the other hand, Zhang et al. 22 focused on plan verification and quality assurance. In this context, conventional phantom-based methods are laborious and time-consuming, while MC approaches require substantial resources and time. 23 The authors explored three DL approaches (cycle-GAN 24 , 3D U-Net 25 , Ghost-U-Net 26 ) for MC-dose simulation denoising in lung and head and neck cancer patients undergoing CIRT. This approach was previously explored and was oriented to preserve the MC simulations’ dosimetric accuracy while reducing the particles used for the simulation27,28. Along the same scope, He et al. 29 developed a custom U-Net model for end-to-end dose prediction in online head and neck adaptive CIRT. Unlike Zhang et al, the authors did not include MC simulations in their proposed pipeline; instead, doses were predicted directly from the treatment planning computed tomography (CT), the 3D energy matrices, and the ray masks. Similarly, to facilitate adaptive planning, Quarz et al. 30 aimed to mitigate MC time consumption by leveraging DL models for voxel sampling in CIRT. While heuristic approaches in defining the relevant voxels were investigated, such methods remain patient-specific and are still labor-intensive and time-consuming. 31 Hence, the authors explored the feasibility of a novel AI model, P-Net, 32 in deriving the optimal sampling rate within the target and organs of interest volume regions, which was found to generate high-quality plans while reducing computational burden.
Synthetic imaging
Five studies published between 2022 and 2025 explored generative AI imaging applications in the domain of CIRT. All of them focused on generating synthetic CT (sCT) from X-ray digital radiography (DR), magnetic resonance imaging (MRI), or cone beam CT (CBCT). The main benefits identified were the potential integration of such tools in an adaptive CIRT framework and/or an MRI-only treatment planning.
In July 2022, Zhang et al. 33 published the first paper on sCT in CIRT, aiming to evaluate the feasibility of generating sCT from a single setup verification DR. The authors built a two-step framework where an improved cycle-GAN and a customized deep neural network models were trained to map DR to digitally reconstructed radiography (DRR), and DRR to CT, respectively. Despite performance metrics and gamma pass-rate being acceptable, the dataset was extremely small, and consisted only of a thorax-abdomen anthropomorphic phantom and two head and neck patients. Furthermore, concerning the clinical cases, the reference treatment planning was only optimized on the treatment positioning CBCT, further limiting their conclusions. Knäusl et al. 34 explored a more popular research approach and evaluated the feasibility of using MRI to generate sCT using a 3D U-Net model in head and neck CIRT. Nonetheless, the relevant dosimetric differences between sCT- and CT-based treatment plans highlighted the need for further improvements in the methodology before such an approach could be integrated into an adaptive CIRT framework. The treatment plans’ dosimetric differences were mainly ascribed to the immobilization mask being poorly reconstructed in the sCT, as well as the target and organs of interest volume-specific positions. Parrella et al. 35 and Nakas et al. 36 investigated the feasibility of generating sCT and four-dimensional sCT (4D-sCT), respectively, from virtual 4D-MRI to mitigate the increased exposure to ionizing radiation during long CT procedures and potential errors due to abdominal breathing motion. In both cases, three-channel conditional-GANs 37 were built, demonstrating good performance and acceptable treatment plans compared to the 4D-CT reference plans. Overall, both Parrella et al. and Nakas et al. results were aligned with the literature investigating sCT in the abdominal site. Nonetheless, in the context of an MRI-only scenario, both reported a few cases with sub-optimal results, likely due to air pockets and inter-acquisition variability as reported in previous literature studies on proton RT. 38 Finally, Pepa et al. 39 explored the feasibility of using cycle-GAN to generate sCT from CBCT acquired for patient positioning in pediatric CIRT. Although the authors reported good results, their evaluation was only based on imaging, with no accompanying dosimetric analysis. On the other hand, Klassen et al. 40 conducted a dosimetric-oriented synthetic study to evaluate the potential advantages of sCT from CBCT for hypo-fractionated adaptive pancreatic CIRT.
Tumour control prediction
Five manuscripts published between 2019 and 2024 investigated the potential of AI in tumor control prediction. Most of these manuscripts (four out of five) were based on a radiomics approach or its derivatives. All of the papers took advantage of simple conventional ML methods rather than complex DL models. Note that none of the AI models developed in these studies has undergone external validation.
Wu et al. 41 investigated MRI radiomics for both T2-weighted and apparent diffusion coefficient maps to predict prostate tumor relapse after CIRT. While the robustness of the radiomics features was considered, the opportunity to properly evaluate the model was strongly limited by an extremely reduced sample size (n = 23). A support vector machine (SVM) model was proposed, and its performance evaluation was carried out with leave-one-out cross-validation area under the receiver operating characteristic curve (AUROC). Despite the high AUROC (0.88), no further studies were conducted to investigate the validity of their findings. In 2020, Buizza et al. 42 investigated the potential benefit of multimodal (MR, CT) radiomics, dosiomics (using 3D dose maps extracted from the treatment plan), and clinical features in predicting local relapse of skull-base chordoma patients after CIRT. Unlike Wu et al, the authors chose to implement a survival-SVM (s-SVM) over its classification variant (SVM) to stratify patients based on their risk for adverse outcomes. Such a decision was likely supported by the higher interest in the survival outcome rather than in the mere prediction of relapse, which is quite common in radiotherapy. Additionally, a comparison with more common statistical approaches was conducted by building a regularized Cox-proportional hazard (r-Cox) model. Their best model capable of effectively separating high and low risk groups was dosiomics-based (shape and texture features), while the lowest generalizability was observed for MRI-based models. A few years later, Parrella et al. 43 also explored models within the same framework, confirming the findings of Buizza et al. to a certain extent. Specifically, the best-performing Parrella et al.’s model also selected shape features, even though texture features were not included. Additionally, the authors introduced dose-averaged linear energy transfer (LETd) maps and physical dose maps-derived radiomics features in their models. These features showed promising results both when fed to the s-SVM model and when integrated into the tumor control probability (TCP) model compared to a clinically based TCP. Despite such encouraging results, it is worth noticing that Buizza et al.’s and Parrella et al.’s studies were conducted relying on a sample gathered in the same institution and under the same protocol. Consequently, further investigations and external validation are needed. Morelli et al. 44 also investigated biological dose maps and LETd in survival prediction models (s-SVM and r-Cox) to predict local recurrence in sacral chordoma after CIRT. LETd-based radiomics r-Cox was reported as the best model, outperforming the more conventional ones based on dose-volume-histogram (DVH) parameters. Biological dosimetric maps were computed by means of two radiobiological models: the local effect model (LEM v.1) and the micro-dosimetric kinetic model (MKM). Interestingly, only MKM-based dosimetric radiomics led to significant results in the model performances.
Only one of the selected studies was not related to radiomics. Qiu et al. 45 compared the performance of statistical and ML models in the prediction of high-grade glioma (HGG) relapse after particle therapy with either proton or carbon ion. Using patient, tumor, and treatment variables as input, the Cox model showed better results compared to a random survival forest.
Normal tissue complication prediction
Four research papers, published from 2018 to 2025, explored the application of AI in predicting normal tissue complication prediction (NTCP). The majority of these papers (three out of four) developed their models using clinical, tumor, and treatment data, rather than incorporating radiomics-based features. Additionally, all the studies utilized traditional ML techniques, mostly logistic regression, instead of employing advanced AI models for building NTCP. Although these NTCP models have not undergone external validation, overall, they were characterized by larger sample sizes, ranging from several hundred to over a thousand patients.
In 2018, Zhang et al. 46 aimed to build a model predicting weight loss in cancer patients treated with particle therapy. The study sample included 365 patients, where no constraints on either the treatment site or particle were applied. Logistic regression and chi-square automatic interaction detector (CHAID) decision tree models were investigated with the following input variables: demographic, nutrition status, tumor-related, treatment-related, and laboratory test results. Basic univariable-based selection methods were implemented to define the multivariable predictors. Both models reported good performances with slightly better accuracy and AUROC for the CHAID decision trees. Li et al. 47 built a model for predicting high-grade acute radiation dermatitis using 187 head and neck patients treated with CIRT. Although the input predictors were consistent with the classification task and most prior literature (patient-specific and treatment-related factors), the model development presented a systematic structure, with no substantial novelty. Nevertheless, robust procedures were implemented from the feature selection to the model evaluation step. The authors also highlighted that the multivariable model did not outperform the univariable logistic regression models based on RBE-weighted or physical dose to surface parameters. Much later in 2025, Zhang et al. 48 evaluated the performance of ML approaches in predicting high-grade xerostomia in head and neck cancer patients receiving proton and carbon ion RT. The sample size for model development was remarkable, with 1769 patients, although the heterogeneity and the treatment may hinder the model's specificity to CIRT. The pipeline leading to model development and evaluation was consistent with good practices in ML. After evaluating the most common ML models, linear-SVM reported the highest balanced accuracy, outperforming logistic regression. The authors reported synthetic minority oversampling technique (SMOTE) as a key method to improve the reliability and model performance while accounting for class imbalance. Finally, Meng et al. 49 developed an NTCP model for predicting high-grade acute oral mucositis in head and neck cancer patients undergoing CIRT. The authors presented the integration of radiomics and dosiomics features in their NTCP model as a major novelty. A conventional logistic regression model was employed, robustly developed and evaluated, achieving a satisfactory AUROC consistent with the literature on the topic. The last three aforementioned studies were evaluated by taking advantage of cross-validation and bootstrap procedures at different stages to ensure a higher level of robustness in the final model.
Discussion
To the best of our knowledge, this is the first scoping review exploring the landscape of AI applications in the context of CIRT. In addition to the two independent human reviewers, the integration of an LLM-based AI reviewer into the screening and data extraction processes represents a novelty in scoping reviews. Although the AI reviewer tended to be slightly liberal in article selection compared to humans (moderate precision), nearly all of the articles selected by humans were also included in the LLM reviewer’s selections (high sensitivity and specificity). In fact, LLM reviewer’s choices were substantially similar to human selections, with no statistically significant difference. While underscoring the immaturity of such tools in conducting independent scoping reviews, our results are encouraging and highlight their potential as a support tool. Furthermore, given the rapid advances in LLMs, we expect higher screening accuracy in the future.
The number of papers meeting the inclusion criteria of our scoping review, in the period between January 2014 and June 2025, was substantially low compared to recent literature reviews on AI applications in proton 3 or photon therapy2,50. Of course, the limited worldwide adoption of CIRT may explain such a discrepancy, leading to fewer AI applications and potentially lower data availability for AI development.
To date, AI research efforts dedicated to CIRT have focused on optimizing laborious CIRT-specific tasks. In fact, DL-based MC dose simulation prediction, denoising, or efficient dose grid sampling were reported to significantly speed up the treatment planning phase, complying with adaptive CIRT requirements. Similarly, the need for dedicated re-planning CT scans during the CIRT course represents additional radiation exposure and computational time. Especially in the framework of adaptive CIRT, such limitations could be impairing; hence, synthetic imaging was identified as a promising tool to aid these processes. sCT from CBCT or MRI offers a promising avenue for low or radiation-free efficient re-planning and for non-invasively monitoring anatomical changes, though clinical translation remains hindered by variations in experimental setup, heterogeneity of anatomical regions, small sample sizes, and limited validation. In this context, model transferability may be a promising research direction to explore.
Indeed, specific AI tools developed in the context of photon 51 or proton 52 RT could be transferred to CIRT; however, it should be considered that the highest accuracy required in heavy ion therapy demands caution in this process. Dedicated experiments should be designed to validate the existing models and evaluate the need for further tuning, improvement, or implementation of new CIRT-specific AI tools. For example, in the context of CIRT, only one study investigated the feasibility of generating sCT from daily positioning CBCT. Conversely, such an approach was widely explored in proton and photon RT,53–55 due to a multiplicity of opportunities. Generating sCT from CBCT would lead to a relevant reduction of the facility costs and invasiveness associated with further re-planning CT scans in adaptive CIRT. 39 The acceptability of such AI methods should be systematically evaluated with a CIRT-based treatment plan that incorporates dosimetric accuracy constraints, an aspect missing in the aforementioned study. 39 On the other hand, an in-silico study by Klassen et al. 40 reported dosimetric advantages of using CBCT adaptive planning in pancreatic hypo-fractionated CIRT, although the authors did not take advantage of AI-based sCT generation. Moreover, it is also worth noticing that the higher sensitivity to variations along the beam path in CIRT may also result in lower gamma pass rates when compared to photons and protons (~99%), and a consensus for the evaluation of acceptability in CIRT is yet to be established. This further underscores the need for dedicated CIRT experiments on the topic and warrants caution in the direct transferability of such applications to the CIRT domain. From a technical point of view, the AI models considered in CIRT literature were aligned with the most recent literature for RT and PT. No clear trends in the AI approaches over time were identified; in fact, the choice of the AI method seemed to be mostly guided by technical aspects rather than technological availability.
Conversely, we suspect that predictive models may exhibit lower transferability from photon/proton therapy due to CIRT-specific relevance of data modalities (e.g., LETd maps) and heterogeneity in radiobiological models 50 (LEM vs. MKM, both available in multiple versions) used for dose calculation. While in photons no biological model needs to be applied, and in protons, most of the centers apply a fixed RBE constant (1.1), 56 in CIRT, different radiobiological models are currently in use without a homogeneous landscape. This underscores a further need for dedicated approaches for CIRT in TCP modeling while emphasizing the difficulty in directly transferring AI models from proton and photon RT. Despite the recent developments in DL, most of the included CIRT studies for prediction tasks (i.e. TCP and NTCP) involved simple ML approaches (such as SVM, s-SVM, or random survival forests) rather than complex AI models (e.g. DL). Such a choice may be justified, given the small sample size, which could generate unstable results when used with data-hungry DL models for prediction tasks. With respect to prediction tasks, it is also worth noticing that 60% of the studies were conducted by the same group, potentially explaining to some extent the methodological homogeneity. Compared to TCP tasks, a trend over time was observed in NTCP modelling. Most of the studies used logistic regression models for NTCP prediction, although the ones published later considered a wider variety of ML models 48 or included radiomics and dosiomics-derived features in the model. Such a pattern is well aligned with RT literature, where logistic regression models are widely recognized and even validated, making it more difficult to experiment with advanced ML methodologies. Note that, even in the context of proton and photon RT, the application of AI in prediction tasks is currently far from clinical practice; this gap seems wider for CIRT.
Given the scarcity of literature on AI applications in CIRT, this scoping review underscored AI investigation efforts in CIRT, retracing the PT research mainstream. However, a recent literature review on AI in PT 3 highlighted additional research areas, including image contouring and image quality improvement, that warrant further attention. As previously discussed, the transferability of most of the proposed approaches is worth evaluating in a CIRT setting and eventually fine-tuned. On the other hand, there are AI-based commercial tools that are currently used in clinical routine,57,58 mainly for automatic contouring (e.g., for organs of interest), which explains the lack of dedicated studies on the topic. However, the highest CIRT demand for spatial accuracy poses relevant challenges to target segmentation tool development. Most of all, efforts on the evaluation of AI tools to derive mass density and relative stopping power maps from dual energy CT images may be of particular interest in CIRT. 59 While not extensively explored also in photon and proton RT, AI tools for patient selection are also worth exploring in CIRT, especially considering the fast-growing data digitalization with an increasing widespread use of electronic health records (EHR).
All of the included studies suffered from limited sample size and lacked external validation. The few studies that reported higher samples demonstrated treatment heterogeneity, i.e., combining patients treated either with proton or carbon ion RT. Although for synthetic imaging tasks, with respect to specific evaluation metrics, 39 it may not affect the study outcome, we expect a higher impact for prediction tasks. Overall, the highest sample sizes were registered for the NTCP prediction category, while the lowest were for treatment planning and imaging applications. This pattern is consistent with the AI tasks and the available literature. Reduced sample sizes and the lack of external validations lead to low robustness and generalizability of AI models. To mitigate such limitations, we encourage the development of a collaborative network among CIRT centers worldwide to facilitate data sharing, AI models validation, and protocol standardization. However, the intrinsic heterogeneity in CIRT treatment planning approaches, leveraging different radiobiological models, further hinders the collection of large datasets even in a collaborative scenario.
Strengths and limitations
To the best of our knowledge, this is the first scoping review that systematically explored the current landscape of AI in CIRT while critically providing an overview of challenges and future directions. The integration of an LLM-based AI reviewer into the screening and data extraction process presents a novel methodological contribution and could enhance the scale and efficiency of future reviews. Among the limitations of this scoping review is that the research was conducted only on three databases over a 10-year period (2014–2025) and only included articles written in the English language. Nonetheless, we truly believe that the selected articles capture the present AI applications in CIRT and would not be substantially affected by a broader search strategy. Finally, it is worth noting that the literature included in this review reported positive results. In this context, a positive bias may exist; however, reduced data availability and limited widespread use of CIRT to date may partially explain this occurrence.
Conclusions
This scoping review highlights the immaturity of AI developments in the CIRT domain. Besides the limited AI contributions in the field, current studies are further constrained by small sample sizes and a lack of external validation. Despite these limitations, transferring existing AI tools from photon and proton RT to CIRT represents a promising avenue, although caution is warranted in this delicate process. Leveraging AI applications in CIRT holds promise for reducing costs and increasing treatment effectiveness and efficiency, especially in adaptive CIRT scenarios. To support the development of robust and generalizable AI tools in CIRT, the establishment of a collaborative network among CIRT centers is strongly encouraged.
Supplemental Material
sj-docx-1-tmj-10.1177_03008916261438952 – Supplemental material for The landscape of artificial intelligence in carbon ion radiotherapy: A scoping review
Supplemental material, sj-docx-1-tmj-10.1177_03008916261438952 for The landscape of artificial intelligence in carbon ion radiotherapy: A scoping review by Giulia Fontana, Sithin Thulasi Seetha, Sara Lillo, Silvia Molinelli, Alessandro Vai, Anna Maria Camarda, Lorenzo Preda, Amelia Barcellini and Ester Orlandi in Tumori Journal
Footnotes
Author contribution
Conception and design: Ester Orlandi and Amelia Barcellini; Collection and assembly of data: Giulia Fontana and Sithin Thulasi Seetha; Data analysis and interpretation: Giulia Fontana and Sithin Thulasi Seetha; Writing – original draft: Giulia Fontana and Sithin Thulasi Seetha; Writing – review & editing: All authors.
This manuscript has been read and approved by all the authors, and each author believes that the manuscript represents honest work.
Data availability statement
No additional datasets were generated or analyzed in this study. The code supporting the large language model (LLM)-related findings of this study is openly available to support future systematic reviews leveraging fully open-source, locally run LLM for screening and critical appraisal. The source code is available on GitHub:
(accessed on 22 January 2026).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
