Abstract
Introduction
Early identification of patients at risk for toxicity induced by radiotherapy (RT) is essential for developing personalized treatments and mitigation plans. Preclinical models with relevant endpoints are critical for systematic evaluation of normal tissue responses. This study aims to determine whether attention-based vision transformers can classify MR images of irradiated and control mice, potentially aiding early identification of individuals at risk of developing toxicity.
Method
C57BL/6J mice (n = 14) were subjected to 66 Gy of fractionated RT targeting the oral cavity, swallowing muscles, and salivary glands. A control group (n = 15) received no irradiation but was otherwise treated identically. T2-weighted MR images were obtained 3–5 days post-irradiation. Late toxicity in terms of saliva production in individual mice was assessed at day 105 after treatment. A pre-trained vision transformer model (ViT Base 16) was employed to classify the images into control and irradiated groups.
Results
The ViT Base 16 model classified the MR images with an accuracy of 69%, with identical overall performance for control and irradiated animals. The ViT's model predictions showed a significant correlation with late toxicity (r = 0.65, p < 0.01). One of the attention maps from the ViT model highlighted the irradiated regions of the animals.
Conclusions
Attention-based vision transformers using MRI have the potential to predict individuals at risk of developing early toxicity. This approach may enhance personalized treatment and follow-up strategies in head and neck cancer radiotherapy.
Keywords
Introduction
Patients with head and neck (H&N) cancer undergoing radiotherapy (RT) may experience serious side effects from radiation-induced damage to healthy tissue. Early adverse effects from RT include dermatitis and oral mucositis, which can occur during or shortly after RT and may require unwarranted treatment pause. 1 RT-induced dermatitis affects up to 95% of H&N cancer patients, 2 while oral mucositis affects about 80% and may require painkillers and in the worst cases also gastrostomy tube to aid in feeding.3,4 Late adverse effects include tissue fibrosis, osteoradionecrosis, and hypo-functioning salivary glands, all of which can significantly lower the patient's quality of life.1,5 Hypo-functioning salivary glands leading to hyposalivation and/or xerostomia is estimated to occur in more than 80% of H&N cancer patients undergoing RT, and can lead to negative effects on oral health causing loss of taste, difficulties in eating, speaking, and swallowing.6,7 Currently, few methods exist to identify patients at risk of early-stage toxicity, limiting opportunities to mitigate side effects.
Magnetic resonance imaging (MRI) is a non-invasive and versatile medical imaging technique with excellent soft tissue contrast, useful for assessing tumors and organs at risk visualizing tissue boundaries and extracting quantitative image biomarkers with prognostic or predictive potential. 8 MR technology and sequences have improved significantly during the last decades, enabling evaluation of tissue properties such as diffusion and perfusion. T2-weighted MR sequences are standard for cancer diagnosis, as they highlight variations in water content, making tumors often appear hyperintense while also revealing hypointense features like fibrosis. While MRI can identify RT-induced tissue injuries typically occurring months to years after treatment, 9 its usefulness in detecting toxicity during or early after treatment appears to be very limited. Thus, developing an MRI-based approach for detecting and predicting tissue effects and toxicities could offer significant benefits.
During the last decade, we have witnessed remarkable progress in artificial intelligence (AI) from machine learning to deep learning, with valuable applications in medical imaging. 10 Recent developments in deep learning have shown promise for image-based outcome prediction in terms of radiotherapy-induced toxicity.11,12 Moreover, transfer learning could be a useful approach, particularly due to the challenges involved in obtaining large-scale datasets. 13 In addition, explainable AI (XAI) has been introduced to provide models that are interpretable and more trustworthy. 14 The attention-based vision transformer (ABVT) 15 is a new type of neural network architecture, inspired by the transformer-based large language models like ChatGPT (Chat Generative Pre-trained Transformer). 16 ABVTs have the ability to model long-range pixel dependencies, adapt to different input sizes, and leverage parallel processing, making them highly suitable for image classification tasks. 17 Here, a weighted representation of the complete image is created by summing attention scores assigned to image patches via an attention mechanism. By using local attention, model efficiency can be improved and the attention maps determine the significance of individual patches. As the attention score indicates the importance of image patches, patches with high scores may align with classifier predictions. Therefore, in this study, we investigated the use of ABVT together with MRI as a tool to predict toxicity and to enhance explainability.
In this work, we hypothesize that subtle radiographic patterns in early post-RT MR images can indicate the risk of late radiation-induced toxicity, and that ABVT can predict its likelihood. T2 weighted MR images were acquired from a preclinical mouse study that employed a clinically relevant radiation field and fractionated radiation delivery to the head and neck of mice. 18 A pre-trained ABVT model, the vision transformer (ViT) base 16 model,15,19 was used for binary classification of MR images into two classes, identifying the images as stemming from irradiated mice or controls. The predictive capability of the image-based ABVT classifier was then tested against late toxicity, with respect to reduced saliva production, in the same mice. Furthermore, attention maps were generated and compared between control and irradiated mice, to reveal image regions of importance for the model. 20 Our assumption was that the model's attention, i.e. the relative weighing of the image information used for the classification task, is related to where in the images the model looks for radiation-induced alterations.
Materials and Methods
Mice and MRI
Nine-week-old C57BL/6J female mice purchased from Janvier labs (France), were kept in a 12-h light/12-h dark cycle under pathogen-free conditions and fed a standard commercial fodder with water given ad libitum (see 18 for animal handling). Standard housing with nesting material and refuge was provided. All experiments were approved by Norwegian Food Safety Authority (ID 27931) and performed in accordance with directive 2010/63/EU on the protection of animals used for scientific purposes. Euthanasia was performed through overdose of anesthetic (Pentobarbitol, Exagon® Vet) by intraperitoneal injection under terminal gas anesthesia (Sevoflurane 4% with O2). Efforts to minimize the number of animals utilized and decrease their suffering was made. The animals were closely monitored throughout the experimental timeline and were adequately cared for. The reporting of this study conforms to ARRIVE 2.0 guidelines.21,22
At the onset of experiments, animals were 12-weeks-old. A total of 29 mice were randomly assigned to a control group (n = 15) and an irradiated group (n = 14). X-irradiation was performed with a Faxitron Multirad 225 system (Faxitron Bioptics, Tucson, AZ, USA). A total dose of 66 Gy was given in 10 fractions with the following settings: 100 kV X-ray potential, 15 mA current, 2.0 mm Al filter, and 0.68Gy/min dose rate. The radiation field (1.5 × 0.75 cm2), covering the oral cavity, swallowing muscles and salivary glands, was delivered over five days (two fractions per day) or twelve days (one fraction per day) (see 18 ). The control group was not irradiated but was otherwise treated identically. Saliva production, used as a late toxicity endpoint, was assessed 105 days after RT. Saliva production was stimulated by intraperitoneal administration of 0.375 mg/kg pilocarpine 23 (Pilocarpine hydrochloride, Sigma) to 8 control and 6 irradiated mice under injection anesthesia (Zoletil-mix: Narcoxyl or Rompun® + Torbugesic® + Zoletil®), as described in 18 and. 24 Saliva was collected into a cotton swab for 15 min. The swab was then centrifuged and the obtained volume was measured. The two fractionation schedules did not result in significant differences in saliva production.
MRI of the H&N region was performed using a 7.05 T Biospec scanner (Bruker Medical systems, Germany) with a fast T2 weighted spin-echo sequence, TurboRARE, with TE = 31 ms, TR = 3100 ms, and voxel size of 0.12 × 0.12 × 0.70 mm3 in the sagittal, coronal and axial plane. Body temperature was monitored and maintained at 37 degrees Celsius by a feedback-regulated heating fan. Respiration rate was monitored by a respiration probe. Animals were anesthetized during imaging using gas anesthesia with Sevoflurane 4% in O2. For image analysis, 420 (14 mice×30 slices per mouse) and 450 (15 mice×30 slices) sagittal MR image slices from irradiated and control mice, respectively, were acquired. Figure 1 shows the treated area in a transmission X-ray image from the X-ray source used for treatment together with a T2-weighted MR image.

Transmission X-Ray Image Of The Irradiation Field Projected Through A Mouse (A) And A Corresponding T2-Weighted Mr Image (B). The Blue And Green Lines Indicate The Irradiation Field And The Midline Contour, Respectively. In (A), Large Parts Of The Mouse Are Not Seen Due To Lead Shielding. A Submandibular Gland (C), Responsible For Saliva Production, Is Indicated.
Data Preparation and Vision Transformer Model
The DICOM MR images were converted to uncompressed TIFF format using a batch conversion tool in IrfanView™ software. Some of the most lateral MR images containing virtually no anatomical information of interest (i.e. images not showing the mouse head or only minor parts of it) were manually removed from the dataset, reducing the image count to 300 for each class. Thereafter, the images were resized from 256 × 256 to 224 × 224 using bilinear interpolation, converted into 3-channel tensors, and then transformed using mean = [0.485,0.456,0.406] and standard deviation = [0.229,0.224,0.225], as required by PyTorch for the subsequent analysis. 25 Using random splitting with repeated training-test splits, images were randomly assigned into training (approximately 70% of images) and test (approximately 30%) sets, stratified by class to ensure balance between control and irradiated mouse images in each set. To avoid information leakage, all images from a single mouse were assigned to only one set (either training or test set). Given the small sample size (29 mice), this procedure was repeated five times with different splits, creating five distinct training and test sets, and thereby generating five different models.
The vision transformer used herein was proposed by Dosovitskiy et al 15 as an alternative neural network architecture to convolution neural networks (CNN). Given that ViT is a fairly large network (appx. 8.5 million parameters), a transfer learning strategy can be used for adapting the weights of the last layer for the given task. 26 We used the ViT base 16 model with 12 attention heads (with 16 × 16 patches) together with transfer learning. The model was pre-trained on IMAGENET1K_V1 weights. 25 We then examined the ViT base 16 model's ability to classify MR images as originating from a control or an irradiated mouse.
Computations were conducted using a NVIDIA RTX 4090 GPU with 16 GB graphics memory. We used a batch size of 32 with the Adam optimizer 27 with a learning rate of 0.000009 and weight decay of 0.0006. Training for 10 epochs was sufficient to achieve the model's maximum accuracy. Only the final layer of the transfer learning model was modified and trained, while the other layers (pre-trained on IMAGENET1K_V1) were frozen. The model output was classification probability pc, and for the binary classification task, the control class was labeled as “1” (pc ≥0.5) and the irradiated class as “0” (pc <0.5). The 12 attention maps generated for the central MR image for each mouse were extracted to identify patterns that could explain what the model highlighted in the images. Here, the maps were split at the coronal midline, approximately coinciding with the radiation field border. Furthermore, a region of interest (ROI) including the mouth and throat (irradiated part in treated animals) was defined inferiorly to the midline, while a ROI including the brain and nape (unirradiated part) was defined superiorly to the midline. The group mean attention values for each attention head and ROI was then calculated.
Performance and Evaluation Metrics
The accuracy in the classification is defined as:
True positive (TP) is the number of positive samples that were also correctly predicted to be positive, false positive (FP) is the number of samples incorrectly predicted as positive, true negative (TN) is the number of samples correctly predicted as negative and false negative (FN) is the number of samples incorrectly classified as negative. Recall (also called sensitivity), precision and F1-score are given by
28
:
Precision gives the confidence of the classifier to correctly classify images to their true class. Recall, often known as ‘true positive rate’ provides information about the fraction of samples the classifier has correctly classified to the true class out of all the samples available in that class. Therefore, precision provides an indication as to how precise the classifier is while recall measures the completeness of classifying true samples. The F1 score combines the recall and precision using their harmonic mean.
Statistics
Group mean and standard error of the mean (SEM) are reported. Student's t-test was used to compare the mean values of the two groups, with a significance level of p < 0.05. Pearson's correlation coefficient r was calculated to assess associations between two variables.
Results
As seen in Table 1, the ViT Base 16 model showed good performance across the five splits of the data into training and test sets. The mean accuracy was 69% and was identical for irradiated and control animals. The control class had slightly higher mean precision than the irradiated class, though with higher standard error, whereas the irradiated class had higher recall and F1-score. However, none of these differences were statistically significant (p > 0.05).
Mean Test Set Performance Metrics (±SEM) for the ViT Base Model. the Model was Trained and Tested on Five Different Training and Test Sets Obtained by Random Splitting.
To investigate the dependence of the classification accuracy on the MR slice location, we separated the image series of each mouse into central and peripheral images, where the former depicts submandibular glands and the vertebral column while the latter does not. The mean predicted class probability for central and peripheral images for control and irradiated animals is shown in Figure 2. For control mice, there was a significant difference (p = 0.01) between the central and peripheral images, but for the irradiated mice there was not.

Mean Predicted Class Probability With Error Bars Indicating Sem For Control And Irradiated Mice Where Each Group Was Separated Into Central And Peripheral Mr Images.
The 12 attention maps corresponding to the central MR slice was extracted, visualized, and analyzed for all mice. Figure 3 shows an MR image illustrating the ROIs for the mouth and throat (irradiated part in treated animals) and the brain and nape (unirradiated part) along with attention maps for head # 9, averaged over each treatment group. For every attention map, the mean attention value within each ROI was calculated for each treatment group. As seen in Figure 4, attention head #1, #4, #9, #10, and #12 had significant differences between the two ROIs for controls. Only head #9 had significant differences between the two ROIs in both controls and irradiated mice. As seen in Figure 3, the attention map for head #9 highlights the mouth and throat, corresponding to the irradiated part in the treated animals. There were generally small differences between the mean attention maps for the two treatment groups. Attention maps for heads #1 and #4 were quite noisy and predominantly highlighted air-tissue interfaces, while for heads #10 and #12 they appeared similar to #9 but with smaller differences between the ROIs (data not shown).

Central MR Image Of A Mouse (Left) And Mean Attention Maps For Head #9 For Control (Middle) And Irradiated Mice (Right). The Solid Line Represents The Roi Encompassing The Mouth And Throat (Irradiated Part In Treated Animals), While The Stippled Line Represents The Brain And Nape (Unirradiated Part).

Mean Attention Value With Sem As Error Bar For The Mouth And Throat (Irradiated Part In Treated Animals) And Brain And Nape (Unirradiated Part) For Each Attention Head And Treatment Group. Data For Control And Irradiated Animals Are Given In The Upper And Lower Panel, Respectively. * p < 0.05.
To investigate the relation between our image-based ViT model predictions and late toxicity in terms of reduced saliva production, we computed the mean predicted class probability over all images (including misclassified ones) for each mouse. Figure 5 shows the mean probability of an animal belonging to the control group plotted against the saliva production. As seen, there was a marked association, with a Pearson's correlation coefficient of 0.65 (p = 0.01). A probability cutoff at 0.5 (also used in the image classification) and a saliva cutoff at 125 μL separated the two treatment groups with a classification accuracy of 0.86.

Late Saliva Production Per Mouse Versus the Image-Based ViT Probability of a Mouse Being Classified as a Control Mouse. The Dotted Lines Represent a Probability Cutoff at 0.5 and a Saliva Cutoff at 125 μL, Respectively, Used When Calculating Prediction Accuracy. Pearson's Correlation Coefficient r is Given.
Discussion
We have shown that Vision Transformers, specifically the ViT Base 16, achieved a mean accuracy of approximately 0.69 in classifying MR images of control and irradiated mice. The transformer's predictions, based on images acquired very early after treatment, were associated with late toxicity in terms of reduced saliva production. Notably, one of the attention maps highlighted the mouth and throat, the irradiated area, demonstrating the model's ability to provide interpretable results. To our knowledge, these findings are novel and hold potential for future clinical applications.
One of the aims in this study was to investigate the attention maps and to determine which region(s) in the MR images that were highlighted. As recently reviewed, 29 explainability in ViTs is challenging because there is no clear standard for what constitutes an appropriate explanation. However, in the current work, one of the extracted attention maps predominantly pointed to the irradiated region, with focus on the submandibular glands, oral cavity, and throat. This substantiates that the ViT model used MR information from the irradiated region in the classification. However, the per-image classification varied and was poorer towards the periphery of the field of view for control images. This could be due to noise and little tissue-specific information, but it is not clear why this was not the case for irradiated animals. It may be that the irradiation caused pronounced effects throughout the animal, making it easier for the vision transformer to extract relevant image information. In future studies, it could be relevant to investigate more recent and sophisticated transformer architectures like Swin or Diet 30 in an effort to increase the model accuracy. Furthermore, it would be interesting to apply ViTs to MR images of patients with head and neck cancer and explore their role in a clinical setting, as these applications can potentially aid in detecting early post-irradiation effects.
Previously, it has been shown that the treatment protocol used in this study gives both acute and late toxicities in all the irradiated animals.18,24,31,32 Thus, the current classification task was in essence a prediction of the probability of developing toxicity. Indeed, we found a correlation between per-animal prediction probability, as based on MR images acquired early after irradiation, and late saliva production, indicating that the ViT classifier could potentially be used for toxicity prediction. The advantage of our approach is that radiation-induced toxicity may be detected early, allowing appropriate action to be taken to determine the best course of therapy. Furthermore, prompt information regarding the possible causes of any side effects that may have resulted from radiation toxicity, and possible mitigation strategies, can be provided to the patient.
AI approaches such as deep learning on medical images have previously been used to predict toxicity in patients receiving radiotherapy (see11,33 for a review). Typically, these studies have employed pre-treatment, patient-specific radiation dose distribution 33 and thus fail to account for inter-patient variations in radiosensitivity. Treatment monitoring by imaging and AI may potentially be able to identify radiosensitive individuals at an early stage after treatment. Also, attention mapping could point to regions in the patient having subtle radiation-induced changes that may progress to clinical toxicity at a later stage. To our knowledge, there are very few publications on the topic. Zhen et al used CNNs and unfolded dose maps for patients receiving radiotherapy of cervical cancer to highlight the possible rectum toxicity location. 34 Still, only pre-treatment dose maps where employed, disregarding any radiographic information that could indicate likely progression to toxicity in individual patients. In a very recent study, Kapoor et al used CNNs and integrated gradient techniques to identify regions in CT and dose images indicative of radiation pneumonitis for lung cancer patients receiving RT. 35 Still, the analysis employed CT images acquired at 3 and 6 months after RT, time points where clinical symptoms and radiographic CT changes are present and when it is too late for any early mitigation. Thus, there is a lack of methods that could indicate toxicity at an early stage during or immediately after RT.
The number of animals included in the study was a limitation, as small sample sizes can impact model robustness. However, we employed pre-training and transfer learning, which are expected to reduce the risk of overfitting. Additionally, by using random splitting with repeated training-test splits, we obtained a more reliable estimate of model performance compared to using a single training-test split. This approach helped minimize the likelihood of the model becoming overly tuned to specific animals. Moreover, we used MR images from only a single MR sequence (T2-weighted), and incorporating multiparametric MRI could have potentially enhanced model performance, although this would increase the risk of overfitting. Finally, when evaluating associations between the model prediction and saliva production, the saliva sampling procedure itself introduces inherent uncertainty, resulting in inter-animal variation of approximately 50%. 18 Additionally, improving the interpretation of attention scores is important, as this could provide valuable insights into the underlying mechanisms of radiotherapy-induced toxicity. We also plan to investigate different vision transformer architectures, which may offer improved performance and interpretation capabilities. Furthermore, we intend to test our methodology with a larger cohort of patients with H&N cancer, also incorporating multiparametric MRI. In conclusion, while our approach is still in its early stages, it is promising and lays the groundwork for future studies.
Footnotes
Ethical Considerations and Informed Consent Statements
All experiments were approved by the Norwegian Food Safety Authority (ID 27 931) and performed in accordance with directive 2010/63/EU on the protection of animals used for scientific purposes.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Author Contributions/CRediT
Manish Kakar: Conceptualization, Methodology, Formal analysis, Investigation, Writing – Original Draft
Bao Ngoc Huynh: Writing - Review & Editing, Formal analysis
Olga Zlygosteva: Writing - Review & Editing, Resources
Inga Solgård Juvkam: Writing - Review & Editing, Resources
Nina Edin: Writing - Review & Editing, Resources
Oliver Tomic: Writing - Review & Editing, Formal analysis
Cecilia Marie Futsaether: Writing - Review & Editing, Formal analysis
Eirik Malinen: Conceptualization, Methodology, Resources, Writing - Original Draft, Writing - Review & Editing, Project administration.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability
The data that support the findings of this study are available from Prof. Eirik Malinen (eirik.malinen@fys.uio.no) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available.
