Sage Journals: Discover world-class research

Abstract

Multi-modality medical image classification aims to combine information from different modalities or devices to generate comprehensive and accurate diagnostic results. Existing research methods have ignored two characteristics of medical images across different phases: the highly redundant background and the low differentiation between different phases. Based on the idea of disentangled representation learning, we introduce a dual-branch network to disentangle images into shared features and modality-specific features. And based on the properties of different features, we propose a prototypical loss and a similar prototypical loss to constrain the two types of features, respectively. Our approach achieves strong performance in classification on the LLD-MMRI dataset and fusion on the ANNLIB dataset. Extensive ablation studies validate the contribution of each component of our framework. Code will be available at: https://github.com/wangzx-tech/explore_mmia

Keywords

medical image analysis multi-phase magnetic resonance imaging prototype

1. Introduction

Automatic classification of medical images can reduce the mental workload of physicians, improving the efficiency of diagnosis (Bhatnagar et al., 2013; Loening & Gambhir, 2003). However, previous research has been limited by the ability to collect datasets. Bring to assisted diagnosis on a single image (Li et al., 2014; Xiao et al., 2021), making it difficult to improve accuracy and real-world application. With the advancement of device capabilities, an increasing number of researchers have expanded their focus from single modality to multi-modality (Lou, 2023; Menze et al., 2015). Aiming to combine information from different medical image modalities to generate more comprehensive and accurate diagnostic results.

In the research of multi-modal medical image classification (MMIC), early methods (Guo et al., 2019; Weibin et al., 2019) adopted simple fusion approaches, such as early fusion, feature fusion, and decision fusion, struggling to capture complex inter-modality relationships. Later, with transformers demonstrating powerful capabilities in modeling relationships between modalities in computer vision (Vaswani et al., 2017), researchers introduced them to medical imaging tasks and achieved superior performance (Dai et al., 2021; Shamshad et al., 2023; Zhan et al., 2022).

However, they overlooked the unique characteristics of medical image processing tasks when incorporating this architecture (Zhao et al., 2023). Specifically, medical images tend to have highly redundant background information. For example, in multi-phase magnetic resonance imaging (MRI), the background features are similar across modalities despite phase differences. Under high inter-modality feature similarity, the capability of transformers to model relationships is impaired, which in turn inhibits the final fitting performance of the model.

Disentangled representation learning (Bengio et al., 2013) can decompose feature representations into multiple independent latent factors, allowing widespread applications in image editing and generation (Lee et al., 2018; Singh et al., 2019). Most existing image fusion methods have achieved improved performance by disentangling images into local and global features (Fu & Wu, 2021; Zhao et al., 2023). Therefore, we propose to introduce disentangled representation learning into the MMIC task by decomposing images into shared and modality-specific information. Attempt to address the problem of redundant background information and release the performance for the transformer architecture. More importantly, we designed dedicated loss functions for each branch. Through constraints such as fusion and contrast, these losses effectively guide the feature extraction tendencies of the dual branches, thereby significantly enhancing the model’s capability to extract information from different modalities.

However, existing disentanglement methods are based primarily on variational autoencoders, generative adversarial networks, and other latent variable models (Wang, Chen, et al., 2024), which are unsuitable for classification tasks. Disentangling methods used in image fusion adopt dual-branch architectures that can be adapted for classification, but most works (Fu & Wu, 2021; Vs et al., 2022) lack effective constraints on disentangled representations to ensure controllability and validity.

To address these issues, we propose corresponding constraints on the disentangled representations from the dual-branch architecture. We first validated our method for semantic extraction and fusion capabilities in the LLD-MMRI dataset, which contains eight-phase MRI images (Lou, 2023), and achieved promising results. Compared to excellent existing published methods (Gao et al., 2021), our approach can improve accuracy by 8.65%. Subsequently, we conducted experiments on medical image fusion tasks on AANLIB datasets and achieved near state-of-the-art (SOTA) performance (Zhan et al., 2022), demonstrating the versatility of our method. Figures 1(a) and 1(b) are the performance in classification and fusion, respectively, validating the capability of our method in fusing multi-modal information. Our main contributions can be summarized as:

(1)
We propose a dual-branch feature extraction network. This architecture separately learns shared features and modality-specific features, effectively enhancing the model’s performance. A corresponding loss function is designed to work in synergy with our network architecture, specifically for the decoupling and fusion of different modal information in multi-modal tasks. It addresses the challenges where high-semantic information is difficult to extract and low-semantic information suffers from high redundancy.
(2)
Considering the redundant nature of shared features, we propose a prototype loss to constrain them to aggregate common characteristics between modalities, thus enhancing the effectiveness of the features.
(3)
For modality-specific features, in order to avoid over-constraint on similar modalities, which may lead to loss of modality information, we propose a prototypical loss based on a similarity prior, which ensures the effectiveness of distinctive features for similar modalities.
(4)
We perform experiments on MMIC and medical image fusion for validation. The results achieve outstanding performance, demonstrating the feasibility of fusing multi-modalities via our method.

Figure 1.
Performance comparison of our proposed method with existing classification and fusion approaches. (a) Performance on LLD-MMRI (Lou, 2023) contrast with seven methods and (b) performance on AANLIB (Zhang, 2022) in eight metrics.
2. Related Work

This section will first introduce relevant background knowledge and previous work on medical image classification, followed by an overview of disentangled representation learning and application to image fusion. Finally, we briefly introduce the concept of prototype networks for few-shot learning.

2.1. Medical Image Classification

Common medical image analysis tasks include lesion classification and segmentation (Azam et al., 2022). Since segmentation focuses more on morphological changes while classification targets high-level semantics, which are higher requirements for multi-modal information utilization, this section elaborates on classification.

Motivated by recent advances in deep learning in natural image analysis, there has been a growing interest in leveraging deep learning for medical image diagnosis (Chan et al., 2020; Jiang et al., 2020). Li et al. (2014) employed a shadow neural network to classify interstitial lung disease in chest images. Xiao et al. (2021) proposed an ultra-lightweight end-to-end classification neural network for electrocardiogram classification. However, their networks were designed for single images, whereas in medical image analysis, advanced utilization of multi-phase data can promote better feature learning and representation, and further improve diagnostic performance and reduce uncertainty (Gao et al., 2021; Hamm et al., 2019; Panayides et al., 2020).

Although researchers have made preliminary attempts at multi-modal information fusion, such as Guo et al. (2019) utilizing multi-modal images including computed tomography (CT), MRI (T1 and T2), and positron emission tomography (PET) to identify soft tissue sarcomas, and Weibin et al. (2019) proposing a deep learning-based radiomic approach that uses multiphase CT images to predict early recurrence of hepatocellular carcinoma, these studies have limitations. The aforementioned method applies simple concatenation or averaging for fusion, which cannot fully explore complementary information across modalities.

Recently, researchers have favored transformers for MMIC (Shamshad et al., 2023). Xu et al. (2023) used attention mechanisms for feature extraction and fusion, along with utilizing prior knowledge of different lesions in the same patient, to guide accurate prediction in multiphase CT. Dai et al. (2021) and Zhan et al. (2022) both attempted to combine transformer and convolutional neural network (CNN) for MMIC, adopting the same pipeline of using CNN for feature extraction followed by transformer for feature fusion, achieving excellent classification performance.

However, their research overlooks constraints between modalities. Incorporating the dual-branch decoupling concept from image fusion, specific constraints can be designed for different features to minimize redundant information that repeatedly disturbs the model.

2.2. Disentangled Representation Learning and Image Fusion Applications

Disentangled representation learning aims to learn interpretable and controllable representations that capture the underlying explanatory factors behind the data (Bengio et al., 2013). In a disentangled representation, each dimension corresponds to a single generative factor while being invariant to changes in other factors.

Medical image fusion commonly involves modalities such as MRI, PET, CT, and single-photon emission computed tomography (SPECT), and is often jointly validated with natural image fusion (James & Dasarathy, 2014; Li et al., 2021). Recent studies on combining disentangled representation learning have achieved performance improvements. Fu and Wu (2021) proposed a dual-branch network, first decoupling features into detail and semantic ones, then fusing the decoupled features independently, before finally merging outputs of the two branches. Xu et al. (2022) proposed a unified unsupervised image fusion method that adaptively determines importance relations between images through multi-level features extracted by CNN. Subsequent studies have been largely based on two-branch architectures (Ding et al., 2020; Dong et al., 2022; Liu et al., 2021).

With the rise of transformers and their strong performance in cross-modal tasks, an increasing number of researchers have applied this architecture to image fusion (Dosovitskiy et al., 2020; Radford et al., 2021; Xu et al., 2023). For example, Vs et al. (2022) employ a two-branch design for fusion, where CNN captures local features and transformer models long-range dependencies. Zhao et al. (2023) adopt the transformer for basic feature extraction and fusion of dual-branch outputs, while the invertible neural network extracts detailed features.

However, existing loss functions for dual-branch tasks struggle to handle scenarios with more than two modalities (Azam et al., 2022; James & Dasarathy, 2014) and lack generalizability across different decoupling tasks (Chen et al., 2022). Therefore, it is necessary to design a tailored loss function specifically for the complementary information within our dual-branch framework.

2.3. Prototype Network

Prototype networks are a type of neural network that can perform effective representation learning from limited training data. The key idea is to learn a prototypical representation for each class, rather than explicitly learning a decision boundary as in traditional neural networks (Snell et al., 2017).

Prototype features help alleviate the impact of inter-class differences on the model. We transfer them to multi-modal scenarios and use them in a self-supervised manner to constrain the information deviation between different modalities, thereby reducing the influence of redundant information.

3. Method

This section provides a detailed explanation of the proposed method. The first subsection introduces the overall architecture of the model. The next subsection delineates the core design methodology. The last two subsections elaborate on the details of the feature fusion module and the overall training loss function, respectively.

3.1. Overview

As illustrated in Figure 2, our model architecture consists mainly of three parts. One is a dual-branch feature disentanglement module, which is used to decouple the original information into shared and specific information simultaneously. The second is a feature fusion module that fuses the shared and specific information. The last is a task decoding module.

Figure 2.

Overview of our proposed framework. The blue box represents input images of different modalities, and the overall architecture contains three parts: Disentanglement, fusion, and decoding. Where $◯ P$ represents the computation of the prototype characteristic.

Our core contribution lies in the feature disentanglement module, which mainly consists of a dual-branch network structure and constraint methods designed for each branch to extract disentangled representations, detailed in Section 3.2. For the feature fusion module, we adopt existing techniques with slight modifications to accommodate our disentanglement approach, elucidated in Section 3.3. As for the decoding module, a simple fully connected layer is used to obtain classification results. The fused features are flattened and then fed directly into a fully connected layer to output the final predictions.

3.2. Disentanglement Module

Li et al. (2023) believe that in remote sensing image detection, most objects are small under a bird’s eye view and require environmental features such as object locations to aid identification. Similarly, in medical images, contextual information around lesion regions is also valuable for disease diagnosis. Therefore, we believe that global information can promote accuracy in MMIC. Hence, we utilize a dual-branch structure in the extraction stage to extract global and local features separately from the images. The global features primarily contain background information for the different modalities, which is weakly relevant to the lesions; we refer to these as shared features. Local features contain primarily modality-specific representations that are strongly relevant to local lesion regions. We refer to these as specific features.

Due to the fact that ResNet (He et al., 2016) shows great potential in deep learning and its simple structure, making it suitable as a benchmark network to verify the validity of the framework. For image classification tasks, simpler ResNet networks are adopted in both branches with weight sharing for all phases. The particular network configurations are visualized in Figure 3.

Figure 3.

Architecture of the feature extractors in medical image classification.

Note that the adopted ResNet networks convert all the original 2D operations into 3D, in order to adapt to the 4D input and encode it into a feature vector for each modality.

However, most existing dual-branch fusion methods do not impose effective constraints on the disentangled representations, resulting in uncertain convergence of the features and making it difficult to interpret whether the model has learned the corresponding representations. Zhao et al. (2023) constrain the base and detail features between the two modalities for different branches by maximizing and minimizing the correlation coefficients, respectively, as shown in the following:

L_{cddfuse} = \frac{{(C C (ϕ_{m 1}^{D}, ϕ_{m 2}^{D}))}^{2}}{C C (ϕ_{m 1}^{B}, ϕ_{m 2}^{B}) + ϵ},

(1)where

ϕ_{m}^{B}

represents the branch

B

feature extracted from modality

m

through the network,

C C

is the correlation coefficient calculation formula, and

ϵ

is a very small value guaranteeing the validity of formulas with division.

We extend the approach of equation (1) to constraint methods for an arbitrary number of modalities by adding the constraints equation (1) between every two modalities to accommodate the broader range of multi-modal model constraint methods being investigated:

L_{tradition} = \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} \frac{{(C C (ϕ_{m_{i}}^{Speci}, ϕ_{m_{j}}^{Speci}))}^{2}}{C C (ϕ_{m_{i}}^{Share}, ϕ_{m_{j}}^{Share}) + ε},

(2)where

N

denotes the number of modalities,

ϕ^{Share}

represents the fused share feature, which corresponds to the base feature, while specific is detail.

We formulate the final loss function as the sum of pairwise modality difference losses. In the next two subsections, we first analyze this simple extension approach, then propose our prototype loss to achieve effective disentanglement of multi-modal features.

3.2.1. Modality-Shared Feature

First, as illustrated in Figure 4, we assume the shared feature of each modality is represented by a point. The optimization objective with Figure 4(a) is to gradually pull the feature points of all modalities closer together, which may eventually converge to a result similar to Figure 4(b). We can observe that the traditional optimization target is rather complex, making it difficult to ensure that the solutions will not bias toward any single modality, thus leading to unpredictable issues.

To address the limitations of the aforementioned constraint and considering the high redundancy among the shared features from different modalities, we can set a background feature to encompass the redundant parts across all modalities. This forces the level of importance to converge evenly across the modalities. Inspired by the prototype network in few-shot learning (Snell et al., 2017), which identifies prototype centers for each class in semantic space. We hypothesize that the shared features across modalities should have a prototype representation. Thus, we define a prototype for shared features:

ϕ_{proto}^{Share} = \frac{1}{N} \sum_{i = 1}^{N} ϕ_{i}^{Share} .

(3)

Figure 4.

Illustrative comparison of the traditional method (equation (2)) and proposed prototype loss (equation (4)) for share features. Including multi-modality feature alignment and problems with the traditional method. Green nodes represent the computed prototype feature, blue nodes represent the original features, and arrows indicate the direction of optimization of the original features. (a) Close share features with $L_{traditional}$ , (b) problem with traditional optimization, and (c) close share features with prototype feature

As illustrated in Figure 4(c), we utilize the green dot as the prototype feature. Based on this concept, we adopt the prototype feature as pseudo-labels to constrain the convergence direction of the features from each modality. Meanwhile, the constraint on each modality is independent, given the determined prototype, which encourages equal treatment of all modalities. It also mitigates the effects of noise or anomalies during modal acquisition by limiting the variance of shared features across modalities, enhancing stability while reducing the complexity of fusing shared features. The loss function is defined as:

L_{share} = \sum_{i = 1}^{N} {‖ ϕ_{proto}^{Share} - ϕ_{i}^{Share} ‖}_{2},

(4)where

{‖ \cdot ‖}_{2}

means the

l_{2}

-norm. This is to prepare for subsequent modal feature fusion.

3.2.2. Modality-Specific Feature

The proposed approach visualizes the specific feature representations of each modality through differently colored data points, using matching colored arrows to denote inter-modal constraint effects (Figure 5). Infrared and visible spectrum images exhibit substantially dissimilar intrinsic feature spaces. To address this, Zhao et al. (2023) apart these modal representations by expanding the loss function to include additional regularization terms that explicitly enforce distribution apart between modal feature spaces, as shown in Figure 5(a). However, directly extending this methodology to more than two modalities significantly increases optimization difficulty due to the higher complexity of the problem.

Initially in Figure 5(b), prototype representations are leveraged to generate pseudo-label assignments for self-supervision. By imposing weighted losses that magnify the distances between encoded features and specific prototypical features, attentional mechanisms learn to focus on modality-specific patterns.

Figure 5.

Illustrative comparison of the traditional method (equatio (2)) and proposed prototype loss (equations (7) and (9)) for specific features. Including multi-modality feature apart from the problem with the simply-prototype method. Green nodes represent the computed prototype feature, other color nodes represent the original features with different modalities, and arrows indicate the direction of optimization of the same color original feature. (a) Apart specific features with $L_{traditional}$ , (b) apart specific features with prototype feature and possible problem, and (c) apart specific features with similarity-based prototype feature.

However, both methods may overlook potential inter-modal commonalities during distancing—for example, in multi-phase MRI classification, liver cyst traits are not substantially enhanced in contrast phase, so their representations need not be far distanced, which can degrade performance or cause mistakes. Thus, we implement a similarity-preserving constraint to retain cross-modal generalizability showed in Figure 5(c). First, the cosine similarity between modal feature embeddings is quantified, then leveraged to cluster modalities based on a predefined threshold $τ$ :

C (i, j) = {\begin{cases} 1, & if CosSim (ϕ_{i}^{Speci}, ϕ_{j}^{Speci}) > τ, \\ 0, & else, \end{cases}

(5)where

C

is a matrix with

N \times N

C (i, j)

indicates whether the two modalities are the same coarse-grained class. And CosSim() represents the calculation of the cosine similarity with two feature vectors.

Then, we calculate the prototype features for each cluster, and the calculation function for single modality $i$ is shown as follows:

ϕ_{proto - i}^{Speci} = \frac{\begin{matrix} \sum_{j = 0}^{N} ϕ_{j}^{Speci} \cdot C (i, j) \end{matrix}}{\begin{matrix} \sum_{j = 0}^{N} C (i, j) \end{matrix}} .

(6)

Subsequently, we provide cluster loss terms that enforce tight feature distributions between modalities assigned to common clusters and their respective prototypical representations. This preserves cross-modal generalizability for modalities deemed sufficiently similar by the initial thresholding process. Constraining modal feature deviations from highly representative prototypes compels generalized, transferable representations while sustaining discriminative power.

L_{cluster} = \sum_{i = 1}^{N} {‖ ϕ_{proto - i}^{Speci} - ϕ_{i}^{Speci} ‖}_{2} .

(7)

Concurrently, the cluster-specific prototypes are leveraged to extract more generalized representations for each modality. Specifically, prototypes derived from the joint cross-modal cluster distributions are utilized to initialize the modality-specific feature mappings. This framework enhances model stability while still enabling the dedicated pathways to capture intrinsic characteristics. The resultant independent prototypical representations for all modalities are obtained through an aggregation of the cluster-level prototypical vectors.

ϕ_{proto}^{Speci} = \frac{1}{N} \sum_{i = 1}^{N} ϕ_{proto - i}^{Speci} .

(8)

Finally, restricting the distance between prototype features and cluster-level prototypes is maximized, as shown in equation (9). It can lead to a large difference in the similarity of specific features of different modalities, thus ensuring that different modality-specific features can extract modality-specific information.

L_{speci} = \frac{1}{N} \sum_{i = 1}^{N} PCCs (ϕ_{proto - i}^{Speci}, ϕ_{proto}^{Speci}),

(9)where PCCs represent the calculation of the Pearson correlation coefficient for two elements. Thus, the distance between features is further apart from the perspective of feature similarity.

This completes an integrated regularization framework that balances (i) cross-modal generalizability through similarity-preserving constraints, while (ii) maintaining representational independence by separation from specialized prototypical conceptions. This joint optimization approach enhances model versatility across heterogeneous datasets by extracting both shared and unique information.

3.3. Fusing Module

For image classification, the extracted shared and specific representations are fused separately. Unlike previous network fusion approaches (Fu & Wu, 2021; Vs et al., 2022; Zhao et al., 2023), we simply concatenated the shared prototype feature with specific features, input into the transformer (Vaswani et al., 2017) encoder for fusion, and the architecture is illustrated in Figure 6.

Figure 6.

Basic architecture of the feature fusion module with $N$ cascaded blocks.

In order to validate the effectiveness of our approach, we employ the most basic transformer encoder architecture. And only retained the self-attention mechanism, which is important for multi-modal interaction, with no additional token settings. Since the information from different modalities is independent of each other, unlike the positional relationship between words or image patches, we discard the positional encoding scheme.

3.4. Loss Function

For classification tasks, the loss function includes the final classification loss additionally.

L_{cls} = CE (y_{label}, y_{predict}),

(10)where

y_{label}

denotes the ground-truth label,

y_{predict}

represents the predicted label from the network, and CE is the cross-entropy loss function.

The final total loss function is:

L_{total} = L_{cls} + α_{1} L_{share} + α_{2} L_{speci} + α_{3} L_{cluster},

(11)where

α_{1}

α_{2}

, and

α_{3}

are hyper-parameters regulating different loss weights.

4. Experiments

4.1. Datasets

LLD-MMRI. The LLD-MMRI dataset, introduced in a challenge hosted by MICCAI 2023, contains multi-phase MRI scans of the liver, including T2-weighted, diffusion-weighted, in-phase, out-phase, pre-contrast, and post-contrast enhanced images in arterial, venous, and delayed phases (Lou, 2023). It encompasses seven types of liver lesions (three malignant: intrahepatic cholangiocarcinoma, liver metastases, and hepatocellular carcinoma; four benign: hemangioma, liver abscess, cysts, and focal nodular hyperplasia) for differential diagnosis. The training and validation subsets comprise 316 and 78 volumes, respectively, with ground truth region of interest (ROI) and lesion annotations provided. The test set offers 104 volumes with ROI labels but undisclosed lesion types, whose classification performance is evaluated by reverse labeling with the official scoring metrics.

AANLIB. The Harvard Medical School and Brigham and Women’s Hospital and others maintain a website called “Whole Brain Atlas,” which provides information and images about the human brain and its diseases (Summers, 2003). It contains different types of brain images, such as MRI, CT, PET, SPECT, etc. Zhang (2022) collected 810 pairs of images from this site, with 184 pairs of MRI–CT, 269 pairs of MRI–PET, and 357 pairs of MRI–SPECT for image fusion. We extracted these three types of pairs according to a split ratio of 0.6, 0.2, and 0.2 to form the training, validation, and test sets.

4.2. Implementation Details

Setup. The experiments in this work were conducted on a single NVIDIA 4090 GPU with an Ubuntu system and Pytorch 2.0.1. For image classification, our basic settings mirror those of the baseline model provided in the LLD-MMRI challenge. The modification included setting the dropout rate to 0.5 to alleviate the potential overfitting due to limited medical image data. And the hyper-parameters $α_{1}$ , $α_{2}$ , and $α_{3}$ are set to 1 for all. In addition, we use ResNet-18 and two layers of two head transformer encoder as our feature extractor and fusion layer.

For the image fusion task, the baseline settings follow Zhao et al. (2023).

Metrics. For the evaluation of image classification, we adopt the relevant metrics associated with the official challenge scoring scheme, including the F1 score and the Kappa score (McHugh, 2012; Sasaki, 2007). F1 score comprehensively measures model precision and recall. Kappa quantifies the accuracy of the classifier while taking into account the possibilities of random guessing.

For image fusion evaluation, we adopt the same eight quantitative metrics as in CDDFuse. Specifically, entropy (EN; Roberts et al., 2008) measures the information contained in the fused image, mutual information (MI; Qu et al., 2002) evaluates the amount of information transferred from source images, and normalized weighted edge information ( $Q_{AB / F}$ ; Petrovic & Xydeas, 2005) based on fractional calculus and logical functions quantifies the edge information in the fused output. These three metrics perform an assessment based on information theory. The feature-based metrics include two aspects: standard deviation (SD) characterizes intensity distribution and contrast, while spatial frequency (SF; Eskicioglu & Fisher, 1995) examines gradient distribution to reveal texture and detail variations. The spatial correlation discrepancy (SCD; Aslantas & Bendes, 2015) and structural similarity (SSIM; Wang, Bovik, et al., 2004) gauge image similarity with aspects of correlation and structure differences, respectively. Lastly, visual information fidelity (VIF; Han et al., 2013) provides an assessment grounded in aspects of human visual perception.

All metrics used showed better performance with higher values.

4.3. Compared With Other Methods

We compare the proposed algorithm with existing state-of-the-art approaches.

First, we selected three baseline models from Guo et al. (2019), abbreviated as EF, FF, and LF, respectively, for classification. We also included four advanced approaches STIC (Gao et al., 2021), MCCNet (Xu et al., 2023), TransMed (Dai et al., 2021), and Zhan (Zhan et al., 2022) as comparison methods. Among them, except for TransMed and MCCNet, the other five approaches are based on 2D CNN, which we simply converted to 3D CNN with a conversion operation. Furthermore, since STIC uses diagnostic data as additional information and MCCNet utilizes different lesion images from the same patient to build complementary knowledge, we removed these components to adjust to the dataset.

As shown in Table 1, MCCNet performed poorly on the LLD-MMRI dataset, due to the lack of knowledge and greater inconsistency without coordination. The baseline model achieved around 60% accuracy, while existing SOTA models reach approximately 70% accuracy. Our method achieved 80.77% accuracy, improved by almost 8 percentage points over the best-performing TransMed model in this classification dataset. The F1 score increased from 0.7184 to 0.8202, a substantial gain of 0.1018. The Kappa score also improved by more than 0.1. These results from different metrics demonstrate the efficacy and powerful semantic fusion ability of the proposed method.

Table 1.
Quantitative Results of the Classification Task on LLD-MMRI.

Algorithm Accuracy(%) F1 score Kappa $P$ -value

EF (Guo et al., 2019) 57.69 0.5765 0.5003 –

FF (Guo et al., 2019) 61.54 0.5911 0.5332 –

LF (Guo et al., 2019) 64.42 0.6208 0.5710 –

MCCNet (Xu et al., 2023) 45.19 0.4268 0.3323 –

STIC (Gao et al., 2021) 68.27 0.6678 0.6138 –

TransMed (Dai et al., 2021) 72.12 0.7184 0.6638 –

Zhan (Zhan et al., 2022) 69.23 0.6940 0.6317 –

$L_{traditional}$ 76.92 0.7572 0.7189 –

Proposed(ours) 80.77 0.8202 0.7688 0.0027

Note. Boldface and underline show the best values by our proposed methods and existing methods, respectively.

Algorithm	Accuracy(%)	F1 score	Kappa	$P$ -value
EF (Guo et al., 2019)	57.69	0.5765	0.5003	–
FF (Guo et al., 2019)	61.54	0.5911	0.5332	–
LF (Guo et al., 2019)	64.42	0.6208	0.5710	–
MCCNet (Xu et al., 2023)	45.19	0.4268	0.3323	–
STIC (Gao et al., 2021)	68.27	0.6678	0.6138	–
TransMed (Dai et al., 2021)	72.12	0.7184	0.6638	–
Zhan (Zhan et al., 2022)	69.23	0.6940	0.6317	–
$L_{traditional}$	76.92	0.7572	0.7189	–
Proposed(ours)	80.77	0.8202	0.7688	0.0027

Experiments were also conducted using the traditional loss function $L_{traditional}$ expanded from Zhao et al. (2023) as a methodology. This method can achieve an accuracy of 76%, which has numerous improvements compared to existing methods, but lower than the method we finally proposed. Analysis indicates that imposing effective constraints on the dual-branch architectural framework can substantially improve benchmark metrics. However, performance deficiencies still manifest in the absence of accounting for the distinct intrinsic attributes associated with each modality.

To enhance the reliability of the experimental results, we selected methods employing traditional loss functions as the baseline and performed a paired sample $t$ -test with our proposed approach. Specifically, we independently trained the baseline model and the proposed model 10 times each on the same dataset, treating the accuracy metrics from the 10 training runs as 10 paired samples to compute the $p$ -value. The resulting $p$ -value was 0.0026, which is below the commonly used threshold of 0.05, demonstrating a statistically significant improvement of our method.

The visualization heatmaps are shown in Figure 7. It illustrates an example of the test set, where the first row displays the eight original MRI modalities. The three middle rows present heatmaps obtained by existing methods after training on LLD-MMRI and testing on the example images. The last row shows the results from our proposed method. It can be observed that STIC has larger attention regions for each modality, with only T2WI, out-phase, and venous phase focusing on the tumor area, while other modalities attend to peripheral background regions. In comparison, Zhan and our work leverage a transformer for feature fusion, which effectively avoids the problem of uneven attention across modalities in multi-modal situations. TransMed cuts images into patches before feeding them into the transformer, leading to unnecessary attention requirement for each patch by the network and, therefore, wasting computational resources. Although achieving good performance, Zhan presents overconcentration on the tumor area with attention regions covering the tumor for all modalities, which also results in inefficient usage of the network capacity. Our method demonstrates proper attention effects on both the tumor region and high-importance areas in different modalities, while significantly reducing redundant attention regions, hence enhancing the utilization efficiency of the network.

Figure 7.

Comparison of heatmap visualizations on the LLD-MMRI dataset. Here, eight columns represent eight different phases. The first row shows the original image, the middle row shows the results of other methods, and the last row shows the results of our method.

4.4. Ablation Study

To explore the impact of each module of our proposed method further, we conducted ablation studies on image classification.

Ablation study on architectures. For the image classification task, we use FF (Guo et al., 2019) as our baseline with shared ResNet-18 weights for feature extractors in all modalities, different from the compared methods above. We then validate our model design through four aspects: replacing the feature extractor with 3D networks, using a dual-branch architecture that calculates prototype features for shared features without limits, and fusing modalities with a transformer. The results are shown in Table 2.

Table 2.
Ablation Study Results of the Classification With Different Architectures on LLD-MMRI Dataset.

Architecture Metrics

Transformer 3D-ResNet Dual-branch Accuracy (%) F1 score Kappa

68.27 0.6843 0.6255

✓ 61.54 ( $- 6.73$ ) 0.6198 ( $- 0.0645$ ) 0.5496 ( $- 0.0759$ )

✓ 63.46 ( $- 4.81$ ) 0.6515 ( $- 0.0328$ ) 0.5742 ( $- 0.0513$ )

✓ 63.46 ( $- 4.81$ ) 0.632 ( $- 0.0523$ ) 0.5649 ( $- 0.0606$ )

✓ ✓ 69.23 ( $+ 0.96$ ) 0.6940 ( $+ 0.0097$ ) 0.6317 ( $+ 0.0062$ )

✓ ✓ 66.67 ( $- 1.6$ ) 0.6106 ( $- 0.0737$ ) 0.5882 ( $- 0.0373$ )

✓ ✓ 57.69 ( $- 10.58$ ) 0.5738 ( $- 0.1105$ ) 0.4937 ( $- 0.1318$ )

✓ ✓ ✓ 70.19 ( $+ 1.92$ ) 0.6917 ( $+ 0.0074$ ) 0.6403 ( $+ 0.0148$ )

Note. The boldface shows the best values.

Architecture	Metrics
			68.27	0.6843	0.6255
✓			61.54 ( $- 6.73$ )	0.6198 ( $- 0.0645$ )	0.5496 ( $- 0.0759$ )
	✓		63.46 ( $- 4.81$ )	0.6515 ( $- 0.0328$ )	0.5742 ( $- 0.0513$ )
		✓	63.46 ( $- 4.81$ )	0.632 ( $- 0.0523$ )	0.5649 ( $- 0.0606$ )
✓	✓		69.23 ( $+ 0.96$ )	0.6940 ( $+ 0.0097$ )	0.6317 ( $+ 0.0062$ )
✓		✓	66.67 ( $- 1.6$ )	0.6106 ( $- 0.0737$ )	0.5882 ( $- 0.0373$ )
	✓	✓	57.69 ( $- 10.58$ )	0.5738 ( $- 0.1105$ )	0.4937 ( $- 0.1318$ )
✓	✓	✓	70.19 ( $+ 1.92$ )	0.6917 ( $+ 0.0074$ )	0.6403 ( $+ 0.0148$ )

Firstly, results represent we can have the best performance when combining all three architectures, and we can observe that using each architecture alone for feature fusion leads to a noticeable performance drop, especially on the transformer. However, when we utilize transformer architecture in collocation with others, performance could be improved. This indicates the complementary effects between different modules, and the transformer cannot fully exploit when models are inappropriate. Another phenomenon worth noting is that when we applied 3D-ResNet and dual-branch, the performance is the worst compared to other studies. This may mean that the transformer can effectively combine different modules to improve the overall performance of the model.

Ablation study on loss function. Further, ablation studies were performed on multiple proposed loss functions, with detailed benchmarking conducted. The experimental implementation leveraged all the architectural frameworks mentioned above, facilitating comparison between the loss function of the proposed methodology and traditional loss $L_{traditional}$ . The proposed loss function encompasses three distinct subordinate terms $L_{share}, L_{cluster}, L_{speci}$ . Comprehensive ablation experimentation on each of these terms was performed independently. Results shown in Table 3 denote individual performance enhancement capability associated with each loss sub-component, with overall impact manifesting in a superimposed fashion. Optimal aggregate metrics were ultimately achieved using the full proposed loss.

Table 3.

Ablation Study Results of Classification with Different Loss Functions on LLD-MMRI Dataset.

Loss Function				Metrics
$L_{traditional}$	$L_{share}$	$L_{cluster}$	$L_{speci}$	Accuracy(%)	F1-Score	Kappa
				70.19	0.6917	0.6403
✓				76.92 ( $+ 6.73$ )	0.7572 ( $+ 0.0655$ )	0.7189 ( $+ 0.0786$ )	✓	75.28 ( $+ 5.09$ )	0.7675 ( $+ 0.0758$ )	0.6992 ( $+ 0.0589$ )
		✓		75 ( $+ 4.81$ )	0.7488 ( $+ 0.0571$ )	0.6979 ( $+ 0.0576$ )
			✓	73.08 ( $+ 2.89$ )	0.74 ( $+ 0.0483$ )	0.6753 ( $+ 0.035$ )
	✓	✓		74.04 ( $+ 3.85$ )	0.7547 ( $+ 0.063$ )	0.6862 ( $+ 0.0459$ )
	✓		✓	75 ( $+ 4.81$ )	0.7637 (0.072)	0.6988 ( $+ 0.0585$ )
		✓	✓	75.96 ( $+ 5.77$ )	0.7724 ( $+ 0.0807$ )	0.7133 ( $+ 0.073$ )
	✓	✓	✓	80.77 ( $+ 10.58$ )	0.8202 ( $+ 0.1285$ )	0.7688 ( $+ 0.1285$ )

Note. The boldface shows the best values.

Analysis indicates performance deficits when excluding each of the proposed loss sub-modules relative to baseline metrics achieved using traditional loss functions, validating the synergistic interplay conferred by the introduced objective terms. The isolated use of $L_{speci}$ displays the lowest comparative gains, albeit exceeding the inherent performance of the system independent of additional constraints. This validates the different optimization capabilities presented by the unique contribution of the different proposed modules. Concurrent employment of $L_{cluster}$ and $L_{speci}$ approaches parity with the conventional methodological standard. Multiple underlying factors contribute to this, namely—the robust interoperability displayed between these two sub-components, alongside the dual-branch architectural framework leveraged, where conjoined use of losses targeting representation sharing enjoys significant benefits. This mutually inclusive implementation facilitated the superior benchmark results attained.

Ablation study on feature importance. To further analyze the effectiveness of the two-branch feature. We conducted an ablation study for both cases of the proposed method and without using the proposed loss function. Experiments were conducted in the trained model by placing the retention of one feature and the other at 0. The results are shown in Table 4. Analyzing the data in the table, it can be found that it is closer to the best result when retaining the specific features, which indicates that the specific features of each modality are an important part of the reference for medical image classification, while the share features can provide some information to help the model to further improve the classification effect, which proves the effective information decoupling of the two-branch structure. Meanwhile, it can be found that the Kappa of the shared features is 0 after adding the loss function, which indicates that the accuracy obtained only by the share features is consistent by accident, and the accuracy of the specific features is greatly improved, which proves that our proposed loss function can help the model decouple the effective information of each modality to the specific features.

Ablation study on hyperparameter. In order to investigate the sensitivity of the proposed method in terms of hyperparameter values. We explored the performance of the model for different values of $τ$ and the results are shown in Figure 8. Since cosine similarity is represented as having similarity in the range of [0, 1], $τ$ is taken in that range. It can be found that the best performance is obtained when taking 0. The small difference in performance in the range 0.1–0.8 proves that there is not much sensitivity to the choice of $τ$ . A significant drop in performance can be seen at values >0.8, which may be due to the failure of this $L_{cluster}$ .

Ablation study on data quality. Finally, we conducted additional experiments and analysis on data quality. It consists of two main aspects, one is the performance under different numbers of modalities, and the other is the robustness study on noisy data as well as incomplete data after the training is completed.

The first aspect, we conducted supplementary experiments on increased-modality scenarios to validate the rationality of the proposed approach from two dimensions: inference time per image expended and final classification results. Specifically, we expanded the LLD-MMRI into eight subsets with one to eight modalities, respectively, and performed training and testing under the corresponding modalities. The results are presented in Table 5.

Figure 8.

Results for different value of $τ$ .

Table 4.

Ablation Study Results of Classification With Different Features on LLD-MMRI Dataset.

Proposed loss	Retaining	Accuracy (%)	F1 score	Kappa
w/o	Share feature	30.77	0.2711	0.1587
	Special feature	68.27	0.6725	0.6184
	All features	70.19	0.2711	0.1587
w/	Share feature	31.77	0.0672	0.0000
	Special feature	78.88	0.7997	0.7460
	All features	80.77	0.8202	0.7688

Note. w/ and w/o represent whether or not the proposed loss function is used during training.

Table 5.

Quantitative Results of the Classification Task With Different Number of Modalities.

Modality	Accuracy(%)	F1 score	Kappa	Time (ms)
1	50.00	0.4951	0.4012	4.5
2	56.73	0.5285	0.4644	6.2
3	62.50	0.6024	0.5367	5.2
4	63.46	0.6188	0.5545	6.5
5	64.42	0.6279	0.5607	6.7
6	71.15	0.7207	0.6455	7.5
7	70.19	0.6955	0.6349	6.3
8	80.77	0.8202	0.7688	6.4

Note. The time column shows average inference time per image.

Comparing performance from one to eight modalities shows accuracy and other metrics improving steadily as the number of modalities increases, with accuracy boosted by 30.77% and other metrics improved by over 0.3 when reaching eight modalities. This validates the importance and necessity of multi-modal image fusion and demonstrates our model’s efficacy in fusing high-level semantic information across modalities. Also, the testing time shows a large increase only from one to two modalities, remaining stable as modalities further increase. This proves our model’s strong scalability with the number of modalities.

On the other hand, we test the robust performance of the model by adding Gaussian noise or a mask to the test image to simulate a low-quality image. The results are shown in Figure 9. It can be found that for low noise intensity, there is good robustness, but when the Gaussian noise intensity is >40, the performance decreases sharply. For adding masks, the robustness is poor, 5% mask also leads to a severe degradation in performance, probably because the model utilizes the detailed information to a high degree.

Figure 9.

Results for different strengths of noise and different percentages of mask ratio.

4.5. Fusion Application

For image fusion, we selected IFCNN (Zhang et al., 2020), DenseFuse (Li & Wu, 2019), EMFusion (Xu & Ma, 2021), U2Fusion (Xu et al., 2022), DualFuse (Fu & Wu, 2021), IFT (Vs et al., 2022), and CDDFuse (Zhao et al., 2023) as comparison methods. The first four are based on single-branch architectures, among which U2Fusion fuses features from different depths of the single branch and is categorized as a single-branch method. The last three share the dual-branch structure with our model, allowing comprehensive evaluation of our approach’s efficacy. Experiments were conducted by training and testing each model on the full AANLIB dataset, where similarity loss in EMFusion is only applicable to fusing red–green–blue images, we remove it in training.

As shown in Table 6, displaying fusion results between every modality pair would be lengthy, so we included them in the supplemental materials and only present the average performance on the test set after full testing in the main text. Ablation studies will also follow this format. It can be observed that the EN metric has little difference, since it was used as the loss function by all methods. We see that U2Fusion achieved excellent performance in the single-branch setting, but still lags behind the dual-branch networks. Among the dual-branch methods, the transformer structure outperformed CNN by a large SSIM margin. Without the share feature fusion module and constraints for specific features, our method remained comparable to the SOTA method CDDFuse. This validates that our proposed approach maintains strong multi-modal semantic fusion capacity without sacrificing low-level dual-modality fusion performance.

Table 6.
Quantitative Results of the Fusion Task on AANLIB.

Algorithm EN MI $Q_{AB / F}$ SD SF SCD SSIM VIF

Single-Branch IFCNN (Zhang et al., 2020) 4.53 1.68 0.46 50.54 19.49 0.36 0.49 0.45

DenseFuse (Li & Wu, 2019) 4.65 1.83 0.54 56.63 18.76 0.87 0.54 0.53

U2Fusion (Xu et al., 2022) 4.52 2.03 0.57 64.47 18.11 1.31 0.75 0.59

EMFusion (Xu & Ma, 2021) 4.67 1.93 0.50 60.63 16.46 1.16 0.60 0.55

Dual-Branch DualFuse (Fu & Wu, 2021) 4.62 1.91 0.59 60.37 18.47 1.10 0.59 0.59

IFT (Vs et al., 2022) 4.59 1.74 0.33 65.84 12.42 1.28 1.48 0.44

CDDFuse (Zhao et al., 2023) 4.30 2.32 0.66 71.95 28.00 1.39 1.52 0.62

Proposed 4.50 2.09 0.70 71.27 26.51 1.46 1.53 0.68

Note. Boldface and underline show the best and second-best values, respectively. EN = entropy; MI = mutual information; SD = standard deviation; SF = spatial frequency; SCD = spatial correlation discrepancy; SSIM = structural similarity; VIF = visual information fidelity.

	Algorithm	EN	MI	$Q_{AB / F}$	SD	SF	SCD	SSIM	VIF
Single-Branch	IFCNN (Zhang et al., 2020)	4.53	1.68	0.46	50.54	19.49	0.36	0.49	0.45
	DenseFuse (Li & Wu, 2019)	4.65	1.83	0.54	56.63	18.76	0.87	0.54	0.53
	U2Fusion (Xu et al., 2022)	4.52	2.03	0.57	64.47	18.11	1.31	0.75	0.59
	EMFusion (Xu & Ma, 2021)	4.67	1.93	0.50	60.63	16.46	1.16	0.60	0.55
Dual-Branch	DualFuse (Fu & Wu, 2021)	4.62	1.91	0.59	60.37	18.47	1.10	0.59	0.59
	IFT (Vs et al., 2022)	4.59	1.74	0.33	65.84	12.42	1.28	1.48	0.44
	CDDFuse (Zhao et al., 2023)	4.30	2.32	0.66	71.95	28.00	1.39	1.52	0.62
	Proposed	4.50	2.09	0.70	71.27	26.51	1.46	1.53	0.68

The visualization results are shown in Figure 10. Three subfigures depict the visualization of the fusion between MRI and CT, PET, and SPECT, respectively. The images demonstrate that our proposed method and CDDFuse achieve clearer and more detailed fusion results compared to other existing methods across different modality fusion tasks. Moreover, we can observe that CDDFuse suffers from loss of CT characteristics after MRI–CT fusion, and also leads to inversion of SPECT features from white to black after MRI–SPECT fusion, while our method still obtains better results on such samples, indicating higher preservation of semantic features after fusion. Furthermore, in MRI–PET fusion, the small colored region in the middle fused by our method exhibits more significant contrast compared to the surroundings than that of CDDFuse, suggesting that our method enables finer-grained fusion of color characteristics.

Figure 10.

Visual comparison in medical image on the ANNLIB dataset with three combinations of different modalities. (a) Visual comparison for MRI–CT, (b) visual comparison for MRI–PET, and (c) visual comparison for MRI–SPECT. MRI = magnetic resonance imaging; CT = computed tomography; PET = positron emission tomography; SPECT = single-photon emission computed tomography.

5. Conclusion

In this paper, we propose a dual-branch architecture that can be used to efficiently diagnose multi-modal medical images. Using a basic network architecture and following the pipeline, we propose a prototype loss for different feature maps to efficiently consolidate multi-modal information when increasing the number of modalities. Experiments on both image classification and fusion tasks validate the fusion efficacy of the method. This provides an effective general solution for deep learning to utilize different modality information as the number of medical imaging modalities increases. However, due to limited time and resources, we did not conduct more experiments to identify the most effective architectures for medical images, nor validated on more multi-modal medical imaging tasks and datasets.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Zhixian Wang

Tao Zhang

Wu Huang

References

Aslantas

Bendes

(2015). A new image quality metric for image fusion: The sum of the correlations of differences. AEU – International Journal of Electronics and Communications, 69(12), 1890–1896. https://doi.org/10.1016/j.aeue.2015.09.004

Azam

M. A.

Khan

K. B.

Salahuddin

Rehman

Khan

S. A.

Khan

M. A.

Kadry

Gandomi

A. H.

(2022). A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Computers in Biology and Medicine, 144, 105253. https://doi.org/10.1016/j.compbiomed.2022.105253

Bengio

Courville

Vincent

(2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50

Bhatnagar

Q. M. J.

Liu

(2013). Directive contrast based multimodal medical image fusion in NSCT domain. IEEE Transactions on Multimedia, 15(5), 1014–1024. https://doi.org/10.1109/TMM.2013.2244870

Chan

H. P.

Samala

R. K.

Hadjiiski

L. M.

Zhou

(2020). Deep learning in medical image analysis. Advances in Experimental Medicine and Biology, 1213, 3–21. https://doi.org/10.1007/978-3-030-33128-3_1

Chen

Wang

Zhang

Fung

K. M.

Thai

T. C.

Moore

Mannel

R. S.

Liu

Zheng

Qiu

(2022). Recent advances and clinical applications of deep learning in medical image analysis. Medical Image Analysis, 79, 102444. https://doi.org/10.1016/j.media.2022.102444

Dai

Gao

Liu

(2021). TransMed: Transformers advance multi-modal medical image classification. Diagnostics, 11, 1384. https://doi.org/10.3390/diagnostics11081384

Ding

Zhou

Nie

Hou

Liu

(2020). Brain medical image fusion based on dual-branch CNNs in NSST domain. BioMed Research International, 14, 6265708. https://doi.org/10.1155/2020/6265708

Dong

Zhang

Xia

(2022). A spatial–spectral dual-optimization model-driven deep network for hyperspectral and multispectral image fusion. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–16. https://doi.org/10.1109/TGRS.2022.3217542

10.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

(2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

11.

Eskicioglu

A. M.

Fisher

P. S.

(1995). Image quality measures and their performance. IEEE Transactions on Communications, 43(12), 2959–2965. https://doi.org/10.1109/26.477498

12.

X. J.

(2021). A dual-branch network for infrared and visible image fusion. In 2020 25th international conference on pattern recognition (ICPR). (pp. 10675–10680). IEEE.

13.

Gao

Zhao

Aishanjiang

Cai

Wei

Zhang

Liu

Zhou

Han

Wang

Ding

Liu

(2021). Deep learning for differential diagnosis of malignant hepatic tumors based on multi-phase contrast-enhanced CT and clinical data. Journal of Hematology & Oncology, 14, 154. https://doi.org/10.1186/s13045-021-01167-2

14.

Guo

Huang

Guo

(2019). Deep learning-based image segmentation on multimodal medical imaging. IEEE Transactions on Radiation and Plasma Medical Sciences, 3(2), 162–169. https://doi.org/10.1109/TRPMS.2018.2890359

15.

Hamm

C. A.

Wang

C. J.

Savic

L. J.

Ferrante

Schobert

Schlachter

Lin

Duncan

J. S.

Weinreb

J. C.

Chapiro

Letzen

(2019). Deep learning for liver tumor diagnosis part I: Development of a convolutional neural network classifier for multi-phasic MRI. European Radiology, 29(7), 3338–3347. https://doi.org/10.1007/s00330-019-06205-9

16.

Han

Cai

Cao

(2013). A new image fusion performance metric based on visual information fidelity. Information Fusion, 14(2), 127–135. https://doi.org/10.1016/j.inffus.2011.08.002

17.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE.

18.

James

A. P.

Dasarathy

B. V.

(2014). Medical image fusion: A survey of the state of the art. Information Fusion, 19, 4–19. https://doi.org/10.1016/j.inffus.2013.12.002

19.

Jiang

Gao

Huang

Zhu

(2020). Attentive and ensemble 3D dual path networks for pulmonary nodules classification. Neurocomputing, 398, 422–430. https://doi.org/10.1016/j.neucom.2019.03.103

20.

Lee

H. Y.

Tseng

H. Y.

Huang

J. B.

Singh

Yang

M. H.

(2018). Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV) (pp. 35–51). Springer.

21.

X-J.

(2019). DenseFuse: A fusion approach to infrared and visible images. IEEE Transactions on Image Processing, 28(5), 2614–2623. https://doi.org/10.1109/TIP.2018.2887342

22.

Cai

Wang

Zhou

Feng

D. D.

Chen

(2014). Medical image classification with convolutional neural network. In 2014 13th international conference on control automation robotics & vision (ICARCV), Singapore (pp. 844–848). IEEE. https://doi.org/10.1109/ICARCV.2014.7064414.

23.

Hou

Zheng

Cheng

M. M.

Yang

(2023). Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF international conference on computer vision. (pp. 16794–16805). IEEE.

24.

Zhao

(2021). Medical image fusion method by deep learning. International Journal of Cognitive Computing in Engineering, 2, 21–29. https://doi.org/10.1016/j.ijcce.2020.12.004

25.

Liu

Jiang

Fan

Luo

(2021). A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion. IEEE Transactions on Image Processing, 30, 1261–1274. https://doi.org/10.1109/TIP.2020.3043125

26.

Loening

A. M.

Gambhir

S. S.

(2003). AMIDE: A free software tool for multimodality medical image analysis. Molecular Imaging, 2(3), 131–137. https://doi.org/10.1162/15353500200303133

27.

Lou

(2023). Liver lesion diagnosis challenge on multi-phase MRI. In International conference on medical image computing and computer assisted intervention (MICCAI) Zenodo. https://doi.org/10.5281/zenodo.7852363.

28.

McHugh

M. L.

(2012). Interrater reliability: The kappa statistic. Biochem Med (Zagreb), 22(3), 276–82.

29.

Menze

B. H.

Jakab

Bauer

Kalpathy-Cramer

Farahani

Kirby

Burren

Porz

Slotboom

Wiest

Lanczi

Gerstner

Weber

M. A.

Arbel

Avants

B. B.

Ayache

Buendia

Collins

D. L.

Cordier

Corso

J. J.

Criminisi

(2015). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging, 34(10), 1993–2024. https://doi.org/10.1109/TMI.2014.2377694

30.

Panayides

A. S.

Amini

Filipovic

N. D.

Sharma

Tsaftaris

S. A.

Young

Foran

Golemati

Kurc

Huang

Nikita

K. S.

Veasey

B. P.

Zervakis

Saltz

J. H.

Pattichis

C. S.

(2020). AI in medical imaging informatics: Current challenges and future directions. IEEE Journal of Biomedical and Health Informatics, 24(7), 1837–1857. https://doi.org/10.1109/JBHI.2020.2991043

31.

Petrovic

Xydeas

(2005). Objective image fusion performance characterisation. In Tenth IEEE international conference on computer vision (ICCV’05), Beijing, China (Vol. 1, pp. 1866–1871). IEEE. https://doi.org/10.1109/ICCV.2005.175.

32.

Zhang

Yan

(2002). Information measure for performance of image fusion. Electronics Letters, 38, 313–315. https://doi.org/10.1049/el:20020212

33.

Radford

Kim

J. W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

Krueger

Sutskever

(2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR.

34.

Roberts

J. W.

Van Aardt

J. A.

Ahmed

F. B.

(2008). Assessment of image fusion procedures using entropy, image quality, and multispectral classification. Journal of Applied Remote Sensing, 2(1), 023522. https://doi.org/10.1117/1.2945910

35.

Sasaki

(2007). The truth of the F-measure. Teach Tutor Mater, 1(5), 1–5. https://www.researchgate.net/publication/268185911_The_truth_of_the_F-measure

36.

Shamshad

Khan

Zamir

S. W.

Khan

M. H.

Hayat

Khan

F. S.

(2023). Transformers in medical imaging: A survey. Medical Image Analysis, 88, 102802. https://doi.org/10.1016/j.media.2023.102802

37.

Singh

K. K.

Ojha

Lee

Y. J.

(2019). FineGAN: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 6490–6499). IEEE.

38.

Snell

Swersky

Zemel

(2017). Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems, 30, 4080–4090. https://dl.acm.org/doi/10.5555/3294996.3295163

39.

Summers

(2003). Harvard whole brain Atlas. www.med.harvard.edu/AANLIB/home.html. Journal of Neurol Neurosurg Psychiatry, 74(3), 288. https://doi.org/10.1136/jnnp.74.3.288

40.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 6000–6010. https://doi.org/10.48550/arXiv.1703.03325

41.

Valanarasu

J. M.

Oza

Patel

V. M.

(2022). Image fusion transformer. In 2022 IEEE international conference on image processing (ICIP). (pp. 3566–3570). IEEE.

42.

Wang

Chen

Tang

Zhu

(2024). Disentangled representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12), 9677–9696. https://doi.org/10.1109/TPAMI.2024.3420937

43.

Wang

Bovik

A. C.

Sheikh

H. R.

Simoncelli

E. P.

(2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612. https://doi.org/10.1109/TIP.2003.819861

44.

Weibin

W. A.

Qingqing

C. H.

Xianhua

H. A.

Hongjie

H. U.

Lanfen

L. I.

Yen-Wei

C. H.

(2019). Deep learning-based radiomics models for early recurrence prediction of hepatocellular carcinoma with multi-phase CT images and clinical data. In 2019 41st annual international conference of the IEEE engineering in medicine and biology society (EMBC), Berlin, Germany (pp. 4881–4884). IEEE. https://doi.org/10.1109/EMBC.2019.8856356.

45.

Xiao

Liu

Yang

Liu

Wang

Zhu

Chen

Long

Chang

Zhou

(2021). ULECGNet: An ultra-lightweight end-to-end ECG classification neural network. IEEE Journal of Biomedical and Health Informatics, 26(1), 206–217. https://doi.org/10.1109/JBHI.2021.3090421

46.

(2021). EMFusion: An unsupervised enhanced medical image fusion network. Information Fusion, 76, 177–186. https://doi.org/10.1016/j.inffus.2021.06.001

47.

Jiang

Guo

Ling

(2022). U2Fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1), 502–518. https://doi.org/10.1109/TPAMI.2020.3012548

48.

Zhu

Clifton

D. A.

(2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis & Machine Intelligence, 45(10), 12113–12132. https://doi.org/10.1109/TPAMI.2023.3275156

49.

Zhu

Ying

Cai

Liu

(2023). A knowledge-guided framework for fine-grained classification of liver lesions based on multi-phase CT images. IEEE Journal of Biomedical and Health Informatics, 27(1), 386–396. https://doi.org/10.1109/JBHI.2022.3220788

50.

Zhan

Wang

Chen

Y. W.

(2022). A transformer-based model for preoperative early recurrence prediction of hepatocellular carcinoma with multi-phase MRI. In Proceedings of the Asian conference on computer vision (pp. 179–188). Springer.

51.

Zhang

(2022). Harvard-Medical-Image-Dataset. https://github.com/HaoZhang1018/Harvard-Medical-Image-Dataset

52.

Zhang

Liu

Sun

Yan

Zhao

Zhang

(2020). IFCNN: A general image fusion framework based on convolutional neural network. Information Fusion, 54, 99–118. https://doi.org/10.1016/j.inffus.2019.07.011

53.

Zhao

Bai

Zhang

Lin

Timofte

Van Gool

(2023). CDDFuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (pp. 5906–5916). IEEE.

Architecture			Metrics
Transformer	3D-ResNet	Dual-branch	Accuracy (%)	F1 score	Kappa
			68.27	0.6843	0.6255
✓			61.54 ( $- 6.73$ )	0.6198 ( $- 0.0645$ )	0.5496 ( $- 0.0759$ )
	✓		63.46 ( $- 4.81$ )	0.6515 ( $- 0.0328$ )	0.5742 ( $- 0.0513$ )
		✓	63.46 ( $- 4.81$ )	0.632 ( $- 0.0523$ )	0.5649 ( $- 0.0606$ )
✓	✓		69.23 ( $+ 0.96$ )	0.6940 ( $+ 0.0097$ )	0.6317 ( $+ 0.0062$ )
✓		✓	66.67 ( $- 1.6$ )	0.6106 ( $- 0.0737$ )	0.5882 ( $- 0.0373$ )
	✓	✓	57.69 ( $- 10.58$ )	0.5738 ( $- 0.1105$ )	0.4937 ( $- 0.1318$ )
✓	✓	✓	70.19 ( $+ 1.92$ )	0.6917 ( $+ 0.0074$ )	0.6403 ( $+ 0.0148$ )

Exploiting Shared and Distinctive Representations for Enhanced Multi-Modality Medical Image Analysis

Abstract

Keywords

1. Introduction

2.1. Medical Image Classification

2.2. Disentangled Representation Learning and Image Fusion Applications

2.3. Prototype Network

3. Method

3.1. Overview

4.1. Datasets

4.2. Implementation Details

4.3. Compared With Other Methods

Footnotes

Funding

Declaration of Conflicting Interests

ORCID iDs

References