Sage Journals: Discover world-class research

Abstract

In recent years, vision-language pretraining (VLP) models have become a crucial driving force in the advancement of artificial intelligence. Besides, studies such as contrastive language-image pretraining (CLIP) have demonstrated that incorporating prompt learning within VLP models can significantly enhance the performance of downstream tasks. However, we believe that CLIP’s visual encoder suffers from feature extraction bias in image classification tasks, which is because of the uneven quantity and distribution of image features CLIP learned between the pretraining and fine-tuning stages. This can be further summarized as an inherent bias in feature extraction for differently distributed samples during the pretraining phase. To address the above problem this paper proposes (i) text-semantic hierarchical injection prompt learning method, which constructs self-attention layers and prompt mapping structures and injects text semantic features into the visual encoder layer by layer to generate visual prompt features and (ii) visual-semantic attention interactive prompt learning method, which further integrates text embeddings with the output features of the visual encoder through cross-attention and constructs instance-level text prompt features for each image. Based on the two above methods, this paper further proposes the multimodal coupling prompt learning CLIP (MCPL-CLIP) to enhance CLIP’s performance in image classification tasks. Experiments conducted on 15 image classification datasets demonstrate that MCPL-CLIP outperforms baseline models such as MaPLe, CoCoOp, and CoOp in cross-dataset transfer, domain generalization, and base-to-novel class generalization tasks, showcasing its superior text semantic representation and visual feature extraction capabilities.

Keywords

prompt learning image classification cross-attention CLIP

1. Introduction

The field of artificial intelligence (AI) is witnessing a significant convergence of computer vision (CV) and natural language processing (NLP), where the combined learning paradigm of CV and NLP, known as vision-language pretraining (VLP) models, has emerged as a critical driving force in the development of AI.

To advance VLP models’ ability to perceive the real world, they need to understand, interpret, and reason with multimodal information. By pretraining on large-scale image-text corpora, VLP models can learn general cross-modal representations beneficial for downstream vision-language tasks (Zellers et al., 2019). For example, LXMERT (Tan & Bansal, 2019) employs a dual-stream fusion encoder to learn vision-language joint representations, significantly outperforming traditional models in tasks such as VQA (Antol et al., 2015) and NLVR2 (Suhr et al., 2018) by pretraining on 9.18 million image–text pairs.

However, fine-tuning VLP models is both expensive and complex, making the effective transfer of VLP models to downstream tasks an exciting and valuable problem. Prompt learning offers an effective solution to this issue. By using a small number of task-specific parameters, VLP models can achieve huge performance gains on numerous vision-language tasks.

Our motivation derives from retrieval augmented (RA)-contrastive language-image pretraining (CLIP), which points out the semantic structure mismatch of CLIP’s (Radford et al., 2021) text encoder between pretraining text data and image classification label text. Taking a further step, we believe that CLIP’s visual encoder also suffers from feature extraction bias between the pretraining and fine-tuning stages in image classification tasks. The primary cause stems from the uneven quantity and distribution of image features learned during the pretraining stage. This can be further summarized as an inherent bias in feature extraction for differently distributed samples during the pretraining phase.

Considering that language features are generally more stable in semantic structure compared to visual features (less dependent on specific visual representations), we think it is necessary to inject the rich semantic information from CLIP’s text encoder into the visual encoder to supplement the semantic information of rare samples from the pretraining stage, thereby enhancing the recognition accuracy and robustness of CLIP’s visual encoder in handling various fine-grained image classification tasks.

To address the issues of feature extraction biases and domain shifts between the pretraining and fine-tuning phases of CLIP’s visual encoders, this paper conducts the following research:

(i)
This paper proposes a two-stage prompt method to tackle the feature extraction bias of CLIP’s vision encoder: (a) text-semantic hierarchical injection prompt learning (THIPL) method, which constructs a prompt mapping structure (PMS) and injects learnable prompt vectors from the text encoder into each layer of the visual encoder’s transformer structure to construct visual prompt features. (b) Visual-semantic attention interactive prompt learning (VAIPL) method, which integrates text embeddings and visual encoder output features through cross-attention to build instance-level text prompt features, and enhances CLIP’s cross-modal alignment capability.
(ii)
Based on the two-stage prompt method, this paper further proposes multimodal coupling prompt learning CLIP (MCPL-CLIP) to enhance CLIP’s performance in image classification tasks. MCPL-CLIP is evaluated on 15 image classification datasets, showing performance improvements in cross-dataset transfer, domain generalization, and base-to-novel class generalization scenarios compared to multiple baseline models such as CLIP, CoOp, CoCoOp, and MaPLe. For instance, MCPL-CLIP achieves leading margins of 3.35% to 0.93% in cross-dataset transfer, 3.88% to 0.78% in domain generalization, and 7.55% to 0.70% in base-to-novel class generalization. This indicates that MCPL-CLIP’s paradigm of using multimodal interactive information to construct prompts is more effective for image classification tasks and their generalization scenarios compared to using single-modal features.

2. Related Work

2.1. VLP Models

Given the success of pretrained models in the fields of CV and NLP, numerous studies have attempted to pretrain large-scale models on the joint modality of vision and language. These pretrained models are known as VLP models. The rise of VLP paradigms began with the transfer of BERT (Devlin et al., 2018) to cross-modal representation learning, followed by a series of studies Lu et al. (2019) introducing BERT into multimodal pretraining. Recently, the encoder–decoder framework’s multimodal pretraining paradigms have gained attention, with many encoder–decoder models achieving state-of-the-art performance in cross-modal understanding and generation tasks (Bao et al., 2022).

Another research trend in the VLP field is contrastive learning. The most typical contrastive learning-based VLP models is CLIP, which uses vision transformer (ViT) (Dosovitskiy et al., 2020) or ResNet (He et al., 2016) as the image encoder, transformer (Vaswani et al., 2017) as the text encoder, and a contrastive loss objective to jointly train the two encoders. CLIP’s pretraining dataset is extensive, comprising approximately 400 million image–text pairs. Following CLIP, a series of studies have demonstrated that successful pretraining with contrastive learning methods on large-scale data is not accidental (Jia et al., 2021).

2.2. Prompt Learning

The motivation behind prompt learning is to leverage the prior knowledge learned by large-scale pretrained models to perform various downstream tasks. Prompt learning can be summarized as “pretraining, prompting, predicting” reorganizing downstream tasks into forms similar to pretraining tasks, thereby better guiding models to utilize pretrained knowledge to complete specific downstream tasks.

Schick and Schütze (2020) proposed PET (Pattern-Exploiting Training), transforming the input of text classification tasks into cloze questions to convert the reasoning process into a text generation task, fully utilizing the text generation capabilities of language models. Petroni et al. (2019) proposed LAMA (LAnguage Model Analysis), modifying the relation extraction task into cloze questions without altering the pretrained language model, achieving better relation extraction performance than knowledge bases. Compared to fine-tuning, prompt learning freezes most of the pretrained model’s parameters, adjusting the pretrained model with minimal parameter amounts (e.g., 1%). Recent advancements indicate that prompt learning can help pretrained models achieve performance comparable to fine-tuning across different NLP downstream tasks, including natural language understanding and generation (Liu et al., 2021).

2.3. Prompt Learning in VLP Models

Recent studies in the vision-language learning field have also demonstrated the effectiveness of prompt learning. Khattak et al. (2023) showed that visual prompt learning could surpass fine-tuning in a series of tasks, with significant advantages in training efficiency. In cross-modal representation learning, prompt learning for CLIP has become a major focus. CLIP is a contrastive learning-based multimodal pretraining model. By using handcrafted prompt templates to convert labels into text descriptions, CLIP achieves remarkable performance in zero-shot image classification tasks. To further enhance performance, CLIP also proposed prompt ensemble methods by manually creating multiple prompt templates. Creating fixed prompts can be a labor-intensive process, which has led to the adoption of continuous prompts or the integration of adapters (Adapter) for CLIP. In addition to CLIP, another area of research explores the use of visual prompts with pretrained language models for multimodal representation learning. These studies demonstrate that even when large-scale pretrained language models are frozen during downstream transfer, they can still effectively adapt to few-shot learning scenarios in multimodal tasks. Similarly, the combination of adapters with CLIP has been shown to yield promising performance. CoOp Zhou et al. (2022b) enhances CLIP for few-shot transfer by fine-tuning a continuous set of prompt vectors within its language branch. On the other hand, CoCoOp (Zhou et al., 2022a) addresses the generalization challenges of CoOp, particularly its underperformance in novel classes. This is achieved by explicitly conditioning the prompts on specific image instances, thus improving its adaptability to previously unseen data. MaPLe (Khattak et al., 2023) adds learnable context tokens in the language branch and conditions vision prompts on language prompts through a coupling function to enable interaction. Inspired by Maple, we add self-attention to the text encoder and project features to the visual encoder, enriching text and guiding visual learning. Cross-modal interaction between text and visual outputs generates prompt features that guide contrastive learning.

3. Methodology

In this section, we first analyze the feature extraction bias of CLIP’s visual encoder. Then we propose a two-stage prompt method: THIPL method and VAIPL method. Finally, based on THIPL and VAIPL, we propose MCPL-CLIP to enhance CLIP’s performance in image classification tasks.

3.1. Problem Description and Analysis for CLIP’s Feature Extraction Bias

Based on RA-CLIP, we believe that CLIP’s visual encoder also suffers from feature extraction bias between the pretraining and fine-tuning stages in image classification tasks. The primary cause is the uneven quantity and distribution of image features learned during the pretraining stage. This can be further summarized as an inherent bias in feature extraction for differently distributed samples during the pretraining phase.

For example, the VLP models may predominantly learn features of common flowers such as “roses,” “carnations,” and “tulips” during the pretraining stage. These common flowers form the VLP models’ prior knowledge and recognition capability for the concept of “flowers.” However, when the VLP models are deployed on a downstream image classification dataset that includes a wide variety of flower species, including many rare ones that the VLP models did not encounter during the pretraining stage, such as “ghost orchids,” “blue cat’s face orchids,” and so on, it struggles to recognize and correctly classify these rare flower species. This visual feature recognition bias due to the uneven image distribution between pretraining and fine-tuning stages is particularly detrimental to fine-grained image classification datasets such as Flower101 (flower species dataset), StanfordCars (vehicle dataset), and FGVC Aircraft (aircraft dataset).

In summary, the model tends to more effectively recognize categories frequently appearing in the pretraining dataset, whereas its feature extraction and recognition abilities significantly diminish for less frequent categories due to insufficient sample support. This learning imbalance directly reflects the model’s prior knowledge bias, where the knowledge base formed during the pretraining stage is not comprehensive but rather overly optimized for certain categories and lacks necessary adaptability for others. When these pretrained models are deployed on specific downstream image classification datasets, the aforementioned feature extraction bias further impacts the model’s recognition accuracy on those datasets. This impact is typically caused by distribution differences between datasets, that is, the visual feature space distance between the pretraining dataset and the downstream task dataset. To narrow this distance, it is essential for the model to capture a broader and more balanced range of visual features during the pretraining stage, thereby enhancing its adaptability to various categories and scenarios.

3.2. Multimodal Coupling Prompt Learning CLIP (MCPL-CLIP)

3.2.1. THIPL Method

As shown in Figure 1, this section proposes the THIPL method, which optimizes both the visual and text encoders of CLIP by creating learnable prompt embeddings in the text modality and injecting them into the corresponding prompt embeddings of the visual modality. The specific implementation of THIPL involves adding learnable prompt vectors to several layers of the text encoder’s transformer structure, which are then introduced into the corresponding layers of the visual encoder’s transformer structure through a bottleneck mapping structure. This method retains the prior knowledge learned by the visual encoder during the pretraining stage and allows it to more easily capture the classification features of new image samples during the generalization process.

Essentially, THIPL is based on the principle of complementarity in cross-modal learning, achieving deeper semantic alignment between images and text to better understand and express the intrinsic features of image data. Text-semantic hierarchy injection prompt learning includes the following components: Text Prompt Features. To learn the contextual features of the text encoder, we first introduce $b$ learnable embedding vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ and add them to the input layer transformer structure of CLIP’s text encoder $T (\cdot)$ . Self-Attention Layer. Considering the instability in a semantic expression of the embedding vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ due to random initialization parameters in the early stages of training, we introduce a self-attention layer (self-attention layer) to extract semantic relevance features and high-dimensional abstract features from the text descriptions, enhancing the semantic diversity and training stability of the embedding vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ .

Figure 1.

THIPL, where “add” denotes element-wise addition, Word Embed represents the word embedding method, Patch Embed represents the image patch embedding method, and $F_{ϕ} (\cdot)$ represents the PMS. $P^{1}, P^{2}, \dots, P^{b}$ and ${\tilde{P}}^{1}, {\tilde{P}}^{2}, \dots, {\tilde{P}}^{b}$ denote text prompt vectors and visual prompt vectors, respectively. $W_{0}$ represents the word embedding of the text, and $W_{0}^{'}$ is the vector obtained from $W_{0}$ after passing through the self-attention layer. $P^{1^{'}}, P^{2^{'}}, \dots, P^{b^{'}}$ are the results of element-wise addition between $W_{0}^{'}$ and $P^{1}, P^{2}, \dots, P^{b}$ . $P_{1}^{1}, P_{1}^{2}, \dots, P_{1}^{b}$ denote the learnable text embedding vectors created at each layer of the text encoder, while ${\tilde{P}}_{1}^{1}, {\tilde{P}}_{1}^{2}, \dots, {\tilde{P}}_{1}^{b}$ represent the learnable visual embedding vectors created at each layer of the visual encoder. The superscript indicates the layer number, and the subscript indicates the specific prompt at that layer. Note. THIPL = text-semantic hierarchy injection prompting learning; PMS = prompt mapping structure.

Let ${SA}_{λ} (\cdot)$ represent the self-attention layer defined by training parameters $λ$ . Then the text prompt vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ , formed by adding the self-attention output features and embedding vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ , can be represented by equation (1):

\begin{aligned} {P^{i} \in R^{d}}_{i = 1}^{b} = [P^{1}, P^{2}, \dots, P^{b}] = {P^{i} \in R^{d}}_{i = 1}^{b} + {SA}_{λ} ([w^{1}, w^{2}, \dots, w^{N}]), \end{aligned}

(1)where

[w^{1}, w^{2}, \dots, w^{N}]

represent the concatenated fixed text features of the input layer. The plus sign represents element-wise addition. Considering that the CLIP uses “A photo of a {CLASS},” for comparison purposes, we have also used this fixed text feature. In this context, the concatenated fixed text features refer specifically to the phrase “A photo of a {CLASS},” which is incorporated into the model to provide a consistent textual input for comparison.

Thus, the input embedding features in CLIP’s text encoder $T (\cdot)$ can be expressed as equation (2):

\begin{aligned} S_{input} = {P^{i} \in R^{d}}_{i = 1}^{b} \cup W_{0} = [P^{1}, P^{2}, \dots, P^{b}, w^{1}, w^{2}, \dots, w^{N}], \end{aligned}

(2)where

W_{0} = [w^{1}, w^{2}, \dots, w^{N}]

We introduce this structure into the first $K$ layers of the CLIP text encoder $T (\cdot)$ ’s transformer structure ${L_{j} (\cdot)}_{j = 1}^{K}$ , where $K$ is a hyperparameter representing the depth of prompt embedding. Assuming the text encoder $T (\cdot)$ has $M$ transformer layers, the features transmitted in the first $K$ layers of the transformer structure can be expressed by equation (3):

\begin{aligned} S_{i} = T_{i} ([P_{i - 1}, W_{i - 1}]), i = 1, 2, \dots, K, \end{aligned}

(3)where

[\cdot, \cdot]

denotes concatenation. When

K = 1

S_{i} = S_{input}

The features transmitted in the rear $M - K$ layers of the text encoder $T_{M} (\cdot)$ ’s transformer structure can be expressed by equation (4):

\begin{aligned} S_{j} = T_{j} ([_, W_{j - 1}]), j = K + 1, K + 2, \dots, M . \end{aligned}

(4)Therefore, the final output features of the text encoder

Z_{text} = T_{M} (W_{M - 1})

can be expressed by equation (5):

\begin{aligned} Z_{text} = T_{M} (W_{M - 1}) . \end{aligned}

(5)PMS. To construct a mapping mechanism from text prompts to visual prompts, we introduce a PMS defined by training parameters

ϕ

, which maps text prompt vectors

{P^{i} \in R^{d}}_{i = 1}^{b}

into visual prompt vectors

{{\tilde{P}}^{i} \in R^{d}}_{i = 1}^{b}

and adds them to the corresponding layers of the visual encoder’s transformer structure. Assuming the PMS

{{\tilde{P}}^{i} \in R^{d}}_{i = 1}^{b}

is defined by training parameters

ϕ

, it can be expressed by equation (6):

\begin{aligned} F_{ϕ} (\cdot) = Bottleneck ({MLP}_{ϕ}^{de-dimension} (\cdot), ReLU (\cdot), {MLP}_{ϕ}^{up-dimension} (\cdot), AvgPooling (\cdot)), \end{aligned}

(6)where MLP denotes a multilayer perceptron, de-dimension denotes dimensionality reduction, up-dimension denotes dimensionality increase, AvgPooling(

\cdot

) denotes average pooling, and ReLU denotes rectified linear unit, respectively.

Based on the definition of PMS, visual prompt vectors ${{\tilde{P}}^{i} \in R^{d}}_{i = 1}^{b}$ can be expressed by equation (7):

\begin{aligned} {{\tilde{P}}^{i} \in R^{d}}_{i = 1}^{b} = F ({P^{i} \in R^{d}}_{i = 1}^{b}) . \end{aligned}

(7)

Assuming CLIP’s visual encoder is represented by $V (\cdot)$ , the input embedding features $I$ in the visual encoder $V (\cdot)$ can now be expressed by equation (8):

\begin{aligned} I_{input} = [{\tilde{P}}^{1}, {\tilde{P}}^{2}, \dots, {\tilde{P}}^{b}, X_{0}], {\tilde{P}}^{n} = F_{ϕ} (P^{n}), n = 1, 2, \dots, b, \end{aligned}

(8)where

X_{0}

denotes the image input features and n represents any one of the b prompts in a certain layer.

Similar to the text encoder $T (\cdot)$ , we introduce this structure into the front $K$ layers of the visual encoder. Assuming the visual encoder $V_{M} (\cdot)$ has $M$ transformer layers, the features transmitted in the front $K$ layers of the visual encoder’s transformer structure can be expressed by equation (9):

\begin{aligned} I_{i} = V_{i} ([{\tilde{P}}_{i - 1}, X_{i - 1}]), i = 1, 2, \dots, K, \end{aligned}

(9)where

[\cdot, \cdot]

denotes concatenation. When

K = 1

I_{i} = I_{input}

The features transmitted in the rear $M - K$ layers of the visual encoder $V_{M} (\cdot)$ ’s transformer structure can be expressed by equation (10):

\begin{aligned} I_{j} = V_{j} ([_, X_{j - 1}]), j = K + 1, K + 2, \dots, M . \end{aligned}

(10)Therefore, the final output features of the visual encoder

Z_{image}

can be expressed by equation (11):

\begin{aligned} Z_{image} = V_{M} (X_{M - 1}) . \end{aligned}

(11)

3.2.2. VAIPL Method

Optimizing and improving text descriptions in the text encoder is a crucial research direction for CLIP models, as it significantly impacts the model’s generalization performance. In the past, during generalization from source to target datasets, it was often difficult to capture concrete visual features that express key semantic information. This is because the text encoder cannot provide a text description that adapts to the new data distribution, making it difficult for the vision-language model to activate its pretrained knowledge circuits for vision-language alignment.

Building on THIPL, we further propose the VAIPL method, as shown in Figure 2. VAIPL method introduces information from the visual encoder under specific data distributions into the text encoder, using the visual information under the specific data distribution to construct instance-level enhanced text features, thereby reducing or compensating for the feature expression differences between modalities in cross-data distribution generalization scenarios.

Figure 2.

Structure of visual-semantic attention interaction prompt learning (VAIPL). It uses the text embeddings as query vectors after max-pooling, and the visual encoder’s output features as key and value vectors. The first multiplication sign represents the product of text features (Q) and image features (K), while the second multiplication sign represents the product of the attention map (the product of Q and K) and image features (V).

The implementation of the VAIPL method is to apply cross-attention between instance-level image features from the visual encoder and text input embeddings and generate instance-level image-weighted text features through a nonlinear projection structure, referred to as semantic interaction prompt features. VAIPL helps the vision-language model learn patterns that identify and extract features beneficial for cross-modal alignment from specific data distributions. VAIPL ensures that semantic interaction prompt features can be integrated into text features in a language-compatible form, thereby enhancing the vision-language model’s modality alignment capability in generalization scenarios.

Additionally, we use CLIP’s handcrafted template “a photo of a CLASS” to nest the annotations of different datasets at the text embedding input layer. This method alleviates text feature mismatches and training instability issues during the domain generalization process.

VAIPL consists of multiple transformer decoder layers, a max-pooling function, and a nonlinear projection structure. Assuming $T F_{μ} (\cdot)$ represents the cross-attention structure defined by training parameters $μ$ , and $G_{ϵ} (\cdot)$ represents the nonlinear projection structure defined by training parameters $ϵ$ , it can be expressed by equation (12):

\begin{aligned} G_{ϵ} (\cdot) = [{MLP}_{ϵ}^{up-dimension} (\cdot), ReLU (\cdot), {MLP}_{ϵ}^{de-dimension} (\cdot)], \end{aligned}

(12)where MLP denotes a multilayer perceptron, de-dimension denotes dimensionality reduction, up-dimension denotes dimensionality increase, and

[;]

denotes concatenation.

VAIPL can be defined by equation (13):

\begin{aligned} {VAIPL}_{μ, ϵ} = [Maxpooling (\cdot), {TF}_{μ} (\cdot), G_{ϵ} (\cdot)], \end{aligned}

(13)where Maxpooling denotes the max-pooling function.

Assuming ${c_{i}}_{i = 1}^{L}$ are the class word embeddings in the dataset $D$ , where $L$ is the number of classes, the sentence embeddings ${c_{i}^{t m p}}_{i = 1}^{L}$ after manual template nesting can be expressed by equation (14):

\begin{aligned} {c_{i}^{tmp}}_{i = 1}^{L} = {WordEmb}_{CLIP} ([a photo of, {c_{i}}_{i = 1}^{L}]), \end{aligned}

(14)where

{WordEmb}_{CLIP} (\cdot)

denotes CLIP’s word embedding method, and

[;]

denotes concatenation.

\begin{aligned} {P_{i}}_{i = 1}^{L} = {VAIPL}_{μ, ϵ} ({c_{i}^{tmp}}_{i = 1}^{L}, Z_{image}) = G_{ϵ} ({TF}_{μ} (Maxpooling ({c_{i}^{tmp}}_{i = 1}^{L}), Z_{image})) \end{aligned}

(15)where

Z_{image} = V_{M} (X_{M - 1})

3.2.3. Multimodal Coupling Prompt Learning CLIP (MCPL-CLIP)

We propose a two-stage MCPL-CLIP model to enhance CLIP’s performance in image classification tasks. The MCPL-CLIP structure, shown in Figure 3, includes two stages: the THIPL method and the VAIPL method. The THIPL method introduces learnable text prompt vectors into the text encoder’s various transformer layers and constructs a PMS to inject text prompt features into the visual encoder’s corresponding transformer layers, leveraging the semantic information of the text encoder to build visual prompt features. On this basis, the VAIPL method further integrates text embeddings with the visual encoder output of the text-semantic hierarchy injection prompt framework through cross-attention, constructing text prompt features for each image.

MCPL-CLIP essentially embodies a bidirectional information flow prompt design strategy. It injects semantic features from the text encoder into the visual encoder to construct visual prompts and incorporates feedback from the visual encoder to generate text prompts, achieving bidirectional mapping and joint optimization of text and visual prompts. This ultimately improves MCPL-CLIP’s classification accuracy and robustness across various fine-grained image classification scenarios, such as cross-dataset transfer, domain generalization, and base-to-novel class scenarios.

Figure 3.

Structure of MCPL-CLIP. MCPL-CLIP includes THIPL with text prompt vectors, visual prompt vectors, PMSs, self-attention layers, and VAIPL with transformer decoder structures, max-pooling functions, and nonlinear projection structures. Point-wise adding represents the word embedding method, and Patch Embed represents the image patch embedding method. Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining; THIPL=text-semantic hierarchical injection prompt learning; VAIPL = visual-semantic attention interactive prompt learning; PMS = prompt mapping structure.

MCPL-CLIP comprises the following parts:

VAIPL Method. The VAIPL method can be defined by equation (16):

\begin{aligned} {VAIPL}_{μ, ϵ} = [Maxpooling (\cdot), {TF}_{μ} (\cdot, \cdot), G_{ϵ} (\cdot)], \end{aligned}

(16)where

{TF}_{θ} (\cdot, \cdot)

represents the cross-attention structure defined by training parameters

θ

, and

G_{η} (\cdot)

represents the projection structure defined by training parameters

η

. The cross-attention structure in VAIPL consists of six transformer decoder blocks, each with eight attention heads. The nonlinear projector structure comprises two multilayer perceptron structures linear-ReLU-linear with a hidden layer of 1024 dimensions.

PMS. The PMS includes multilayer perceptrons with ReLU activation functions as the PMS $F_{ϕ} (\cdot)$ , defined by equation (17):

\begin{aligned} F_{ϕ} (\cdot) = Bottleneck ({MLP}_{ϕ}^{de-dimension} (\cdot), ReLU (\cdot), {MLP}_{ϕ}^{up-dimension} (\cdot)) . \end{aligned}

(17)PMS consists of two bottleneck multilayer perceptron structures (linear-ReLU-linear) with a hidden layer of 128 dimensions.

Text Prompt Vectors & Visual Prompt Vectors. Text prompt vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ are transformed into image-semantic-compatible visual prompt vectors through the PMS $F_{ϕ} (\cdot)$ , specifically expressed by equation (18):

\begin{aligned} {{\tilde{P}}^{i} \in R^{d}}_{i = 1}^{b} = F_{ϕ} ({P^{i} \in R^{d}}_{i = 1}^{b}) . \end{aligned}

(18)The main difference between element-wise addition and multiplication is in how the information is integrated. While multiplication strengthens the interaction between text features and prompts, it may distort the original semantics of the text. In contrast, element-wise addition smoothly combines the prompt and text features, preserving their original meaning. Therefore, we use element-wise addition.

Coupling Visual Encoder. Based on the original CLIP visual encoder, we introduce visual prompt vectors ${P^{' i} \in R^{d}}_{i = 1}^{b} = F ({P^{i} \in R^{d}}_{i = 1}^{b})$ through the PMS $F_{ϕ} (\cdot)$ into several transformer layers. Assuming the original CLIP visual encoder is denoted by $V^{origin} (\cdot)$ , the modified coupled visual encoder $V (\cdot)$ can be expressed by equation (19):

\begin{aligned} V (\cdot) = [V_{i} (\cdot), V_{j}^{origin} (\cdot)], i = 1, 2, \dots, K, j = K + 1, K + 2, \dots, M, \end{aligned}

(19)where

V_{i} (\cdot), i = 1, 2, \dots, K

denotes the modified visual encoder’s first

K

layers, and

[\cdot, \cdot]

denotes concatenation. Assuming

x

is an image sample in the dataset

D

, the final image feature output

Z_{image}

can be expressed by equation (20):

\begin{aligned} Z_{image} = V_{M} (X_{M - 1}), \end{aligned}

(20)where

X_{i}, i = 1, 2, \dots, M - 1

denotes the image features of sample

x

after passing through each layer of the visual encoder

V (\cdot)

Semantic Interaction Prompt Features. Semantic interaction prompt features ${P_{i}^{'}}_{i = 1}^{L}$ can be expressed by equation (21):

\begin{aligned} {P_{i}^{'}}_{i = 1}^{L} = {VAIPL}_{μ, ϵ} ({c_{i}^{tmp}}_{i = 1}^{L}, Z_{image}) = G_{ϵ} (T F_{μ} (Maxpooling ({c_{i}^{tmp}}_{i = 1}^{L}), Z_{image})), \end{aligned}

(21)where

Z_{image} = V_{M} (X_{M - 1})

Coupling Text Encoder. According to Section 4.2.1, we modify several transformer layers of the original CLIP text encoder to introduce learnable text prompt vectors ${P^{i} \in R^{d}}_{i = 1}^{b}$ for learning the contextual features of the text encoder. Assuming the original CLIP text encoder is denoted by $T^{origin} (\cdot)$ , the modified coupled text encoder $T (\cdot)$ can be expressed by equation (22):

\begin{aligned} T (\cdot) = [T_{i} (\cdot), T_{j}^{origin} (\cdot)], i = 1, 2, \dots, K, j = K + 1, K + 2, \dots, M, \end{aligned}

(22)where

[\cdot, \cdot]

denotes concatenation. Finally, the semantic interaction prompt features

{P_{i}^{'}}_{i = 1}^{L}

and the text encoder’s output features

Z_{text}

are summed to form the final text features

{\tilde{Z}}_{text}

{\tilde{Z}}_{text}

can be expressed by equation (23):

\begin{aligned} {\tilde{Z}}_{text} = {P_{i}^{'}}_{i = 1}^{L} + Z_{text}, \end{aligned}

(23)where

Z_{text} = T_{M} (W_{M - 1})

Loss Function. The optimization objective of MCPL-CLIP is to minimize the standard classification loss function, using cross-entropy loss as the loss function to maximize the score from the true label. Assuming the dataset $D$ has $L$ categories, $x$ is an image sample in the dataset $D$ , and $y$ is the true label, the loss function $Loss (y ∣ x)$ can be expressed by equation (24):

\begin{aligned} Loss (y ∣ x) = \frac{\exp (sim (V (x), {\tilde{Z}}_{text}^{(i = y)}))}{\sum_{i = 1}^{L} \exp (sim (V (x), {\tilde{Z}}_{text}))}, \end{aligned}

(24)where

sim (\cdot, \cdot)

denotes the semantic similarity function. During training, gradients will backpropagate and update

{VAIPL}_{μ, ϵ}

, text prompt vectors

{P^{i} \in R^{d}}_{i = 1}^{b}

, and the PMS

F_{ϕ} (\cdot)

, while the text encoder

T (\cdot)

and visual encoder

V (\cdot)

remain frozen.

4. Experiments and Analysis

4.1. Baselines and Datasets

The baselines include CLIP, CoOp, CoCoOp, and MaPLe. CLIP uses manually designed templates as text prompts; CoOp optimizes only learnable context vectors as text prompt features; CoCoOp uses a lightweight neural network to learn an instance-level image input token and adds it to the text encoder’s prompt embeddings; MaPLe introduces learnable context vectors in the text encoder and maps them to the visual encoder.

We use a total of 15 image classification datasets (11 image classification datasets and four ImageNet variant datasets) for evaluation. For base-to-new class generalization and cross-dataset transfer test scenarios, we evaluate the proposed method on 11 image classification datasets. These datasets cover a wide range of recognition tasks, including two general object datasets, ImageNet and Caltech101; five fine-grained datasets, Oxford Pets, Stanford Cars, Flowers102, Food101, and FGVC Aircraft; scene recognition dataset SUN397; action recognition dataset UCF101; texture dataset DTD; and satellite image dataset EuroSAT.

For the cross-dataset transfer experiment and domain generalization experiment, MCPL-CLIP is trained on the ImageNet dataset and tested on 10 image classification datasets and four ImageNet variant datasets. Detailed dataset statistics are shown in Table 1.

Table 1.
Datasets Statistics.

Datasets Number of classes Training sets Evaluation sets Test sets

ImageNet 1,000 160,000 50,000 50,000

Caltech101 100 41,128 1,649 2,456

OxfordPets 37 2,944 736 3,669

StanfordCars 196 6,509 1,635 8,041

Flowers102 102 4,093 1,633 2,463

Food101 101 50,500 20,200 30,300

FGVCAircraft 100 3,334 3,333 3,333

SUN397 397 15,880 3,970 19,850

DTD 47 2,820 1,128 1,692

EuroSAT 10 13,500 5,400 8,100

UCF101 101 7,639 1,898 3,783

ImageNetV2 1,000 10,000 – 10,000

ImageNet-Sk 1,000 50,889 – 50,889

ImageNet-A 200 7,500 – 7,500

ImageNet-R 200 $- 30, 000$ – 30,000

Datasets	Number of classes	Training sets	Evaluation sets	Test sets
ImageNet	1,000	160,000	50,000	50,000
Caltech101	100	41,128	1,649	2,456
OxfordPets	37	2,944	736	3,669
StanfordCars	196	6,509	1,635	8,041
Flowers102	102	4,093	1,633	2,463
Food101	101	50,500	20,200	30,300
FGVCAircraft	100	3,334	3,333	3,333
SUN397	397	15,880	3,970	19,850
DTD	47	2,820	1,128	1,692
EuroSAT	10	13,500	5,400	8,100
UCF101	101	7,639	1,898	3,783
ImageNetV2	1,000	10,000	–	10,000
ImageNet-Sk	1,000	50,889	–	50,889
ImageNet-A	200	7,500	–	7,500
ImageNet-R	200	$- 30, 000$	–	30,000

4.2. Training Details

For the visual encoder, we use ViT-B/16 as the backbone structure, and for the text encoder, we use the original CLIP text encoder. During training, only the training parameters of ${VAIPL}_{μ, ϵ}$ , text prompt vectors ${P^{i} \in R^{d^{b}}}_{i = 1}^{L}$ , and PMS $F_{φ} (\cdot)$ are updated, while the text encoder $T (\cdot)$ and visual encoder $V (\cdot)$ are frozen.

All experiments use a few-shot training strategy, randomly selecting 16 samples per category. The prompt embedding depth $K$ is set to 8, and the lengths of text prompt vectors ${P^{i} \in R^{d^{b}}}_{i = 1}^{L}$ and visual prompt vectors ${{\tilde{P}}^{i} \in R^{d^{b}}}_{i = 1}^{L}$ are set to 2. Due to graphics processing unit (GPU) memory limitations, the batch size is set to 1.

Both MCPL-CLIP and benchmark models are trained on a single NVIDIA 3090 GPU for two epochs. The training process is optimized using the stochastic gradient descent optimizer with a learning rate of 0.0035. The training and test datasets are run three times, and the average result is taken as the final result.

4.3. Cross-Dataset Transfer Experiment

This experiment focuses on verifying the image classification performance of the model on new data distributions different from the training data distribution. This section evaluates the robustness of the proposed method on out-of-distribution datasets. The emphasis is on the model’s ability to transfer from one dataset to another, highlighting the model’s adaptability to different data sources and potential data variations. Consistent with CoCoOp (Zhou et al., 2022a), the proposed MCPL-CLIP is trained in a few-shot manner on all 1000 classes of the ImageNet dataset.

Table 2(a) shows the experimental results of MCPL-CLIP transferring to 10 datasets. It can be seen that MCPL-CLIP outperforms other benchmark models on multiple datasets. Overall, MCPL-CLIP achieves the highest average accuracy of 67.23%, surpassing other baseline models.

Table 2.
Comparison of MCPL-CLIP and Baselines in Cross-Data Transfer Setting.

(a) Average results of MCPL-CLIP and baselines

Datasets Model Acc. Macro F1

Training dataset ImageNet CLIP 66.73 –

CoOp 71.51 0.739

CoCoOp 71.02 0.733

MaPLe 70.72 –

MCPL-CLIP 71.33 0.727

Average results on 10 generalization datasets CLIP 57.18 –

CoOp 59.28 –

CoCoOp 59.91 –

MaPLe 60.28 –

MCPL-CLIP 61.06 –

(b) Detailed results of MCPL-CLIP and baselines

Datasets Model Acc. Macro F1

Caltech101 CoOp 93.70 –

CoCoOp 94.43 0.927

MaPLe 93.53 0.891

MCPL-CLIP 94.31 0.923

OxfordPets CoOp 89.14 –

CoCoOp 90.14 0.872

MaPLe 90.49 0.882

MCPL-CLIP 91.63 0.959

StanfordCars CoOp 64.51 –

CoCoOp 65.32 0.595

MaPLe 65.10 0.600

MCPL-CLIP 66.46 0.675

Flowers102 CoOp 68.71 -

CoCoOp 71.88 0.631

MaPLe 72.23 0.654

MCPL-CLIP 73.29 0.697

Food101 CoOp 85.30 –

CoCoOp 86.06 0.828

MaPLe 86.20 0.838

MCPL-CLIP 87.28 0.901

Aircraft CoOp 18.47 –

CoCoOp 22.94 0.201

MaPLe 24.74 0.220

MCPL-CLIP 24.45 0.212

SUN397 CoOp 64.15 –

CoCoOp 67.36 0.709

MaPLe 67.01 0.664

MCPL-CLIP 68.33 0.754

DTD CoOp 41.92 –

CoCoOp 45.73 0.413

MaPLe 46.49 0.435

MCPL-CLIP 47.34 0.495

EuroSAT CoOp 46.39 –

CoCoOp 45.37 0.396

MaPLe 48.06 0.412

MCPL-CLIP 49.47 0.450

UCF101 CoOp 66.55 –

CoCoOp 68.21 0.656

MaPLe 68.69 0.661

MCPL-CLIP 69.75 0.727

Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining.

(a) Average results of MCPL-CLIP and baselines
Training dataset ImageNet	CLIP	66.73	–
	CoOp	71.51	0.739
	CoCoOp	71.02	0.733
	MaPLe	70.72	–
	MCPL-CLIP	71.33	0.727
Average results on 10 generalization datasets	CLIP	57.18	–
	CoOp	59.28	–
	CoCoOp	59.91	–
	MaPLe	60.28	–
	MCPL-CLIP	61.06	–
(b) Detailed results of MCPL-CLIP and baselines
Datasets	Model	Acc.	Macro F1
Caltech101	CoOp	93.70	–
	CoCoOp	94.43	0.927
	MaPLe	93.53	0.891
	MCPL-CLIP	94.31	0.923
OxfordPets	CoOp	89.14	–
	CoCoOp	90.14	0.872
	MaPLe	90.49	0.882
	MCPL-CLIP	91.63	0.959
StanfordCars	CoOp	64.51	–
	CoCoOp	65.32	0.595
	MaPLe	65.10	0.600
	MCPL-CLIP	66.46	0.675
Flowers102	CoOp	68.71	-
	CoCoOp	71.88	0.631
	MaPLe	72.23	0.654
	MCPL-CLIP	73.29	0.697
Food101	CoOp	85.30	–
	CoCoOp	86.06	0.828
	MaPLe	86.20	0.838
	MCPL-CLIP	87.28	0.901
Aircraft	CoOp	18.47	–
	CoCoOp	22.94	0.201
	MaPLe	24.74	0.220
	MCPL-CLIP	24.45	0.212
SUN397	CoOp	64.15	–
	CoCoOp	67.36	0.709
	MaPLe	67.01	0.664
	MCPL-CLIP	68.33	0.754
DTD	CoOp	41.92	–
	CoCoOp	45.73	0.413
	MaPLe	46.49	0.435
	MCPL-CLIP	47.34	0.495
EuroSAT	CoOp	46.39	–
	CoCoOp	45.37	0.396
	MaPLe	48.06	0.412
	MCPL-CLIP	49.47	0.450
UCF101	CoOp	66.55	–
	CoCoOp	68.21	0.656
	MaPLe	68.69	0.661
	MCPL-CLIP	69.75	0.727

MCPL-CLIP can capture image-region-to-text correlation features that adapt to new datasets during the transfer to new datasets, which is crucial for handling complex image classification tasks. For example, in fine-grained image classification datasets such as StanfordCars and Flowers102, where the model needs to understand and distinguish highly similar categories, MCPL-CLIP effectively utilizes its cross-modal prompts to capture these nuances. Additionally, VAIPL further enhances MCPL-CLIP’s cross-modal alignment capability. By applying cross-attention between instance-level image features from the visual encoder and text input embeddings, and generating weighted text features through nonlinear projection, VAIPL allows MCPL-CLIP to identify and extract features beneficial for cross-modal alignment when learning from specific data distributions. This might explain MCPL-CLIP’s excellent performance on datasets such as EuroSAT (satellite image dataset) and UCF101 (action recognition dataset), as these datasets require the model to identify complex, task-relevant patterns from visual features.

However, in some datasets such as Aircraft, although MCPL-CLIP’s performance has improved, the improvement is relatively small. This might be due to the high similarity in visual features of aircraft in the Aircraft dataset, posing a greater challenge to the model’s discrimination ability. While MCPL-CLIP helps the model identify features beneficial for cross-modal alignment, distinguishing such highly similar visual features may require more refined model adjustments or specially designed feature extraction strategies. The Caltech 101 dataset contains 101 different object categories, with each category having several dozen images. The images exhibit variations in background and shooting angles. CoCoOp may be better at extracting stable features from images of objects with multiple categories and angles, which gives it an advantage on datasets such as Caltech 101. In contrast, MCPL-CLIP might not achieve optimal performance when handling these features.

The design concept of MCPL-CLIP is to inject textual features into the visual encoder and enhance the text prompts through a feedback mechanism. While this may help improve performance on specific datasets, this feedback mechanism could introduce additional noise in cross-dataset transfer tasks, thereby affecting generalization performance. Moreover, the complex prompt coupling in the MCPL-CLIP model may not be as stable and effective across different datasets as MaPLe and CoCoOp, as the complex feedback mechanism may be inefficient when handling domain shifts.

4.4. Domain Generalization Experiment

This experiment focuses on verifying the model’s ability to transfer from one dataset to another, highlighting the model’s adaptability to different data sources and potential data variations. The performance of MCPL-CLIP is tested on four ImageNet variant datasets.

Table 3 shows the experimental results of MCPL-CLIP in the domain generalization evaluation scenario. MCPL-CLIP performs comparably to other benchmark models on the ImageNet training set, showing good baseline performance. Overall, MCPL-CLIP achieves the best performance with an average accuracy of 61.06% across the four domain generalization datasets.

Table 3.
Comparison MCPL-CLIP and Baselines in Domain Generalization Settings.

(a) Average results of MCPL-CLIP and baselines

Datasets Model Acc. Macro F1

Training dataset ImageNet CLIP 66.73 –

CoOp 71.51 0.739

CoCoOp 71.02 0.733

MaPLe 70.72 –

MCPL-CLIP 71.33 0.727

Average results on 10 generalization datasets CLIP 57.18 –

CoOp 59.28 –

CoCoOp 59.91 –

MaPLe 60.28 –

MCPL-CLIP 61.06 –

(b) Detailed results of MCPL-CLIP and baselines

Datasets Model Acc. Macro F1

ImageNetV2 CLIP 60.83 –

CoOp 64.20 0.615

CoCoOp 64.07 0.613

MaPLe 64.07 0.621

MCPL-CLIP 65.18 0.689

ImageNet-S CLIP 46.15 -

CoOp 47.99 0.469

CoCoOp 48.75 0.481

MaPLe 49.15 0.508

MCPL-CLIP 49.05 0.494

ImageNet-A CLIP 47.77 –

CoOp 47.71 0.447

CoCoOp 50.63 0.460

MaPLe 50.90 0.473

MCPL-CLIP 51.97 0.526

ImageNet-R CLIP 73.96 –

CoOp 75.21 0.719

CoCoOp 76.18 0.731

MaPLe 76.98 0.743

MCPL-CLIP 78.05 0.788

Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining.

(a) Average results of MCPL-CLIP and baselines
Training dataset ImageNet	CLIP	66.73	–
	CoOp	71.51	0.739
	CoCoOp	71.02	0.733
	MaPLe	70.72	–
	MCPL-CLIP	71.33	0.727
Average results on 10 generalization datasets	CLIP	57.18	–
	CoOp	59.28	–
	CoCoOp	59.91	–
	MaPLe	60.28	–
	MCPL-CLIP	61.06	–
(b) Detailed results of MCPL-CLIP and baselines
Datasets	Model	Acc.	Macro F1
ImageNetV2	CLIP	60.83	–
	CoOp	64.20	0.615
	CoCoOp	64.07	0.613
	MaPLe	64.07	0.621
	MCPL-CLIP	65.18	0.689
ImageNet-S	CLIP	46.15	-
	CoOp	47.99	0.469
	CoCoOp	48.75	0.481
	MaPLe	49.15	0.508
	MCPL-CLIP	49.05	0.494
ImageNet-A	CLIP	47.77	–
	CoOp	47.71	0.447
	CoCoOp	50.63	0.460
	MaPLe	50.90	0.473
	MCPL-CLIP	51.97	0.526
ImageNet-R	CLIP	73.96	–
	CoOp	75.21	0.719
	CoCoOp	76.18	0.731
	MaPLe	76.98	0.743
	MCPL-CLIP	78.05	0.788

On the ImageNetV2 dataset, which has a slight distribution shift, MCPL-CLIP achieves the highest accuracy of 65.18% and a Macro F1 score of 0.689, indicating that it can better maintain its performance when dealing with slight distribution changes. ImageNet-S (Sketch) contains sketch images with styles significantly different from the original ImageNet images. Although MCPL-CLIP performs slightly lower than MaPLe on this dataset, it still outperforms other models, demonstrating its robustness to significant style differences in data.

ImageNet-A (Adversarial) contains difficult-to-classify image samples from natural images. MCPL-CLIP achieves the best accuracy of 51.97% and a Macro F1 score of 0.526, showing its strong adaptability and generalization ability to challenging samples. ImageNet-R (Rendition) includes artistic renditions of various images. The experimental results of MCPL-CLIP again lead, further proving the effectiveness of its cross-modal prompt and interaction mechanism in handling highly diverse image style data samples.

From the provided experimental results, it can be seen that MCPL-CLIP shows varying performance in domain generalization tasks compared to other baseline models. MCPL-CLIP’s multimodal coupling mechanism may cause feature combinations from the original data to lose effectiveness on stylized data. For stylized images, simpler feature extraction (e.g., lines and shapes) may be more effective than complex bidirectional information flow. This is why MCPL-CLIP underperforms MaPLe on the ImageNet-S dataset in domain generalization tasks, where the model needs strong feature abstraction and recognition despite domain shifts and reduced details. The complexity of MCPL-CLIP may not always be beneficial in such cases, suggesting that optimizations and adaptations are needed for stylized datasets to improve their domain generalization and robustness.

Table 4 shows the experimental results of MCPL-CLIP in base-to-new class generalization within the same data distribution. It can be seen that MCPL-CLIP maintains good generalization ability for unseen classes while ensuring the accuracy of base classes, with the harmonic mean (HM) score outperforming other benchmark models on most datasets. This indicates that MCPL-CLIP can capture and utilize cross-modal features to construct more effective and enriched cross-modal prompt features, providing the model with the ability to understand and adapt to unseen classes.

Table 4.

Comparison of MCPL-CLIP and Baselines in Base-to-new Generalization Settings.

(a) Average results of MCPL-CLIP and baselines on 11 datasets
Model	Base Acc	New Acc.	HM
CLIP	69.34	74.22	71.70
CoOp	82.69	63.22	71.66
CoCoOp	80.47	71.69	75.83
MaPLe	82.28	75.14	78.55
MCPL-CLIP	83.40	75.98	79.25
(b) ImageNet
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	72.43	–	68.14	–	70.22
CoOp	76.47	0.752	67.88	0.663	71.92
CoCoOp	75.98	0.743	70.43	0.691	73.10
MaPLe	76.66	0.761	70.54	0.702	73.47
MCPL-CLIP	77.47	0.773	71.58	0.714	74.41
(c) Caltech101
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	96.84	–	94.00	–	95.40
CoOp	98.20	0.979	89.81	0.899	93.73
CoCoOp	97.96	0.965	93.81	0.937	95.84
MaPLe	97.74	0.956	94.36	0.943	95.02
RAPL-CLIP	98.34	0.963	95.21	0.951	96.75
MCPL-CLIP	98.76	0.989	95.96	0.957	97.34
(d) OxfordPets
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	91.17	–	97.26	–	94.12
CoOp	93.67	0.935	95.29	0.952	94.47
CoCoOp	95.20	0.951	97.69	0.975	96.43
MaPLe	95.43	0.943	97.76	0.976	96.58
RAPL-CLIP	95.62	0.957	97.94	0.977	96.77
MCPL-CLIP	96.27	0.973	97.92	0.983	97.09
(e) StanfordCars
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	63.37	–	74.89	–	68.65
CoOp	78.12	0.780	70.00	0.601	73.64
CoCoOp	70.49	0.702	73.59	0.735	72.01
MaPLe	72.94	0.715	74.00	0.736	73.47
RAPL-CLIP	74.00	0.737	74.95	0.750	74.67
MCPL-CLIP	74.24	0.737	74.87	0.748	74.55
(f) Flowers102
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	72.08	–	77.50	–	74.83
CoOp	97.60	0.975	59.47	0.574	74.06
CoCoOp	85.23	0.889	65.90	0.645	74.31
MaPLe	95.92	0.937	64.89	0.601	77.39
RAPL-CLIP	97.08	0.968	73.37	0.737	83.55
MCPL-CLIP	97.01	0.968	73.49	0.737	83.55
(g) Food101
(h) FGVCAircraft
CLIP	90.10	–	91.22	–	90.66
CoOp	88.33	0.882	82.26	0.821	85.19
CoCoOp	90.70	0.907	91.29	0.910	90.99
MaPLe	90.71	0.897	92.05	0.918	91.38
RAPL-CLIP	90.74	0.907	91.65	0.917	91.19
MCPL-CLIP	91.38	0.926	92.53	0.934	91.95
CLIP	27.19	–	36.29	–	31.09
CoOp	40.44	0.401	22.30	0.222	28.75
CoCoOp	33.41	0.334	23.71	0.235	27.74
MaPLe	37.44	0.365	35.61	0.353	36.50
RAPL-CLIP	36.82	0.361	34.11	0.343	35.41
MCPL-CLIP	39.76	0.396	36.48	0.374	38.05
(i) SUN397
Model	Base Acc.	Base mF1	New Acc.	New mF1	HMCLIP	69.36	–	75.35	–	72.23
CoOp	80.60	0.795	65.89	0.624	72.51
CoCoOp	79.74	0.773	76.86	0.742	78.27
MaPLe	80.82	0.781	78.70	0.763	79.75
MCPL-CLIP	81.95	0.822	79.98	0.781	80.95
(j) DTD
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	53.24	–	59.90	–	56.37
CoOp	79.44	0.787	41.18	0.415	54.24
CoCoOp	77.01	0.767	56.00	0.558	64.85
MaPLe	80.36	0.791	59.18	0.589	68.16
RAPL-CLIP	81.18	0.812	56.84	0.568	66.86
MCPL-CLIP	81.06	0.807	58.91	0.587	68.23
(k) EuroSAT
Model	Base Acc.	Base mF1	New Acc.	New mF1	HM
CLIP	56.48	–	64.05	–	60.03
CoOp	92.19	0.919	54.74	0.545	68.69
CoCoOp	87.49	0.874	60.04	0.597	71.21
MaPLe	94.07	0.939	73.23	0.727	82.35
RAPL-CLIP	90.07	0.898	71.38	0.712	73.16
MCPL-CLIP	94.46	0.951	74.65	0.740	83.39
(l) UCF101
Model	Base Acc.	Base mF1	New Acc.	New mF1	HMCLIP	70.53	–	77.50	–	73.85
CoOp	84.69	0.839	56.05	0.545	67.46
CoCoOp	82.33	0.821	73.45	0.717	77.64
MaPLe	83.00	0.827	78.66	0.768	80.77
MCPL-CLIP	85.05	0.847	79.43	0.784	82.14

Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining.

On general image classification datasets such as ImageNet and Caltech101, MCPL-CLIP achieves the best results in base class accuracy, new class accuracy, and HM. This proves that MCPL-CLIP can effectively leverage the knowledge learned during the pretraining stage and apply it to new, unseen classes through vision-language coupled interaction prompt learning. On fine-grained datasets such as StanfordCars and Flowers102, MCPL-CLIP’s performance still leads other benchmark models, indicating that MCPL-CLIP is more advantageous in handling data samples with subtle differences in categories. On other datasets such as satellite images (EuroSAT), scene understanding (SUN397), and action recognition (UCF101), MCPL-CLIP also generally leads, demonstrating its robust visual-language alignment and generalization capabilities.

MCPL-CLIP couples text and visual features through bidirectional information flow, a complex mechanism that may not effectively generalize the fine-grained features required when handling new classes, especially in tasks that demand capturing subtle differences. In tasks that require fine-grained feature extraction and recognition (such as Stanford Cars), MCPL-CLIP may not be as effective as some models specifically optimized for such tasks, such as CoOp. The bidirectional information flow and complex prompt design may introduce interference or noise in certain cases, leading to insufficient extraction and generalization of features for new classes.

4.5. Ablation Experiments

4.5.1. Prompt Mapping Structure (PMS)

To verify the importance and effectiveness of the PMS, we conducted an ablation study by removing PMS and directly adding the text prompt vectors to the visual encoder (i.e., the text prompt vectors are the same as the visual prompt vectors). Other training parameters and model structures were kept unchanged. MCPL-CLIP was trained on ImageNet and tested on the cross-dataset transfer scenarios of StanfordCars, OxfordPets, and EuroSAT, as well as the domain generalization scenarios of ImageNet-A and ImageNet-R. The specific experimental results are shown in Figures 4 and 5.

Figure 4.

Ablation experiments of prompt mapping structure (PMS) on Stanford Cars, Oxford Pets, and EuroSAT.

Figure 5.

Ablation experiments of prompt mapping structure (PMS) on Imagnet-A and Imagnet-R.

The results show that the PMS plays a crucial role in extracting higher-level abstract semantic features. Removing the PMS causes the vision-language model to lose the ability to transform text prompt vectors into visual-compatible prompt vectors, leading to a collapse of the multimodal feature representation in the visual encoder.

4.5.2. Visual-Semantic Attention Interaction Prompt Learning (VAIPL)

To verify the role of the VAIPL in image classification tasks, we proposed a cross-modal adding-pooling prompting (CMPP) method for comparison experiments. CMPP averages the input embedding features of the text description and directly adds them point-wise to the output features of the visual encoder to construct prompts.

In the ablation study, VAIPL was replaced with CMPP, and MCPL-CLIP was trained on the source dataset ImageNet and tested on the cross-dataset transfer scenarios of Flower102, Food101, and SUN397, as well as the domain generalization scenarios of ImageNetV2 and ImageNet-R. The specific experimental results are shown in Figures 6 and 7.

Figure 6.

Ablation study of VAIPL and CMPP on Flower 102, Food 101, and SUN397. Note. CMPP = cross-modal adding-pooling prompting; VAIPL = visual-semantic attention interactive prompt learning

Figure 7.

Ablation study of VAIPL and CMPP on ImageNetV2 and ImageNet-R. Note. CMPP = cross-modal adding-pooling prompting; VAIPL = visual-semantic attention interactive prompt learning.

Figures 6 and 7 show that in cross-dataset transfer and domain generalization evaluations, VAIPL achieves higher accuracy on all three datasets compared to the CMPP module. This indicates that VAIPL, by introducing cross-attention and projection mechanisms, not only achieves deeper semantic alignment between visual and language modalities but also captures and integrates semantic information at different levels. This is crucial for improving CLIP’s cross-dataset transfer ability, domain generalization ability, and alignment capability.

In contrast, CMPP, although using a more direct feature fusion method, may be less effective in handling more complex semantic relationships because it lacks deep interaction and nonlinear feature fusion capabilities. Therefore, we believe that compared to simpler prompt construction methods, VAIPL can provide more enriched and multilayered semantic alignment prompts for vision-language models.

4.6. Parameter Experiments

4.6.1. Analysis of PMS Scale

This experiment explores whether different scales of the PMS affect MCPL-CLIP’s performance in cross-dataset generalization and domain generalization scenarios. Three different sizes of PMS, namely PMS-Standard, PMS-Large, and PMS-XLarge, were designed and trained on the source dataset ImageNet and tested on the cross-dataset transfer scenarios of Caltech101, Food101, and UCF101, and the domain generalization scenarios of ImageNetV2 and ImageNet-R.

PMS-Standard is the standard parameter scale used in this section, which is a two-layer bottleneck MLP structure (linear-ReLU-linear) with a hidden layer of 128 dimensions; PMS-Large is a four-layer bottleneck MLP structure (linear-ReLU-linear) with a hidden layer of 128 dimensions; PMS-XLarge is a four-layer bottleneck MLP structure (linear-ReLU-linear) with a hidden layer of 256 dimensions. The experimental results are shown in Figures 8 and 9:

Figure 8.

Experiments of different scales of prompt mapping structure (PMS) on StanfordCars, OxfordPets, and EuroSAT datasets.

Figure 9.

Experiments of different scales of prompt mapping structure (PMS) on Imagnet-A and Imagnet-R datasets.

Figures 8 and 9 show that larger-scale PMS does provide a certain degree of generalization performance improvement, indicating that larger hidden layer dimensions may offer more parameters to capture complex features, thereby enhancing the model’s ability to recognize samples in new data distributions.

However, it is important to note that excessively increasing the size of PMS may lead to overfitting the source training dataset and significant computational resource consumption. Meanwhile, on the Food101 dataset, the performance difference between PMS-XLarge and PMS-Standard is minimal, indicating that for this type of dataset, a larger-scale PMS does not bring significant benefits.

4.6.2. Analysis of VAIPL Scale

To explore whether the scale of the VAIPL affects the image classification performance of MCPL-CLIP, we proposed three different scales, namely VAIPL-Small, VAIPL-Standard, and VAIPL-Large, which were trained on the source dataset ImageNet and evaluated on the cross-dataset transfer scenarios of Caltech101, StanfordCars, and OxfordPets, as well as the domain generalization scenarios of ImageNetV2 and ImageNet-S. The detailed structure settings of the different scales of VAIPL are shown in Table 5:

Table 5.
Detailed Structure Settings for Different Prompt Embedding Depths.

Prompt embedding depth Text encoder (layers) Visual encoder (layers)

Shallow 4 4

Standard 8 8

Deep 10 10

Prompt embedding depth	Text encoder (layers)	Visual encoder (layers)
Shallow	4	4
Standard	8	8
Deep	10	10

4.6.3. Analysis of Prompt Embedding Depth

Ablation experiments were conducted on the prompt embedding depth in both the text encoder and the visual encoder to explore whether different prompt embedding depths affect the classification performance of MCPL-CLIP. The experiments were trained on ImageNet and tested on the cross-dataset transfer scenarios of DTD, Food101, and EuroSAT, as well as the domain generalization scenarios of ImageNet-A and ImageNet-S.

Tables 6 and 7 indicate that as the prompt embedding depth increases, the generalization performance of MCPL-CLIP also improves, but the improvement is not significant. This suggests that deeper prompt embedding can indeed bring performance improvements, possibly because deeper embeddings provide more complex feature representations, helping the model better understand and adapt to new domain data. However, the marginal performance improvement indicates that beyond a certain depth, additional complexity may not bring significant performance gains. In practical applications, this means finding a balance between performance and computational efficiency. If additional depth does not yield noticeable performance improvements, simpler structures may be the optimal choice as they achieve similar performance at lower computational costs.

Table 6.
Experiments of MCPL-CLIP with Different Prompt Embedding Depths on Food101 and EuroSAT Datasets.

Datasets Shallow Acc. (%) Standard Acc. (%) Deep Acc. (%)

DTD 46.34 47.34 47.77

Food101 86.51 87.28 87.68

EuroSAT 48.33 49.47 49.69

Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining.

Datasets	Shallow Acc. (%)	Standard Acc. (%)	Deep Acc. (%)
DTD	46.34	47.34	47.77
Food101	86.51	87.28	87.68
EuroSAT	48.33	49.47	49.69

Table 7.

Experiments of MCPL-CLIP with Different Prompt Embedding Depths on ImageNet-A and ImageNet-S Datasets.

Datasets	Shallow Acc. (%)	Standard Acc. (%)	Deep Acc. (%)
ImageNet-A	51.33	51.97	52.27
ImageNet-S	48.83	49.05	49.11

Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining.

Therefore, we believe that while increasing prompt embedding depth positively impacts the model’s generalization performance, considering the diminishing marginal benefits and the need to maintain computational efficiency, choosing a “standard” depth of prompt embedding is a reasonable solution.

4.6.4. Analysis of Training Epochs

We conducted an analysis experiment on training epochs to explore whether different training epochs affect MCPL-CLIP’s generalization performance and to investigate whether MCPL-CLIP faces overfitting or underfitting during the ImageNet dataset training process. The training epochs of MCPL-CLIP were set to 1, 2, 3, and 4, respectively, and the training dataset was ImageNet. The generalization scenarios were evaluated on the EuroSAT and ImageNet-R datasets. The experimental results of MaPLe under the same training settings were introduced for comparison. The specific experimental results are shown in Figures 10 to 12.

Figure 10.

Experiments of multimodal coupling prompt learning contrastive language-image pretraining (MCPL-CLIP) and MaPLe for different training epochs on ImageNet.

Figure 11.

Experiments of multimodal coupling prompt learning contrastive language-image pretraining (MCPL-CLIP) and MaPLe for different training epochs on EuroSAT.

Figure 12.

Experiments of multimodal coupling prompt learning contrastive language-image pretraining (MCPL-CLIP) and MaPLe for different training epochs on Imagnet-R dataset.

Figure 10 shows that during the ImageNet training phase, the accuracy of MCPL-CLIP gradually improves with the increase in training epochs, from 69.23% in one epoch to 72.29% in four epochs. This indicates that MCPL-CLIP can better learn the features of the ImageNet dataset as training progresses. The accuracy of MaPLe also increases with the number of epochs but shows a slight decline in the fourth epoch, from 70.91% to 70.84%, indicating a slight overfitting phenomenon.

Figures 11 and 12 show that during the generalization to EuroSAT/ImageNet-R, MCPL-CLIP improves from 47.23%/76.53% in one epoch to 50.02%/78.30% in four epochs. This shows that MCPL-CLIP does not exhibit significant overfitting and can improve its adaptation and generalization ability to new datasets with more training epochs. In contrast, the performance of MaPLe tends to stabilize and even regress to some extent in four epochs. This may indicate that MaPLe’s fewer training parameters lead to saturation and reaching its performance bottleneck earlier in the training phase.

In summary, MCPL-CLIP can more effectively benefit from the increased training epochs, showing stronger generalization ability and sustained learning capability.

4.6.5. Computational Cost and Resource Consumption Analysis

This section conducted resource consumption analysis experiments on three different computing platforms. The experiments were conducted on NVIDIA 3090 GPU, Nvidia 3080Ti GPU, and Nvidia Tesla T4 GPU platforms, and all 14 target datasets were tested. The metrics reported include test duration, GPU memory consumption, and memory consumption. Due to the high memory usage during the training phase, MCPL-CLIP was trained only on the NVIDIA 3090 GPU platform, and the training duration for the source dataset ImageNet was additionally reported. The detailed resource consumption on each computing platform is shown in Table 8.

Table 8.
Computational Cost and Resource Overhead Analysis of MCPL-CLIP.

(a) NVIDIA 3090 GPU

Datasets Training time Inference time (min) GPU memory consumption (GB) Memory consumption (GB)

ImageNet 16 h 41 min – 23.6 15.3

Caltech101 – 3.7 5.1 11.8

OxfordPet – 4.3 4.8 11.2

StanfordCars – 12.7 5.4 11.7

Flowers102 – 3.3 4.9 12.0

Food101 – 36.2 5.7 12.4

FGVCAircraft – 4.1 4.8 11.5

SUN397 – 45.6 5.9 12.3

DTD – 1.9 4.7 11.4

EuroSAT – 9.2 2.8 11.1

UCF101 – 4.5 4.8 12.2

ImageNetV2 – 44.8 5.5 11.7

ImageNet-S – 232 5.7 12.1

ImageNet-A – 11.7 4.9 11.3

ImageNet-R – 48.6 5.1 11.8

(b) Nvidia 3080Ti GPU

Datasets Inference time (min) GPU memory consumption (GB) Memory consumption (GB)

Caltech101 5.1 5.3 12.3

OxfordPet 6.1 4.8 11.2

StanfordCars 15.3 5.6 11.8

Flowers102 4.7 5.2 12.2

Food101 47.3 5.9 12.5

FGVCAircraft 6.0 5.4 11.4

SUN397 59.2 6.3 12.2

DTD 2.8 5.1 11.6

EuroSAT 12.4 2.9 11.3

UCF101 7.3 5.1 12.3

ImageNetV2 58.3 5.7 11.6

ImageNet-S 313 6.0 12.4

ImageNet-A 14.8 5.2 11.4

ImageNet-R 63.5 5.3 11.8

(c) Nvidia Tesla T4 GPU

Datasets Inference time (min) GPU memory consumption (GB) Memory consumption (GB)

Caltech101 8.3 5.4 11.9

OxfordPet 10.1 4.9 11.6

StanfordCars 29.7 5.6 12.2

Flowers102 6.7 5.2 12.5

Food101 81 5.7 12.8

FGVCAircraft 9.3 4.9 11.7

SUN397 105 5.8 12.7

DTD 3.1 5.2 11.9

EuroSAT 19.7 3.1 11.5

UCF101 8.7 4.9 12.1

ImageNetV2 95 5.6 11.5

ImageNet-S 423 5.8 12.3

ImageNet-A 25.8 4.9 11.4

ImageNet-R 1106 5.2 11.8

Note. MCPL-CLIP = multimodal coupling prompt learning contrastive language-image pretraining; GPU = graphics processing unit.

(a) NVIDIA 3090 GPU
ImageNet	16 h 41 min	–	23.6	15.3
Caltech101	–	3.7	5.1	11.8
OxfordPet	–	4.3	4.8	11.2
StanfordCars	–	12.7	5.4	11.7
Flowers102	–	3.3	4.9	12.0
Food101	–	36.2	5.7	12.4
FGVCAircraft	–	4.1	4.8	11.5
SUN397	–	45.6	5.9	12.3
DTD	–	1.9	4.7	11.4
EuroSAT	–	9.2	2.8	11.1
UCF101	–	4.5	4.8	12.2
ImageNetV2	–	44.8	5.5	11.7
ImageNet-S	–	232	5.7	12.1
ImageNet-A	–	11.7	4.9	11.3
ImageNet-R	–	48.6	5.1	11.8
(b) Nvidia 3080Ti GPU
Datasets	Inference time (min)	GPU memory consumption (GB)	Memory consumption (GB)
Caltech101	5.1	5.3	12.3
OxfordPet	6.1	4.8	11.2
StanfordCars	15.3	5.6	11.8
Flowers102	4.7	5.2	12.2
Food101	47.3	5.9	12.5
FGVCAircraft	6.0	5.4	11.4
SUN397	59.2	6.3	12.2
DTD	2.8	5.1	11.6
EuroSAT	12.4	2.9	11.3
UCF101	7.3	5.1	12.3
ImageNetV2	58.3	5.7	11.6
ImageNet-S	313	6.0	12.4
ImageNet-A	14.8	5.2	11.4
ImageNet-R	63.5	5.3	11.8
(c) Nvidia Tesla T4 GPU
Datasets	Inference time (min)	GPU memory consumption (GB)	Memory consumption (GB)
Caltech101	8.3	5.4	11.9
OxfordPet	10.1	4.9	11.6
StanfordCars	29.7	5.6	12.2
Flowers102	6.7	5.2	12.5
Food101	81	5.7	12.8
FGVCAircraft	9.3	4.9	11.7
SUN397	105	5.8	12.7
DTD	3.1	5.2	11.9
EuroSAT	19.7	3.1	11.5
UCF101	8.7	4.9	12.1
ImageNetV2	95	5.6	11.5
ImageNet-S	423	5.8	12.3
ImageNet-A	25.8	4.9	11.4
ImageNet-R	1106	5.2	11.8

5. Conclusion and Future Work

5.1. Conclusion

The main research work of this paper includes:

(1)
Identification of feature extraction bias and domain shift issues in CLIP’s visual encoder between pretraining and fine-tuning phases. To address these, we propose a two-stage MCPL framework that synergistically constructs and optimizes text-visual prompts through cross-modal interactions. The first stage employs text-semantic hierarchical injection to embed learnable prompts from the text encoder into visual transformer layers. The second stage enhances cross-modal alignment via visual-semantic attention interaction, generating instance-aware text prompts.
(2)
This paper proposes MCPL-CLIP to improve CLIP’s performance in image classification and generalization tasks. Experiments demonstrate that MCPL-CLIP achieves varying degrees of performance improvement over multiple baseline models such as CLIP, CoOp, CoCoOp, and MaPLe in cross-dataset transfer, domain generalization, and base-to-new class generalization settings. For example, MCPL-CLIP leads by 3.35% to 0.93% in cross-dataset transfer scenarios, 3.88% to 0.78% in domain generalization scenarios, and 7.55% to 0.70% in base-to-new class generalization scenarios. This indicates that compared to using single-modal features to construct prompts, the paradigm of constructing prompts using multimodal interactive information in MCPL-CLIP is more effective for image classification tasks and their generalization scenarios.

5.2. Future Work

Through systematic analysis of experimental results, we identify the following key limitations of the MCPL-CLIP approach: (1) performance improvement remains limited on datasets with highly similar visual features (e.g., Aircraft) due to insufficient sensitivity to fine-grained differences; (2) cross-modal feedback mechanisms may introduce noise interference in complex domain transfer tasks, leading to suboptimal performance on datasets with multiperspective variations such as Caltech101; (3) the multimodal coupling structure shows limited adaptability to stylized data (e.g., ImageNet-S sketches), where bidirectional information flow becomes less efficient in simplified feature extraction scenarios; (4) computational cost and resource consumption are higher than baseline models, restricting deployment in resource-constrained environments; and (5) complex prompt designs may induce feature interference in fine-grained classification tasks (e.g., Stanford Cars), compromising novel class generalization. To address these findings, future work will focus on: (1)

Fine-Grained Feature Enhancement

Developing dynamic attention-focusing mechanisms with differentiable feature masks to strengthen learning of local discriminative regions, particularly in aerospace recognition scenarios requiring subclass differentiation.

(2)

Robust Cross-Modal Interaction

Designing noise-suppressed hierarchical gating structures integrated with self-supervised contrastive learning to enhance feature stability in cross-dataset transfer.

(3)

Adaptive Architecture Optimization

Constructing configurable modal coupling units that automatically switch feature extraction paradigms based on input data style complexity (e.g., sketches vs natural images).

(4)

Resource-Efficient Training

Exploring parameter-shared lightweight prompt mapping networks combined with gradient accumulation strategies to reduce training memory requirements below 18 GB.

(5)

Interference-Aware Learning

Introducing feature purity assessment modules to implement selective feature fusion in bidirectional information flow, improving generalization accuracy in fine-grained classification tasks.

Footnotes

ORCID iDs

Yufei Liu

Hua Cheng

Yiquan Fang

Yiming Pan

Zehong Qian

HXiaoning Chen

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Antol

Agrawal

Mitchell

Batra

Zitnick

C. L.

Parikh

(2015). VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision (pp. 2425–2433). IEEE. https://doi.org/10.1109/ICCV.2015.297

Bao

Wang

Dong

Liu

Mohammed

O. K.

Aggarwal

Som

Piao

Wei

(2022). Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35, 32897–32912. arXiv preprint arXiv:2111.02358. https://arxiv.org/abs/2111.02358.

Devlin

Chang

M. W.

Lee

, & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423

Dosovitskiy

Beyer

Kolesnikov

, Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https://arxiv.org/abs/2010.11929

Zhang

Ren

, & Sun, J.

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE. https://doi.org/10.1109/CVPR.2016.90

Jia

Yang

Xia

, Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, T., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning (pp. 4904–4916). PMLR. arXiv preprint arXiv:2102.05918. https://arxiv.org/abs/2102.05918

Khattak

M. U.

Rasheed

Maaz

, Khan, S., & Khan, F. S. (2023). Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19113–19122). IEEE. https://doi.org/10.1109/CVPR45693.2023.01901

Liu

, Tam, W. L., Du, Z., Yang, Z., & Tang, J. (2021). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602. https://arxiv.org/abs/2110.07602

Batra

Parikh

, & Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32, 4904–4916. https://doi.org/10.5555/3454287.3454289

10.

Petroni

Rocktäschel

Lewis

, Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? arXiv preprint arXiv:1909.01066. https://arxiv.org/abs/1909.01066

11.

Radford

Kim

J. W.

Hallacy

, Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748–8763). PMLR. arXiv preprint arXiv:2103.00020. https://arxiv.org/abs/2103.00020

12.

Schick

Schütze

(2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. https://arxiv.org/abs/2001.07676

13.

Suhr

Zhou

Zhang

, Zhang, I., Bai, H., & Artzi, Y.

(2018). A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491. https://arxiv.org/abs/1811.00491

14.

Tan

Bansal

(2019). LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. https://arxiv.org/abs/1908.07490

15.

Vaswani

Shazeer

, Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I.

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 6000–6010. arXiv preprint arXiv:2102.05918. https://arxiv.org/abs/2102.05918

16.

Zellers

Bisk

Farhadi

, & Choi, Y. (2019). From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6720–6731). IEEE. arXiv preprint arXiv:1811.10830. https://arxiv.org/abs/1811.10830

17.

Zhou

Yang

Loy

C. C.

, & Liu, Z. (2022a). Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16816–16825). IEEE. arXiv preprint arXiv:2203.05557. https://arxiv.org/abs/2203.05557

18.

Zhou

Yang

Loy

C. C.

, & Liu, Z. (2022b). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337–2348. arXiv preprint arXiv:2109.01134. https://arxiv.org/abs/2109.01134

(a) NVIDIA 3090 GPU
Datasets	Training time	Inference time (min)	GPU memory consumption (GB)	Memory consumption (GB)
ImageNet	16 h 41 min	–	23.6	15.3
Caltech101	–	3.7	5.1	11.8
OxfordPet	–	4.3	4.8	11.2
StanfordCars	–	12.7	5.4	11.7
Flowers102	–	3.3	4.9	12.0
Food101	–	36.2	5.7	12.4
FGVCAircraft	–	4.1	4.8	11.5
SUN397	–	45.6	5.9	12.3
DTD	–	1.9	4.7	11.4
EuroSAT	–	9.2	2.8	11.1
UCF101	–	4.5	4.8	12.2
ImageNetV2	–	44.8	5.5	11.7
ImageNet-S	–	232	5.7	12.1
ImageNet-A	–	11.7	4.9	11.3
ImageNet-R	–	48.6	5.1	11.8
(b) Nvidia 3080Ti GPU
Datasets	Inference time (min)	GPU memory consumption (GB)	Memory consumption (GB)
Caltech101	5.1	5.3	12.3
OxfordPet	6.1	4.8	11.2
StanfordCars	15.3	5.6	11.8
Flowers102	4.7	5.2	12.2
Food101	47.3	5.9	12.5
FGVCAircraft	6.0	5.4	11.4
SUN397	59.2	6.3	12.2
DTD	2.8	5.1	11.6
EuroSAT	12.4	2.9	11.3
UCF101	7.3	5.1	12.3
ImageNetV2	58.3	5.7	11.6
ImageNet-S	313	6.0	12.4
ImageNet-A	14.8	5.2	11.4
ImageNet-R	63.5	5.3	11.8
(c) Nvidia Tesla T4 GPU
Datasets	Inference time (min)	GPU memory consumption (GB)	Memory consumption (GB)
Caltech101	8.3	5.4	11.9
OxfordPet	10.1	4.9	11.6
StanfordCars	29.7	5.6	12.2
Flowers102	6.7	5.2	12.5
Food101	81	5.7	12.8
FGVCAircraft	9.3	4.9	11.7
SUN397	105	5.8	12.7
DTD	3.1	5.2	11.9
EuroSAT	19.7	3.1	11.5
UCF101	8.7	4.9	12.1
ImageNetV2	95	5.6	11.5
ImageNet-S	423	5.8	12.3
ImageNet-A	25.8	4.9	11.4
ImageNet-R	1106	5.2	11.8

Multimodal Coupling Prompt Learning for Image Classification Tasks

Abstract

Keywords

1. Introduction

2.1. VLP Models

2.2. Prompt Learning

2.3. Prompt Learning in VLP Models

3. Methodology

3.1. Problem Description and Analysis for CLIP’s Feature Extraction Bias

3.2. Multimodal Coupling Prompt Learning CLIP (MCPL-CLIP)

3.2.1. THIPL Method

4.1. Baselines and Datasets

4.3. Cross-Dataset Transfer Experiment

4.5.1. Prompt Mapping Structure (PMS)

4.6.1. Analysis of PMS Scale

Table 5. Detailed Structure Settings for Different Prompt Embedding Depths. Prompt embedding depth Text encoder (layers) Visual encoder (layers) Shallow 4 4 Standard 8 8 Deep 10 10

5.1. Conclusion

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

References

Table 5.
Detailed Structure Settings for Different Prompt Embedding Depths.

Prompt embedding depth Text encoder (layers) Visual encoder (layers)

Shallow 4 4

Standard 8 8

Deep 10 10