Sage Journals: Discover world-class research

Abstract

Fine-grained image retrieval (FGIR) is a challenging task that demands precise representation of subtle inter-class variations and intra-class similarities. Traditional methods, heavily reliant on labeled data, face scalability and cost-efficiency challenges, limiting their applicability in real-world scenarios. To address these limitations, this study introduces a novel self-supervised pre-training framework tailored for FGIR. Central to our approach is the adaptive sample selector module, which dynamically selects training samples of varying difficulties, enhancing the learning of discriminative features without significantly increasing training costs. Additionally, we propose an integrated learning strategy that synergistically combines contrastive and generative learning, enabling the model to capture both intra-image contextual information and inter-image similarities. This approach significantly improves feature extraction, setting a new benchmark for FGIR tasks. Extensive experiments on multiple FGIR benchmarks demonstrate that our model achieves state-of-the-art performance, consistently surpassing existing methods. By reducing reliance on labeled data and advancing self-supervised learning for fine-grained image analysis, this framework offers a scalable and efficient solution for FGIR and beyond.

Keywords

fine-grained image retrieval pre-training method self-supervised learning adaptive sample selector

1. Introduction

Fine-grained image retrieval (FGIR) has become a significant research area within computer vision (CV), focused on retrieving images belonging to the same subclass within broader categories. Unlike general image retrieval, FGIR emphasizes capturing subtle visual differences essential for distinguishing between subclasses, marking a developmental trend in content-based image retrieval research. FGIR applications span several domains, including pedestrian reidentification (Han et al., 2025; Sun et al., 2024), fashion image retrieval (Islam et al., 2024; Vagena et al., 2025), remote sensing image retrieval (Zhou et al., 2024), and medical image retrieval (Shetty et al., 2024; Wang et al., 2024).

Early FGIR approaches relied mainly on hand-crafted features and traditional machine learning, which often struggled to differentiate visually similar objects at a fine-grained level. With deep learning’s rise, FGIR has seen notable advancements in retrieval accuracy, primarily through models pre-trained on large labeled datasets. Supervised learning models, such as convolutional neural network (CNN)-based architectures such as ResNet and VGG, demonstrate strong feature extraction capabilities but rely heavily on annotated data, creating a significant bottleneck. Obtaining fine-grained annotations is both time-intensive and costly. To address this, self-supervised learning (SSL) has emerged as an effective solution, enabling models to learn representations from unlabeled data by solving pretext tasks. Self-supervised models in CV generally fall into two categories: contrastive learning and generative learning. Contrastive learning models, such as MoCo (He et al., 2020) and SimCLR (Chen et al., 2020), learn representations by contrasting positive and negative samples (Figure 1(b)). This method effectively brings similar images closer in feature space while pushing dissimilar images apart. Generative learning models, such as MAE (He et al., 2022) and SimMIM (Xie et al., 2022), focus on reconstructing masked portions of images to learn representations (Figure 1(c)). Both paradigms are effective for exploiting large-scale unlabeled data and have proven beneficial for downstream tasks such as object recognition, semantic segmentation, and image understanding.

Figure 1.

Supervised Learning Versus Contrastive Learning Versus Generative Learning Versus Self-Supervised Learning. “ $\to$ ” Indicates Forward Propagation, While “ $⇢$ ” Presents Gradient Back-Propagation. (a) Supervised Learning; (b) Contrastive Learning; (c) Generative Learning; And (d) Ours.

While SSL has shown substantial potential, current pre-trained models face notable challenges when applied to FGIR tasks. First, in contrastive learning-based models, the choice of the level of difficulty of the training sample significantly impacts FGIR performance. Random pair selection can lead to overly simplistic training or, conversely, overfitting to complex samples. Some researchers employ strategies such as self-paced learning (Kumar et al., 2010) or hard negative mining (Schroff et al., 2015) to address this; however, these methods often increase computational complexity and training costs. Additionally, the application of self-paced learning in self-supervised pre-training remains in an early exploratory phase with limited findings. These strategies are generally suited for closed-set problems with pre-defined classes, making them less effective for the open-set nature of self-supervised tasks where only sample embeddings are available. Furthermore, contrastive learning-based models rely solely on image similarity to learn features, limiting their ability to capture contextual information critical for fine-grained detail in FGIR. In contrast, generative learning-based models focus on contextual information within image content but overlook inter-image similarity. Effective FGIR requires leveraging both intra-image contextual information and inter-image similarity. However, current self-supervised pre-training models lack effective integration of these two paradigms, which restricts their applicability for FGIR tasks.

To address these limitations, we propose a novel self-supervised pre-trained model tailored for FGIR. Our contributions are as follows: (1)

We introduce an adaptive sample selector (ASS) module that dynamically selects training sample pairs with varying difficulty levels based on the model’s learning progress. Unlike existing strategies, ASS efficiently selects samples of different difficulties without a significant increase in computational cost, thereby reducing the overall training cost while enhancing model performance. Its plug-and-play design allows easy integration into various image processing frameworks, making ASS a versatile tool for multiple tasks.

(2)

We propose a novel SSL strategy that integrates contrastive and generative learning through a masking mechanism, as illustrated in Figure 1(d). This approach allows the model to leverage both contextual information within images (from generative learning) and inter-image similarities (from contrastive learning), leading to significantly enhanced feature extraction for FGIR tasks.

(3)

Our model achieves new state-of-the-art results on challenging FGIR benchmarks, validating the effectiveness of our approach and demonstrating its superiority over existing FGIR methods.

The remainder of this paper is organized as follows: Section 2 provides a comprehensive review of related work on FGIR, self-supervised pre-trained models, and sample selection techniques in contrastive learning. Section 3 describes our proposed method, including the construction of the ASS module and the design of a SSL strategy tailored for FGIR. Section 4 presents experimental results and ablation studies, demonstrating the efficacy of our approach. Finally, Section 5 concludes the paper and explores potential directions for future research.

2. Related Work

In this section, we review the related work on FGIR, self-supervised pre-trained models, and sample selection techniques in contrastive learning, highlighting key developments and challenges that motivate our work.

2.1. Fine-Grained Image Retrieval (FGIR)

FGIR focuses on retrieving images within the same subclass of a specific category, a critical task in computer vision. Traditional FGIR methods relied on hand-crafted features such as HOG (Yu et al., 2013), color histograms (Han & Ma, 2002), and SIFT (Lowe, 2004), often paired with traditional machine learning methods such as support vector machines (Suthaharan, 2016). However, these techniques struggle to capture the subtle distinctions required for FGIR tasks.

The advent of deep learning marked a significant shift in FGIR. CNNs can extract features at different levels, significantly improving retrieval performance (Wei et al., 2017). Supervised learning models, such as VGGNet and ResNet, leverage large labeled datasets to develop robust feature representations (He et al., 2016; Simonyan & Zisserman, 2014). Building on these networks, some FGIR methods explore attention mechanisms (Li & Ma, 2023; Zeng et al., 2024) to focus on discriminative image regions, while others integrate CNNs with metric learning (Duan et al., 2022; Yang et al., 2023; Zhang et al., 2024) to develop similarity measures tailored to FGIR’s unique challenges. Despite their success, these models are constrained by the need for annotated data, a limitation particularly pronounced in fine-grained categories due to the high cost and effort of labeling.

To relieve the annotation burden, several recent works explore SSL for FGIR. LCR (Shu et al., 2023) pools gradient-weighted class activation mapping (Grad-CAM) foreground responses as a common rationale before contrastive training and boosts recall over MoCov2. DMCAC (Trivedy & Latecki, 2024) aligns query–gallery similarity distributions via a divergence loss, narrowing the gap to supervised baselines. A2-SSL Hu et al. (2024) leverages asymmetric view augmentation and part-oriented dense contrast to learn compact hashing codes, while CMD (Bi et al., 2025) distills patch-crop cues into whole-image tokens, improving the retrieval recall rate. All the above methods rely on heuristic or fixed strategies to form positive/negative pairs and do not consider masked reconstruction. Our ASS schedules pairs from easy to hard and is combined with a masking branch, resulting in consistently higher retrieval accuracy without additional labels.

2.2. Self-Supervised Pre-Trained Models

To address the limitations of supervised learning, SSL has emerged as a viable alternative. Self-supervised models learn useful representations from unlabeled data by solving pretext tasks, enabling them to generalize effectively to downstream tasks. These models, which achieve state-of-the-art performance across diverse downstream applications, primarily fall into two categories: generative learning and contrastive learning models. Generative models, such as CAE (Chen et al., 2024), MAE (He et al., 2022), SimMIM (Xie et al., 2022), and DeepMIM (Ren et al., 2025), adopt a masked-image-modeling (MIM) objective in which random patches are removed and subsequently reconstructed. Recent evidence shows that MIM can benefit fine-grained tasks: SemMAE employs semantic-guided masking and boosts fine-grained recognition accuracy (Li et al., 2022), while BirdSAT leverages cross-view masked autoencoding to enhance bird-species retrieval (Sastry et al., 2024). Although these models capture rich contextual cues, they generally overlook inter-image similarity relations that are critical for retrieval. Contrastive learning models such as SimCLR (Chen et al., 2020), MoCo (Chen et al., 2021), and DINO (Caron et al., 2021) address this by comparing positive and negative samples, bringing similar features closer in the feature space while pushing dissimilar ones apart. However, contrastive models struggle to capture the fine-grained contextual information essential for FGIR, and their performance is often impacted by sample selection strategies.

Recent efforts attempt to combine generative and contrastive learning to leverage the strengths of both approaches, such as the iBOT model (Zhou et al., 2021). Despite these efforts, there remains a significant gap in effectively integrating these learning paradigms to meet the specific demands of FGIR. Firstly, the contrastive learning component does not pay sufficient attention to the strategic selection of training samples, which can lead to high training costs and suboptimal feature extraction. Additionally, current methods fail to balance the extraction of contextual and discriminative features, essential for accurate FGIR.

2.3. Sample Selection Techniques in Contrastive Learning

In contrastive learning, the selection of positive and negative pairs significantly impacts model performance. Simple pairs may result in learning only basic features, whereas overly challenging pairs can lead to overfitting and degraded performance. Hard negative mining and semi-hard negative selection techniques (He et al., 2020; Oh Song et al., 2017; Schroff et al., 2015) address these issues but often increase training complexity and cost. Some approaches apply self-paced learning Kumar et al. (2010), enabling the model to start with simpler samples and gradually tackle harder ones (Liu et al., 2023, 2022). However, these methods are unsuitable for SSL. First, they are generally based on closed-set problems with pre-defined classes, not applicable to SSL frameworks, where only sample embeddings are available (Franco et al., 2023). Secondly, they need to pre-define the sample difficulty evaluation and sample selection, which significantly increases computational complexity and training time. Additionally, they do not consider how to reasonably use hard samples for training, which is not conducive to improving the model’s learning ability (Ge et al., 2020).

In self-supervised pre-trained models, dynamic sample selection based on difficulty is relatively underexplored, yet it is especially needed for FGIR, where intra-class variance is high and inter-class variance is low. Effective sample selection strategies can significantly enhance the training efficiency and performance of self-supervised models, benefiting downstream tasks such as FGIR.

To address these gaps, we propose a self-supervised pre-training method tailored to FGIR tasks. Our approach effectively combines generative and contrastive learning, enabling models to extract fine-grained contextual and discriminative features during pre-training. Additionally, we introduce an ASS module to dynamically select training pairs based on difficulty level as the model’s training progresses, improving training efficiency and effectiveness without significantly increasing computational complexity. This method empowers the pre-trained model to develop discriminative fine-grained features, enhancing FGIR performance.

3. Proposed Method

Our self-supervised pre-training method for FGIR consists of two main stages: the ASS module and the SSL task. Firstly, the ASS module adaptively selects training sample pairs with varying levels of difficulty. By dynamically adjusting the difficulty of the selected pairs as training progresses, the module optimizes the learning process. This allows the model to efficiently capture fine-grained differences across images without significantly increasing training costs. Building on the pairs selected by the ASS, we design a novel SSL that integrates generative and contrastive learning. This enables the pre-trained model to simultaneously capture contextual information within images and similarity relationships across images, resulting in the extraction of discriminative features enriched with contextual details, enhancing its capability for FGIR tasks. The following subsections provide detailed explanations of these components.

3.1. ASS Module

Training SSL pre-trained models is resource-intensive in both computation and time. In contrastive learning, the choice of difficulty for positive and negative sample pairs significantly impacts the effectiveness and efficiency of model training. To address this, we propose an ASS module. The ASS dynamically selects positive and negative sample pairs with progressively increasing difficulty based on the training progress. As illustrated in Figure 2, we first pre-train an independent ASS network whose backbone is a lightweight ViT-S/16 optimized with a joint generative–contrastive objective (training strategy in Section 3.2). The network follows the standard ViT-S/16 configuration (12 transformer layers, six attention heads, and 384-d patch embeddings), and adds a 128-d projection head. It is trained with a 75% random-masking ratio, and the complete training recipe is listed in Section 4.1. This stage is inexpensive, yet it equips the encoder with strong discrimination. Once converged, the decoder is discarded and the encoder is frozen, becoming the feature extractor $g_{ϕ}$ . The $g_{ϕ}$ remains independent of the downstream retrieval backbone and can be flexibly integrated into the pre-training process of other retrieval architectures. In the ensuing main SSL stage, every image in train data is passed through this frozen $g_{ϕ}$ to obtain compact embeddings, which are fed to our ASS algorithm 1. The algorithm can dynamically control the difficulty of training samples according to the training process. Specifically, for a batch of $N$ training pairs, the ASS prioritizes easy-to-distinguish samples in the initial training stages and shifts to harder pairs as training progresses.

Figure 2.

Top: ASS Network Is Self-Trained with a Lightweight ViT-S/16 Encoder–Decoder and Dual Objectives. Bottom: After Training, the Encoder $g_{ϕ}$ Is Frozen and Used to Rank Image–Image Pairs from Easy to Hard, Providing Difficulty-Aware Curricula for the Main SSL Network. Note. ASS = Adaptive Sampling Selector; SSL = Self-Supervised Learning.

In Algorithm 1, the parameter $γ$ controls task difficulty. When $γ \to 0$ , $p_{i}$ approximates a uniform distribution, favoring easier samples. As $γ$ increases, the model focuses on harder samples. Therefore, during the $T$ -th iteration, we set $γ = σ T_{max} / T$ , where $T_{max}$ represents the maximum number of iterations. $σ > 1$ is a manually specified hyperparameter used to adjust the rate at which the training difficulty increases. This incremental adjustment ensures a smooth transition from learning coarse-grained to fine-grained features. The selected positive sample set $A$ and negative sample set $B$ compose a batch of input samples during the training phase.

It is worth noting that, due to the small size of the ASS module, the cost associated with selecting candidate hard samples is significantly low. This process can be further optimized by pre-extracting the feature vectors for all data using the ASS. Consequently, during model training, each sample only needs to pass through the ASS once, effectively minimizing the computational and time costs of sample selection to nearly negligible levels.

Using Algorithm 1, the large-scale model is designed to learn from a broad set of easily distinguishable sample pairs during the early stages of training. This approach helps the model acquire foundational coarse-grained feature representations. As training progresses to later stages, the model transitions to learning from more challenging sample pairs, fostering a deeper understanding and acquisition of fine-grained feature representations.

3.2. Pre-Trained Strategy

With difficulty-aware training pairs supplied by ASS, we design a self-supervised regimen that tightly couples generative and contrastive learning (Figure 3). The synergy between these two objectives allows the network to recover global context while learning instance-level distinctions that are critical for FGIR.

Figure 3.

Dual-Objective Pre-Training Driven by Adaptive Sample Selector (ASS)-Selected Image Pairs. For Each Pair, We Patchify and Randomly Mask the Images, Encode the Visible Tokens with a ViT-B/16 Encoder $f_{θ}^{e}$ , and Reconstruct the Missing Content with a Lightweight Decoder $f_{θ}^{d}$ . A Pixel-Level Generative Loss and an Instance-Level Contrastive Loss Are Jointly Optimized, Endowing the Model with Both Contextual Understanding and Fine-Grained Discrimination.

3.2.1. Generative Learning With Random Masking

Given an image $X^{i}$ split into $n \times n$ patches

X^{i} = {x_{1}^{i}, \dots, x_{n \times n}^{i}},

(1)

we randomly select a subset

M

of patches to mask. Define a masked version

M (X^{i})

such that

M (x_{j}^{i}) = {\begin{cases} x_{j}^{i}, & j \notin M, \\ [MASK], & j \in M . \end{cases}

(2)

The model

f_{θ}

reconstructs the missing patches:

{\hat{x}}_{j}^{i} = f_{θ} (M (X^{i})), \forall j \in M .

(3)

Here

f_{θ}

consists of an encoder (

f_{θ}^{e}

) and a decoder (

f_{θ}^{d}

). It can be expressed as

f_{θ} = f_{θ}^{d} \circ f_{θ}^{e}

, where the symbol

\circ

represents the composition of functions. We measure reconstruction quality via the mean squared error:

L_{g} (X^{i}) = \frac{1}{| M |} \sum_{j \in M} ‖ x_{j}^{i} - {\hat{x}}_{j}^{i} ‖_{2}^{2} .

(4)

This process enables the model to learn both local and global contextual relationships, crucial for FGIR tasks. To enhance the synergy between generative and contrastive learning, we use shared and auxiliary representations. The shared representation, generated by the encoder $f_{θ}^{e}$ from unmasked patches, is utilized across both reconstruction and contrastive tasks to ensure consistency in learning objectives. Simultaneously, the auxiliary representation, produced by the decoder $f_{θ}^{d}$ during reconstruction, contains latent features for the masked patches. These auxiliary features are integrated with the shared representation during contrastive learning, effectively bridging the generative and discriminative components of the model.

Masking of patches compels the encoder to recover missing details from the visible context, thereby establishing long-range part–whole dependencies that complement local texture cues. Such holistic reasoning is particularly valuable in fine-grained categories where discriminative regions (e.g., a bird’s beak or a car emblem) occupy only a small fraction of the image. MIM has already proved effective on fine-grained benchmarks: MAE achieves competitive transfer accuracy without labels (He et al., 2022), and the hybrid iBOT framework further improves CUB-200-2011 performance by combining masking with self-distillation (Zhou et al., 2021). Building on these insights, our design unifies MAE-style masking with an adaptive contrastive schedule, allowing the network to learn global context and progressively sharpen subtle inter-class distinctions.

3.2.2. Contrastive Learning: Inter-Image and Intra-Image

The contrastive learning component is designed to enhance the discriminative capability of image features. Denote the reconstructed image as ${\hat{X}}^{i}$ :

{\hat{X}}^{i} = {x_{1}^{i}, x_{2}^{i}, \dots, {\hat{x}}_{j}^{i}, \dots, {\hat{x}}_{k}^{i}, \dots, x_{n \times n}^{i}} .

(5)

Here, the original image

X^{i}

serves as the anchor sample,

{\hat{X}}^{i}

as the positive sample, and reconstructed images from other original images,

{\hat{X}}^{m}

, as negative samples. The triplet loss

L_{tri}

, is given by:

\begin{aligned} L_{tri} = max ( & D (f_{θ}^{e} (X^{i}), f_{θ}^{e} ({\hat{X}}^{i})) - D (f_{θ}^{e} (X^{i}), f_{θ}^{e} ({\hat{X}}^{m})) + α, 0), \end{aligned}

(6)

where

D (\cdot, \cdot)

measures the distance between feature embeddings, and

α

is a margin hyperparameter ensuring that positive samples are closer to the anchor than negative samples by at least

α

To further enhance representation learning, we introduce a dual contrastive objective that maximizes the similarity between the encoder’s shared representation and the decoder’s auxiliary representation for the same image. This intra-image contrastive loss $L_{intra}$ is defined as:

L_{intra} = - \log \frac{\exp (S (f_{θ}^{e} (X^{i}), f_{θ}^{d} ({\hat{X}}^{i})) / τ)}{\sum_{j = 1}^{N} \exp (S (f_{θ}^{e} (X^{i}), f_{θ}^{d} ({\hat{X}}^{j})) / τ)},

(7)

where

S (\cdot, \cdot)

is the cosine similarity, and

τ

is a temperature hyperparameter controlling the sharpness of the similarity distribution, which affects how strongly the model pushes apart negative samples and pulls together positive samples.

3.2.3. Final Loss Function

We unify the above objectives into a single cost function:

L = λ_{1} L_{g} + λ_{2} L_{t r i} + (1 - λ_{2}) L_{intra},

(8)

where

λ_{1}

and

λ_{2}

are regularization coefficients that balance the contributions of each loss component. These coefficients ensure that the model optimally weighs reconstruction accuracy, discriminative feature learning, and cross-component alignment, resulting in robust representations.

4. Experimental Results

4.1. Implementation Details

We use ImageNet as the pre-training dataset for our self-supervised model. Our framework employs a two-tiered backbone strategy for efficiency. The main self-supervised model, which learns the final feature representations, is built upon a ViT-B/16 backbone (Dosovitskiy et al., 2020). For the lightweight ASS module, whose purpose is the efficient pre-selection of training pairs, we utilize a smaller ViT-S/16 backbone, where “/16” represents a patch size of 16 pixels. This design significantly reduces the computational cost of the adaptive sampling process. During pre-training, we employ random cropping and random horizontal flipping as the only data augmentation techniques. We set the batch size to 64. The key hyperparameters are set as $σ = 16$ , temperature parameter $τ = 0.1$ , and number of negative samples $K = 10$ . Based on our ablation study, the optimal loss weights are set to $λ_{1} = 0.7$ and $λ_{2} = 0.3$ , respectively. The main ViT-B/16 model is pre-trained for 800 epochs using the AdamW optimizer (Loshchilov & Hutter, 2017). Input images are resized to $224 \times 224$ , resulting in each image being divided into $14 \times 14$ patches. We apply a weight decay of 0.05, a dropout rate of 0.1, and accumulate gradients over four steps. The initial learning rate is set to $1 \times 10^{- 4}$ and decays according to a cosine schedule following a 20-epoch warm-up phase.

For FGIR-specific fine-tuning, we further optimize the ViT-B/16 model using AdamW. The input resolution is increased to $512 \times 512$ , and the batch size is reduced to 16. Fine-tuning is conducted over 160K steps, with a layer-wise learning rate decay of 0.65 and a drop path rate of 0.1 to stabilize training. We evaluate four learning rates ( $1 \times 10^{- 4}, 2 \times 10^{- 4}, 3 \times 10^{- 4}$ , and $4 \times 10^{- 4}$ ) and report the best results. During this phase, the weight decay is set to $1 \times 10^{- 4}$ , the dropout rate is 0.1, and gradients are accumulated over four steps. For consistency, multi-scale testing is excluded from our evaluation.

4.2. Datasets

We evaluate the effectiveness of our self-supervised pre-training model on four widely used FGIR benchmark datasets: CUB-200-2011 (Wah et al., 2011), a fine-grained bird image dataset with 200 species; Cars-196 (Krause et al., 2013), which contains images of 196 cars with varying visual similarities; Stanford Online Products (SOP; Oh Song et al., 2016), featuring over 22,000 product categories from online stores; and In-Shop Clothes Retrieval (Liu et al., 2016), a dataset focused on fashion item retrieval with complex intra-class variations across 11,735 categories. For dataset partitioning, we adhere to the splits widely adopted in prior works (An et al., 2023; El-Nouby et al., 2021; Patel et al., 2022). To ensure consistent evaluation, we use the Recall@K metric, a standard measure for retrieval tasks, enabling fair comparisons with state-of-the-art methods.

4.3. Comparison With State-of-the-art Methods

We compare our approach with three categories of state-of-the-art methods: Supervised FGIR methods, including AMCICIR (Li & Ma, 2023), SGSL (Yang et al., 2023), IRTr (El-Nouby et al., 2021), DVF (Jiang et al., 2024), and RS@k (ViT-B/16, supervised ImageNet-1k baseline;Patel et al., 2022); and self-supervised pre-trained models, including SimCLR (Chen et al., 2020), SwAV (Zhu et al., 2020), DINO (Caron et al., 2021), MoCo v3 (Chen et al., 2021), MAE (He et al., 2022), CAE (Chen et al., 2024), iBOT (Zhou et al., 2021), and DeepMIM (Ren et al., 2025). These self-supervised pre-trained models are further categorized by their objectives—contrastive learning, generative learning, or a combination of both. And recent self-supervised models tailored for FGIR, including LCR (Shu et al., 2023), DMCAC (Trivedy & Latecki, 2024), and CMD (Bi et al., 2025). For fairness, whenever we report a supervised ImageNet-1k baseline (e.g., RS@k), we fine-tune it with exactly the same ViT-B/16 backbone, data-augmentation pipeline, $224 \times 224$ input resolution, and AdamW training recipe described in Section 4.1. Table 1 provides a summary of the backbones and sample selection strategies employed by these methods.

Table 1.
Overview of Representative State-of-the-art Methods Used in Comparative Experiments.

Method Backbone Sample Selection Strategy

Supervised FGIR

AMCICIR R50 Attention-guided hard negative mining

SGSL R50 Stop-gradient softmax with hard negative mining

RS@k ViT Large batches with similarity mixup

IRTr ViT Hard negative mining with cross-batch memory

DVF ViT Random sample selection

SSL – contrastive

SimCLR R50 Random sample pair selection

SwAV R50 Multi-view clustering

DINO ViT Self-distillation without labels

MoCo v3 ViT Momentum-based queue

SSL – generative

MAE ViT Random sample selection

CAE ViT Random sample selection

DeepMIM ViT Random sample selection

SSL - tailored for FGIR

LCR R50 Grad-CAM rationale screening

DMCAC ViT Divergence profile matching

CMD ViT Cross-level patch image distillation

SSL – generative and contrastive

iBOT ViT Online tokenizer (teacher-guided pairs)

Ours ViT Adaptive difficulty-based sample selection

Method	Backbone	Sample Selection Strategy
Supervised FGIR
AMCICIR	R50	Attention-guided hard negative mining
SGSL	R50	Stop-gradient softmax with hard negative mining
RS@k	ViT	Large batches with similarity mixup
IRTr	ViT	Hard negative mining with cross-batch memory
DVF	ViT	Random sample selection
SSL – contrastive
SimCLR	R50	Random sample pair selection
SwAV	R50	Multi-view clustering
DINO	ViT	Self-distillation without labels
MoCo v3	ViT	Momentum-based queue
SSL – generative
MAE	ViT	Random sample selection
CAE	ViT	Random sample selection
DeepMIM	ViT	Random sample selection
SSL - tailored for FGIR
LCR	R50	Grad-CAM rationale screening
DMCAC	ViT	Divergence profile matching
CMD	ViT	Cross-level patch image distillation
SSL – generative and contrastive
iBOT	ViT	Online tokenizer (teacher-guided pairs)
Ours	ViT	Adaptive difficulty-based sample selection

Note. FGIR = fine-grained image retrieval; SSL = self-supervised learning; R50 = ResNet50; ViT = ViT-B/16; Grad-CAM = gradient-weighted class activation mapping.

Table 2 shows the experimental results on CUB-200-2011 and Cars-196 datasets, while Table 3 presents the experimental results of these methods on SOP and In-Shop datasets. From these experimental results, we can observe that with the improvement of SSL, self-supervised pre-trained models can be effectively applied to the FGIR tasks. And, they have the potential to outperform FGIR methods using supervised pre-training. Under matched conditions (ViT-B/16 backbone, ImageNet-1k pre-training, and identical fine-tuning), the self-supervised DeepMIM model outperforms the supervised baseline RS@k by $+ 3.7$ , $+ 2.5$ , $+ 0.9$ , and $+ 0.5$ percentage points at Recall@1/2/4/8 on Cars-196. On CUB-200-2011, our method achieves an R@1 of 89.2%, significantly surpassing SGSL’s 72.0% and AMCICIR’s 68.8%. Similarly, on the Cars-196 dataset, our method reaches an R@1 of 95.5%, compared to SGSL’s 94.1% and AMCICIR’s 87.2%. These findings underscore the capability of self-supervised pre-training to exceed the performance of supervised methods in FGIR tasks without relying on expensive labeled datasets. These performance margins can be traced to two complementary properties of self-supervised pre-training. First, because no external labels are involved, the encoder is not constrained by the coarse ImageNet taxonomy; instead, it is optimized directly on instance-level distinctions and therefore attends to the fine-grained, part-level cues that are decisive in FGIR. Second, the absence of an annotation requirement allows the pre-training stage to exploit a substantially larger and more diverse image collection which further enhances generalization.

Table 2.

Recall@K(%) on CUB-200-2011 and Cars-196 With State-of-the-art Methods.

	CUB-200-2011 (K)				Cars-196 (K)
Method	1	2	4	8	1	2	4	8
AMCICIR	68.8	79.5	87.5	92.4	87.2	92.2	95.9	98.4
SGSL	72.0	81.1	88.3	93.1	94.1	96.7	98.0	99.0
RS@k	–	–	–	–	89.5	94.2	96.6	98.3
DVF	88.8	93.1	95.3	96.5	90.2	94.6	97.3	98.9
SimCLR	66.3	74.7	85.6	89.1	86.1	90.6	94.8	96.2
SwAV	67.5	75.3	87.4	91.0	86.8	91.2	95.1	97.4
DINO	79.4	85.5	88.7	92.3	92.2	94.7	96.5	97.9
MoCo v3	81.6	86.7	90.5	93.5	89.2	93.8	97.3	98.6
MAE	80.1	86.2	91.7	93.1	88.7	93.0	95.5	97.2
CAE	82.5	88.0	92.1	93.8	93.9	97.8	98.6	99.5
DeepMIM	84.0	90.3	93.5	94.1	93.2	96.7	97.5	98.7
iBOT	87.4	92.5	94.9	96.5	91.8	95.6	97.9	98.9
DMCAC	86.2	92.0	94.7	96.7	88.5	93.9	96.7	98.0
LCR	71.3	74.8	86.9	89.4	60.8	78.9	86.3	90.5
CMD	81.5	85.6	89.2	92.7	85.9	92.3	94.5	96.8
Ours	89.2	93.6	95.3	96.4	95.5	98.6	99.0	99.2

Note. The best performances are marked in bold.

Table 3.

Recall@K(%) on SOP and In-Shop With State-of-the-art Methods.

	SOP (K)				In-Shop (K)
Method	1	10	100	1000	10	20	30	40
IRTr	84.2	93.7	97.3	99.1	91.9	98.1	98.7	98.9
SimCLR	79.1	88.4	94.6	97.2	89.2	96.4	97.2	98.0
SwAV	79.9	89.5	95.4	97.8	90.7	96.3	97.8	98.2
DINO	82.2	91.9	97.6	99.1	92.8	97.4	98.6	99.0
MoCo v3	81.3	91.2	96.8	98.4	92.3	96.5	98.1	98.9
MAE	81.5	90.8	96.7	98.9	91.8	97.2	97.9	98.3
CAE	84.6	93.4	96.9	99.0	92.5	97.6	98.2	99.1
SGSL	81.4	91.8	96.2	–	93.5	98.6	99.1	99.3
DeepMIM	84.2	94.1	96.5	98.7	92.0	98.1	98.6	99.1
iBOT	84.9	93.2	96.9	98.8	93.0	97.7	98.5	99.0
DMCAC	86.3	95.2	97.5	99.5	92.7	98.2	98.9	99.3
CMD	80.2	90.6	95.3	97.9	90.1	95.3	97.2	98.0
Ours	87.9	95.2	98.1	99.4	94.0	98.8	99.1	99.3

Note. The best performances are marked in bold.

Moreover, our model, which combines contrastive and generative learning, consistently outperforms methods relying on a single paradigm. For instance, on the SOP dataset, our method achieves an R@1 of 87.9%, surpassing contrastive-only models such as SimCLR (79.1%) and MoCo v3 (81.3%), as well as generative-only models such as MAE (81.5%) and CAE (84.6%). This demonstrates the strength of integrating contrastive and generative approaches, enabling the model to capture both inter-image similarities and intra-image contextual details, leading to superior feature learning and retrieval performance. Furthermore, the ASS module plays a critical role in enhancing retrieval performance. On Cars-196, our model achieves an R@1 of 95.5% and an R@2 of 98.6%, outperforming strong competitors such as CAE (R@1 of 93.9%) and DeepMIM (R@1 of 93.2%). By dynamically adjusting sample difficulty during training, ASS ensures the model learns from a balanced mix of simple and challenging examples, improving generalization and overall performance across datasets.

We further benchmark against three very recent methods specifically designed for fine-grained retrieval—LCR, DMCAC, and CMD—and the hybrid SSL baseline iBOT. iBOT, which unifies masked reconstruction with teacher–student distillation, achieves Recall@1 values of 87.4% on CUB-200-2011 and 91.8% on Cars-196. These figures remain 1.8 and 3.7 percentage points below those of our model, respectively, indicating that adaptive difficulty sampling offers additional benefits beyond online token distillation. On the large-scale product datasets SOP and In-Shop, iBOT trails our approach by 3.0–4.0 percentage points, reinforcing the same observation. Compared with the most recent FGIR-oriented method, DMCAC, our model improves Recall@1 by 3.0 percentage points on CUB, 7.0 percentage points on Cars, and 1.6 percentage points on SOP. CMD and LCR exhibit substantially lower performance, emphasizing their limited scalability to transformer backbones.

In summary, these experimental results decisively demonstrate that our self-supervised method surpasses state-of-the-art supervised methods in FGIR. The integration of contrastive and generative learning paradigms proves superior to relying on a single learning mode, while the adaptive difficulty-based sample selection strategy further enhances performance, establishing our model as a new benchmark in FGIR tasks.

4.4. Ablation Study

To assess the contribution of individual components of our self-supervised pre-training method, we conducted a series of ablation experiments on the CUB-200-2011 dataset. The components examined include the ASS module, the model backbone, the use of random masking, and the loss weights $λ_{1}$ and $λ_{2}$ .

ASS module. The importance of the ASS module is evaluated in Table 4. Without the ASS module, the model’s performance drops significantly, achieving only R@1 = 85.3% compared to R@1 = 89.2% when the ASS module is included. This trend is consistent across all recall metrics, with the ASS-enabled model outperforming the one without the ASS by large margins. These results confirm the effectiveness of the ASS in dynamically selecting training samples of varying difficulty, which enhances the model’s ability to learn discriminative features and boosts retrieval performance.

Model Backbone. To evaluate the impact of the backbone architecture, we compare the performance of our method using ResNet50 and ViT-B/16 backbones, as shown in Table 5. The model using ViT-B/16 achieves superior results, with R@1 = 89.2% compared to R@1 = 81.9% for ResNet50. This indicates that ViT-B/16 is better suited for FGIR tasks, likely due to its ability to capture more complex and nuanced image representations, leading to improved feature extraction and retrieval accuracy.

Loss Weights $λ_{1}$ and $λ_{2}$ . To evaluate the effect of the losses $λ_{1}$ and $λ_{2}$ in balancing the generative and contrastive components of our self-supervised pre-training strategy, we conducted an ablation study on the CUB-200-2011 dataset, as shown in Table 6. Setting $λ_{1} = 1.0$ and $λ_{2} = 0.0$ (generative-only) results in R@1 = 81.1%, demonstrating the limitations of relying solely on generative learning. Conversely, when $λ_{1} = 0.0$ and $λ_{2} = 1.0$ (triplet contrastive-only), the model achieves R@1 = 85.5%, indicating the critical role of contrastive objectives in discriminative feature learning. When both $λ_{1}$ and $λ_{2}$ are non-zero, the model exhibits significant improvements across recall metrics. This shows the advantages of combining generative learning and comparative learning. The best performance is achieved with $λ_{1} = 0.3$ and $λ_{2} = 0.7$ , yielding R@1 = 89.2% and consistent gains across other recall metrics. This configuration highlights the synergistic effect of integrating generative learning with a balanced emphasis on both triplet contrastive loss and intra-image contrastive alignment loss. These results validate the effectiveness of the proposed loss design in producing robust and fine-grained representations for image retrieval tasks.

Random Masking. Table 7 contrasts the default setting (75% random masking) with an otherwise identical no-mask baseline in which the reconstruction branch is disabled. Removing masking causes a 3.7 percentage-point drop in Recall@1 (from 89.2% to 85.5%), and consistent degradations at higher $K$ . Because the encoder–decoder structure and inference floating-point operations remain the same, the performance gap can be attributed solely to the absence of the MIM objective. These findings confirm that random masking forces the encoder to model long-range part–whole relations that are critical for fine-grained discrimination, and provides complementary supervisory signals to the adaptive contrastive pairs generated by ASS.

The ablation studies validate the effectiveness of each key component in our self-supervised pre-training framework. Specifically, the integration of the ASS module, the adoption of the ViT-B/16 backbone, the use of random masking, and the tuning of the loss weight $λ$ each contribute meaningfully to the overall performance. These results collectively affirm the soundness of our design choices.

Table 4.
Ablation Study of Adaptive Sample Selector (ASS).

CUB-200-2011 (Recall@K)

Method 1 2 4 8

Without the ASS module 85.3 91.0 93.8 94.6

With the ASS module 89.2 93.6 95.2 96.4

	CUB-200-2011 (Recall@K)
Without the ASS module	85.3	91.0	93.8	94.6
With the ASS module	89.2	93.6	95.2	96.4

Note. The best performances are marked in bold.

Table 5.

Ablation Study of Model Backbone.

	CUB-200-2011 (Recall@K)
Backbone	1	2	4	8
ResNet50	81.9	86.4	92.5	93.8
Vit-B/16	89.2	93.6	95.2	96.4

Note. The best performances are marked in bold.

Table 6.

Ablation Study of Weights $λ_{1}$ and $λ_{2}$ on CUB-200-2011.

		CUB-200-2011 (Recall@K)
$λ_{1}$	$λ_{2}$	1	2	4	8
1.0	0.0	81.1	85.3	89.8	92.0
0.9	0.1	84.9	88.7	92.9	93.9
0.7	0.3	87.6	89.9	94.8	94.9
0.5	0.5	89.2	92.5	95.2	96.5
0.3	0.7	89.2	93.6	95.2	96.4
0.1	0.9	88.3	92.1	94.6	96.0
0.0	1.0	85.5	89.4	93.3	94.2

Note. The best performances are marked in bold.

Table 7.

Ablation Study of Random Masking.

	CUB-200-2011 (Recall@K)
Backbone	1	2	4	8
No-mask	85.5	90.7	93.4	95.0
Random mask	89.2	93.6	95.3	96.4

Note. The best performances are marked in bold.

4.5. Visualization

To evaluate the focus and attention mechanisms of our model, we generated feature heat maps for images in the CUB-200-2011 dataset, as shown in Figure 4. These heat maps demonstrate how different learning configurations influence the model’s ability to capture fine-grained details that are essential for accurate image retrieval.

Figure 4.

Feature Heat Maps Comparing Different Learning Configurations on the CUB-200-2011 Dataset. Each row Represents a Different Bird Species, With Columns Showing the Original Image, Only Contrastive Learning, Only Generative Learning, and our Proposed Method.

The figure is organized into five columns for comparison, as described below. The first column displays the original bird images. In the second column, the results of using only contrastive learning are shown. The heat maps reveal that the model’s attention is scattered and only partially focused on key features of the birds. This limited attention reduces the model’s ability to capture fine-grained details effectively, leading to suboptimal retrieval performance. The third column illustrates the effects of applying only generative learning, where the heat maps display broader and more continuous attention across the bird’s body. However, this approach lacks a distinct focus on discriminative features, such as the beak or unique plumage patterns, which are crucial for distinguishing between visually similar bird species. The fourth column highlights the results of our proposed method, which combines contrastive learning, generative learning, and the ASS module. The heat maps produced by this configuration demonstrate clear, consistent, and highly focused attention on critical areas, such as feather texture, color patterns, and specific body parts such as the beak and wings. This focused attention reflects the model’s ability to dynamically adjust sample difficulty through the ASS, thereby enhancing feature learning and enabling the identification of subtle distinctions essential for FGIR tasks.

The comparison of heat maps across these configurations highlights the advantage of our method. By integrating contrastive and generative learning with adaptive sample selection through the ASS, our approach achieves superior focus on discriminative regions, thereby enhancing the model’s ability to capture essential fine-grained details for image retrieval tasks.

5. Conclusion

This study presents a novel self-supervised pre-training framework specifically tailored for the challenging task of FGIR. Traditional FGIR approaches, which rely heavily on labeled datasets, are constrained by scalability and cost limitations, hindering their real-world applicability. To address these challenges, we propose a framework that combines an ASS module with an integrated learning strategy that synergizes contrastive and generative learning. The ASS module plays a critical role in dynamically selecting training samples of varying difficulty, enabling the model to focus on learning discriminative features progressively. This dynamic sample selection not only enhances the learning process but also maintains computational efficiency, making the approach practical for large-scale applications. Additionally, the integration of contrastive and generative learning provides the model with the ability to capture both intra-image contextual details and inter-image similarities, resulting in robust and discriminative fine-grained feature representations. Extensive experimental results on widely used FGIR benchmarks, including CUB-200-2011, Cars-196, SOP, and In-Shop Clothes Retrieval, demonstrate that our framework achieves state-of-the-art performance. By consistently outperforming existing supervised and self-supervised methods, our approach highlights the power of SSL for FGIR tasks. Furthermore, the framework significantly reduces dependence on costly labeled data, offering a scalable and cost-effective solution for fine-grained image analysis.

The ASS module proposed in this study holds promise for a broader application in other image processing tasks, where it could improve training efficiency by dynamically adjusting sample difficulty. Furthermore, the training approach and integrated learning strategy presented here can be generalized to various fine-grained image-related tasks, extending the impact of this work beyond the FGIR. Future research can explore these applications and investigate further improvements in SSL pre-trained techniques to reduce reliance on labeled data while advancing model performance.

Footnotes

Acknowledgments

We would like to express our sincere gratitude to colleagues who participated in discussions and offered valuable suggestions.

ORCID iDs

Xiaoqing Li

Ya Wang

Author Contributions

Xiaoqing Li conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and tables, reviewed drafts of the article, and approved the final draft. Ya Wang conceived and designed the experiments, analyzed the data, reviewed drafts of the article, and approved the final draft.

Ethical Considerations

This study did not involve any human participants, animal subjects, or sensitive data, and therefore did not require ethical approval.

Consent to Participate

No human participants were involved in this study; informed consent was not required.

Consent for Publication

This study does not include any individual person’s data in any form; therefore, consent for publication is not required.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Capital University of Economics and Business (Project No. XRZ2022065). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The datasets used in this study are publicly available:

iCartoonFace: https://github.com/luxiangju-PersonAI/iCartoonFace?tab=readme-ov-file#Dataset

CIFAR-100: https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR100.html

Flowers102: https://pytorch.org/vision/stable/generated/torchvision.datasets.Flowers102.html

Stanford Cars: https://pytorch.org/vision/stable/generated/torchvision.datasets.StanfordCars.html

References

Deng

Yang

Feng

Guo

Liu

(2023). Unicom: Universal and compact representation learning for image retrieval. arXiv preprint arXiv:2304.05884.

Zhan

Xia

G. S.

(2025). Cross-level multi-instance distillation for self-supervised fine-grained visual categorization. IEEE Transactions on Image Processing, 34, 2954–2969.

Caron

Touvron

Misra

Jégou

Mairal

Bojanowski

Joulin

(2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650–9660). IEEE Computer Society.

Chen

Kornblith

Norouzi

Hinton

(2020). A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (pp. 1597–1607). JMLR.org.

Chen

Ding

Wang

Xin

Wang

Han

Luo

Zeng

Wang

(2024). Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, 132(1), 208–223.

Chen

Xie

(2021). An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9640–9649). IEEE.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Houlsby

(2020). An image is worth

16 \times 16

words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Duan

Feng

Zhou

Xiong

Wang

Qiang

Jia

(2022). Multilevel similarity-aware deep metric learning for fine-grained image retrieval. IEEE Transactions on Industrial Informatics, 19(8), 9173–9182.

El-Nouby

Neverova

Laptev

Jégou

(2021). Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644.

10.

Franco

Mandica

Munjal

Galasso

(2023). Hyperbolic self-paced learning for self-supervised skeleton-based action representations. arXiv preprint arXiv:2303.06242.

11.

Zhu

Chen

Zhao

(2020). Self-paced contrastive learning with hybrid memory for domain adaptive object re-ID. Advances in Neural Information Processing Systems, 33, 11309–11321.

12.

Han

Liu

Shao

Liu

Zhou

(2025). Feature aggregation and connectivity for object re-identification. Pattern Recognition, 157, 110869.

13.

Han

K. K.

(2002). Fuzzy color histogram and its use in color image retrieval. IEEE Transactions on Image Processing, 11(8), 944–952.

14.

Chen

Xie

Dollár

Girshick

(2022). Masked autoencoders are scalable vision learners. In In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000–16009). IEEE.

15.

Fan

Xie

Girshick

(2020). Momentum contrast for unsupervised visual representation learning. In In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729–9738). IEEE.

16.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). IEEE.

17.

Zhang

Guo

Wei

X. S.

Zhao

Gao

(2024). An asymmetric augmented self-supervised learning method for unsupervised fine-grained image hashing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17648–17657). IEEE.

18.

Islam

S. M.

Joardar

Sekh

A. A.

(2024). A survey on fashion image retrieval. ACM Computing Surveys, 56(6), 1–25.

19.

Jiang

Tang

Yan

Tang

(2024). DVF: Advancing robust and accurate fine-grained image retrieval with retrieval guidelines. In Proceedings of the 32nd ACM international conference on multimedia (pp. 2379–2388). ACM.

20.

Krause

Stark

Deng

Fei-Fei

(2013). 3D object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops (pp. 554–561). IEEE.

21.

Kumar

Packer

Koller

(2010). Self-paced learning for latent variable models. In Advances in neural information processing systems, 23 (pp. 1189–1197). Curran Associates, Inc.

22.

Zheng

Liu

Wang

Zheng

(2022). SemMAE: Semantic-guided masking for learning masked autoencoders. Advances in Neural Information Processing Systems, 35, 14290–14302.

23.

(2023). Fine-grained image retrieval by combining attention mechanism and context information. Neural Computing and Applications, 35(2), 1881–1897.

24.

Liu

Zhu

Shen

Liu

Razavian

Geras

K. J.

Fernandez-Granda

(2023). Multiple instance learning via iterative self-paced supervised contrastive learning. In In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3355–3365). IEEE.

25.

Liu

Luo

Qiu

Wang

Tang

(2016). DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1096–1104). IEEE.

26.

Liu

Zhu

Zheng

Liu

Zhou

Zhao

(2022). Margin preserving self-paced contrastive learning towards domain adaptation for medical image segmentation. IEEE Journal of Biomedical and Health Informatics, 26(2), 638–647.

27.

Loshchilov

Hutter

(2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

28.

Lowe

D. G.

(2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

29.

Oh Song

Jegelka

Rathod

Murphy

(2017). Deep metric learning via facility location. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5382–5390). IEEE.

30.

Oh Song

Xiang

Jegelka

Savarese

(2016). Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4004–4012). IEEE.

31.

Patel

Tolias

Matas

(2022). Recall@k surrogate loss with large batches and similarity mixup. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7502–7511). IEEE.

32.

Ren

Wei

Zhang

S. A. Z.

(2025). DeepMIM: Deep supervision for masked image modeling. In 2025 IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 879–888). IEEE.

33.

Sastry

Khanal

Dhakal

Huang

Jacobs

(2024). BirdSAT: Cross-view contrastive masked autoencoders for bird species classification and mapping. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 7136–7145). IEEE.

34.

Schroff

Kalenichenko

Philbin

(2015). FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823). IEEE.

35.

Shetty

Bhat

V. S.

Pujari

(2024). Content-based medical image retrieval using deep learning-based features and hybrid meta-heuristic optimization. Biomedical Signal Processing and Control, 92, 106069.

36.

Shu

Van den Hengel

Liu

(2023). Learning common rationale to improve self-supervised representation for fine-grained visual recognition problems. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11392–11401). IEEE.

37.

Simonyan

Zisserman

(2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

38.

Sun

Wang

Zhang

Song

Zhao

(2024). A comprehensive review of pedestrian re-identification based on deep learning. Complex & Intelligent Systems, 10(2), 1733–1768.

39.

Suthaharan

(2016). Support vector machine. In Machine learning models and algorithms for big data classification: Thinking with examples for effective learning (vol. 36, pp. 207–235). Springer.

40.

Trivedy

Latecki

L. J.

(2024). Image retrieval with self-supervised divergence minimization and cross-attention classification. In Proceedings of the thirty-third international joint conference on artificial intelligence (IJCAI) (pp. 1344–1352). ACM.

41.

Vagena

Wei

Kurtz

Cloppet

(2025). Semantic aware representation learning for optimizing image retrieval systems in radiology. Pattern Recognition, 158, 111060.

42.

Wah

Branson

Welinder

Perona

Belongie

(2011). The Caltech-UCSD Birds-200-2011 dataset.

43.

Wang

Zhang

Liu

(2024). Triplet-constrained deep hashing for chest X-ray image retrieval in COVID-19 assessment. Neural Networks, 173, 106182.

44.

Wei

X. S.

Luo

J. H.

Zhou

Z. H.

(2017). Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 26(6), 2868–2881.

45.

Xie

Zhang

Cao

Lin

Bao

Yao

Dai

(2022). SimMIM: A simple framework for masked image modeling. In In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9653–9663). IEEE.

46.

Yang

Wang

Zhang

(2023). Stop-gradient softmax loss for deep metric learning. In Proceedings of the AAAI conference on artificial intelligence (vol. 37, No. 3, pp. 3164–3172). AAAI Press.

47.

Qin

Wan

Zhang

(2013). Feature integration analysis of bag-of-features model for image retrieval. Neurocomputing, 120, 355–364.

48.

Zeng

Wang

Chen

Dai

Xia

S. T.

Wang

(2024). Pyramid hybrid pooling quantization for efficient fine-grained image retrieval. Pattern Recognition Letters, 178, 106–114.

49.

Zhang

Xiao

Wang

Feng

Yang

(2024). Relationship constraint deep metric learning. Applied Intelligence, 54(8), 6654–6666.

50.

Zhou

Qin

Hou

Dai

Huang

Zhang

(2024). Deep global semantic structure-preserving hashing via corrective triplet loss for remote sensing image retrieval. Expert Systems with Applications, 238, 122105.

51.

Zhou

Wei

Wang

Shen

Xie

Yuille

Kong

(2021). iBOT: Image BERT pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.

52.

Zhu

Wang

Zhou

Yang

Meng

Zhang

(2020). SWAV: A web-based visualization browser for sliding window analysis. Scientific Reports, 10(1), 149.

Self-Supervised Pre-Training With Adaptive Sampling for Fine-Grained Image Retrieval

Abstract

Keywords

1. Introduction

2.1. Fine-Grained Image Retrieval (FGIR)

2.2. Self-Supervised Pre-Trained Models

2.3. Sample Selection Techniques in Contrastive Learning

3. Proposed Method

3.1. ASS Module

4.1. Implementation Details

4.2. Datasets

4.3. Comparison With State-of-the-art Methods

Table 4. Ablation Study of Adaptive Sample Selector (ASS). CUB-200-2011 (Recall@K) Method 1 2 4 8 Without the ASS module 85.3 91.0 93.8 94.6 With the ASS module 89.2 93.6 95.2 96.4

Footnotes

Acknowledgments

ORCID iDs

Author Contributions

Ethical Considerations

Consent to Participate

Consent for Publication

Funding

Declaration of Conflicting Interests

Data Availability

References

Table 4.
Ablation Study of Adaptive Sample Selector (ASS).

CUB-200-2011 (Recall@K)

Method 1 2 4 8

Without the ASS module 85.3 91.0 93.8 94.6

With the ASS module 89.2 93.6 95.2 96.4