Sage Journals: Discover world-class research

Abstract

Test-time adaptation enables a pre-trained model to update its weights during inference, in order to adapt to a target domain that has a different distribution from the source domain. This adaptation occurs without any supervision and often in a more challenging source-free setting where no data from the source domain are used. While test-time adaptation has received considerable attention for classification tasks, domain adaptation is equally important for other computer vision tasks, such as object detection. Many approaches consider a static target domain, which fails to simulate real-world conditions, where the target domain is non-stationary and the target distribution can gradually change over time. In this work, we focus on the continual test-time adaptation scenario, in which the target domain is continually changing over time. Leveraging the mean teacher framework for object detection, we stochastically restore a small part of the student’s weights to the source pre-trained model weights during adaptation. Additionally, we aim to enhance performance by using contrastive learning. After a consistent experimental work, it is shown that our proposed method compares favorably with the standard mean teacher approach.

Keywords

continual test-time adaptation mean teacher stochastic restoration contrastive learning

1. Introduction

In real-world applications, domain shifts are inevitable due to the dynamic nature of the environments. These dynamically changing conditions can lead to degradation of the performance of the models. Especially, object detectors can suffer from a significant performance drop, due to the more challenging localization task they perform. For example, the performance of an object detector trained on data from clear daytime weather can degrade when tested on data from rainy nighttime conditions. Therefore, it is essential to be able to adapt to new domains and bridge the gap of domain shifts in order to maintain optimal performance across various settings.

In test-time adaptation (TTA), a pre-trained model can adapt to a target domain that has a different distribution from the source domain, during the inference phase in an online fashion. Adaptation is used in a self-supervised manner, leveraging unlabeled data streams. Given that labeling data are both time-consuming and expensive, this further highlights the value of self-supervised adaptation. In addition, due to data privacy issues, data from the source domain are considered unavailable, which leads to a more demanding scenario. Existing studies consider a static target domain, and the model is adapted through pseudo-labeling, entropy minimization, batch-normalization statistics, etc. However, these approaches may be unstable in a continually changing environment, where the target distribution is dynamically changing. Furthermore, research is required to achieve improved performance of existing methods in a non-stationary target domain.

Approaches that use the mean teacher (MT) framework for test-time domain adaptive object detection comprise two identical models. The teacher generates pseudo-labels on the target domain, which are then used for updating student’s weights through backpropagation. Teacher’s weights are updated as the exponential moving average (EMA) of the student’s weights. A teacher can be considered as a weighted average of consecutive student models. Unfortunately, MT self-training can suffer from noisy pseudo-labels. As the domain continually changes, pseudo-labels become unreliable and mis-calibrated. This can result in a degradation in the performance of the detector. However, contrastive learning (CL) does not require accurate labels for the learning process. It can produce robust features by pulling together representations of similar instances and pushing apart representations of dissimilar instances, even with noisy pseudo-labels. Specifically, contrastive mean teacher (CMT; Cao et al., 2023) introduces object-level CL that focuses on learning representations at the level of individual objects. These representations are beneficial for both the localization and the classification tasks in object detection and can enhance the performance of the MT framework. Moreover, as the model is adapted to continually changing distributions for a long time, retaining knowledge from the source domain becomes challenging. This phenomenon is referred to as catastrophic forgetting. CoTTA (Wang et al., 2022) proposes to stochastically restore a small part of the students’ weights to the source pre-trained model weights during adaptation. This strategy aims to preserve the source knowledge and prevent forgetting.

In this paper, we integrate MT with object-level CL, similar to CMT, but in a source-free setting. The object-level CL is applied to a one-stage object detector, namely YOLOX (Ge et al., 2021), in contrast to CMT, which uses a two-stage object detector. We further enhance its performance with the stochastic restoration (SR) technique. Our method shows an improvement over the existing baseline.

The key contributions of this work can be summarized as follows:

We propose a novel approach for continual TTA that combines MT with object-level CL and SR.

Our method operates in a source-free setting, without using any data from the source domain, and can directly deploy the pre-trained model. Furthermore, it is agnostic to domain shifts that may occur during inference.

We demonstrate that incorporating object-level CL into a one-stage object detector enhances its performance, highlighting that competitive results can be achieved without relying on more complex two-stage detectors.

Our proposed method is thoroughly evaluated on a variety of standard datasets, which can serve as benchmarks for continual TTA techniques.

Our approach achieves improved performance compared to the existing baseline for object detection.

2. Related Work

2.1. Unsupervised Domain Adaptation (UDA)

UDA refers to the setting in which there is a shift between the labeled source domain and the unlabeled target domain. Labeled data from the source domain are available during the adaptation process. Many methods aim to align the feature distributions between the two domains, using discrepancy losses or adversarial training (Chen et al., 2020a; Ganin et al., 2017; Tzeng et al., 2017). Recently, self-training methods have also received attention and have been extended from semi-supervised to unsupervised settings, utilizing pseudo-labels for supervision and relying on the MT framework (Tarvainen & Valpola, 2017).

2.2. Test-Time Adaptation (TTA)

TTA methods leverage the available test data during inference to adapt to the target domain. In numerous studies, TTA is regarded as a source-free domain adaptation, where labeled source data are unavailable and cannot be used for the adaptation process. The model is adapted for multiple epochs on the data from the target domain before being used to generate the final predictions. This process operates in an offline manner, so that the target data are available and can be fed into the model multiple times. However, in some applications where the network is updated in real-time, the target data may not be available beforehand. Additionally, predictions might be needed immediately and may be impossible to store the target data after making predictions. All these factors have led to the development of online TTA methods (Wang et al., 2024), which aim to adapt the model using only the current batch of available data.

Test data provide valuable information about the domain shifts, leading researchers to focus on updating batch normalization (BN) statistics during the testing process (Li et al., 2018). While these techniques require only a forward pass, recent TTA approaches perform a backward pass to update the weights of the model, by entropy minimization with respect to the BN parameters of the model (Wang et al., 2021). Other methods, especially used in object detection, utilize pseudo-labels to update the model and are focused on improving the quality of these predicted pseudo-labels (Chen et al., 2023; He et al., 2023). Typically, a student–teacher model is employed for this purpose (Chen et al., 2022, 2023; Deng et al., 2021; He et al., 2023; Li et al., 2022; Sinha et al., 2023). These self-training approaches, which are often combined with CL or other self-supervised learning techniques, assume that the pre-trained model on the source domain has some degree of generalization ability to the target domain due to the similarity between the two domains (Li et al., 2024; Wang et al., 2024). Therefore, the network parameters can be updated using the predictions on the new data, in order to improve performance.

Many works are designed under the assumption of having a batch of test data available, which is unrealistic for online real-world applications. Some other approaches focus on online TTA, where only one test sample is available in each step (Bartler et al., 2022; Zhang et al., 2022). Other studies require the retraining of the pre-trained model and cannot use it directly. For example, Liang et al. (2020) trained a specialized source model using the label-smoothing technique in conjunction with a weight normalization layer and Chen et al. (2023) had to compute the feature distributions of the source domain in an offline manner. Some methods use self-attention mechanisms to better focus on important features in the data (Li et al., 2024). There are also some approaches which aim to align the source and target domains, based on domain reconstruction techniques or to reduce domain shifts, based on extracted information from the data (Li et al., 2024).

2.3. Continual TTA

TTA methods aim to adapt the model to a single target domain while trained on a source domain. In real-world applications, a model can encounter many domain shifts, and it should be able to adapt to the gradual changes. Continual TTA considers the setting where the target domain is changing over time. TTA methods may lead to error accumulation and catastrophic forgetting when the target distribution is non-stationary (Wang et al., 2022).

The first method addressing the problem of continual TTA setting is CoTTA (Wang et al., 2022). CoTTA introduces weight and augmentation-averaged predictions to minimize error accumulation and SR to prevent forgetting. Some additional studies have been published that focus specifically on the continual TTA setting, such as RMT (Döbler et al., 2023) and AR-TTA (Sójka et al., 2023). RMT utilizes CL to bring the test feature space closer to the source domain, where the pre-trained model is well established. AR-TTA utilizes a small memory buffer to store exemplars from the source domain. All the aforementioned methods address the problem of continual TTA setting for classification, and some are extended for segmentation.

Research on continual TTA for object detection is not adequately explored. The proposed methods and the existing benchmarks are still quite limited. This problem was investigated in the “ICCV VCL 2023 Challenge B” (vcl_workshop_2023, 2023), but only one technical report has been published by one of the participating teams (Lin et al., 2023). Some other recent approaches include Mirza et al. (2023) and Yoo et al. (2024). Mirza et al. (2023) explore both classification and object detection, focusing primarily on online adaptation, but also tested in a continual adaptation scenario. The method adapts the model to out-of-distribution data by aligning activation statistics across multiple layers of the network. Yoo et al. (2024) investigate the online domain adaptation scenario for object detection in continually changing test domains, introducing architecture-agnostic adaptor modules that update only lightweight parts of the model to prevent catastrophic forgetting, resolve domain shifts through class-wise feature alignment, and determine when further adaptation is needed. Further development of specialized datasets, performance metrics, and baselines is essential, and more research is needed, in order to explore this area.

3. Methodology

In this work, we combine the SR technique used in CoTTA (Wang et al., 2022) with the MT framework for object detection. We further enhance the model’s performance by using object-level CL inspired by the novel CMT (Cao et al., 2023). Our approach does not use data from the source domain during adaptation, and the pre-trained model can be deployed directly, without requiring retraining. We aim at enhancing the performance of the pre-trained model on the continually changing target domain by leveraging the sequentially provided test data. The model only has access to the test data of the current time step, which leads our setting to be an online setting. An overview of our proposed method is given in Figure 1.

Figure 1.

An Overview of Our Proposed Method. The Original Target Image is Fed Into the Teacher Model, While the Student Model Receives two Different Strongly Augmented Views. The Student Detector is Updated Based on the Final Loss Through Backpropagation. The Teacher’s Weights Are Updated as the EMA of the Student’s Weights. A Small Part of the Student’s Weights is Stochastically Restored to the Source Pre-Trained Model’s Weights.

3.1. MT Framework

The MT architecture was initially proposed for semi-supervised learning, but was extended for unsupervised or self-supervised learning by leveraging pseudo-labeling. The MT for object detection consists of two identical detectors, which relate to the teacher and the student, respectively. Both detectors take as input the images from the target domain. The teacher generates pseudo-labels based on its predictions on the target data.

A consistency loss (bounding box regression loss and classification loss) is calculated between the pseudo-labels generated by the teacher detector and the predictions of the student. The student detector is updated based on the consistency loss through backpropagation. Then, the teacher’s weights $θ^{T}$ are updated as the EMA of the student’s weights $θ^{S}$ :

\begin{aligned} θ_{t}^{T} \leftarrow α θ_{t - 1}^{T} + (1 - α) θ_{t}^{S}, \end{aligned}

(1)

where

α

is a smoothing coefficient hyper-parameter and is usually large to ensure smooth teacher updates. A teacher can be considered as a weighted average of consecutive student models and is more robust and capable of preserving knowledge from previous domains. In contrast, the student gains knowledge for new domains but tends to forget previous domains more easily. Final detections are generated by the teacher. To ensure the high quality of the pseudo-labels and further boost the MT , the input of the student is strongly augmented (e.g., color jittering, blurring, and random erasing). The key idea is to enforce consistency between the teacher’s predictions and the student’s predictions. The models learn to produce consistent outputs over different augmentations of the same image.

The mutual knowledge transfer that occurs in the MT framework has demonstrated promising results for domain adaptive object detection (Cao et al., 2023). Unfortunately, self-training can suffer from noisy pseudo-labels. As the domain continually changes, pseudo-labels become unreliable and mis-calibrated. This can result in a degradation in the performance of the detector. Many works have focused on correcting the pseudo-labels and trying to make them more reliable (Chen et al., 2023; He et al., 2023). In object detection, correcting pseudo-labels presents a greater challenge compared to classification, as it involves adjusting not just the labels themselves but also their corresponding positions. Furthermore, in continual TTA, the adaptation for a long time may lead to catastrophic forgetting. Inspired by the work in CMT (Cao et al., 2023), we use CL, which does not depend on accurate pseudo-labels during the learning process. Additionally, to minimize forgetting, we employ SR as presented in CoTTA (Wang et al., 2022). A detailed description of these methods is given in the following sections.

3.2. Stochastic Restoration (SR)

Continual TTA for a long time, by MT self-training can lead to forgetting. This happens because when there is a sequence of domain shifts, the model is continually adapted to the new domains, and after many updates of the model, the knowledge from the initial source domain is lost. Moreover, strong domain shifts may lead to very noisy pseudo-labels, introducing errors in the model’s updates. If the model’s update is significantly incorrect after encountering challenging examples, it may struggle to recover, even when the new data are not severely shifted. To preserve the knowledge of the source domain and mitigate the impact of incorrect model updates, we utilize the SR method proposed in CoTTA (Wang et al., 2022). In this approach, we stochastically restore a small part of the student’s weights to the source pre-trained model’s weights during adaptation.

A convolution layer within the student model, after a gradient update at time step $t$ is:

\begin{aligned} x_{l + 1} = W_{t + 1} * x_{l}, \end{aligned}

(2)

where * denotes the convolution operation, index

l

denotes the layer within the student model,

x_{l}

and

x_{l + 1}

denote the input and output to this layer, respectively, and

W_{t + 1}

, denotes the trainable convolution filters. The proposed SR method additionally updates the weight

W

as follows:

\begin{aligned} M & \sim Bernoulli (p) \end{aligned}

(3)

\begin{aligned} W_{t + 1} & = M \cdot W_{0} + (1 - M) \cdot W_{t + 1}, \end{aligned}

(4)

where

\cdot

denotes the element-wise multiplication,

p

is a small restoration probability, and

M

is a mask tensor of the same dimension as

W_{t + 1}

. Based on the mask tensor, the elements within

W_{t + 1}

that are selected will be restored back to the source weight

W_{0}

This restoration mechanism aims to stabilize learning by preserving source knowledge and reducing catastrophic forgetting during continual adaptation. SR can be interpreted as a structured regularization technique similar to dropout (Srivastava et al., 2014). By randomly resetting a small subset of trainable weights to their initial values prevents the model from deviating excessively from the source representation, thereby reducing the risk of catastrophic forgetting. This mechanism allows the entire network to remain fully trainable without suffering from model collapse, thus offering greater flexibility during adaptation.

3.3. Object-Level CL

In addition to the consistency loss discussed in Section 3.1, we also employ a contrastive loss for updating the student. This method was inspired by CMT (Cao et al., 2023) and is capable of generating robust features by bringing together representations of similar instances, while pushing apart representations of dissimilar instances, even when the pseudo-labels are noisy.

Initially, we extract object-level features based on the features and the pseudo-labels generated by the teacher. We then utilize these features to calculate a class-based contrastive loss, inspired by supervised CL (Khosla et al., 2020) and self-supervised CL (Chen et al., 2020b).

3.3.1. Object-Level Features

We generate pseudo-labels using the teacher detector. Regions of interest (RoIs) are given by the bounding boxes of the predicted pseudo-labels. Next, we extract the feature maps of the student’s and the teacher’s backbones. To extract object-level features from RoIs within an image, we use RoIAlign (He et al., 2017). This pooling operation addresses challenges related to misalignment between the RoIs and the underlying feature maps. Finally, we normalize the extracted object-level features according to standard procedures, as described in Khosla et al. (2020). If augmentations on the input images alter bounding boxes differently, we need to transform the bounding boxes of the pseudo-labels to align the two feature maps.

3.3.2. Class-Based Contrastive Loss

Inspired by supervised CL (Khosla et al., 2020), we utilize the classes predicted by the teacher to calculate the class-based contrastive loss:

\begin{aligned} L_{contrastive} = \frac{λ}{N} \sum_{i = 1}^{N} \frac{- 1}{| P (i) |} \sum_{p \in P (i)} \log \frac{\exp (z_{i}^{S} \cdot z_{p}^{T} / τ)}{\sum_{j = 1}^{N} \exp (z_{i}^{S} \cdot z_{j}^{T} / τ)}, \end{aligned}

(5)

where

P (i) = {p | C_{p} = C_{i}, p \in {1, \dots, N}}

is the positive pair set that includes all objects of the same predicted class as object

i

N

represents the total number of generated bounding boxes. The hyperparameter

λ

determines how much emphasis is put on the contrastive loss during training. Increasing the value of

λ

strengthens the impact of the contrastive term, encouraging the student model to produce feature representations that are more closely aligned with those of the teacher. The temperature parameter

τ

controls the sharpness of the similarity distribution in contrastive loss (Chen et al., 2020b). Lower values of

τ

sharpen the distribution, making the model more focused on distinguishing positive pairs from negative pairs, while higher values produce a smoother distribution that treats all pairs on an equal basis.

z_{i}^{T}

and

z_{i}^{S}

denote the object-level normalized features for the teacher and the student, which are achieved as follows:

\begin{aligned} z_{i}^{M} = Normalize (RoIAlign ({FeatureMap}^{M}, {BoundingBoxes}^{T})), \end{aligned}

(6)

where

M

is either

T

(teacher) or

S

(student). RoIAlign is utilized to extract object-level features from RoIs by aligning the RoIs with the underlying feature map. The normalization of these object-level features is performed based on the Euclidean distance (

L_{2}

normalization), where each feature vector is scaled by its Euclidean norm as follows:

z_{i}^{M} = z_{i}^{M} / max (‖ z_{i}^{M} ‖_{2}, 10^{- 12})

The object-level CL method is illustrated in Figure 2. In this example, the teacher’s object-level features of the original images are pulled closer to the student’s object-level features of the strongly augmented images when they have the same pseudo-labels, while features with different pseudo-labels are pushed apart (supervised contrastive loss).

Figure 2.

Object-Level Contrastive Learning Method. Object-Level Features Are Extracted From Regions of Interest (RoIs) by using the Pseudo-Labels Along With the Feature Maps From Both the Teacher and Student Models. Contrastive Learning Aims to Pull Together Representations of Similar Instances (Shown by “Green” arrows) While Pushing Apart Representations of Dissimilar Instances (Shown by “Red” Arrows). The Continuous Arrows Illustrate the Relationships Between the First Detection of the Teacher and all Student Detections. The Dashed Arrows Represent Relationships Between the Second Teacher Detection and the Student Detections, While the Dash-dot Arrows Indicate Relationships Involving the Third Teacher Detection.

3.3.3. Multi-Scale Feature Maps

To maximize the benefits of object-level CL, multi-scale feature maps are extracted from the YOLOX detector. Our architecture employs a CSPDarknet backbone and a YOLOXPAFPN neck, which together produce three scales of feature maps corresponding to down-sampling factors of $\times 8$ , $\times 16$ , and $\times 32$ relative to the input resolution. Each feature map has 320 channels and spatial dimensions that vary depending on the input image size. These multi-scale features enable the model to capture object representations at different spatial resolutions. The object-level contrastive loss is computed independently at each of the three scales, and the final contrastive loss is the sum of all three losses. This approach enables the model to learn more comprehensive and robust representations that generalize across varying object sizes and image contexts.

Finally, $L_{contrastive}$ is added to the consistency loss, described in Section 3.1 and the student detector is updated through gradient descent using the following loss:

\begin{aligned} L_{Final} = L_{Consistency} + L_{Contrastive} . \end{aligned}

(7)

4. Experiments

4.1. Datasets, Settings, and Metrics

In our experimental work, a consistent evaluation framework was employed that comprises a variety of standard datasets, namely SHIFT (Sun et al., 2022), KITTI (Geiger et al., 2013, 2012), Cityscapes (Cordts et al., 2016; Sakaridis et al., 2018), C-CLAD (Verwimp et al., 2023), and COCO-C (Hendrycks & Dietterich, 2019; Lin et al., 2014).

All datasets but the SHIFT dataset consist of images, while the SHIFT dataset is composed of videos. Our experiments followed a scenario as in Mirza et al. (2023) and Yoo et al. (2024), where they begin with a pre-trained model on the source domain, undergo several consecutive domain shifts, and finally return to the source domain to assess how well knowledge is preserved during the adaptation process. Synthetic images simulating domain shifts are used in four out of five datasets, with CLAD-D being the only one that includes realistic images from shifted domains.

Detailed information for understanding the particularities of each dataset is given in the following sections.

4.1.1. SHIFT

We evaluate our method on the continuous validation set of the SHIFT dataset (Sun et al., 2022), and we follow the setting that was specified by “ICCV VCL 2023 Challenge B” (vcl_workshop_2023, 2023). The SHIFT dataset consists of six classes and is divided into two main sets. The discrete set contains images depicting various weather conditions and times of day, while the continuous set consists of 40-second video sequences in which the driving conditions gradually change. The weather conditions in the dataset include clear, foggy, cloudy, overcast, and rainy, while the times of the day comprise daytime, night, and dawn/dusk.

We use the pre-trained model that was provided by the organizers of the challenge, which is a YOLOX object detector, trained on the SHIFT clear-daytime discrete train/val set. We evaluate our method on six validation videos presenting continuous domain shifts starting from the clear-daytime conditions. Another method that participated in the competition is outlined in Lin et al. (2023). However, this method uses information about domain shifts observed in the validation set and simulates these shifts during training to enhance the model’s performance. Thus, it may not be effective when domain shifts are unknown during the training process. Our approach is evaluated in a per-sequence manner, where the model is assessed on each sequence separately. After the evaluation in each sequence, the MT model (both teacher and student detectors) is reset to its source state before evaluating the next sequence. The evaluation metric is the average performance across all sequences.

To assess the effectiveness of our approach in adapting to the shifted domain while also retaining knowledge from the source domain, we divide each video into three parts. Then, we calculate the mean average precision (mAP) for each of these parts rather than solely reporting the overall mAP for the whole sequence. The first part contains information only from the source domain (first 20 frames), the middle part contains information about the shifted target domain (frames 180–220), and the last part loops back to the source domain (last 20 frames). These three mAP values are denoted as mAP_Source, mAP_Target, and mAP_Loopback, respectively. Then, we use the “Drop” metric that was specified in the “ICCV VCL 2023 Challenge B”:

\begin{aligned} Drop = mAP\_Source - mAP\_Target . \end{aligned}

(8)

This metric demonstrates the ability of our model to adapt to the target domain. A small value indicates that the model retains optimal performance even in shifted target domains. Finally, we calculate “Overall” metric as follows:

\begin{aligned} Overall = Ave\_mAP - 2 \cdot Drop, \end{aligned}

(9)

where Ave_mAP denotes the total mAP for the entire sequence. Subtracting by a factor of two gives more emphasis to the adaptation to the target domain. The final ranking in “ICCV VCL 2023 Challenge B” was determined by the “Overall” metric. Finally, “Forgetting” metric is proposed:

\begin{aligned} Forgetting = mAP\_Source - mAP\_Loopback . \end{aligned}

(10)

This metric estimates the amount of information loss by the model after adaptation. A small value indicates that the model has successfully preserved the knowledge learned in the source domain, while a negative value suggests that the adaptation process has enhanced the model’s ability to generalize within the source domain.

4.1.2. KITTI

The KITTI (Geiger et al., 2012, 2013) training set consists of 7,481 images, which are divided into two parts. One part, containing 3,740 images, is used as a training set, while the other part, comprising 3,741 images, is designated as the validation set. The network is initially trained on the 3,740 images from the training set. Afterward, the 3,741 images from the validation set are used to assess the network’s performance. This validation set serves as the source domain. Subsequently, we simulate four synthetic domains under varying weather conditions, as described in publications (Mirza et al., 2023; Yoo et al., 2024). Finally, we evaluate our proposed adaptation method on the following scenarios: Fog, Rain, Snow, and Clear, which includes a total of 14,964 images for domain adaptation across these four distinct domains. The Clear domain contains the images of the validation set of the Cityscapes dataset and is used to estimate the performance on the source domain after the adaptation.

4.1.3. Cityscapes

The network is initially trained on the Cityscapes (Cordts et al., 2016) training set, which consists of 2,975 images, and uses the validation set as the source domain, which includes 500 images. The Foggy Cityscapes dataset (Sakaridis et al., 2018) is a synthetic dataset that simulates fog in real-world urban scenes. The evaluation of the adaptation method during testing is conducted at three increasing levels of fog density. Finally, a set of clear-weather images from the Cityscapes validation set is included to assess the model’s performance under normal conditions after the adaptation process. Overall, the method is evaluated under the scenarios Low Fog, Medium Fog, High Fog, and Clear, which contains a total of 2,000 images.

4.1.4. CLAD-D

The “Continual Learning for Autonomous Driving (CLAD)” benchmark (Verwimp et al., 2023) is dedicated to autonomous driving, focusing on the problems of object classification and object detection. In this work, we encounter the dataset related to CLAD-D, which is a domain incremental continual object detection benchmark. It consists of four domains: Clear Weather-Daytime-City Streets, Clear Weather-Daytime-Highway, Night, and Rain-Daytime. This dataset is specifically designed for continual learning, where the model is trained sequentially across these domains and its performance is subsequently evaluated on all four. We introduce a setting suitable for continual TTA scenarios, based on these four domains. The domain Clear Weather-Daytime-City Streets is used for training the network, and the validation set of this domain serves as the source domain. This dataset contains 4,470 training images and 497 validation images. Then, the test sets from all four domains are used in the following sequence: Clear Weather-Daytime-Highway, Night, Rain, and Clear Weather-Daytime-City Streets. In total, 9,969 images are used for evaluating our domain adaptation technique.

4.1.5. COCO-C

The COCO dataset (Lin et al., 2014) is one of the most extensive and widely used datasets for training and evaluating computer vision algorithms. The COCO-C or corrupted COCO dataset simulates continuous and drastic domain shifts in the images of the COCO dataset. COCO-C is created using 15 types of realistic image corruptions (Hendrycks & Dietterich, 2019), such as image distortion or various weather conditions, to simulate various domain shifts from the original domain. For training the network, the 118,287 images of the COCO training set are used, and the 5,000 images of the validation set serve as the source domain. During adaptation, the model’s performance is evaluated sequentially on each set of corrupted images. Finally, the model is assessed on the original COCO validation set, referred to as Original, to evaluate its performance on the source domain after adaptation. This setting includes a total of 16 domains, and the adaptation method is evaluated on 80,000 images. The sequence of corruptions is as follows: Gaussian-Noise, Shot-Noise, Impulse-Noise, Defocus-Blur, Glass-Blur, Motion-Blur, Zoom-Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic-Transform, JPEG-Compression, Pixelate, and Original. The same scenario is used in Yoo et al. (2024). This setting is particularly interesting for evaluating our adaptation method on long sequences and analyzing its performance.

4.2. Implementation Details

As mentioned earlier, we use the YOLOX object detector. In the MT framework, we set the smoothing coefficient $α$ =0.99965, the temperature $τ = 0.07$ , the contrastive weight $λ = 0.003$ , and the restoration probability p=0.025. We also have a weight for the consistency loss that we set to 0.005. At each step, the detectors take as input only one image to maintain an online setting; thus, the batch size is set to one. Consistency loss is calculated using the mean squared error between the classification scores, the predicted bounding boxes, and the objectness scores of the teacher and the student detectors. The teacher takes as input the original image, and the student takes as input two different augmented views of the same image in each step. The augmentations we use are “ColorJitter,” “Solarize,” “RandomGrayscale,” “GaussianBlur,” and “RandomErasing.” To ensure the consistency of the predictions of the two detectors, the same augmented views of the input image are given multiple times to the student detector. Each time the student makes a prediction, the loss is calculated, and both detectors are updated accordingly. This approach provides the student with multiple chances to learn from a particular instance of an image. We choose to show the same views of an image five times.

For the KITTI and CLAD-D datasets, the smoothing coefficient is $α = 0.99995$ and the restoration probability is $p = 0.01$ . A larger smoothing coefficient results in slower updates of the teacher model, leading to slower adaptation. A smaller restoration probability means fewer parameters are restored to the source pre-trained model’s values during adaptation. As a result, the model adapts more quickly to the shifted domains, but less knowledge from the source domain is retained. For the Cityscapes dataset, $α = 0.99965$ and $p = 0.01$ . For the COCO-C dataset, $α = 0.995$ and $p = 0.01$ . The smaller smoothing coefficient for COCO facilitates quicker adaptation in each shifted domain.

The hyperparameter tuning experiments revealed a clear tradeoff between adaptation and forgetting. When the model adapts too quickly to domain shifts, it tends to forget previously learned knowledge, especially in cases where large or abrupt adaptation steps are applied. Such rapid changes can lead to model degradation over time. To mitigate this, it is essential to find an optimal balance, where the model adapts efficiently to new domains while maintaining the integrity of previously learned information.

To reduce the computational complexity, SR occurs only after the last update of the student for the same augmented views of the input image, ensuring that intermediate steps do not require additional computational resources. In view of this, for each input image, the augmented views are processed five times. The model is also updated five times, with SR occurring only after the final update.

In CL, the teacher’s predictions are filtered by a confidence threshold, and non-maximum suppression is performed to generate the pseudo-labels. We use a confidence threshold of 0.7 and an intersection over union (IOU) threshold of 0.7 to retain predictions that are likely to be correct. Additionally, we extract multi-scale features from three stages of the backbone network. Other hyperparameters are the confidence threshold of the detectors, which is set to 0.01, the IOU threshold for non-maximum suppression is set to 0.7, and the optimizer used is stochastic gradient descent with a learning rate 0.00025, momentum 0.9, and weight decay 0.0005.

Last but not least, for the sake of clarity, it should be mentioned that the only difference from the metrics used for the SHIFT dataset lies in the definition of each domain. The source domain corresponds to the validation set of the domain on which the model was trained, and mAP_Source is calculated before the adaptation. The target domain includes all images where domain shifts occur. Finally, the loopback domain is identical to the source domain, but mAP_Loopback is calculated after completing the adaptation process. Ave_mAP indicates the total average mAP across all domains, including the loopback domain, while Target_Ave_mAP denotes the total average mAP for the shifted domains only.

4.3. Results

In Table 1, we report our results on the validation set of the SHIFT dataset (Sun et al., 2022) for sequences starting from clear-daytime conditions. Additionally, we include the baseline results for “no adaptation” and “MT” adaptation. Our method shows an improvement of +1.0 Ave_mAP compared to the existing “no adaptation” baseline. When SR and object-level CL are combined with MT, the “Drop” is reduced to 4.1, and the reduction equals to $- 2.2$ . Following the metric utilized in the “ICCV VCL 2023 Challenge,” the “Overall” metric is enhanced to 31.3, reflecting an improvement of +5.4. It is evident that SR and CL techniques contribute to an increase in overall mAP, while minimizing mAP drop. Additionally, the SR technique plays a crucial role in addressing the catastrophic forgetting introduced by the MT network, indicated by an increase in mAP loopback.

Table 1.
Comparative Performance in Terms of mAP of our Proposed Method and the Baselines of “no adaptation” and “MT.” The MT Model is Reset to its Source State at the Beginning of Each Sequence.

Method mAP_Source mAP_Target mAP_Loopback Ave_mAP Drop Forgetting Overall

No adaptation 46.9 40.6 47.1 38.5 6.3 −0.2 25.9

MT baseline 47.0 42.2 46.5 38.6 4.8 0.5 29.0

$MT + CL$ 47.0 42.4 46.5 39.2 4.6 0.5 30.0

$MT + SR$ 47.0 42.3 46.9 39.4 4.7 0.1 30.0

$MT + SR + CL$ 47.0 42.9 46.6 39.5 $(+ 1.0)$ $4.1 (- 2.2)$ 0.4 31.3 $(+ 5.4)$

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

To simulate a more realistic setting, we conduct the same experiments as above, but without resetting the model to its source state at the beginning of each sequence. As previously discussed, TTA methods often encounter issues with overfitting when applied in dynamic environments, leading to catastrophic forgetting and subsequent deterioration of the detector’s performance. Resetting the model to its original source state can mitigate this problem, enabling the use of TTA methods even in continually changing target domains. However, in real-world applications, it is difficult to determine the appropriate timing for resetting to the source model. A recent study (Chakrabarty et al., 2023) introduces a technique to detect significant domain shifts and perform a reset to the source model accordingly. We choose to evaluate our method without resetting to the source model. The results of our experiments are presented in Table 2.

Table 2.

Comparative Performance in Terms of mAP of Our Proposed Method and the Baselines of “No Adaptation” and “MT.” The MT Model is not Reset to its Source State at the Beginning of Each Sequence.

Method	mAP_Source	mAP_Target	mAP_Loopback	Ave_mAP	Drop	Forgetting	Overall
No adaptation	46.9	40.6	47.1	38.5	6.3	−0.2	25.9
MT baseline	44.4	38.6	42.8	35.9	5.8	1.6	24.3
$MT + CL$	43.1	37.7	45.0	35.1	5.4	−1.9	24.3
$MT + SR$	46.6	42.8	45.6	39.3	3.8	1.0	31.7
$MT + SR + CL$	46.6	43.0	45.6	39.3 $(+ 0.8)$	3.6 $(- 2.7)$	1.0	32.1 $(+ 6.2)$

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

We observe a significant deterioration in the performance of the “MT” baseline model without resetting. Specifically, the Ave_mAP for the MT baseline, which is 35.9, is even lower than that with “No Adaptation,” which is 38.5. Although the “Drop” is reduced to 5.8, the overall performance of the detector is degraded. Catastrophic forgetting is evident, particularly from the noticeable drop in the mAP_Loopback metric, which equals to 42.8.

The SR technique demonstrates its capability to improve over the “no adaptation” baseline, even without resetting to the source model, achieving an Ave_mAP of 39.3. When we employ only the CL with MT, we observe a significant drop in the performance of the detector, resulting in a 35.1 Ave_mAP. However, when both SR and CL are utilized with MT, the performance of the detector is improved, “Drop” is reduced, and the Ave_mAP is slightly lower than when we used the resetting and equals 39.3. Nevertheless, some forgetting is still present as indicated by the lower mAP_Loopback, which is 45.6. The adaptation to the target domain is improved, as indicated by the reduced “Drop” of 3.6.

Therefore, our method proves to be suitable and capable of retaining optimal performance even when the target domain is continually changing. This is primarily due to the SR technique, which can be characterized as a continual TTA method. The MT network alone faces significant issues when domains change continuously during adaptation. Consequently, it is not suitable for continual TTA problems, as it introduces catastrophic forgetting. Methods that utilize CL can, in some cases, mitigate forgetting by learning more robust and generalized features, leading to better adaptation to new domains. However, in the case of rapid domain shifts or long-term adaptation, catastrophic forgetting remains unavoidable. SR appears to be the most effective method for reducing catastrophic forgetting. This approach, which can be easily combined with other techniques, such as the MT framework, is highly effective in addressing continual TTA problems. Additionally, utilizing object-level CL can further enhance the adaptation process.

In Table 3, we present experimental results for the KITTI dataset under the continual setting mentioned above, including fog, rain, snow, and clear weather conditions. For the “no adaptation” baseline, “Forgetting” is 0.0 because the model is frozen. Therefore, the performance on the same images from the source domain, which are used as the loopback domain, remains exactly the same. The MT method shows noticeable improvements across all domain shifts except for the last one, suggesting that some knowledge from the source domain has been lost, as indicated by the “Forgetting” value, which increases to 2.5.

Table 3.

Comparative Performance in Terms of mAP for the KITTI Dataset Under Different Weather Conditions and Clear Weather.

Method	Fog	Rain	Snow	Clear	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	22.4	37.9	28.0	50.1	29.4	33.6	20.7	0.0
MT	22.7	38.6	42.8	47.6	34.7	36.3	15.4	2.5
$MT + CL$	22.7	40.2	46.9	56.0	36.6	39.7	13.5	−5.9
$MT + SR$	22.7	40.2	47.5	57.7	36.8	41.2	13.3	−7.6
$MT + SR + CL$	22.7	40.3	47.6	57.7	36.9 $(+ 7.5)$	41.3 $(+ 7.7)$	13.2 $(- 7.5)$	−7.6

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

When SR and CL are combined with MT, significant improvements are observed. The Target_Ave_mAP increases to 36.9, showing an improvement of +7.5 over the “no adaptation” baseline, and the Ave_mAP rises to 41.3, an improvement of +7.7. The “Drop” is reduced by $- 7.5$ to 13.2, and “Forgetting” improves to $- 7.6$ , demonstrating the robustness of this method across diverse weather conditions and its enhanced ability to detect objects, even outperforming the pre-trained model on the source domain. Our proposed approach outperforms the MT baseline in nearly all shifted domains.

In Table 4, the mAP@50 results are shown for the KITTI dataset. The MT method shows improvements, particularly for the “Snow” domain, with no evidence of forgetting. This suggests that although localization accuracy, as indicated by the mAP results, is slightly reduced, the model remains capable of detecting objects in the source domain with similar effectiveness as before.

Table 4.

Comparative Performance in Terms of mAP@50 for the KITTI Dataset Under Different Weather Conditions and Clear Weather.

Method	Fog	Rain	Snow	Clear	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	43.3	76.6	48.2	78.0	56.0	60.9	22.0	0.0
MT	44.1	79.6	74.9	79.8	66.2	68.9	11.8	−1.8
$MT + CL$	43.9	80.1	77.2	84.7	67.1	71.1	10.9	−6.7
$MT + SR$	43.9	80.1	77.5	85.2	67.2	71.4	10.8	−7.2
$MT + SR + CL$	43.9	80.1	77.6	85.2	67.2 $(+ 11.2)$	71.4 $(+ 10.5)$	10.8 $(- 11.2)$	−7.2

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

MT with SR and CL leads to further improvements. The Target_Ave_mAP increases to 67.2, a +11.2 improvement over the “no adaptation” baseline, and the Ave_mAP increases to 71.4, showing a +10.5 improvement. The “Drop” is reduced by 11.2, and “Forgetting” improves to $- 7.2$ . Our method outperforms both baselines by a considerable margin. Only in the “Fog” domain are the MT results better, likely because our method focuses on a more gradual and steady adaptation, which over time, leads to robust results and enhances performance even for the original images.

In Table 5, we present experimental results for the Cityscapes dataset under varying fog levels and clear conditions. The MT method shows improvements, particularly for the low and medium fog conditions, with minimal evidence of forgetting. Although the performance slightly decreases for the high fog condition, the model remains effective in detecting objects.

Table 5.

Comparative Performance in Terms of mAP for the Cityscapes Dataset Under Varying Fog Levels and Clear Conditions.

Method	Low fog	Medium fog	High fog	Clear	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	40.0	38.9	36.5	40.3	38.5	38.7	1.8	0.0
MT baseline	41.6	40.4	37.4	42.4	39.8	40.1	0.5	−2.1
$MT + CL$	41.1	40.3	37.6	42.6	39.7	40.2	0.6	−2.3
$MT + SR$	41.1	40.3	37.6	42.6	39.7	40.2	0.6	−2.3
$MT + SR + CL$	41.1	40.3	37.6	42.6	39.7 $(+ 1.2)$	40.2 $(+ 1.5)$	0.6 $(- 1.2)$	−2.3

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

When SR and CL are combined with MT, the Target_Ave_mAP is 39.7, a +1.2 improvement over the “no adaptation” baseline, but there is a small decrease, $- 0.1$ over the “MT” baseline. However, there is a slight decrease of $- 0.1$ compared to the “MT” baseline. On the other hand, the Ave_mAP increases to 40.2, demonstrating a +1.5 improvement, emphasizing that our method enhances overall performance despite minor domain-specific fluctuations. The “Drop” is reduced by 1.2, and “Forgetting” improves to $- 2.3$ . Overall, our proposed technique improves adaptation and effectively minimizes forgetting. However, in certain domains, such as Low Fog and High Fog, it does not achieve the best performance.

Table 6 presents the results in terms of the mAP@50 metric. More subtle performance differences are observed across the various methods. The proposed technique continues to demonstrate consistent performance gains in terms of adaptation and stability, as evidenced by slight improvements in Target_Ave_mAP and Ave_mAP metrics and reductions in “Drop” and “Forgetting.”

Table 6.

Comparative Performance in Terms of mAP@50 for the Cityscapes Dataset Under Varying Fog Levels and Clear Conditions.

Method	Low fog	Medium fog	High fog	Clear	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	63.1	61.5	57.3	63.4	60.6	61.2	2.8	0.0
MT baseline	63.8	61.9	56.3	64.9	60.7	61.5	2.7	−1.5
$MT + CL$	63.5	62.2	57.1	65.0	60.9	61.7	2.5	−1.6
$MT + SR$	63.5	62.2	57.1	65.0	60.9	61.7	2.5	−1.6
$MT + SR + CL$	63.5	62.2	57.1	65.0	60.9 $(+ 0.3)$	61.7 $(+ 0.5)$	2.5 $(- 0.3)$	−1.6

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

In Table 7, we present experimental results for the CLAD-D dataset under various environmental conditions and scenes. The MT method demonstrates slight improvements over “no adaptation,” especially for the clear highway and night conditions, although some performance degradation is observed for Rain and Clear City Street. There is a noticeable amount of forgetting, as indicated by the forgetting score of 3.6. However, when SR and CL are combined with MT, the Target_Ave_mAP increases to 46.6, a +0.7 improvement over the “no adaptation” baseline, and the Ave_mAP increases by +1.0 to 48.5. Additionally, the “Drop” is reduced by 0.7, and “Forgetting” improves significantly to $- 1.3$ , further showcasing the ability of our technique to minimize forgetting and enhance adaptation.

Table 7.

Comparative Performance in Terms of mAP for the CLAD-D Dataset Across Different Environmental Conditions and Scenes.

Method	Clear highway	Night	Rain	Clear city street	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	51.9	38.0	47.8	58.5	45.9	47.5	12.6	0.0
MT	52.4	38.3	46.6	54.9	45.8	45.6	12.7	3.6
$MT + CL$	52.1	38.8	49.0	59.4	46.6	48.2	11.9	−0.9
$MT + SR$	52.1	38.8	49.0	59.7	46.6	48.4	11.9	−1.2
$MT + SR + CL$	52.1	38.8	49.0	59.9	46.6 $(+ 0.7)$	48.5 $(+ 1.0)$	11.9 $(- 0.7)$	−1.3

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

Table 8 presents experimental results in terms of mAP@50. As with Table 7, the MT, SR, and CL combination outperforms both the “no adaptation” and MT baselines, yielding improvements across evaluation metrics and reinforcing the robustness of the proposed approach.

Table 8.

Comparative Performance in Terms of mAP@50 for the CLAD-D Dataset Across Different Environmental Conditions and Scenes.

Method	Clear highway	Night	Rain	Clear city street	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	71.0	57.3	69.6	79.5	66.0	68.4	13.5	0.0
MT	71.7	58.1	69.6	77.7	66.5	67.3	13.0	1.8
$MT + CL$	71.2	58.1	70.6	80.0	66.6	69.2	12.8	−0.5
$MT + SR$	71.2	58.0	70.5	80.1	66.5	69.1	12.9	−0.6
$MT + SR + CL$	71.2	58.1	70.6	80.1	66.6 $(+ 0.6)$	69.2 $(+ 0.8)$	12.8 $(- 0.7)$	−0.6

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

In Table 9, we present experimental results for the COCO-C dataset, showcasing performance across various noise, blur, weather, and digital corruptions. The “no adaptation” method shows limited performance across all conditions, with values ranging from 1.1 to 35.6, reflecting its struggle to adapt to the synthetic distortions. The “MT Baseline” method shows improvements, particularly in the “Noise” and “Blur” categories, where it achieves better results compared to “no adaptation.” Significant degradation of the model is observed in the “Org” domain, which contains the original COCO images. Both Target_Ave_mAP and Ave_mAP show a noticeable decline, indicating a degradation in performance. The “Drop” metric increases, suggesting that the model struggles to adapt to the new challenging domains, and the amount of forgetting is substantial, reaching 33.6, highlighting the model’s difficulty in retaining previously learned knowledge.

Table 9.

Comparative Performance in Terms of mAP for the COCO-C Dataset, Showing Performance Across Different Types of Noise, Blur, Weather, and Digital Distortions.

	Noise			Blur				Weather				Digital
Method	Gau	Sht	Imp	Def	Gls	Mtn	Zm	Snw	Frs	Fog	Brt	Cnt	Els	Px	Jpg	Org	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	1.6	1.7	1.1	9.6	5.5	8.4	4.5	23.7	29.5	35.6	34.3	10.6	17.5	5.0	9.9	43.3	13.2	15.7	30.1	0.0
MT baseline	12.6	15.9	14.2	2.7	2.1	3.4	1.5	8.5	10.1	10.4	12.1	0.9	4.1	1.8	3.5	9.7	6.9	6.8	36.4	33.6
$MT + CL$	13.5	13.4	12.5	6.3	5.9	6.7	3.5	14.3	15.4	18.4	20.3	6.0	11.9	4.9	7.3	19.2	10.7	10.6	32.6	24.1
$MT + SR$	13.8	14.7	14.9	6.4	6.7	7.7	4.6	24.0	27.9	33.7	34.8	12.5	21.7	11.2	15.1	42.2	16.6	18.2	26.7	1.1
$MT + SR + CL$	13.8	14.8	14.9	6.4	6.7	7.7	4.6	24.0	28.0	33.7	34.8	12.5	21.8	11.3	15.2	42.2	16.7 $(+ 3.5)$	18.3 $(+ 2.6)$	26.6 $(- 3.5)$	1.1

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

Table 10.

Comparative Performance in Terms of mAP@50 for the COCO-C Dataset, Showing Performance Across Different Types of Noise, Blur, Weather, and Digital Distortions.

	Noise			Blur				Weather				Digital
Method	Gau	Sht	Imp	Def	Gls	Mtn	Zm	Snw	Frs	Fog	Brt	Cnt	Els	Px	Jpg	Org	Target_Ave_mAP	Ave_mAP	Drop	Forgetting
No adaptation	2.5	2.7	1.6	16.4	9.5	16.4	11.9	38.2	46.4	54.4	52.9	17.7	29.9	7.9	16.8	63.5	21.7	29.4	41.8	0.0
MT baseline	21.2	26.1	24.0	4.6	3.6	6.2	4.3	14.2	17.1	17.6	20.3	1.9	7.6	3.1	6.1	16.9	11.9	13.1	51.6	46.6
$MT + CL$	22.7	22.0	20.9	10.6	10.3	11.5	8.6	24.2	25.9	30.8	34.2	10.6	20.5	8.1	12.6	32.9	18.2	19.8	45.3	30.6
$MT + SR$	23.8	24.7	25.3	11.7	11.8	15.4	12.9	39.4	44.8	52.5	54.0	21.1	37.4	17.9	26.0	62.8	27.9	33.7	35.6	0.7
$MT +$ SR+ $CL$	23.8	24.8	25.4	11.7	11.9	15.4	12.9	39.4	44.8	52.6	54.0	21.2	37.5	17.9	26.1	62.8	28.0 $(+ 6.3)$	33.7 $(+ 4.3)$	35.5 $(- 6.3)$	0.7

Note. mAP = mean average precision; MT = mean teacher; CL = contrastive learning; SR = stochastic restoration. The bold values highlight the best-performing results.

Our approach significantly improves performance across most categories. For example, Target_Ave_mAP reaches 16.7, showing a +3.5 improvement, while Ave_mAP increases to 18.3, a +2.6 improvement. The “Drop” decreases to 26.6, and “Forgetting” is minimized to 1.1, indicating enhanced adaptation. The method outperforms both “no adaptation” and “MT baseline,” demonstrating its effectiveness in handling the various consecutive distortions. Although the “Drop” remains high due to the challenging nature of the setting, our method demonstrates robustness for long-term adaptation across multiple and diverse domain shifts.

In Table 10, the experimental results in terms of mAP@50 reinforce the robustness of our method and support the findings from the previous experiments. The method performs particularly well in noise and digital distortions, compared to the other approaches. Target_Ave_mAP reaches 28.0, a +6.3 improvement, and Ave_mAP equals 33.7, showing a +4.3 increase. “Drop” decreases to 35.5, reflecting a $- 6.3$ improvement, while “Forgetting” is minimized to 0.7. Overall, our method significantly enhances performance, highlighting its capacity to handle complex distortions and adapt effectively to diverse environments.

Overall, our experimental results demonstrate that the proposed MT + SR + CL method consistently outperforms the MT approach and generally exceeds the performance of each individual component (MT + CL and MT + SR). In Table 4, the performance of MT + SR + CL is almost identical to that of MT + SR, while in Table 8, the results of MT + SR + CL are nearly equal to those of MT + CL. These observations indicate that, in all cases, employing our combined method achieves the best possible performance, eliminating the need to decide which individual components to use for each scenario. In the Cityscapes dataset (Tables 5 and 6), the improvement over individual components is imperceptible, possibly because there is not a strong domain shift, only fog levels change, meaning that the domain difference is limited, allowing even the “no adaptation” baseline to achieve strong performance.

Figures 3, 4, 5 and 6 illustrate the evolution of mAP during the adaptation process across four datasets and provides valuable insight into how our proposed method compares to existing techniques. The plots highlight that the mAP of the “no adaptation” baseline drops when a domain shift occurs. For extreme domain shifts, such as in the KITTI dataset under the snow domain conditions, the model’s performance is completely degraded. This severe drop underscores the significant challenge posed by such domain shifts, where the model fails to generalize effectively. It also emphasizes the necessity of adaptation methods to maintain or improve performance in the presence of such shifts. Without adaptation, the model struggles to handle new environments, further validating the importance of implementing robust adaptation strategies.

Figure 3.

Evolution of Mean Average Precision (mAP) During Adaptation for the KITTI Dataset.

Figure 4.

Evolution of Mean Average Precision (mAP) During Adaptation for the Cityscapes Dataset.

Figure 5.

Evolution of Mean Average Precision (mAP) During Adaptation for the CLAD-D Dataset.

Figure 6.

Evolution of Mean Average Precision (mAP) During Adaptation for the COCO-C Dataset.

In contrast, the MT approach demonstrates a more robust performance, indicating that adaptation helps the model recover from these extreme shifts. However, in datasets such as KITTI, CLAD-D, and COCO-C, the performance on the source domain after adaptation is significantly lower due to catastrophic forgetting. Additionally, in the COCO-C dataset, long-term adaptation leads to further deterioration of the model’s performance after some domain shifts, where the model’s mAP declines to a point where it performs worse than the “no adaptation” baseline. This suggests that while adaptation can improve resilience to domain shifts, it is crucial to balance the adaptation process to avoid catastrophic forgetting and ensure stable, long-term performance across different domains.

Our proposed technique achieves improved performance under domain shifts and minimizes catastrophic forgetting, resulting in even better performance on the source domain after adaptation. Furthermore, it is well-suited for long-term adaptation, as demonstrated by the performance improvement on the COCO-C dataset. This highlights the effectiveness of our approach in maintaining model robustness and adaptability over time, even in continually changing environments.

5. Conclusion

In this study, we focus on continual TTA setting, in which the target domain is non-stationary, involving a sequence of domain shifts. We use the MT framework for object detection and integrate it with the SR technique. Furthermore, we seek to boost its effectiveness through object-level CL. Our proposed approach demonstrates an improved performance over the MT approach on several standard datasets. Additionally, our method is agnostic to the domain shifts that may occur during inference and has proved to be robust in long-term adaptation. This can be particularly valuable in real-world applications, where conditions often change over time.

Research in this field is relatively recent, and there are currently no widely accepted benchmarks for evaluating performance. The development of specialized datasets, performance metrics, and baselines is essential. CL stands out as an intriguing future direction that can be combined with self-supervision to address this challenging scenario. Future research may focus on enhancing the synergy between CL and SR technique, potentially by restoring only a subset of the network parameters. Given that SR significantly increases computational complexity, exploring more efficient mechanisms, such as restoring parameters only when significant domain shifts are detected, could be beneficial.

Advancements in continual TTA will enable the development of adaptive models for dynamically evolving domains, reducing the cost and time required to train new models for each domain. Experimental results are promising, indicating that adaptation during inference can enhance the performance and flexibility of deep learning algorithms.

Footnotes

ORCID iDs

Panagiota Moraiti

Ilias Papadeas

Ioannis Pratikakis

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Competing Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Bartler

Bühler

Wiewel

Döbler

Yang

(2022). MT3: Meta test-time training for self-supervised test-time adaption. In G. Camps-Valls, F. J. R. Ruiz & I. Valera (Eds.), Proceedings of the 25th international conference on artificial intelligence and statistics, proceedings of machine learning research (Vol. 151, pp. 3080–3090). PMLR. https://proceedings.mlr.press/v151/bartler22a.html.

Cao

Joshi

Gui

L. Y.

Wang

Y. X.

(2023). Contrastive mean teacher for domain adaptive object detectors. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 23839–23848). IEEE. https://doi.org/10.1109/CVPR52729.2023.02283.

Chakrabarty

Sreenivas

Biswas

(2023). A simple signal for domain shift. In 2023 IEEE/CVF international conference on computer vision workshops (ICCVW) (pp. 3569–3576). IEEE. https://doi.org/10.1109/ICCVW60793.2023.00384.

Chen

Jin

Cheng

Jin

Hua

(2020a). HoMM: Higher-order moment matching for unsupervised domain adaptation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(4), 3422–3429. https://doi.org/10.1609/aaai.v34i04.5745

Chen

Yang

Song

Wang

Zhang

Yan

Zhuang

Xie

(2022). Learning domain adaptive object detection with probabilistic teacher. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu & S. Sabato (Eds.), Proceedings of the 39th international conference on machine learning, proceedings of machine learning research (Vol. 162, pp. 3040–3055). PMLR. https://proceedings.mlr.press/v162/chen22b.html.

Chen

Kornblith

Norouzi

Hinton

(2020b). A simple framework for contrastive learning of visual representations. In H. Daumé III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning, proceedings of machine learning research (Vol. 119, pp. 1597–1607). PMLR. https://proceedings.mlr.press/v119/chen20j.html.

Chen

Jia

(2023). STFAR: Improving object detection robustness at test-time by self-training with feature alignment regularization. arXiv:2303.17937. https://doi.org/10.48550/arXiv.2303.17937.

Cordts

Omran

Ramos

Rehfeld

Enzweiler

Benenson

Franke

Roth

Schiele

(2016). The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223). IEEE.

Deng

Chen

Duan

(2021). Unbiased mean teacher for cross-domain object detection. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4089–4099). IEEE. https://doi.org/10.1109/CVPR46437.2021.00408.

10.

Döbler

Marsden

R. A.

Yang

(2023). Robust mean teacher for continual and gradual test-time adaptation. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7704–7714). IEEE. https://doi.org/10.1109/CVPR52729.2023.00744.

11.

Ganin

Ustinova

Ajakan

Germain

Larochelle

Laviolette

Marchand

Lempitsky

(2017). Domain-adversarial training of neural networks (pp. 189–209). Springer International Publishing.

12.

Liu

Wang

Sun

(2021). YOLOX: Exceeding yolo series in 2021. CoRR abs/2107.08430.

13.

Geiger

Lenz

Stiller

Urtasun

(2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11), 1231–1237. https://doi.org/10.1177/0278364913491297

14.

Geiger

Lenz

Urtasun

(2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Conference on computer vision and pattern recognition (CVPR) (pp. 3354–3361). IEEE.

15.

Gkioxari

Dollár

Girshick

(2017). Mask R-CNN. In 2017 IEEE international conference on computer vision (ICCV) (pp. 2980–2988). IEEE. https://doi.org/10.1109/ICCV.2017.322.

16.

Chen

Liang

Tan

Liang

Guo

(2023). Pseudo-label correction and learning for semi-supervised object detection. https://doi.org/10.48550/arXiv.2303.02998.

17.

Hendrycks

Dietterich

(2019). Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the 7th international conference on learning representations (pp. 1–16).

18.

Khosla

Teterwak

Wang

Sarna

Tian

Isola

Maschinot

Liu

Krishnan

(2020). Supervised contrastive learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 18661–18673). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf.

19.

Zhu

Shen

H. T.

(2024). A comprehensive survey on source-free domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5743–5762. https://doi.org/10.1109/TPAMI.2024.3370978 .

20.

Wang

Shi

Hou

Liu

(2018). Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80, 109–117. https://doi.org/10.1016/j.patcog.2018.03.005

21.

Y. J.

Dai

C. Y.

Liu

Y. C.

Chen

Kitani

Vajda

(2022). Cross-domain adaptive teacher for object detection. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7571–7580). IEEE. https://doi.org/10.1109/CVPR52688.2022.00743.

22.

Liang

Feng

(2020). Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In H. Daumé III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning, proceedings of machine learning research (Vol. 119, pp. 6028–6039). PMLR. https://proceedings.mlr.press/v119/liang20a.html

23.

Lin

Zang

Tang

Wang

(2023). VCL challenges 2023 at ICCV 2023 technical report: Bi-level adaptation method for test-time adaptive object detection. In International conference on computer vision, arXiv:2310.08986. https://doi.org/10.48550/arXiv.2310.08986.

24.

Lin

T. Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

Zitnick

C. L.

(2014). Microsoft COCO: Common objects in context. In D. Fleet, T. Pajdla, B. Schiele & T. Tuytelaars (Eds.), Computer vision – ECCV 2014 (pp. 740–755). Springer International Publishing.

25.

Mirza

M. J.

Soneira

P. J.

Lin

Kozinski

Possegger

Bischof

(2023). ActMAD: Activation matching to align distributions for test-time-training. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 24152–24161). IEEE Computer Society. https://doi.org/10.1109/CVPR52729.2023.02313.

26.

Sakaridis

Dai

Van Gool

(2018). Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision, 126(9), 973–992. https://doi.org/10.1007/s11263-018-1072-8

27.

Sinha

Gehler

Locatello

Schiele

(2023). Test: Test-time self-training under distribution shift. In 2023 IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 2758–2768). IEEE. https://doi.org/10.1109/WACV56688.2023.00278.

28.

Sójka

Cygert

Twardowski

Trzcinski

(2023). AR-TTA: A simple method for real-world continual test-time adaptation. In 2023 IEEE/CVF international conference on computer vision workshops (ICCVW) (pp. 3483–3487). IEEE. https://doi.org/10.1109/ICCVW60793.2023.00374.

29.

Srivastava

Hinton

Krizhevsky

Sutskever

Salakhutdinov

(2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56), 1929–1958.

30.

Sun

Segu

Postels

Wang

Van Gool

Schiele

Tombari

(2022). SHIFT: A synthetic driving dataset for continuous multi-task domain adaptation. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 21339–21350). IEEE. https://doi.org/10.1109/CVPR52688.2022.02068.

31.

Tarvainen

Valpola

(2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett (Eds.), Advances in neural information processing systems 30 (NIPS 2017) (pp. 1195–1204). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/68053af2923e00204c3ca7c6a3150cf7-Paper.pdf.

32.

Tzeng

Hoffman

Saenko

Darrell

(2017). Adversarial discriminative domain adaptation. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2962–2971). IEEE. https://doi.org/10.1109/CVPR.2017.316.

33.

Verwimp

Yang

Parisot

Hong

McDonagh

Pérez-Pellitero

De Lange

Tuytelaars

(2023). CLAD: A realistic continual learning benchmark for autonomous driving. Neural Networks, 161, 659–669. https://doi.org/10.1016/j.neunet.2023.02.001

34.

vcl_workshop_2023 (2023) 1st workshop on visual continual learning. ICCV 2023. https://wvcl.vis.xyz/.

35.

Wang

Shelhamer

Liu

Olshausen

Darrell

(2021). Tent: Fully test-time adaptation by entropy minimization. In Proceedings of the 9th international conference on learning representations (pp. 1–15).

36.

Wang

Fink

Van Gool

Dai

(2022). Continual test-time domain adaptation. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 7191–7201). IEEE. https://doi.org/10.1109/CVPR52688.2022.00706.

37.

Wang

Luo

Zheng

Chen

Wang

Huang

(2024). In search of lost online test-time adaptation: A survey. International Journal of Computer Vision, 133, 1106–1139. https://doi.org/10.1007/s11263-024-02213-5

38.

Yoo

Lee

Chung

Kim

Kwak

(2024). What, how, and when should object detectors update in continually changing test domains? In 2024 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 23354–23363). IEEE Computer Society. https://doi.org/10.1109/CVPR52733.2024.02204.

39.

Zhang

Levine

Finn

(2022). MEMO: Test time robustness via adaptation and augmentation. In Proceedings of the 36th International conference on neural information processing systems, NIPS ’22 (pp. 38629–38642). Curran Associates Inc.

Continual Test-Time Domain Adaptation for Object Detection via Contrastive Mean Teacher and Stochastic Restoration

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Unsupervised Domain Adaptation (UDA)

2.2. Test-Time Adaptation (TTA)

2.3. Continual TTA

3. Methodology

3.3.1. Object-Level Features

3.3.2. Class-Based Contrastive Loss

4.1. Datasets, Settings, and Metrics

4.1.1. SHIFT

4.1.3. Cityscapes

4.1.4. CLAD-D

4.1.5. COCO-C

4.2. Implementation Details

4.3. Results

Footnotes

ORCID iDs

Funding

Declaration of Competing Interest

References