Sage Journals: Discover world-class research

Abstract

Person re-identification technology has made significant progress in recent years with the development of deep learning. However, the recognition rate of models in this field is still lower than that of face recognition, which is challenging to implement in practical application scenarios. Therefore, improving the recognition rate of the pedestrian re-identification model is still a critical task. This paper mainly focuses on three aspects of this problem. The first is to use the characteristics of the multi-branch network structure of person re-identification to dig out the most effective online self-distillation scheme between branches without increasing additional resource requirements, making full use of the information contained in each branch. Secondly, this paper analyzes and verifies the pros and cons of knowledge distillation based on mean squared error (MSE) loss function and Kullback-Leibler (KL) divergence from theoretical and experimental perspectives. Finally, we verified through experiments that adding a specific value of noise perturbation to the model weights can further improve the recognition rate of the model. After several improvements in these areas, we obtained the current state-of-the-art performance on four public datasets for person re-identification.

Keywords

person re-identification deep learning self-distillation transformer

Introduction

In recent years, with the emergence of vision transformer (ViT),¹ the transformer network structure has also shown explosive development in computer vision after the outstanding achievements in natural language processing.² In computer vision research, video surveillance systems consider person re-identification to be a crucial task. Deep learning techniques have significantly enhanced recognition outcomes in this domain.^3–10 However, many problems still exist, such as the presence of different viewpoints,^11,12 occlusion,^13,14 and changes in pedestrian poses.^15,16 Several approaches within this field rely on extracting dependable feature representations.^17–28 There are also methods based on distance metric learning.^29–34

TransReid³⁵ is the first to use transformers for person re-identification. TransReid has demonstrated strong performance on various publicly available person re-identification datasets. This paper’s primary innovation is the introduction of transformer architecture to pedestrian re-identification by utilizing it for the first time in ViT.¹ The authors also propose the Side Information Embeddings (SIE) and Jigsaw Patch Module (JPM) modules to enhance the re-identification performance. The SIE module mainly sets the corresponding embedded features for different cameras. The JPM module divides the entire image’s patches into different areas for processing. This partition processing scheme is a prevalent practice in pedestrian re-identification. In TransReid, these partitions are turned into embedded features one by one at the beginning and then are extracted separately in the last transformer structure and processed with their loss function.

Therefore, TransReid essentially performs partition processing on pedestrian images. The factual information in the different branches of JPM is the characteristics of different partitions of the human body. All in all, the network structure of TransReid, similar to other early pedestrian re-identification network structures based on convolutional neural network structure, will eventually have several branches, representing the global features and several local features of pedestrian images, respectively. The primary motivation of our paper stems from our belief that in TransReid, the cross-entropy and triplet loss are calculated separately for these branches, but these branches have the same goal: to distinguish pedestrian images with different IDs. Since everyone has the same goal, the power of working together must be greater than the power of individual efforts. Theoretically, if these different branch associations can be effectively integrated, the overall model’s recognition effect can be improved.

Furthermore, the research on knowledge distillation has been highly active in the associated domain of deep learning.^36–45 Many knowledge distillation methods have been developed.^36–45 For person re-identification network structures with different branch structures with both global branches and local branches, this structure can naturally be used for online knowledge distillation. However, because our structure is different from the parallel branch structures constructed in the literature,⁴⁵ our branches are not the same, so it is a question of which one is the teacher and which is the student for distillation learning. This is one of the main issues that our paper aims to address. In addition, among the methods of knowledge distillation, literature³⁶ is distilled with Kullback-Leibler (KL) divergence, and literature⁴³ is distilled with an mean squared error (MSE) loss function. Among them, the former needs to set an appropriate temperature parameter, T, which needs to be selected through relevant experiments on the specific dataset and network structure and requires a certain amount of extra work. The latter is more convenient, but literature⁴² said this method has some defects if used to distill softmax output.

In summary, our contributions can be summarized as follows. Firstly, this paper obtains the most suitable online distillation scheme through extensive experiments based on the TransReid³⁵ network structure and several public person re-identification datasets. Using this scheme, we can effectively improve this network structure’s recognition effect based on global and local branches without adding additional parameters. Secondly, this paper makes a relevant derivation for this problem and proves that this defect can be avoided if the batch normalization (BN) layer is added before the softmax and relevant experiments are done to verify our inference. Experiments show that when there is a BN layer in front of the softmax, we can use the MSE loss function to distill the output of the softmax to obtain the same effect as the KL divergence distillation of the carefully set temperature parameter T. Thirdly, we suggest monitoring for signs of overfitting in the network and introducing interference noise to the network weights if it occurs, in order to prevent getting stuck in a suboptimal solution. Finally, through our improvement schemes, we obtain the state-of-the-art recognition performance on four public datasets for person re-identification.

Related work

Partitioned pedestrian re-identification

In the research of person re-identification, it is a standard scheme to perform feature extraction after partitioning the human image. In literature²², the author proposes a method called Part-based Convolutional Baseline (PCB), which divides the human body image into six regions uniformly horizontally before global pooling and treats each part as a human body part. Next, conduct global pooling to acquire the characteristics of every regional zone. The author proposed a method named Multi-Granularity Network (MGN),²³ which utilizes three parallel branch network architectures to process the feature map of human images after several convolutional layers. The first branch extracts the overall features of the complete image, while the second branch splits the human body image into two regions and extracts the corresponding local features.

Similarly, the third branch splits the human body image into three regions and extracts the local features of each region. In the testing phase, the final feature is obtained by merging the features from the three branches. The author in the literature²⁴ still partitions pedestrian images, but the network learns the specific partitioning scheme from a specific dataset. A common feature of these schemes is that the final output layer is not a simple output but has several branches simultaneously, representing pedestrian images’ global and local information, respectively.

The transformer-based image recognition method

Convolutional neural networks have dominated almost all computer vision (CV) fields in the past time. At the same time, an algorithm called transformer² began to develop rapidly in natural language processing (NLP). Finally, the ViT¹ algorithm was born, and the hurricane of the transformer was blown into the CV field. In just a few years, many algorithms have used the transformer structure to solve problems in the CV field. Literature ⁴⁶ adopts almost the same network structure as ViT,¹ adding better hyperparameter settings, multiple data augmentation, and knowledge distillation to improve the effectiveness of ViT. Literature⁴⁷ proposes a transformer-based end-to-end object detection without non-maximum suppression NMS postprocessing steps, prior knowledge, and constraints such as anchors, and the entire network is implemented end-to-end. The implementation of target detection greatly simplifies the pipeline of target detection. In literature⁴⁸, a feature pyramid transformer (FPT) is proposed to exploit feature interactions across spaces and scales fully. FPT incorporates three transformer types: the self-transformer, grounding transformer, and rendering transformer. These transformers encode the information from the feature pyramid’s self-level, top-down, and bottom-up paths. FPT leverages the self-attention module in the transformer to augment the feature pyramid network’s feature fusion process.

TransReid³⁵ is the first method to use the transformer structure to solve the person re-identification problem. The underlying structure is similar to that of ViT.¹ It divides the input image into several patches, converts each patch into an embedded feature, and then passes through several transformer block structures for the final output. Different from ViT, TransReid adds SIE modules that represent relevant information, such as cameras and perspectives for the particularity of pedestrian re-identification. In addition, one output from ViT becomes a multi-branch structure with a global branch and several local branches. The method proposed in this paper is mainly realized based on TransReid combined with various improvement schemes. Finally, state-of-the-art recognition outcomes were achieved in person re-identification across four publicly available datasets.

Knowledge distillation

Over the years, the techniques for knowledge distillation have developed. In the academic literature,³⁶ KL divergence has been utilized to evaluate the resemblance between the probability distributions of the teacher and student network’s softmax output. The objective is for the student network to attain a probability distribution that resembles the teacher network. Literature³⁷ is to perform knowledge distillation on the intermediate features of each layer in the network. The primary purpose is to hope that the student network can obtain output similar to the teacher network at different levels of the entire network and guide the learning using the mean square error metric function. Literature³⁸ proposes to use attention as a knowledge carrier for knowledge distillation. A paper in the field of literature³⁹ suggests incorporating similarity-preserving knowledge into the teacher and student networks, ensuring that they produce comparable activations for the same input samples. To measure this similarity, the paper employs mean square error, which calculates the inner product of the corresponding feature maps of the two networks. Literature⁴⁰ uses mutual information to measure student and teacher networks’ differences. Mutual information can indicate the degree of the mutual dependence of two variables, and the larger the value, the higher the degree of dependence between the variables. Mutual information is the teacher model’s entropy minus the teacher model's entropy given the known student model. The aim is to increase mutual information. In reference⁴², comparative learning is incorporated into knowledge distillation to achieve this objective. The approach seeks to acquire a representation that minimizes the distance between the teacher and student networks for positive sample pairs and maximizes it for negative sample pairs.

Some of these knowledge distillation methods are online distillation. The teacher and student networks are trained simultaneously, but the online distillation method, such as literature⁴⁵, must manually construct additional parallel branch structures. Although redundant branches will be removed in the prediction stage, the computational complexity of the training stage will be doubled in the training stage. The method proposed in this paper also belongs to the category of online knowledge distillation. We combine the particularity of person re-identification field, that is, there is often a global branch and multiple local branches in the network structure, to explore the application of online knowledge distillation scheme in this particular structure. TransReid has achieved an improved recognition rate on various public datasets related to person re-identification without increasing computational requirements during training and testing. In addition, we analyze some problems with knowledge distillation of softmax output using the MSE loss function. When there is a BN layer in front of softmax, the effect of using the MSE loss function for knowledge distillation is the same as using KL divergence for knowledge distillation. However, the former does not need time to select the appropriate temperature parameter, T. Through a series of comparative experiments, the correctness of our inferences is verified.

Method

Self-distillation structure

Our self-distillation network structure does not add other redundant network structures based on TransReid³⁵ but uses knowledge distillation to fully exploit the potential information on the network structure of TransReid, which has both a global branch structure and multiple local branch structures. We tried several different combinations of self-training.

First, we believe global features should be more informative than local ones. We hope that the local branch structure should not only pay attention to the local features themselves during training but also consider the constraints of the relevant information of the global features of the image so that they can reduce the ill effects of some locally noisy images. During the training process, we employ the output of the global branch as the teacher network’s structure and the four local branch structures as the structure of the student network. The loss function of this network structure is shown in the following formulas:

L_{D} (distill ({local 1}_{s}, {global}_{s}) + distill ({local 2}_{s}, {global}_{s}) + distill ({local 3}_{s}, {global}_{s}) + ({local 4}_{s}, {global}_{s})) / 4

(1)

L_{T} = \log [1 + \exp (f_{a} - f_{p 2}^{2} - f_{a} - f_{n 2}^{2})],

(2)

L_{C} = L_{c}^{global} + L_{c}^{local 1} + L_{c}^{local 2} + L_{c}^{local 3} + L_{c}^{local 4},

(3)

L = L_{D} + L_{T} + L_{C},

(4)

L_{softmax} = - \sum_{i = 1}^{N} l o g \frac{e^{w_{i}^{T} x_{i}}}{\sum_{j = 1}^{K} e^{w_{j}^{T} x_{j}}} .

(5)

The distill function in the above formulas represents a specific knowledge distillation function. This function, such as KL divergence or MSE loss, can be selected differently. Please refer to the following chapters for the selected analysis of this function. The $globa l_{s}$ , $local 1_{s}$ , $local 2_{s}$ , $local 3_{s}$ , and $local 4_{s}$ in the formula represent the softmax output values of the global and the four local branches, respectively. Represents the triplet function we use for the features of the global branch, and $f_{a}$ , $f_{p}$ , and $f_{n}$ in it represent the triplet sample features in the triplet loss function, respectively. Represents the cross-entropy loss function, and $L_{c}^{global}$ , $L_{c}^{local 1}$ , $L_{c}^{local 2}$ , $L_{c}^{local 3}$ , and $L_{c}^{local 4}$ represent the cross-entropy loss function of the global and four local branches, respectively. This self-training network is shown in Figure 1.

Figure 1.

The first self-training network structure.

The second is that we believe that the global branch should pay attention to not only global features but also some noticeable local features, and for the capture ability of local features, the local branch should be stronger than the global branch because it only needs to focus on a local area. Therefore, during training, we use the local branch structure as the teacher network structure and the global branch structure as the student network structure. The loss function of this network structure is very similar to the first case above, only the form of the knowledge distillation loss function is slightly different. The specific form is as follows:

\begin{aligned} L_{D} = & (distill (globa l_{s}, local 1_{s}) + distill (globa l_{s}, local 2_{s}) + distill (globa l_{s}, local 3_{s}) \\ + distill (globa l_{s}, local 4_{s})) / 4, \end{aligned}

(6)

The only difference between the above formula and the form of the first case is that now, in the distill function, the softmax output value of the local branch is used as the teacher, while in the first case, the softmax output value of the global branch is used as the teacher. As before, we averaged the knowledge distillation loss values. The forms of the other related loss functions and the final total loss function are the same as in the first case. This self-training network is shown in Figure 2.

Figure 2.

The second self-training network structure. In order to avoid excessive similarity with Figure 1, we only depicted the parts that are different from it.

The third one is inspired by literature⁴⁴ and combined with the ideas of the first and second schemes above. When the network is trained, each local branch and global branch serve as both the student and teacher network structures. Each branch learns from the other and progresses with each other so that the global branch can learn from the local feature extraction ability, and the local branch can also learn from the overall view of the global branch and finally obtain a better recognition effect. The loss function in this network architecture closely resembles the one in the first scenario. However, the structure of the loss function varies for knowledge distillation. The precise structure of the knowledge distillation loss function is as follows:

\begin{aligned} L_{D g} = & (distill (globa l_{s}, local 1_{s}) + distill (globa l_{s}, local 2_{s}) + distill (globa l_{s}, local 3_{s}) \\ + distill (globa l_{s}, local 4_{s})) / 4, \end{aligned}

(7)

\begin{aligned} L_{D 1} = & (distill (local 1_{s}, globa l_{s}) + distill (local 1_{s}, local 2_{s}) + distill (local 1_{s}, local 3_{s}) \\ + distill (local 1_{s}, local 4_{s})) / 4, \end{aligned}

(8)

\begin{aligned} L_{D 2} = & (distill (local 2_{s}, globa l_{s}) + distill (local 2_{s}, local 1_{s}) + distill (local 2_{s}, local 3_{s}) \\ + distill (local 2_{s}, local 4_{s})) / 4, \end{aligned}

(9)

\begin{aligned} L_{D 3} = & (distill (local 3_{s}, globa l_{s}) + distill (local 3_{s}, local 1_{s}) + distill (local 3_{s}, local 2_{s}) \\ + distill (local 3_{s}, local 4_{s})) / 4, \end{aligned}

(10)

\begin{aligned} L_{D 4} = & (distill (local 4_{s}, globa l_{s}) + distill (local 4_{s}, local 1_{s}) + distill (local 4_{s}, local 2_{s}) \\ + distill (local 4_{s}, local 3_{s})) / 4, \end{aligned}

(11)

L_{D} = (L_{D g} + L_{D 1} + L_{D 2} + L_{D 3} + L_{D 4}) / 5,

(12)

In the above formulas, $L_{D g}$ , $L_{D 1}$ , $L_{D 2}$ , $L_{D 3}$ , and $L_{D 4}$ indicate that the global branch is the student, local branch 1 is the student, local branch 2 is the student, local branch 3 is the student, and local branch 4 is the student. In these five cases, when each branch is a student, all other branches act as teachers to perform knowledge distillation. The sum of these five terms obtains the final knowledge distillation loss function. As before, we take the corresponding average of the relevant loss values to eliminate the excessive gradient generated by the loss function of the knowledge distillation part. This self-training network is shown in Figure 3.

Figure 3.

The third self-training network structure. In order to avoid excessive similarity with Figure 1, we only depicted the parts that are different from it.

The fourth is inspired by literature⁴⁵. We believe that although the concerns of different branches may differ, the goals are the same, so if we integrate the outputs of all different branches, we should get better recognition results. Moreover, this better result incorporates all the valuable information from the local and global branches. Using this comprehensive information as the teacher network structure and then teaching each branch structure, in turn, should improve each branch’s feature extraction ability. The loss function for this network architecture is similar to the one described in the previous case, but the function used in knowledge distillation differs in its form. The precise form of the knowledge distillation loss function is as follows:

\begin{aligned} L_{D} = & (distill (globa l_{s}, ensembl e_{s}) + distill (local 1_{s}, ensembl e_{s}) \\ + distill (local 2_{s}, ensembl e_{s}) + distill (local 3_{s}, ensembl e_{s}) \\ + distill (local 4_{s}, ensembl e_{s})) / 5, \end{aligned}

(13)

Ensembl e_{s} = (globa l_{s} + local 1_{s} + local 2_{s} + local 3_{s} + local 4_{s}) / 5,

(14)

The above two formulas represent the average value of the softmax output values of the global branch and all local branches, that is, their ensemble probability distribution. Use the result of this ensemble to perform knowledge distillation on all branch structures. This self-training network is shown in Figure 4.

Figure 4.

The fourth self-training network structure. In order to avoid excessive similarity with Figure 1, we only depicted the parts that are different from it.

It should be noted that the different self-training schemes we mentioned earlier have different methods and effects, but generally, each scheme does not increase the amount of extra computation. At the same time, the recognition efficiency of the original network structure is improved. Please refer to the related content in the following chapters for the experimental results.

Compared with other online distillation methods, our method has obvious advantages and differences. For example, in literature⁴⁵, multiple branch network structures parallel to the original network structure are additionally constructed, and these parallel network structures are used for information distillation, thereby improving the recognition effect of the original network structure. While achieving higher performance, there is a trade-off with exponentially increasing the computational requirements during training.

The approach presented in literature⁵⁴ does not introduce additional network structure. However, this paper proposes a method of rearranging the sampling order by constraining half of each mini-batch with the previous iteration while the remaining half coincides with the upcoming iteration. In this way, the first half of the mini-batch distills the soft targets generated in the previous iteration. This method performs self-distillation from the perspective of the input timing of the data. Literature⁵⁰ uses the perspective of different network parameters at different times to do self-distillation. The methods of these articles are different from ours.

Choice of the distillation method

The distillation method proposed in literature³⁶ is the most commonly employed technique among the various knowledge distillation methods. It employs KL divergence as a loss function to quantify the distribution difference between the softmax output results of the teacher and student networks. However, this method needs to set a suitable temperature parameter, T. If the set parameter value is not suitable, a harmful distillation effect may be obtained. Moreover, the appropriate temperature parameter T may be inconsistent in different datasets.

However, in literature⁴³, it is proposed that MSE loss can be used to measure the difference between the teacher and student networks so that the parameter T can be avoided. However, the paper also pointed out that the softmax output cannot be directly distilled using MSE loss. The example they gave is that the two data inputs to the softmax function are [10, 20, 30] and [–10, 0, 10], and the output of softmax is precisely the same, but the two datasets are different. Therefore, only the data before entering the softmax can be regressed, and the softmax output cannot be regressed. For this problem, we analyze that if there is no BN layer, the output of the softmax of [10, 20, 30] and [–10, 0, 10] are indeed the same [2.0611e-09, 4.5398e-05, 9.9995 e-01]. As shown in Figure 5a. In this case, if the output of softmax is regressed on the MSE loss function, there is a problem of not reflecting the difference between the two sample features, as mentioned in literature⁴³. However, since a BN layer normalizes all the data before entering the softmax activation function, this problem does not exist in our model. Let us analyze the calculation formula of the BN layer:

μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i},

(15)

σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2},

(16)

\hat{x_{i}} = \frac{x_{i} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}},

(17)

y_{i} = γ \hat{x_{i}} + β,

(18)

Figure 5.

Illustration of the difference in adding a BN layer before the softmax layer for special input data: (a) without BN layer and (b) with BN layer.

After the BN layer, there must be other data besides these two in a batch of input data, and we assume it is [15.0, 21.0, 16.0]. According to the calculation formula of the BN layer, the input data entering the softmax is calculated by the four parameters $μ_{B}$ , $σ_{B}^{2}$ , $γ$ , and $β$ . Even if we assume that $γ$ is one and $β$ is 0, the data entering the softmax will also be changed by $μ_{B}$ and $σ_{B}^{2}$ . $μ_{B}$ and $σ_{B}^{2}$ are calculated by adding other input data, so the form of [10, 20, 30] and [–10, 0, 10] will also be changed. If there are a total of three data inputs, the third data is [15.0, 21.0, 16.0], under the assumption that $γ$ is one and $β$ is 0, according to the calculation formula of the BN layer, $μ_{B}$ is [5.0, 13.6667, 18.6667], $σ_{B}^{2}$ is [116.6667, 93.5555, 70.2222], the data input to softmax function is actually [–1.3887, –1.4130, –1.0342], [0.4629, 0.6548, 1.3524], [0.9258, 0.7582, –0.3182]. From the calculation formula of the softmax function, the output should be [0.2940, 0.2869, 0.4191], [0.2153, 0.2608, 0.5239], [0.4686, 0.3963, 0.1351]. As shown in Figure 5b. We found that after the BN layer, the softmax outputs obtained for [10, 20, 30] and [–10, 0, 10] are not the same, but the trend of their probability distributions is still the same. In this way, the network can focus more on learning the probability distribution of the sample feature responses rather than the location of the feature space in which the feature is located. Therefore, we still compare the effects of using the MSE loss function to distill the data before the softmax input, using the MSE loss function to distill the softmax output, and using the KL divergence for the softmax output. We found that, at least for the few public datasets we used, the distillation of softmax’s output using the MSE loss function compared with the distillation of KL divergence, which was carefully adjusted to get the most appropriate t value. The distillation result of the MSE loss function is almost identical to that of the best KL divergence. However, using the MSE loss function to distill the data before softmax input is less effective. This result is consistent with the conclusion of our previous deduction, indicating that our deduction is reasonable. Please refer to the subsequent analysis of relevant content for specific experimental results.

At the same time, considering that using KL divergence as the distillation scheme requires selecting the appropriate temperature parameter T through experiments, and different datasets may require different appropriate parameter T, we believe that the MSE loss function is a simpler and more appropriate scheme for distillation of softmax output. Therefore, in our final experiment, we used the MSE loss function to distill the output data of softmax. It is worth noting that if we use the MSE loss function for knowledge distillation, the first network structure and the second network structure mentioned earlier in this article are equivalent because in our network, due to the online self-distillation do not fix the parameters of the teacher network structure. Hence, Equations 1 and 6 are entirely equivalent. However, for KL divergence, since the formula of KL divergence clearly defines the concept of the probability distribution to be fitted and the target probability distribution for the two probability distributions involved in the calculation, Equations 1 and 6 can represent different ideas.

For the case that the BN layer is not connected, we have also done relevant experiments, and the results show that, as we have analyzed, if the BN layer is not connected, the effect of direct regression on the output value of softmax is worse than that on the input value of softmax. For specific experimental results, please refer to relevant content later.

Add noise to network weights

The complexity of the transformer network structure used in this paper is very high, and it is easy to overfit compared to the tens of thousands of pictures in the public dataset in person re-identification. There are many approaches to mitigating overfitting, with regularization applied to the network being common. Adding noise to the network weights can be seen as a regularization technique added to the network structure. Following this idea, this paper treats adding noise to the network weights as a scheme to avoid getting stuck in local minima. We observed that when the network gets stuck in a local minimum, there is usually overfitting with 100% test accuracy on the training set, so we can observe whether the test accuracy on the current training set tends to be 100%. If so, add noise to the network weights to help it jump out of the current local minimum. In this way, the network can be prompted to jump out of the current local minimum and try more local extreme points to converge to a relatively good local minimum and obtain a better recognition effect. For related experimental analysis, please refer to the related content later in this article.

Experimental results and analysis

Introduction to experimental datasets

This article uses four public datasets, namely Market-1501,³ DukeMTMC-reid,⁴ MSMT17,⁶ and Occluded-Duke,⁵¹ to evaluate the proposed methods. Market-1501³ is a dataset of 1501 pedestrians captured by six cameras on the campus of Tsinghua University in 2015, with a training set of 751 people and a test set of 750 people. DukeMTMC-Reid⁴ is a subset of the DukeMTMC dataset, containing 1404 people and 36,411 images captured by eight cameras. MSMT17⁶ was collected from 15 cameras on campus, with a training set of 1041 pedestrians and a test set of 3060 pedestrians. Finally, unlike other datasets, the images in Occluded-Duke⁵¹ are selected from DukeMTMC-reID, and the training/query/gallery set contains 9%/100%/10% occluded images, respectively.

The Market-1501 dataset is fascinating because it includes high-resolution and low-resolution images captured by multiple cameras.³ The training set averages 17.2 images per person, while the test set averages 26.3 images per person. The query images are manually annotated, while the gallery images use a DPM detector for pedestrian detection.

DukeMTMC-Reid is a valuable dataset for pedestrian re-identification research, as it contains many images captured by multiple cameras.⁴ The training set and test set both include 702 people, with the training set containing 16,522 images and the test set containing 17,661 images.

MSMT17 is notable for using 15 cameras covering indoor and outdoor areas, with video collected over four days with varying weather conditions.⁶ The dataset was annotated by three human annotators and contained a training set of 1041 pedestrians with 32,621 bounding boxes and a test set of 3060 pedestrians with 93,820 bounding boxes. The test set also includes query images randomly selected from the bounding boxes.

Finally, Occluded-Duke is a unique dataset focusing on occluded images, with 9% of the training set and 10% of the query/gallery sets containing occluded images.⁵¹ The dataset was selected from DukeMTMC-Reid and included 1812 people with 36,411 images.

These datasets provide valuable resources for researchers studying pedestrian re-identification and related areas, such as object detection and image recognition.

Experimental parameters and testing metrics

This paper’s experimental environment and parameters are designed as follows: using ubuntu16.04 system, python3.6 for programming, the deep learning framework used is pytorch1.7.0 version, and the computing graphics card is 2080ti. The input image is scaled to 384 × 128 and 256 × 128 sizes (for Occluded-Duke). The batch size is set to 64. Since the triplet loss function is used, 16 pedestrian IDs are sampled each time, and each pedestrian samples four pictures. The learning process employs the cosine annealing method to adjust the learning rate, starting with an initial rate of 0.008. The weight decay factor is 0.0005, and the backbone model is based on the TransReid model pre-trained on ImageNet as literature³⁵. Data augmentation utilizes three methods: random horizontal flip, crop, and erasure. The triplet loss function uses the soft margin approach. We use stochastic gradient descent (SGD) with a momentum value of 0.9 to optimize the model.

By established practices within the person re-identification community, we assess all approaches using cumulative matching characteristic curves and the mean Average Precision (mAP).

Analysis of the results of the experiment

Experimental analysis of self-distillation learning structure

As the previous relevant chapters mentioned, our self-distillation learning network structure is divided into four types. The first is to use the information of the local branch as a teacher to train the global branch. It is hoped that the global branch can learn some classification capabilities based on local features while learning global classification. This method is represented by ltg. The second is to use the information of the global branch as a teacher to train the local branch. It is hoped that the local branch will pay attention to the local features and consider the implicit information of the global image to which the local image belongs. This method is represented by gtl. The third is inspired by mutual teaching,⁴⁴ allowing each branch to perform mutual online distillation learning as both student and teacher to allow them to learn from each other and improve. This way is represented by mt. The fourth is to integrate the output information of the global branch and all local branches and use the probability distribution information obtained by their integration as a teacher to guide the training of the global branch and all local branches. This paper uses the method to verify the superiority of the method under the network structure of TransReid, which has different branch structures without adding computational burden. This method is represented by st. We use KL divergence as the loss function of knowledge distillation for the experiments in these cases.

According to the experimental data in Table 1, it is apparent that the self-distillation learning network structure utilizing “st” performs the best on the public datasets. As a result, our final network structure employs this method. Among the other three methods, “mt” has a slightly better effect, while “ltg” and “gtl” have a similar effect. Overall, all four methods outperform the original TransReid network structure. This demonstrates the significance of our proposed approach, which utilizes online knowledge distillation to fully extract information from different branches without increasing the computational burden.

Table 1.

Self-distillation learning structure comparison experimental results. The baseline data in the table come from the data of TransReid.

Dataset	mAP_baseline	R1_baseline	mAP_gtl	R1_gtl	mAP_ltg	R1_ltg	mAP_mt	R1_mt	mAP_st	R1_st
Market1501	88.8	95.0	89.2	95.3	89.3	95.3	89.5	95.4	89.7	95.8
DukeMTMC-reID	81.8	90.4	82.3	90.6	82.2	90.5	82.5	90.8	82.6	91.0
MSMT17	66.6	84.6	66.8	84.5	66.7	84.5	67.1	84.6	67.3	84.6
Occluded-Duke	55.7	64.2	56.5	64.0	56.3	64.0	56.8	64.1	57.0	64.1

Experimental analysis of the choice of the distillation method

To prove the network structure with BN layers deduced in this paper, using the MSE loss function to distill the output of softmax does not have the disadvantages mentioned in literature⁴³. We conducted experiments on the fourth self-distillation network structure described earlier in this paper. The experimental results are shown in Table 2. From the experimental results on the three public datasets in this table, we can see that in our network structure, since there is a BN layer before the softmax layer, it is better to use the MSE loss function to distill the output of softmax than to distill the features before the input of softmax. To verify our inference, we remove the last BN layer in the network structure and then compare the difference between using the MSE loss function to distill the output of softmax and the features before input to softmax. We found that in this case, as stated in paper 44, the effect of directly distilling the softmax output is worse. Through the experimental data of these two cases, we have proved the correctness of our inference.

Table 2.

Comparison of experimental results for distillation method selection. The letter T in the table indicates that knowledge distillation is performed using KL divergence. The number after the letter T indicates the specific temperature parameter value. The representation with the keyword MSE uses the MSE loss function for knowledge distillation. The representation with the keyword BN has the BN layer in front of the softmax layer. The representation with the keyword logit performs knowledge distillation on the features before the softmax input, and the representation without the keyword logit performs knowledge distillation on the output value of the softmax.

Distillation Method	Market1501		DukeMTMC		MSMT17		Occluded-Duke
Distillation Method	mAP	R1	mAP	R1	mAP	R1	mAP	R1
T1	88.8	94.9	82.1	90.6	66.6	83.8	56.7	63.6
T2	88.9	94.9	82.2	90.5	66.8	84.0	56.8	63.9
T4	89.0	95.0	82.2	90.7	66.8	84.1	56.8	63.8
T8	89.4	95.2	82.4	90.8	67.1	84.4	56.9	63.9
T16	89.7	95.8	82.6	91.0	67.3	84.6	57.0	64.1
T32	89.6	95.6	82.7	91.1	67.2	84.5	57.0	64.0
T64	89.7	95.7	82.5	90.8	67.2	84.4	56.9	63.9
T128	89.4	95.3	82.1	90.2	67.0	84.1	56.7	63.7
MSE	89.1	94.9	81.9	89.9	66.7	84.0	56.5	63.6
MSE_logit	89.3	95.2	82.2	90.4	66.9	84.3	56.8	63.9
MSE_bn	89.8	95.9	82.7	91.0	67.2	84.7	57.0	64.1
MSE_bn_logit	89.3	95.3	82.3	90.7	67.0	84.3	56.7	63.8

In addition, we compare the distillation method for KL divergence with different temperature parameters T and the method for distillation using the MSE loss function. From our experimental results, it can be seen that after carefully setting the value of the temperature parameter, the KL divergence is no worse than using the MSE loss function, but there is no absolute advantage. The method of distillation using the MSE loss function is more straightforward. It does not require any verification experiments of the temperature parameter T. So, we believe that in the case of having a BN layer before the softmax layer, the distillation using the MSE loss function is better than the KL divergence.

Experimental analysis of adding noise to network weights

In the experiment of adding noise to the network weights, considering that the backbone network structure has been pre-trained with a large amount of data on the ImageNet dataset, it already has good feature extraction and generalization capabilities, so we do not do noise perturbation to the backbone weights. The classifier layer of each branch we added on the backbone has not been pre-trained by the dataset ImageNet, and it is easier to overfit the training set, so noise needs to be added to the classifier layer of each branch.

To be more specific, our approach involves primarily adding Gaussian noise to the data with a mean of 0 and a variance of 1. We have considered the impact of directly adding Gaussian noise with a mean of 0 and a variance of 1 to the network weight and have found that this can cause excessive disruption to the network, impeding its ability to converge appropriately. Therefore, the generated random noise data must be multiplied by a small coefficient, which should not be too small. If it is too small, the purpose of perturbing the network weight will not be achieved. So we multiply the Gaussian noise with a mean of 0 and a variance of 1 by a series of smaller coefficients, add them to the weights, and compare them with the recognition effect without noise (the baseline data without noise are shown in Table 3). The e0 in Figure 6 means adding noise with a mean of 0 and a variance of 1 directly. The e1 means multiplied by the minus one power of 10, e2 means multiplied by the minus two power of 10, and so on. Based on the experimental results depicted in Figure 6, it can be inferred that adding Gaussian noise with a mean of 0 and a variance of 1 directly to the network weight produces adverse effects. However, when the coefficient begins from a negative power of 10 to the third power, the mAP and Rank-1 indicators on these datasets show an improvement. Although these improvements are not very large, from the overall effect of these datasets, adding appropriate noise to the weights still improves the model’s recognition rate. We found that adding Gaussian noise with a coefficient of negative seven to the power of 10 and a mean of 0, along with a variance of 1, is the most effective method when working with multiple public datasets for person re-identification. As a result, we have incorporated this approach into our final model training scheme.

Figure 6.

Experimental data with added noise. (a) Experimental data on the mAP indicator; (b) Experimental data on the rank-1 indicator.

Table 3.

Baseline data for adding noise experiments.

Test Metrics	Market1501	DukeMTMC	MSMT17	Occluded-Duke
Map	89.8	82.7	67.2	57.0
Rank 1	95.9	91.0	84.7	64.1

Comparison with state-of-the-art methods

Table 4 displays the experimental results of this paper’s approach, which combines all the methods discussed compared with state-of-the-art techniques on four publicly available person re-identification datasets. To facilitate comparison with other approaches, the table includes input images with two different resolutions, 384 × 128 and 256 × 128, indicated in the size column. Regarding the experimental results of the Occluded_duke dataset, since other articles only have the experimental results of 256 × 128 input data, we also only did 256 × 128-related experiments on this dataset. The training images are augmented with random horizontal flipping, padding, cropping, and erasing. The other related training parameters are as follows: the initial learning rate is set to 0.008, and the learning rate is adjusted using cosine annealing. The batch size is set to 64, where 16 IDs are sampled each time, and four samples are taken from each ID. The weight decay is set to 1e-4, the momentum is set to 0.9, and the optimization method is SGD. The PyTorch (Version 1.7) training framework is used, and the training is conducted on an NVIDIA 2080ti GPU. We used a pre-trained ViT model on the ImageNet dataset to initialize the backbone of our network structure. The margin of the triplet loss was set to 0.3. Other losses were weighted equally, including the cross-entropy loss and the knowledge distillation loss.

Table 4.

The performance of the proposed method is compared with state-of-the-art techniques.

Method	Size	Market1501		DukeMTMC		MSMT17		Occluded-Duke
Method	Size	mAP	R1	mAP	R1	mAP	R1	mAP	R1
MGN²³	384	86.9	95.7	78.4	88.7	52.1	76.9	–	–
SCSN⁵²	384	88.5	95.7	79.0	91.0	58.5	83.8	–	–
ABDNet⁵³	384	88.3	95.6	78.6	89.0	60.8	82.3	–	–
MGN_CPWA²⁴	384	87.1	95.8	78.5	89.2	–	–	–	–
PGFA⁵¹	256	76.8	91.2	65.5	82.6	–	–	37.4	81.4
HOReID⁵⁴	256	84.9	94.2	75.6	86.9	–	–	43.8	55.1
ISP⁵⁵	256	88.6	95.3	80.0	89.6	–	–	52.3	62.8
TransReid³⁵	384	88.8	95.0	81.8	90.4	66.6	84.6	–	–
TransReid*³⁵	384	89.5	95.2	82.6	90.7	69.4	86.2	–	–
TransReid³⁵	256	88.2	95.0	80.6	89.6	64.9	83.3	55.7	64.2
TransReid*³⁵	256	88.9	95.2	82.0	90.7	67.4	85.3	59.2	66.4
Our	384	90	96	82.8	91.2	67.4	84.8	-	-
Our*	384	90 . 2	96.1	83.0	91.2	69.8	86.3	-	-
Our	256	89.5	95.7	81.8	90.5	65.7	83.6	57.2	64.3
Our*	256	89.9	95.9	82.8	90.7	68.4	85.5	59.6	66.4

The asterisk (*) indicates that the backbone is configured with a sliding-window setting. The bold values indicate the maximum values in their respective columns in the table.

Our approach achieves superior results on various datasets. On the Market1501 dataset, we surpass the previous best method by 0.7% points for the map metric and 0.9% points for Rank 1. Similarly, on the DukeMTMC-reID dataset, our method outperforms the previous state-of-the-art approach by 0.4% points for the map metric and 0.5% points for rank 1. On the MSMT17 dataset, our method outperforms the previous state-of-the-art approach by 0.4% points for the map metric and 0.1% points for rank 1. Finally, on the Occluded-Duke dataset, our method outperforms the previous state-of-the-art method by 0.4% points for the map metric, while the rank-1 metric remains the same.

Regarding these experimental data, it should be noted that in literature³⁵, the author proposed the sliding-window method, which can effectively improve the recognition rate of the model but significantly increases the resource requirements. Since the best result at present is the method of adding this setting, we also adopted the same experimental setting for the convenience of comparison: setting the sliding window size to 12. The above results represent the state-of-the-art results achieved through the method proposed in this paper when used with the prescribed settings. However, it is worth noting that our approach still delivers state-of-the-art results on Market1501 and DukeMTMC-reID, even when the sliding window is not set to 12. This outcome validates the efficacy of our methodology. Nevertheless, with the sliding window set to 12, we achieved state-of-the-art results on all four public datasets. These experimental results demonstrate that our approach enhances person re-identification through visual transformers and multi-branch network structures like TransReid without additional resource requirements.

Analyze the reasons for improved performance

The previous experimental data on four public datasets have shown that our method can improve the model's recognition rate. However, we hope to understand better why our model recognition rate can be improved for some specific pictures. Therefore, we adopt the method of literature⁵⁶ to visualize the attention map of the input image. The visualization results are shown in Figure 7.

Figure 7.

Heatmap analysis of experimental data. We give a total of six sets of experimental data, and each set of experimental data has three pictures. These three images visually represent the performance comparison between the TransReid model and the proposed method. By analyzing the heatmaps, we can observe the strengths and weaknesses of each approach in identifying and localizing persons of interest in the image.

There are a total of six sets of experimental data in Figure 7. For the first group, our model and TransReid's model have a similar focus on the data, both are upper body clothes and backpacks, but as seen from the heatmap, our model pays more attention to this part of the area. For the second group, the focus between the two models is different. Our model mainly focuses on the clothes on the upper body, especially the logo on the top of the clothes, which is more reasonable than focusing on pure black pants. In the third group, although both models paid more attention to black pants, our model also paid more attention to the clothes marks on the upper body. In the fourth set of experimental data from the human eye’s perspective, the green area in the middle of the dress and the shoes are more prominent, and our model captures these features. Compared with the previous four sets of experimental data, the advantages of our model are more evident in the fifth and sixth sets of experimental data. First, our model extracts more obvious heatmap regions in both experimental datasets. Secondly, the red bag in the fifth dataset is a very prominent sign, and it is very reasonable to pay attention to it. In the sixth dataset, our model can pay attention to the unique texture features of most clothes, especially the right shoulder strap of the backpack, and even its outline can be seen from the heatmap. It can be seen from these example data that our model can improve the recognition rate because it pays more attention to some of the more iconic positions in pedestrian images, which is more in line with the mechanism of the human eye. So it is reasonable to get better results.

Conclusion

The work done in this paper is mainly based on the network structure like TransReid and makes full use of the characteristics of this network structure with global branches and local branches. While not adding any additional resource requirements, the method of online self-distillation is introduced. We conduct a detailed experimental comparison of various possible online self-distillation methods. The best online self-distillation scheme effectively improves the recognition rate of network structures such as TransReid on four public datasets in person re-identification. Secondly, this paper proves through theoretical analysis and experimental verification, at least on the four public datasets of person re-identification, on the premise that there is a BN layer in front of softmax, the effect of using MSE loss function for knowledge distillation and using KL Divergence for knowledge distillation is about the same. So, we think using the MSE loss function for knowledge distillation is more straightforward and effective. In addition, we judge whether it is necessary to add noise to the network weights by checking whether the recognition rate on the training sample set tends to be 100% and conduct a series of experiments to compare and analyze the magnitude of the added noise, which further improves our recognition effect. By incorporating the aforementioned improvement strategies, we have achieved the most advanced recognition performance on four publicly available datasets for person re-identification.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Chongqing College of Electronic Engineering Project (grant number XJWT202107).

ORCID iD

Wenjie Chen

Author biographies

Wenjie Chen received a BS degree from the School of Environmental Engineering, Nanchang University, Nanchang, China, in 2008 and an MS degree from the School of Technology of computer application, Nanchang University, in 2012. He is currently a lecturer at the Chongqing College of Electronic and Engineering, China. His research interests include computer vision, machine learning, and deep learning.

Kuan Yin received an MS degree in Computer Science and Technology from the School of Computer Science, Sichuan Normal University in 2020. Currently, he is pursuing a PhD degree in Computer Science and Technology at Chongqing University of Posts and Telecommunications. His research interests encompass pattern recognition and intelligent systems, object tracking and detection, real-time image processing, and embedded systems.

Yongsheng Wu is a student at Chongqing College of Electronic and Engineering, China. His research interests include computer vision, and deep learning.

Yunbing Hu received an MS degree from Chongqing University in 2007. He is currently pursuing a PhD degree with Xiamen University in an Australia-China Joint Cultivation Program. He has experience working for higher education institutes. His research interests are indoor navigation and multi-source positioning, and navigation.

References

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16(16 words: transformers for image recognition at scale. ArXiv 2021; 2010: 11929.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30: 6000–6010.

Zheng

Shen

Tian

, et al. Scalable person re-identification: a benchmark. In: IEEE International Conference on Computer Vision (ICCV), 2015, pp.1116–1124. doi:10.1109/ICCV.2015.133.

Zheng

Yang

. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision, 2017, pp.3754–3762. doi:10.1109/ICCV.2017.405.

Zhao

Xiao

. Deepreid: deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp.152–159. doi:10.1109/CVPR.2014.27.

Wei

Zhang

Gao

. Person transfer GAN to bridge domain gap for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.79–88. doi:10.1109/CVPR.2018.00016.

Zhu

Gong

. Harmonious attention network for person re-identification In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.2285–2294. doi:10.1109/cvpr.2018.00243.

Shen

Lin

, et al. Deep learning for person re-identification: a survey and outlook. IEEE Trans Pattern Anal Mach Intell 2022; 44: 2872–2893.

Zheng

Gong

Xiang

. Person re-identification by probabilistic relative distance comparison. In: IEEE Conf Comput Vis Pattern Recognit,

2011,

pp.649–656. doi:10.1109/CVPR.2011.5995598.

10.

Zhao

Tian

Sun

, et al. Spindle net: person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.1077–1085. doi:10.1109/CVPR.2017.103.

11.

Karanam

Radke

. Person re-identification with discriminatively trained viewpoint invariant dictionaries. Int Conf Comput Vis,

2015,

pp.4516–4524. doi:10.1109/ICCV.2015.513.

12.

Bak

Zaidenberg

Boulay

Improving person re-identification by viewpoint cues

AVSS, 2014, pp.175–180. doi:10.1109/AVSS.2014.6918664.

13.

Huang

Zhang

, et al. Adversarially occluded samples for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.5098–5107. doi:10.1109/CVPR.2018.00535.

14.

Hou

Chang

, et al. VRSTC: occlusion-free video person re-identification In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp.7183–7192. doi:10.1109/CVPR.2019.00735.

15.

Cho

Yoon

K-J

. Improving person re-identification via pose-aware multi-shot matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.1354–1362. doi:10.1109/CVPR.2016.151.

16.

Sarfraz

Schumann

Eberle

A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.420–429. doi:10.1109/CVPR.2018.00051.

17.

Hoffer

Ailon

. Deep metric learning using triplet network.

SIMBAD,

October 12-14, 2015, Proceedings 3 (pp. 84–92).

18.

Farenzena

Bazzani

Perina

, et al. Person re-identification by symmetry-driven accumulation of local features. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2010, pp.2360–2367. doi:10.1109/CVPR.2010.5539926.

19.

Zhao

Ouyang

Wang

. Learning mid-level filters for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp.144–151. doi:10.1109/CVPR.2014.26.

20.

Zhao

Ouyang

Wang

. Person re-identification by salience matching. In: Proceedings of the IEEE international conference on computer vision, 2013, pp.2528–2535. doi:10.1109/ICCV.2013.314.

21.

Chang

Liang

, et al. Learning locally-adaptive decision functions for person verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp.3610–3617. doi:10.1109/CVPR.2013.463.

22.

Sun

Zheng

Yang

, et al. Beyond part models: person retrieval with refined part pooling. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.501–518. doi:10.1007/978-3-030-01225-0_30.

23.

Wang

Yuan

Chen

, et al. Learning discriminative features with multiple granularities for person re-identification. In: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp.274–282. doi:10.1145/3240508.3240552.

24.

Chen

Fan

. Person re-identification based on partition adaptive network structure and channel partition weight adaptive. IEEE ACCESS 2021; 9: 101402–101413.

25.

Kviatkovsky

Adam

Rivlin

. Color invariants for person reidentification. IEEE Trans Pattern Anal Mach Intell 2013; 35: 1622–1634.

26.

Liu

Gong

Loy

. Person re-identification: what features are important? In: Computer Vision–ECCV, 2012, pp.391–401. doi:10.1007/978-3-642-33863-2_39.

27.

Jurie

. Local descriptors encoded by fisher vectors for person re-identification. In: Computer Vision–ECCV, 2012, pp.413–422. doi:10.1007/978-3-642-33863-2_41.

28.

Wang

Doretto

Sebastian

, et al. Shape and appearance context modeling. In: Conference on computer vision, 2007, pp.1–8. doi:10.1109/iccv.2007.4409019.

29.

Barhillel

Hertz

Shental

, et al. Learning a Mahalanobis metric from equivalence constraints. J Mach Learn Res 2005; 6: 937–965.

30.

Liao

. Efficient PSD constrained asymmetric metric learning for person re-identification. In: Proceedings of the IEEE international conference on computer vision, 2015, pp.3685–3693. doi:10.1109/ICCV.2015.420.

31.

Paisitkriangkrai

Shen

Hengel

AVD

. Learning to rank in person re-identification with metric ensembles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.1846–1855. doi:10.1109/CVPR.2015.7298794.

32.

Zhang

Chen

Saligrama

. Group membership prediction. Int Conf Comput Vis,

2015,

pp.3916–3924. doi:10.1109/ICCV.2015.446.

33.

Koestinger

Hirzer

Wohlhart

, et al. Large scale metric learning from equivalence constraints. In: IEEE conference on computer vision and pattern recognition, 2012, pp.2288–2295. doi:10.1109/CVPR.2012.6247939.

34.

Xiong

Gou

Camps

. Person re-identification using kernel-based metric learning methods.

Comput Vis ECCV,

2014,

pp.1–16. doi:10.1007/978-3-319-10584-0_1.

35.

Luo

Wang

, et al. TransReid: transformer-based object re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.15013–15022. doi:10.1109/iccv48922.2021.01474.

36.

Hinton

Vinyals

Dean

. Distilling the knowledge in a neural network. ArXiv 2015; 1503: 02531.

37.

Romero

Ballas

Kahou

, et al. FitNets: hints for thin deep nets. ArXiv 2014; 1412: 6550.

38.

Zagoruyko

Komodakis

. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. ArXiv 2016; 1612: 03928.

39.

Tung

Mori

. Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF international conference on computer vision

2019,

pp.1365–1374. doi:10.1109/iccv.2019.00145.

40.

Ahn

Damianou

, et al. Variational information distillation for knowledge transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp.9163–9171. doi:10.1109/cvpr.2019.00938.

41.

Tian

Krishnan

Isola

. Contrastive representation distillation. ArXiv 2019; 1910: 10699.

42.

Song

, et al. Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.11740–11750. doi:10.1109/iccv48922.2021.01153.

43.

Caruana

. Do deep nets really need to be deep? ArXiv 2014; 3: 2654–2662.

44.

Zhang

Xiang

Hospedales

. Deep mutual learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition

2018,

pp.4320–4328. doi:10.1109/cvpr.2018.00454

45.

Lan

Zhu

Gong

. Knowledge distillation by on-the-fly native ensemble. ArXiv 2018; 1806: 04606.

46.

Touvron

Cord

Douze

, et al. Training data-efficient image transformers & distillation through attention. ArXiv 2012; 2012: 12877.

47.

Carion

Massa

Synnaeve

, et al. End-to-end object detection with transformers. ArXiv 2020; 2005: 12872.

48.

Zhang

Tang

, et al. Feature pyramid transformer. ArXiv 2020; 2007: 09451.

49.

Shen

Yang

, et al. Self-distillation from the last mini-batch for consistency regularization. ArXiv 2022; 2203: 16172.

50.

Kim

Yoon

, et al. Self-knowledge distillation with progressive refinement of targets. ArXiv 2020; 2006: 12000.

51.

Miao

Liu

, et al. Pose-guided feature alignment for occluded person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.542–551. doi:10.1109/iccv.2019.00063.

52.

Chen

Zhao

, et al. Salience-guided cascaded suppression network for person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.3300–3310. doi:10.1109/cvpr42600.2020.00336.

53.

Chen

Ding

Xie

, et al. ABD-Net: attentive but diverse person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.8351–8361. doi:10.1109/iccv.2019.00844.

54.

Wang

Yang

Liu

, et al. High-order information matters: learning relation and topology for occluded person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp.6449–6458. doi:10.1109/cvpr42600.2020.00648.

55.

Zhu

Guo

Liu

, et al. Identity-guided human semantic parsing for person re-identification. In: Computer Vision–ECCV, 2020, pp.346–363. doi:10.1007/978-3-030-58580-8_21.

56.

Selvaraju

Cogswell

Das

, et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, 2017, pp.618–626. doi:10.1109/iccv.2017.74.

Person re-identification based on multi-branch visual transformer and self-distillation

Abstract

Keywords

Introduction

Related work

Partitioned pedestrian re-identification

The transformer-based image recognition method

Knowledge distillation

Method

Self-distillation structure

Choice of the distillation method

Add noise to network weights

Experimental results and analysis

Introduction to experimental datasets

Experimental parameters and testing metrics

Analysis of the results of the experiment

Experimental analysis of self-distillation learning structure

Experimental analysis of the choice of the distillation method

Experimental analysis of adding noise to network weights

Comparison with state-of-the-art methods

Analyze the reasons for improved performance

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

References