Sage Journals: Discover world-class research

Abstract

In the long-term deployment of mobile robots, changing appearance brings challenges for localization. When a robot travels to the same place or restarts from an existing map, global localization is needed, where place recognition provides coarse position information. For visual sensors, changing appearances such as the transition from day to night and seasonal variation can reduce the performance of a visual place recognition system. To address this problem, we propose to learn domain-unrelated features across extreme changing appearance, where a domain denotes a specific appearance condition, such as a season or a kind of weather. We use an adversarial network with two discriminators to disentangle domain-related features and domain-unrelated features from images, and the domain-unrelated features are used as descriptors in place recognition. Provided images from different domains, our network is trained in a self-supervised manner which does not require correspondences between these domains. Besides, our feature extractors are shared among all domains, making it possible to contain more appearance without increasing model complexity. Qualitative and quantitative results on two toy cases are presented to show that our network can disentangle domain-related and domain-unrelated features from given data. Experiments on three public datasets and one proposed dataset for visual place recognition are conducted to illustrate the performance of our method compared with several typical algorithms. Besides, an ablation study is designed to validate the effectiveness of the introduced discriminators in our network. Additionally, we use a four-domain dataset to verify that the network can extend to multiple domains with one model while achieving similar performance.

Keywords

Place recognition feature disentanglement adversarial self-supervised changing appearance

Introduction

Place recognition is a vital ability for robots. Ranging from autonomous driving to flying robots for precision agriculture, different kinds of sensors are leveraged in different localization scenarios, among which cameras are gaining more and more popularity. The main advantage of imagery sensors is their low cost, compared to expensive light detection and ranging (LiDAR), inertial navigation system (INS), etc. Visual localization has been studied for years, and many visual simultaneous localization and mapping (SLAM) systems are proposed,^1,2 which achieve impressive performance under ideal conditions. Visual place recognition plays two roles in a SLAM system: (1) in the mapping stage, robots need to find loop closure, so as to reduce drifting and build a global consistency map and (ii) in the localization stage, localization may sometimes fail, such as the kidnapped robot problem, which also needs loop detection. Most loop detection algorithms firstly try to coarsely relocalize the robots against a known database using place recognition, followed by pose estimation. Thus, visual place recognition may affect mapping accuracy and relocalization success rate.

As the community pays more and more attention to scenes with appearance changes, such as the urban environment, challenges for visual place recognition also come. A typical visual place recognition pipeline includes feature extraction,^3,4 feature matching,^5,6 and temporal fusion,^7,8 among which feature extraction is now a bottleneck. The recent success of deep learning has shown a great potential for a neural network to become a robust feature extractor. Thus, this article tries to improve the feature extraction module using deep learned features. The main challenge for handcrafted visual features is that they are sensitive to appearance changes, such as a shift from day to night and seasonal transitions. Some methods exploit deep learned features from supervised learning.^9,10 When testing data are similar to training data, these supervised methods perform very well. However, they require massive manually labeled data, which are labor-intensive and time-consuming. That may be unsuitable for some place recognition scenarios. Thus, self-supervised or unsupervised methods are preferred. Autoencoder learns features in a self-supervised way, and the output of the encoder is shown to be suitable for place recognition.¹¹ In their work, training data and testing data are quite different; therefore, it pays more attention to generalization ability. Compared to supervised learning, it sacrifices performance to obtain generalization. To find a good balance between performance and generalization, some researchers assume that training and testing data can share some similar properties, such as the same place, the same sensors, and similar illumination, but without position-level alignment.^12,13 This assumption is reasonable in some applications, such as inspection robots working in the same place at different times. ToDayGAN translated from nighttime images to daytime style using generative adversarial network (GAN), and the generated images are used to match with images captured in the daytime.¹² Although nighttime images in the training and testing phase look similar, they still have subtle appearance differences because they are captured at different times. Our article is also targeted at this setting. The main difference is that features in ToDayGAN are extracted from the translated images after the image translation, while our method directly puts constraints on features.

In this article, we are set to construct the neural network that can explicitly disentangle domain-unrelated and domain-related features of images from different domains, and the domain-unrelated features are used for place recognition. The only supervised information is the domain, where a domain denotes a specific appearance condition, such as daytime in spring or nighttime in summer. For example, by assuming that the appearance does not change too much during a short period, we can label all the images as some domain according to the appearance condition at that time. By grouping images into different domains, the proposed network can find the invariant information among them. This idea is motivated by the hypothesis that in the place recognition application, an image can be seen as the composition of place content and appearance content, where the place is the domain-unrelated content of a scene (e.g. corners or edges of buildings), whereas appearance is the domain-related properties (e.g. brightness of sunlight and type of season). Under this hypothesis, based on the definition of disentangled representation,¹⁴ we disentangle place and appearance features using two encoders, which are part of an autoencoder. To ensure that the appearance feature only corresponds to domain-related content, an adversarial loss is applied on pairs of place features and appearance features. Besides, another adversarial loss is designed to constrain the place features mapping to the same latent space, such that the appearance feature only corresponds to domain-related information. As a result, the place feature is robust against appearance changes, and it is used as the descriptor in visual place recognition. The network is trained in a self-supervised manner without the requirement for aligned image sequences, and only domain information is needed, which is also called weak-supervised in some articles. This can be easily achieved by firstly collecting sessions of images under the same conditions and then marking each session as one domain. Besides, the network is shared across different domains, which allows our model to be adaptable to more domains without increasing the number of parameters. The main contributions of this article are listed as follows:

A data modeling method is presented for visual place recognition, and a self-supervised feature learning method is proposed to disentangle the domain-unrelated and domain-related content from multiple-domain images.

A disentangled feature learning network based on adversarial learning is proposed, which is able to be extended to multiples domains without increasing model complexity. This makes our network feasible for applications with limited resources.

Two toy case studies are carried out to validate our feature disentanglement method with qualitative and quantitative results. We also try to interpret our network in this part.

Experiments for place recognition are conducted on three public dataset and one newly proposed dataset. Our method shows favorable performance. We also open the source code for reproduction (https://github.com/dawnos/fdn-pr).

This article is an extensive study based on our previous work.¹⁵ One additional contribution is that we improve our network to get higher performance by reconstructing high-level image feature instead of the original image. Another one is that we try to interpret our network through theoretical discussion and ablation study. Finally, more datasets and comparison methods are employed to verify the proposed disentanglement method.

The remainder of this article is organized as follows: related work is discussed and summarized in the second section. Then in the third section, we present the data model used in this article. Our method will be presented thoroughly in the fourth section. We will introduce the experiments in detail and show results in the fifth section. The conclusion will be made in the sixth section.

Related work

Typical pipeline for visual place recognition includes (1) feature extraction and (2) matching,^{5,6,16

–21} optionally followed by (3) temporal fusion.^7,8,22,23 This article will focus on the feature extraction module.

Handcraft features

In the early years, handcrafted features are used for place recognition. These features can be classified into two categories, namely global features and local features. Methods based on handcrafted global features try to assign appearance-invariant features to each image directly. In some early works, histograms of oriented gradient (HOG)²⁴ features are used as descriptors to compute thedistance between database and query image, where the gradient is able to overcome simple illumination changes.⁷ Later, the GIST descriptor, which represents the high-frequency part of an image, is also leveraged because the human eye is more sensitive to it.²⁵ On the other hand, local-feature-based methods firstly extract a bunch of local features and then aggregate them into global features. In place recognition, performance is limited by the choice of local features, thus local features with an appearance-invariant property are preferred. Gradient-based local features such as Scale-Invariant Feature Transform (SIFT)³ and Oriented FAST and Rotated BRIEF Oriented FAST and Rotated BRIEF (ORB)⁴ are found to be robust to small appearance changes. Especially, the ORB descriptor is used in the place recognition (loop closure) module of the widely used ORB-SLAM system.²⁶ In addition, different aggregating algorithms are designed to generate global descriptors, such as a bag of visual words²⁷ and Vector of Locally Aggregated Descriptor (VLAD).²⁸ Due to the limited robustness of handcrafted features, these methods are sensitive to extreme appearance changes, thus not preferable in visual place recognition.

Supervised features

After the great success of deep neural networks in the computer vision area in recent years, researchers start to explore how can visual place recognition benefit from deep learning. In one of the earliest trials, different layers in AlexNet²⁹ are reported to have different place recognition performances.³⁰ The authors find that features from the middle layers are more robust against changing appearance. However, the AlexNet features are not good enough, because the network is pretrained on the imagenet large-scale visual recognition challenge dataset,³¹ which is different from the testing set.¹¹ To go further, different supervised methods are proposed, which constrain the extracted features of images from the same place to be similar.^{9,10,32

–35} One way is to use a classification network for place recognition where each place is a class, and this method can achieve comparable results.⁹ The network is trained and tested on a dataset captured from static cameras at different times. This work does not need manual labeling, and it shows the potential for the neural network to distinguish places. However, as places are determined once training is finished, it cannot be extended to other scenes. To make the network more flexible, NetVLAD improves the competitive aggregating method VLAD²⁸ by using soft assignment, which becomes a differential module.¹⁰ With the module, the feature network can be optimized by triplet loss. Labeled data from Google Street View Time Machine are needed to construct training tuple. Besides, another research assigns multiple images to one place and fuses their features together to boost performance, which uses two datasets aligned by Global Positioning System (GPS).³² In addition to improving the features, finding useful regions for place recognition can also help, such as stable return on investments^36
–38 and attention maps.^39
–41 These supervised methods achieve impressive results, but the requirement for massive labeled data may be hard to fulfill in the fast deployment.

It is also possible to make the extracted features more robust by postprocessing before the feature matching module.^42,43 For example, the original AlexNet features³⁰ can be further improved by applying PCA to those features followed by only keeping the components with small eigenvalues, because the principal components with large eigenvalues represent variations in the images.⁴² Additionally, quantifying the features using hashing can speed up to the matching process.⁴³ These methods are not competitors of our method, and instead, they can be added to any feature extractors to get high performance.

Self-supervised features

Another branch of machine learning methods, namely self-supervised learning, does not rely on aligned data. Autoencoder⁴⁴ is a popular self-supervised network architecture, which is used in many place recognition researches.^11,45
–47– The output features of encoders exhibit robustness in place recognition. The robustness of the learned features can be improved by reconstructing corrupted input images,⁴⁷ or trying to reconstruct HOG of the input images.¹¹ One advantage of the latter one is that their training sets are different from the testing set, demonstrating favorable generalization.

In recent years, adversarial learning is getting more and more attention in the literature on self-supervised learning. Inspired by the work of style transfer^13,48,49 using GAN,⁵⁰ some researchers try to transfer query images to match the style of database images, followed by local feature matching, global descriptor matching, or dense matching. ^12,51

–54 For example, ToDayGAN transfers nighttime images into daytime ones, extracts features using DenseVLAD,⁵⁵ and finally matches with daytime images in the database.¹² Each model in these methods is targeted at the two domains used in the training phase. When adding new domains, new models are needed, and the number of models increases rapidly. Another method is enhancing the pretrained NetVLAD¹⁰ with semantic information, which is shown to be viewpoint-invariant.⁵⁶ It demonstrates that appearance-based descriptors, such as our method, can be improved to overcome changing viewpoints with this technique.

Data modeling

In this section, we formulate our problem and define disentangled representation. The input data are modeled as a multiple-domain generative process, and the feature extraction module is modeled as an inference process. These two processes are summarized in Figure 1. Based on these, we derive our definition at the end of this section.

Figure 1.

The generative process and the inference process.

Generative process

Images are the reflections of the world. Let $W$ be the state space of the world. An image $x \in X$ can be seen as the observation of a state in $W$ , where $X$ is the image space. They are connected by a generative process $g : W \to X$ , written as follows

x = g (w), x \in X, w \in W

In our setting, images are from different domains, where each domain represents a specific type of appearance condition, such as daytime in summer or nighttime in winter. We assume that the world space $W$ can be decomposed into domain-unrelated space $S$ and domain-related space $A$ , namely $W = S \times A$ . $S$ contains place information (such as the structure of a building or shape of traffic sign) and $A$ contains appearance information (such as sunny daytime in spring). Now, the generative process can be rewritten as

\begin{array}{l} g : S \times A \to X \\ x = g (s, a) \end{array}

Any image $x \in X$ can be seen as the composition of a domain-unrelated state $s \in S$ and a domain-related state $a \in A$ under g. We assume that states s and a are sampled independently from latent spaces $S$ and $A$ , respectively.

Under the multi-domain setting, latent space $A$ can be divided into different subspaces, such that

\begin{array}{l} A = & \underset{n}{\cup} A_{i} \\ ਀ & ਀ \end{array}

where n is the number of subspaces of $A$ . With $a_{i} \in A_{i}$ sampled from one of these subspaces, the generated image x_i can be written as

x_{i} = g (s, a_{i}) \in X_{i}, s \in S, a_{i} \in A_{i}

Given a_i and a_j sampled from different subspaces, we assume that the generated images x_i and x_j follow different distributions or saying from different domains. We will use the notation $X_{i}$ to denote a domain (i.e. the set of images collected from spring) in the rest of the article. Apparently, we have n domains.

Images taken from the same place at different times with different appearances share the same s and vary a. On the contrary, images collected continuously with similar appearances have identical a and different s.

Inference process

Our goal is to find two functions $h_{S}$ and $h_{A}$ to extract disentangled representation ${\hat{s}}_{i}$ and ${\hat{a}}_{i}$

\begin{array}{l} {\hat{s}}_{i} = h_{S} (x_{i}), {\hat{a}}_{i} = h_{A} (x_{i}) \end{array}

They are estimated from the inference processes $h_{S}$ and $h_{A}$ , and we will refer to them as place feature and appearance feature, respectively. Apparently, we desire the place feature to be domain-unrelated and the appearance feature to be domain-related. Feature pair $z = (\hat{s}, \hat{a})$ is constituted of a place feature $\hat{s}$ and an appearance feature $\hat{a}$ , and all the zs constitute the feature space $Z$ .

As s and a_i in Eq. (4) are independent variables, ${\hat{s}}_{i}$ and ${\hat{a}}_{i}$ are also desired to be disentangled. Formally, a disentangled representation calls for that a change in ${\hat{s}}_{i}$ only corresponds to a change in s, while a change in ${\hat{a}}_{i}$ only corresponds to a change in a_i .¹⁴ In our setting, disentanglement requires that ${\hat{s}}_{i}$ and ${\hat{a}}_{i}$ should be unrelated. Additionally, these two features should have semantic meaning. For example, in our problem, it is required that ${\hat{s}}_{i}$ represents the place contents, such as the buildings and traffic signs in sight, while ${\hat{a}}_{i}$ represents the appearance information, such as the weather and the season. At an ideal state, a change in ${\hat{s}}_{i}$ should represent a change in place (i.e. moving 5 m forward), while a change in ${\hat{a}}_{i}$ should represent a change in appearance (i.e. shifting from sunny to rainy).

Adversarial disentangled feature learning

In this section, we present our feature disentanglement network in detail, including network architecture and the loss functions, which are summarized in Figure 2. At last, we introduce how to train the network and extend it to multiple domains.

Figure 2.

Network architecture. $E_{S}$ : place encoder; $E_{A}$ : appearance encoder; G: decoder; $D_{p l a}$ : place domain discriminator; $D_{a p p}$ : appearance compatibility discriminator; $s_{i} / a_{i}$ : place/appearance feature from domain i; $x_{i} / {\hat{x}}_{i}$ : input/reconstructed image from domain i; and $\otimes$ : concatenate operation. Any symbol with a superscript “ $'$ ” indicates that it is sampled or generated from another image of the same domain.

Network architecture

The motivation of the proposed network is to extract disentangled features from given images. To achieve it, we set to constrain features explicitly to fulfill the disentanglement requirement.

Firstly, our method uses an autoencoder as the feature extractor. Autoencoder is a widely used self-supervised machine learning method, where the encoder encodes the input as a feature and the decoder tries to reconstruct the original input from the feature. Our encoder is composed of two parts, namely the place encoder $E_{S}$ and appearance encoder $E_{A}$ , which is used to approximate $h_{S}$ and $h_{A}$ , respectively. The decoder G is used to reconstruct the input image, which ensures that the network learns meaningful features. Additionally, the dimension of the place and appearance features are usually smaller than the input dimension, leading to a more compact representation.

Pure autoencoder is not enough for disentanglement, as there is no constraint for $\hat{s}$ and $\hat{a}$ to be disentangled. As discussed above, we also need to ensure that a change in $\hat{s}$ only corresponds to a change in s, and a change in $\hat{a}$ only corresponds to a change in a. We solve this problem in two steps. Firstly, we introduce an appearance compatibility discriminator into the autoencoder, which is designed to ensure $\hat{s}$ and $\hat{a}$ that are unrelated. Secondly, a place domain discriminator is designed to ensure $\hat{s}$ that only has place information, leading to the disentanglement of $\hat{s}$ and $\hat{a}$ . For simplicity, we start from two domains ( $X_{1}$ and $X_{2}$ )

\begin{array}{l} {\hat{s}}_{1} = h_{S} (x_{1}), {\hat{a}}_{1} = h_{A} (x_{1}) \\ {\hat{s}}_{2} = h_{S} (x_{2}), {\hat{a}}_{2} = h_{A} (x_{2}) \end{array}

where x ₁ and x ₂ are sampled from domains $X_{1}$ and $X_{2}$ , respectively.

Our autoencoder follows the widely used bottleneck architecture, where the input is downsampled into two smaller feature maps by the encoders, while the decoder tries to recover the input from those two feature maps. The discriminators take combinations of those feature maps as input and give one dimension as output (Figure 2). The detailed architectures of different experiments are listed in the appendix.

Autoencoder

To measure the reconstruction quality, different distances can be used. In the previous version, we used L2 loss for reconstruction, which can be expressed as follows¹⁵

\begin{array}{l} L_{recon} = & \frac{1}{2} E_{x_{1}} {‖G (h_{S} (x_{1}), h_{A} (x_{1})) - x_{1}‖}_{2}^{2} \\ ਀ & + \frac{1}{2} E_{x_{2}} {‖G (h_{S} (x_{2}), h_{A} (x_{2})) - x_{2}‖}_{2}^{2} \end{array}

where x_i is an image sampled from some data distribution $p (x_{i})$ of domain $X_{i}$ . For specific tasks, such as place recognition, L2 loss is not the best choice. In this article, for the place recognition task, we try to further improve the reconstruction quality by constraining the reconstructed image to have similar high-level features to the original image. The high-level features are more meaningful because they have more “perceptual” information.⁵⁷ Formally, this can be written as follows

\begin{array}{l} L_{recon} = & \frac{1}{2} E_{x_{1}} {‖F (G (h_{S} (x_{1}), h_{A} (x_{1}))) - F (x_{1})‖}_{2}^{2} \\ ਀ & + \frac{1}{2} E_{x_{2}} {‖F (G (h_{S} (x_{2}), h_{A} (x_{2}))) - F (x_{2})‖}_{2}^{2} \end{array}

where F is a pretrained VGG feature extractor.⁵⁸ With this perceptual loss, the reconstructed images are more semantically meaningful than the ones with L2 loss, which also leads to more meaningful features. This helps for boosting place recognition performance, as shown in the experiments.

Appearance compatibility discriminator

As pointed out in the last section, disentanglement requires that $\hat{s}$ only corresponds to s while $\hat{a}$ only corresponds to a. We firstly turned it into a “weaker” version that $\hat{s}$ and $\hat{a}$ are unrelated. As stated before, we want to constrain the features directly. We construct two tuples, $({\hat{s}}_{1}, {\hat{a}}_{1})$ and $({\hat{s}}_{2}, {\hat{a}}_{1})$ , and constrain them to follow the same distribution. To fulfill this constraint, we design a discriminator, the appearance compatibility discriminator, which is denoted by $D_{app}$ . The input of $D_{app}$ is the constructed tuples $({\hat{s}}_{1}, {\hat{a}}_{1})$ and $({\hat{s}}_{2}, {\hat{a}}_{1})$ . We use an adversarial network to minimize the distance between these two tuples. The discriminator $D_{app}$ and the encoders $E_{S}$ and $E_{A}$ are optimized successively. The encoders try to learn features, which minimize the distance between $({\hat{s}}_{1}, {\hat{a}}_{1})$ and $({\hat{s}}_{2}, {\hat{a}}_{1})$ . On the other hand, the goal of $D_{app}$ is to tell whether the given tuples follow the same distribution. In addition, instead of using Jensen–Shannon -divergence as a distance metric,⁵⁰ we use the Pearson $χ^{2}$ divergence to get faster convergence, which uses L2 loss without sigmoid function.⁵⁹

During training, $D_{app}$ and encoders ( $E_{S}$ and $E_{A}$ ) are updated successively. The loss function for $D_{app}$ is

\begin{array}{l} L_{D}^{app,1} = & \frac{1}{2} E_{{\hat{s}}_{1}, {\hat{a}}_{1}} [(D_{app} ({\hat{s}}_{1}, {\hat{a}}_{1}) - {1)}^{2}] \\ ਀ & + \frac{1}{2} E_{{\hat{s}}_{2}, {\hat{a}}_{1}} [(D_{app} ({\hat{s}}_{2}, {\hat{a}}_{1}) - {0)}^{2}] \end{array}

where ${\hat{s}}_{1}$ , ${\hat{s}}_{2}$ , and ${\hat{a}}_{1}$ is given by $E_{S} (x_{1})$ , $E_{S} (x_{2})$ , and $E_{A} (x_{1})$ . x ₁ and x ₂ are images from domain $X_{1}$ and $X_{2}$ , respectively.

Encoders $E_{S}$ and $E_{A}$ are encouraged to confuse $D_{app}$ . The loss function for autoencoder can be expressed as follows

\begin{array}{l} L_{Adv}^{app,1} = & \frac{1}{2} E_{x_{1}} [(D_{app} (E_{S} (x_{1}), E_{A} (x_{1})) - {0)}^{2}] \\ ਀ & + \frac{1}{2} E_{x_{1}, x_{2}} [(D_{app} (E_{S} (x_{2}), E_{A} (x_{1})) - {1)}^{2}] \end{array}

It is worth noticing that $E_{S} (x_{1})$ , $E_{S} (x_{2})$ , and $E_{A} (x_{1})$ in Eq. (10) are in fact ${\hat{s}}_{1}$ , ${\hat{s}}_{2}$ , and ${\hat{a}}_{1}$ in Eq. (9). We replace them to remind that $E_{S}$ is fixed when updating $D_{app}$ (Eq. (9)), while $D_{app}$ is fixed when updating $E_{S}$ and $E_{A}$ (Eq. (10)).

Equations (9) and (10) are formulated for the case that the first input image is sampled from $X_{1}$ . When the first image is from $X_{2}$ , we can derive $L_{D}^{app,2}$ and $L_{Adv}^{a pp,2}$ by exchanging two domains in Eqs. (9) and (10).

Place domain discriminator

With the appearance compatibility discriminator, we can only ensure that $\hat{s}$ and $\hat{a}$ are unrelated. An extreme situation is that $\hat{s}$ contains both place and appearance information, while $\hat{a}$ learns little information. This happens when the dimension of $\hat{s}$ is large enough to encode both s and a. It often holds true when we do not know the actual dimension of s and a, thus we have to set the dimension for $\hat{s}$ larger than $\hat{a}$ . The reason is that place information is often much richer than appearance information for any given image. To address this problem, we construct another discriminator to ensure that $\hat{s}$ only contains place information. If satisfied, combined with $D_{app}$ , a change in $\hat{s}$ will only correspond to a change in s. At the same time, all the appearance information will be represented in $\hat{a}$ with the reconstruction constraint. Thus, all the place features from $X_{1}$ and all the place features from $X_{2}$ are desired to follow the same distribution. When this holds true, it means neither ${\hat{s}}_{1}$ nor ${\hat{s}}_{2}$ is affected by domain-related states.

Supervised methods like NetVLAD constrain features from the same place with different appearances to be the same,¹⁰ which requires for alignment information. But as we hope to override the dependency on aligned data, we accomplish it in a different way. Constraining ${\hat{s}}_{1}$ and ${\hat{s}}_{2}$ to follow the same distribution is applicable using adversarial learning, but it is hard to extend to multiple domains. The reason is that such discriminator is domain-specific, and more domains require more discriminators. To overcome this, we construct the place feature pairs $({\hat{s}}_{1}, {\hat{s}}_{1^{'}})$ and $({\hat{s}}_{1}, {\hat{s}}_{2})$ , where ${\hat{s}}_{1^{'}}$ is another place feature from domain $X_{1}$ . In particular, we sample another image $x_{1^{'}}$ from the domain $X_{1}$ and extract its place feature ${\hat{s}}_{1^{'}}$ . Then $({\hat{s}}_{1}, {\hat{s}}_{1^{'}})$ and $({\hat{s}}_{1}, {\hat{s}}_{2})$ are constrained to follow the same distribution, which implies that ${\hat{s}}_{1}$ and ${\hat{s}}_{2}$ follow the same distribution. Now we introduce the place domain discriminator, denoted by $D_{pla}$ . $D_{pla}$ is designed to constrain $({\hat{s}}_{1}, {\hat{s}}_{1^{'}})$ and $({\hat{s}}_{1}, {\hat{s}}_{2})$ to follow the same distribution. Similar to the $D_{app}$ , the discriminator $D_{pla}$ and the place encoder $E_{S}$ are optimized in an adversarial way. Now we can formulate the loss function for $D_{pla}$ , which can be written as follows

\begin{array}{l} L_{D}^{pla,1} = & \frac{1}{2} E_{{\hat{s}}_{1}, {\hat{s}}_{1^{'}}} [(D_{pla} ({\hat{s}}_{1}, {\hat{s}}_{1^{'}}) - {1)}^{2}] \\ ਀ & + \frac{1}{2} E_{{\hat{s}}_{1}, {\hat{s}}_{2}} [(D_{pla} ({\hat{s}}_{1}, {\hat{s}}_{2}) - {0)}^{2}] \end{array}

where ${\hat{s}}_{1}$ and ${\hat{s}}_{1^{'}}$ are derived from $E_{S} (x_{1})$ and $E_{S} (x_{1^{'}})$ , with x ₁ and $x_{1^{'}}$ being two samples from the same domain $X_{1}$ . The discriminator $D_{pla}$ can be interpreted from another perspective: it outputs 1 if the given two-place features are from the same domain and 0 for those from different domains. Again, the loss function for the encoder $E_{S}$ can be expressed as follows

\begin{array}{l} L_{Adv}^{pla,1} = & \frac{1}{2} E_{x_{1}, x_{1^{'}}} [(D_{p l a} (E_{S} (x_{1}), E_{S} (x_{1^{'}})) - {0)}^{2}] \\ ਀ & + \frac{1}{2} E_{x_{1}, x_{2}} [(D_{p l a} (E_{S} (x_{1}), E_{S} (x_{2})) - {1)}^{2}] \end{array}

Similarly, we can have $L_{D}^{pla,2}$ and $L_{Adv}^{pla,2}$ .

Training strategies

The discriminators ( $D_{pla}$ and $D_{app}$ ) and encoders ( $E_{S}$ and $E_{A}$ ) are updated in turn by back-propagation. The only difference is that encoders are updated with the decoder G together. Each iteration of the training is constituted of three steps: (i) updating $D_{pla}$ , (ii) updating $D_{app}$ , and (iii) updating $E_{S}$ , $E_{A}$ , and G. Each training iteration can be formally written as follows

min_{D_{pla}} L_{D}^{pla,1} + L_{D}^{pla,2}

min_{D_{app}} L_{D}^{app,1} + L_{D}^{app,2}

min_{E_{S}, E_{A}, G} L_{recon} + λ_{1} (L_{Adv}^{app,1} + L_{Adv}^{app,2}) + λ_{2} (L_{Adv}^{pla,1} + L_{Adv}^{pla,2})

where $λ_{1}$ and $λ_{2}$ are hyperparameters to balance reconstruction and disentanglement, which is determined by grid search for different tasks. When training with only two domains, images x ₁ and $x_{1^{'}}$ are sampled from the first domain, while x ₂ and $x_{2^{'}}$ are from the second one. The network is trained with Adam optimizer,⁶⁰ with $β_{1} = 0.9$ and $β_{2} = 0.999$ . Learning rates for autoencoder (including $E_{S}$ , $E_{A}$ , and G), $D_{p l a}$ , and $D_{a p p}$ are set as $0.000003$ , $0.00001$ , and $0.00001$ .

During training, the training datasets are augmented to increase robustness against viewpoint changes. For each image, we select four points on it randomly, and the surrounded area is cropped and wrapped into the original size.¹¹ Besides, the wrapped image is flipped horizontally randomly.

Extension: Multiple domain case

Based on this domain-unrelated architecture, we present how to extend our network to multiple domains. Assume that there are n domains, denoted by $X_{1}, X_{2}, \dots, X_{n}$ . In each training iteration, we firstly select two domains $X_{i}$ and $X_{j}$ randomly, where $i, j = 1, 2, \dots, n$ and $i \neq j$ . Then two batches of images are randomly sampled from $X_{i}$ and $X_{j}$ , respectively. These images are fed into the network as training data for this iteration. During testing, when matching between any two of those domains, images are fed into $E_{S}$ to obtain place features, followed by feature matching.

This extension does not require additional parameters. To see it, one should note that the autoencoder is shared across different domains. Besides, the discriminators are also shared and domain-unrelated, which only use information between two domains. For example, if the inputs of $D_{pla}$ are $({\hat{s}}_{i}, {\hat{s}}_{i})$ and $({\hat{s}}_{i}, {\hat{s}}_{j})$ , $D_{pla}$ can be regarded as a metric to measure the distance between $({\hat{s}}_{i}, {\hat{s}}_{i})$ and $({\hat{s}}_{i}, {\hat{s}}_{j})$ , instead of ${\hat{s}}_{i}$ and ${\hat{s}}_{j}$ . It is similar to the triplet loss, which measures the metric distance between inner classes and outer classes.⁶¹ The parameters are shareable among different domains, and no additional parameters are needed when extending to more domains.

The extension enables that only one model is needed in a specific scene for long-term deployment. In the beginning, we have a baseline model trained from several domains. When new data with different appearances in the same area are available, the model can be fine-tuned by retraining on the dataset enhanced with newly collected data. The retraining does not require additional parameters. In contrast, style-transfer-based methods need new models to transfer new data into known styles. When new data come periodically, this will lead to quadratically increasing parameters. To see it, one can assume that there are m domains. To transfer each domain to others, they need to train $(\begin{matrix} m \\ 2 \end{matrix}) = \frac{1}{2} m (m - 1)$ models. Thus compared to style-transfer-based methods, our method is more suitable for plugging into any long-term localization framework as the feature extractor.^62,63

Experiments

We conduct several experiments to illustrate our method. Firstly, we validate the proposed network with two toy cases, including Linear Gaussian for quantitative analysis and Colored MNIST for qualitative analysis. Then, we apply our network to visual place recognition and demonstrate its performance in different perspectives, including basic performance, ablation study, and multiple-domain performance.

Toy case validation

Linear Gaussian

We test the network on a linear gaussian generative process with two domains to validate whether it can produce disentangled representation as desired. The reason to choose the linear model is that we can use correlation as a quantitative metric for disentanglement in the linear case. The generative process can be written as follows

\begin{matrix} x = M_{1} s + M_{2} a_{i}, i = 1, 2 \\ s \sim N (μ_{s}, σ_{s}^{2}) \in ℝ^{n_{s}}, a_{i} \sim N (μ_{a, i}, σ_{a, i}^{2}) \in ℝ^{n_{a}} \\ M_{1} \in ℝ^{n_{x} \times n_{s}}, M_{2} \in ℝ^{n_{x} \times n_{a}} \end{matrix}

where s and a_i are vectors with the dimension of n_s and n_a , and they are sampled from two normal distributions, respectively, with $μ_{s}$ and $μ_{a, i}$ as means and $σ_{s}^{2}$ and $σ_{a, i}^{2}$ as covariance matrices. Elements in s are independent of each other, thus $σ_{s}^{2}$ is the diagonal matrix, so as $σ_{a, i}^{2}$ . The generated data point x is n_x -dimension vector. In our setting, $n_{x} > (n_{s} + n_{a})$ , thus each dimension in the generated data is not totally independent of others. This often happens in the natural world, such as pixels in an image, which are not independent of each other. Specifically, $n_{s} = 5$ , $n_{a} = 5$ , and $n_{x} = 50$ .

The encoders $E_{S}$ and $E_{A}$ are single-layer linear modules. For another, the discriminators are multiple-layer nonlinear modules, with leaky rectified linear unit (LReLU) as an activation function.⁶⁴

To demonstrate the power of the two proposed adversarial losses, we compute the correlation matrix between the place feature $\hat{s}$ and appearance feature ${\hat{a}}_{i}$ of our method (Figure 3(c)). In comparison, we train a model with only an autoencoder (Figure 3(b)). Both of them are trained from the same initial state (Figure 3(a)). By comparing Figure 3(b) and (c), we can see that the introduced loss functions can produce less correlated $\hat{s}$ and ${\hat{a}}_{i}$ . Besides, the correlation between place feature and appearance feature is nearly zero (Figure 3(c)), which shows that the features are disentangled very well. To further ensure that $D_{pla}$ is necessary, we train the network with only $D_{app}$ (Figure 3(d)). From Figure 3(c) and (d), we can see that without $D_{pla}$ , correlation is slightly larger. Thus, $D_{pla}$ can improve the disentanglement to some extent.

Figure 3(a) to (d) shows the case that $n_{a} = n_{\hat{a}} = 5$ and $n_{s} = n_{\hat{s}} = 5$ , where $n_{\hat{a}}$ and $n_{\hat{s}}$ are the dimensions of appearance feature and place feature. As discussed above, in practice it is better to set larger $n_{\hat{a}}$ and $n_{\hat{s}}$ because we need to ensure $n_{\hat{a}} > n_{a}$ and $n_{\hat{s}} > n_{s}$ but n_a and n_s are unknown. To simulate the practical case, in Figure 3(e) to (h), we set $n_{\hat{a}} = 10$ and $n_{\hat{s}} = 10$ . We get similar results to the former case, except that the autoencoder becomes more correlated when we increase the dimension of the feature space (Figure 3(b) and Figure (f)). When the features are over-parameterizing, purge autoencoder cannot guarantee that place feature and appearance feature are disentangled, and our method can effectively solve this problem.

Figure 3.

Correlation matrices for the linear toy case. Each subfigure is composed of two correlation matrices: correlation matrix between $\hat{s}$ and ${\hat{a}}_{1}$ and correlation matrix between $\hat{s}$ and ${\hat{a}}_{2}$ . Figures 3(a) to (d): dimensions of $\hat{s}$ and ${\hat{a}}_{i}$ are both 5; Figures 3(e) to (h): dimensions of $\hat{s}$ and ${\hat{a}}_{i}$ are both 10. Figures 3(a) and (e) are the initial state before training. Figures 3(b) and (f) are the result of the autoencoder baseline. Figures 3(c) and (g) are the result of our method. Figures 3(d) and (h) are our methods without $D_{pla}$ . (a) Initial state, (b) autoencoder, (c) ours, (d) ours ( $D_{a p p}$ only), (e) initial state, (f) autoencoder, (g) ours, and (h) ours ( $D_{app}$ only).

Colored MNIST

The linear Gaussian is a simple toy case. To show the power of our method in a more complicated scenes, we propose colored MNIST, an enhanced version of the handwritten digits dataset MNIST.⁶⁵ In this dataset, the original grayscale digits are colored randomly in seven colors to simulate different domains. The colored MNIST is constructed to fulfill our hypothesis that the input data are composed of domain-related and domain-unrelated components. Based on this, we train our network and analyze the learned features qualitatively. Dimensions of features $n_{\hat{a}} = 8$ and $n_{\hat{s}} = 4 \times 4 \times 64 = 1024$ . In training, $λ_{1}$ and $λ_{2}$ are set as 0.1.

To investigate what has the network learned from this toy case, we visualize the place feature and appearance feature using t-distributed stochastic neighbor embedding (t-SNE).⁶⁶ As the dimensions of our features (1024 for $\hat{s}$ and 8 for $\hat{a}$ ) are too high for visualization, we employ t-SNE, which reduces high-dimensional data into two-dimensional ones for visualization. In this experiment, the “place” content is a digit, while the ‘appearance’ content is color. Each point in Figure 4(a) and (b) represents a sample from the colored MNIST dataset. For example, the spatial locations of all the points in Figure 4(a) represent the (reduced) place features of the samples. The color of each point in the left part of Figure 4(a) is determined the color of the sample, whereas the color of each point in the right part of Figure 4(a) is determined by the digit of the sample. Figure 4(b) is similar except that locations correspond to appearance features. From Figure 4(a), we can see that the place features can be easily clustered by digits, and they are unrelated to their color. It means that the place feature corresponds to the digit, and it is not affected by the color. Conversely, from Figure 4(b), we can find that the color information is already embedded into appearance features, and the same digit with different can have varied appearance features. Thus, the appearance feature corresponds to the color, and it is not affected by the digit. From these results, we can safely conclude that our network can obtain disentangled representation.

Figure 4.

Visualization of the learned features on colored MNIST. Left: t-SNE visualization of place feature (Figure 4(a)) and appearance feature (Figure 4(b)) and right: image translation. (a) Place feature ( $\hat{s}$ ). Left: colored by domains (colors) and right: colored by digits. (b) Appearance feature ( $\hat{a}$ ). Left: colored by domains (colors) and right: colored by digits. (c) Image translation. Each subplot is constituted of four images: the image from the first domain (top left), the image from the second domain (top right), the translated image (bottom left), and the zero-appearance image (bottom right). The first input images in the same column share the same domain, whereas the second input images in the same row share the same domain. t-SNE: t-distributed stochastic neighbor embedding.

We also try to analyze our network in image space. As done in some image-to-image translation literature,⁶⁷ we can replace the appearance features with those from another domain for the decoder to obtain image with different style. We implement this in two ways. Firstly, given two colored digits as inputs, we combine the place feature from the first digit and the appearance feature from the second digit and reconstruct a new digit through the decoder. The newly recovered digit is called translated image. Secondly, the appearance feature is replaced with a new vector filled with zeros, and the decoded image is called zero-appearance image. Results are presented in Figure 4(c). We sample images from every two domains. As there are seven domains in the dataset, there are $7 \times 7$ subplots in Figure 4(c), except for the samples from the same domain (diagonal). Each subplot can be denoted by $p_{i, j}$ , where $i = 1, 2, \dots,7, j = 1, 2, \dots,7, i \neq j$ . $p_{i, j}$ can be further divided by four parts, namely $p_{i, j} (m, n)$ , where $m = 1, 2, n = 1, 2$ . For example, the two input images for $p_{1, 1}$ are a blue five ( $p_{1, 1} (1, 1)$ ) and a white seven ( $p_{1, 1} (1, 2)$ ). The translated image is a white five ( $p_{1, 1} (2, 1)$ ), while the zero-appearance image is a green five ( $p_{1, 1} (2, 2)$ ). We can see that the translated image keeps the shape of the first input image, while its color is determined by the second input image. It is also found that the colors of all the zero-appearance images are the same, while their digits remain the same as their original first input images. Thus, we can say that the digit of the reconstructed image is controlled by the place feature, while the color is determined by the appearance feature.

Datasets

To validate the proposed method on visual place recognition task, we test our network on three public datasets: Partitioned Nordland dataset,³² Alderley Day/Night dataset,⁷ and RobotCar-Seasons dataset.⁶⁸ Besides, we also experiment on a new dataset, YQ Day/Night dataset, which is collected in a campus environment with changing appearance and will be opened to the public (https://tangli.site/projects/academic/yq21).

Partitioned Nordland dataset

It is collected from a train on the same route (729 km) in four seasons (spring, summer, fall, and winter) with GPS data. In this article, four seasons are treated as four domains. The original Nordland dataset is used by many researches,^69,42 but they use different partitions for training and testing. This dataset proposes a reasonable partition of Nordland, where the whole route is partitioned into five segments, with two as training set and three as testing set. The training and testing sets have 24,570 and 3450 images for each domain, respectively. The GPS information is accurate enough to align between different domains. This dataset provides images without perspective changes and high-quality ground truth, thus it is useful for validating the ability to overcome appearance changes (Figure 5(a)).

Figure 5.

Sample images from datasets used in the experiments. (a) Partitioned Nordland dataset³²; (b) Alderley Day/Night dataset⁷; (c) RobotCar-Seasons dataset⁶⁸; and (d) YuQuan Day/Night dataset.

Alderley Day/Night dataset

It is captured from a camera mounted on a car. It is constituted of two domains, one daytime and one nighttime, on an 8 km journey. As mentioned in their article, the GPS data are not reliable enough, thus the images are manually aligned frame by frame.⁷ However, as the images are collected in an urban environment, the vehicle is moving with lateral and heading changes, which makes the alignment less reliable. Besides, the apparent differences between those two sessions are really large, making it challenging for place recognition. From Figure 5(b), one can see that sometimes it is even challenging for humans. Each domain in the dataset is split as a training and testing set with 10,007 and 4600 images, respectively.³²

RobotCar-Seasons dataset

It is based on the Oxford RobotCar dataset, which is recorded on a vehicle with six cameras under different conditions in the urban environment, with other sensors including INS, GPS, and LiDAR.⁷⁰ RobotCar-Seasons selects a subset of RobotCar dataset, with one reference traversal in overcast condition (overcast-reference) as database and nine traversals for query in different conditions.⁶⁸ The query set is further split as all-day and all-night, where all-day refers to images collected in the daytime, while all-night refers to images collected in the nighttime. In this article, only the all-night subset is used, as we want to validate our method in different domains. Thus, two domains, namely overcast-reference and all-night, are used for training and testing. The ground truth is obtained from large-scale structure from motion and initialized with INS data. One thing that should be mentioned is that the ground truth is only used in the testing phase, not in the training process. This dataset provides challenges in autonomous driving including appearance changes, perspective difference, and motion blur (Figure 5(c)).

YQ Day/Night dataset

It is a subset of the YQ dataset.⁶² The original YQ dataset has 21 sessions, out of which we choose two for the place recognition task in this article. These two sessions become two domains in this dataset, namely day and night. The first session is collected in the morning, while the second one is collected in the evening (Table 1). The evening traversal is collected at the time when it is turning from day to night. Although this traversal only lasts for about 19 min, it contains a different appearance. Thus, it can be used to validate the robustness of algorithms against dynamic appearance changes in a short period. Sample images are shown in Figure 5(d). The ground truth is obtained from the LiDAR SLAM results. We split the trajectory into two folds, with the 60% beginning being the training set and the 30% ending as the testing set. The training set is sampled every 0.1 m, and the testing set is sampled every 2 m. As the data are recorded on a mobile robot controlled by a remoter at low speed, it can be seen as a representative in mobile logistics in a small area, with appearance and perspective changes.

Table 1.

Details of YQ Day/Night dataset.

Session	Date and time	Duration	Number of images
Session	Date and time	Duration	Training	Testing
Day	2017/03/03 07:52:31	17.7 min	5393	181
Night	2017/03/07 18:07:21	19.8 min	4973	181

Evaluation metrics and training details

We choose two widely used metrics for visual place recognition in the following experiments (except for RobotCar-Seasons which will be discussed later): area under curve (AUC) and accuracy (true positive rate). After the training is done, the place features are used as global descriptors for matching. The goal of visual place recognition is to find the nearest image to a given query image x_Q in database $q_{D B, i}, i \in N_{D B}$ , where $N_{D B}$ is the number of database images. This article adopts the brute force matching without any temporal fusion to demonstrate the discriminative ability of our feature. Specifically, we extract place features for database images using the trained place encoder, namely ${\hat{s}}_{D B, i} = E_{S} (x_{D B, i})$ , resulting in the database feature set ${{\hat{s}}_{D B, i}}$ . The place feature ${\hat{s}}_{Q}$ for the query image x_Q can also be extracted using $E_{S}$ . Finally, the nearest image x_m is determined by the cosine distances between ${\hat{s}}_{Q}$ and ${\hat{s}}_{D B, i}$

m = \underset{i = 1, \dots, N_{D B}}{arg
max} (\frac{{\hat{s}}_{Q}}{∥ {\hat{s}}_{Q} ∥} \cdot \frac{{\hat{s}}_{D B, i}}{∥ {\hat{s}}_{D B, i} ∥})

Before computing the distances, all the features, including ${\hat{s}}_{Q}$ and ${\hat{s}}_{D B, i}$ , are flattened as vectors. In Eq. (17), we normalize the input vector because it can increase the robustness of the features against illumination changes by normalizing the magnitude of the features.

Our network is trained with $λ_{1} = 0.003$ and $λ_{2} = 0.01$ in all visual place recognition tasks. The exception is that in the ablation study, these two parameters are set to zeros somewhere. All models of our method are trained for 100 epochs, except for that model of the YQ Day/Night dataset, which is trained for 50 epochs. It is because the number of images in YQ is too small, and the model gets overfitted too quickly.

Comparison methods

To illustrate the performance of our method on visual place recognition, several methods are selected as a comparison:

Three methods based on handcrafted features (DBoW₂, HOG, and DenseVLAD) are chosen as representatives of traditional methods.^24,27,55

For supervised methods, NetVLAD and method by Facil et al. are used to show their advantages and limitation.^10,32 As NetVLAD has opened the code for training NetVLAD, we also retrain NetVLAD on these datasets to see the improvement obtained from supervision. All settings follow the original article of NetVLAD, except that when training YQ Day/Night, the “nNegChoice,” “nTestSample,” and “nTestRankSample” are set as 20, 20, and 100, respectively, because the dataset is small.

Besides, two style-transfer-based algorithms are also selected.^12,71

RobotCar-Seasons dataset uses different evaluation metrics. As described by the benchmark, RobotCar-Seasons dataset measures the performance using percentages of query images localized within three error tolerance thresholds.⁶⁸

Ablation study

Our method introduces two new adversarial losses to the autoencoder. To see whether the new losses work, we conduct an ablation study in this section. We selected the most challenging pair in Partitioned Nordland, namely winter and spring, as target domains. The experiment follows the same settings described in the last subsection, except the hyperparameters $λ_{1}$ and $λ_{2}$ . The baseline is the autoencoder, where $λ_{1}$ and $λ_{2}$ are set as zeros. To see the necessity of $D_{pla}$ and $D_{app}$ , $D_{app}$ is added to the baseline ( $λ_{1} = 0.003$ and $λ_{2} = 0$ ), followed by adding $D_{pla}$ ( $λ_{1} = 0.003$ and $λ_{2} = 0.01$ ). Finally, augmentation is added as a comparison.

Results are displayed in Table 2. We can see that the complete network (with $D_{pla}$ , $D_{app}$ , and augmentation) is much better than the baseline (pure autoencoder). By adding $D_{app}$ , the network gets slight improvement (0.06 in AUC and 9.9 percent points in accuracy), while by adding $D_{p l a}$ further, more improvements are obtained (0.21 in AUC and 11.5 percent points in accuracy). Finally, by adding augmentation, we can have a further performance improvement. Thus, we can safely conclude that the introduced losses are effective in visual place recognition.

Table 2.

Ablation study.

2 $D_{pla}$	$D_{app}$	Augmentation	AUC	Accuracy
✗	✗	✗	0.50	34.1%
✗	✓	✗	0.53	36.1%
✓	✓	✗	0.66	48.8%
✓	✓	✓	0.87	66.3%

AUC: area under curve.

Two domains

This section investigates the performances of different methods in the scenario with two domains. For Partitioned Nordland dataset, only the winter and spring sessions are chosen as two domains.

Experimental results are listed in Tables 3 and 4. Table 3 presents the performance of different methods on three of the mentioned datasets: Partitioned Nordland, Alderley Day/Night, and YQ Day/Night. To evaluate our method on RobotCar-Seasons dataset, we submit our results to the Visual Localization Benchmark (https://www.visuallocalization.net/benchmark). In Table 4, we also consider several methods on that benchmark as a comparison (only published place recognition algorithms are selected).

Table 3.

Performance of different methods.

Method	Need aligned data?	Partitioned Nordland		Alderley Day/Night		YQ Day/Night
Method	Need aligned data?	AUC	Accuracy	AUC	Accuracy	AUC	Accuracy
DBoW₂	✗	0.09	1.33%	0.00	0.22%	0.071	4.97%
HOG	✗	0.17	17.0%	0.02	1.85%	0.044	4.42%
Merrill et al.¹¹	✗	0.28	15.8%	0.10	1.26%	0.174	14.92%
NetVLAD	✗	0.22	19.9%	0.02	2.65%	0.224	30.39%
ToDayGAN	✗	0.10	13.4%	0.02	1.79%	0.162	22.10%
Ours	✗	0.87	66.3%	0.45	25.5%	0.610	42.5%
Ours-L2	✗	0.70	49.4%	0.34	21.0%	0.510	40.9%
Facil et al.³²	✓	N/A1	92%	N/A1	7.82%	N/A1	N/A1
NetVLAD	✓	0.74	83.0%	0.13	15.8%	0.078	11.05%

AUC: area under curve;HOG: histograms of oriented gradient; NA: not available because they do not use AUC as criteria.

Table 4.

Performance on RobotCar-Seasons.

Method	% under threshold
Method	0.25 m, 2	0.5 m, 5	5 m, 10
NetVLAD	0.3	2.3	15.9
ToDayGAN	2.2	10.8	50.5
DenseVLAD	1.0	4.4	22.7
Hu et al.⁷¹	2.5	6.5	15.8
Ours	3.3	13.4	46.5

Red: best performance; blue: second-best performance.

From Table 3, we can see that methods based on handcrafted features are not good at place recognition tasks with extreme appearance changes. For learning-based, the supervised methods outperform self-supervised ones in some dataset (Partitioned Nordland) as expected. However, our method is better than those supervised methods on some datasets (Alderley Day/Night and YQ Day/Night). There are two possible reasons. One is that supervised methods depend on the quality of alignment. As we described before, the alignment of Alderley Day/Night is not good enough somewhere, which may degrade the performance for supervised methods. Another reason is that the training set and testing set of YQ Day/Night have subtle differences as described above, which request for generalization ability for algorithms. Generally speaking, self-supervised methods outperform supervised methods in the sense of generalization.

Compared with other self-supervised methods, our method achieves comparable results. It shows the best performance on Partitioned Nordland, Alderley Day/Night, and YQ Day/Night. Combine with Table 4, we can find that our method and ToDayGAN perform better than other self-supervised methods on different datasets. One main difference between our method and ToDayGAN is that in our method the adversarial learning is applied directly to features, while ToDayGAN is on images. The transferred image by ToDayGAN looks realistic, but sometimes it still looks different from the original style image (the reader can see transferred images in the appendix). The reason is that in style transfer the same input image can generate different output. Especially, Partitioned Nordland has a slight appearance difference in the same domain, resulting in the miss-matching problem. Conversely, the appearance of the same domain in RobotCar-Seasons is quite stable, that is why ToDayGAN outperforms our method on RobotCar-Seasons in some criteria.

To see the improvement brought by perceptual loss, we also trained our network with L2 reconstruction loss (Our-L2 in Table 3). From Table 3, we can see that perceptual loss has an obvious increase compared to L2 loss. This finding illustrates that the learned features can improved by reconstructing high-level features instead of reconstructing original images.

We also try to explore what does the network learns by generating images from cross-domain features. For specific, we feed place features from one domain (Figure 6(a)) with appearance features from the other domain (Figure 6(b)) into our network to see what will be generated from the decoder G (Figure 6(c)). In Figure 6(c), one can find that the place information (e.g. tracks, buildings, and traffic lights) is determined by images in the first row (Figure 6(a)), while the appearance information (e.g. color of the ground, illumination) is determined by the second row (Figure 6(b)). This result satisfies our motivation: the place information is embedded in the place feature, while the appearance feature controls the appearance of the reconstructed images. To further illustrate what is learned in the place feature, we replace all the appearance features with all-zero vectors (Figure 6(d)). All the zero-appearance images look to have a similar appearance, while their place information remains the same as the first row (Figure 6(a)). These two findings demonstrate that the proposed method can disentangle the input image across appearance changes.

Figure 6.

Translated and zero-appearance images of Partitioned Nordland dataset. Each image in (c) is generated from the place feature of an image from (a) in the same column and appearance feature of an image from (b) in the same column, while an image in (d) is generated from place feature of (a) and an all-zero appearance feature. Columns 1 to 5: $X_{1}$ is winter and $X_{2}$ is spring and columns 6 to 10: $X_{1}$ is spring and $X_{2}$ is winter. (a) Input images from the first domain ( $X_{1}$ ); (b) input images from the second domain ( $X_{2}$ ); (c) translated images from $X_{1}$ to $X_{2}$ ; and (d) zero appearance images of $X_{1}$ .

Multiple domains

One novelty of our network is that it is designed to be trainable with multiple domains without additional parameters. The benefit is that when new data from different domain come, we can retrain our network without increasing model capacity. We use Partitioned Nordland dataset to illustrate this point as it has four domains. Firstly, for every two domains, we train a network and evaluate it on the testing set. In this process, we have 12 models in total. Secondly, we train a unified model with four domains as the training set and then evaluate it on every two domains. In this stage, only one model is obtained.

Table 5 is the comparison results between two-domain models and multiple-domain models. We can see that by fusing more domains, our network achieves comparable performance to the two-domain models. However, the two-domain network needs 12 models, while the multiple-domain network needs only 1 model. It means our method can be extended to more domains while keeping the same model complexity. This is very useful when deploying deep learning networks in an environment with changing appearance.

Table 5.

Performance comparison between two-domain and multiple-domain models.^a

(a) AUC
$X_{1} / X_{2}$	Spring	Summer	Fall	Winter	Mean
Spring	—	0.94/0.96	0.94/0.95	0.91/0.92	0.93/0.94
Summer	0.91/0.93	—	0.98/0.98	0.70/0.71	0.86/0.87
Fall	0.92/0.91	0.98/0.98	—	0.68/0.76	0.86/0.88
Winter	0.87/0.75	0.69/0.59	0.61/0.65	—	0.72/0.66
(b) Accuracy (%)
$X_{1} / X_{2}$	Spring	Summer	Fall	Winter	Mean
Spring	—	83.8/87.6	86.3/88.0	80.6/79.2	83.6/84.9
Summer	80.5/84.4	—	95.7/96.7	45.7/57.5	74.0/79.5
Fall	84.0/81.9	95.2/96.5	—	50.9/62.7	76.7/80.4
Winter	66.3/55.4	42.1/43.7	42.6/47.5	—	50.3/48.9

AUC: area under curve.

^a Each item corresponds to two-domain/multiple-domain.

Conclusion

We propose a feature disentanglement network for place recognition, which is composed of an autoencoder and two discriminators. By training the network in an adversarial manner, we can obtain domain-unrelated and domain-related features from multiple-domain data. The appearance compatibility discriminator enforces the appearance feature to be invariant to place content, while the place domain discriminator constrains the place feature to be robust against appearance changes. Qualitative and quantitative results on two two-domain cases demonstrate that our network is capable to obtain disentangled representation. Experiments on four datasets show that our method achieves favorable performance in visual place recognition tasks. Additionally, our network can be extended to multiple domains without increasing model capacity and sacrificing performance. We also open a new place recognition dataset for the research community.

Appendices

Network architecture

Networks for the linear case, colored MNIST, and place recognition are listed in Tables 6 to 8, respectively. Conv-(Nn, Kk, Pp, Ss) denotes a convolution with output channels of n, kernel size of $k \times k$ , padding of p, and stride of s. Two types of pooling are used, including max-pooling (MaxPool) of kernel size $2 \times 2$ and average-pooling (AvgPool) of kernel size $8 \times 8$ . Linear-(n) represents a fully connected layer with n outputs. Decoders upsample feature maps using the nearest neighbor algorithm (upsample). Each layer can be optionally followed by normalization function and activation function. Normalization functions include instance normalization⁷² and layer normalization⁷³. Activation functions include rectified linear unit (ReLU), LReLU (with a negative slope of 0.2), and Tanh. The concat operation tries to fuse two feature maps. For input with the same spatial dimensions, such as $H \times W \times C_{1}$ and $H \times W \times C_{2}$ , the output of concat is determined by concatenating these two features along the last axis. Thus, the size of the output is $H \times W \times (C_{1} + C_{2})$ . If input feature maps have different spatial dimensions, such as $H \times W \times C_{1}$ and $1 \times 1 \times C_{2}$ , the smaller one will be repeated along two spatial axes firstly to match with the larger one, resulting a $H \times W \times C_{2}$ feature map. Then, it is concatenated with the larger one, with a $H \times W \times (C_{1} + C_{2})$ feature map as output.

Table 6.

Network for linear case.

Layers	Place encoder	Output size
0	Input	$50$
1	Linear-(2)	5
Layers	Appearance encoder	Output size
0	Input	$50$
1	Linear-(2)	5
Layers	Decoder	Output size
0	Concat-(5, 5)	$10$
1	Linear-(50)	$50$
Layers	Place domain discriminator	Output size
0	Concat-(5, 5)	$10$
1	Linear-(40), LReLU	$40$
2	Linear-(40), LReLU	$40$
3	Linear-(40)	$40$
4	Linear-(1)	1
Layers	Appearance compatibility discriminator	Output size
0	Concat-(5, 5)	$10$
1	Linear-(40), LReLU	$40$
2	Linear-(40), LReLU	$40$
3	Linear-(40)	$40$
4	Linear-(1)	1

LReLU: leaky rectified linear unit.

Table 7.

Network for colored MNIST.

Layers	Place encoder	Output size
0	Input	$28 \times 28 \times 3$
1	Conv-(N8, K3, P1, S1), ReLU	$28 \times 28 \times 8$
2	MaxPool	$14 \times 14 \times 8$
3	Conv-(N16, K3, P1, S1), ReLU	$14 \times 14 \times 16$
4	MaxPool	$7 \times 7 \times 16$
5	Conv-(N32, K3, P2, S1), ReLU	$7 \times 7 \times 32$
6	MaxPool	$4 \times 4 \times 32$
7	Conv-(N64, K3, P1, S1)	$4 \times 4 \times 64$
Layers	Appearance encoder	Output size
0	Input	$28 \times 28 \times 3$
1	Conv-(N8, K3, P1, S1), ReLU	$28 \times 28 \times 8$
2	MaxPool	$14 \times 14 \times 8$
3	Conv-(N16, K3, P1, S1), ReLU	$14 \times 14 \times 16$
4	MaxPool	$7 \times 7 \times 16$
5	Conv-(N32, K3, P2, S1), ReLU	$7 \times 7 \times 32$
6	MaxPool	$4 \times 4 \times 64$
7	Conv-(N32, K3, P1, S1), ReLU	$4 \times 4 \times 32$
8	MaxPool	$2 \times 2 \times 32$
9	Conv-(N16, K3, P1, S1), ReLU	$2 \times 2 \times 16$
10	MaxPool	$1 \times 1 \times 16$
11	Conv-(N8, K3, P1, S1)	$1 \times 1 \times 8$
Layers	Decoder	Output size
0	Concat-( $4 \times 4 \times 64$ , $1 \times 1 \times 8$ )	$4 \times 4 \times 72$
1	Conv-(N128, K1, P0, S1)	$4 \times 4 \times 128$
2	Upsample	$7 \times 7 \times 128$
3	Conv-(N32, K3, P1, S1), ReLU	$7 \times 7 \times 32$
4	Conv-(N16, K3, P1, S1), ReLU	$7 \times 7 \times 16$
5	Upsample	$14 \times 14 \times 16$
6	Conv-(N16, K3, P1, S1), ReLU	$14 \times 14 \times 16$
7	Conv-(N8, K3, P1, S1), ReLU	$14 \times 14 \times 8$
8	Upsample	$28 \times 28 \times 8$
9	Conv-(N8, K3, P1, S1), ReLU	$28 \times 28 \times 8$
10	Conv-(N3, K3, P1, S1), ReLU	$28 \times 28 \times 3$
Layers	Place domain discriminator	Output size
0	Concat-( $4 \times 4 \times 64$ , $4 \times 4 \times 64$ )	$4 \times 4 \times 128$
1	Conv-(N64, K3, P1, S1), ReLU	$4 \times 4 \times 64$
2	Conv-(N32, K3, P1, S1), ReLU	$4 \times 4 \times 32$
3	Conv-(N16, K3, P1, S1), ReLU	$4 \times 4 \times 16$
4	Conv-(N8, K3, P1, S1)	$4 \times 4 \times 8$
5	Linear-(32)	$32$
6	Linear-(8)	8
7	Linear-(1)	1
Layers	Appearance compatibility discriminator	Output size
0	Concat-( $4 \times 4 \times 64$ , $1 \times 1 \times 8$ )	$4 \times 4 \times 72$
1	Conv-(N32, K3, P1, S1), ReLU	$4 \times 4 \times 32$
2	Conv-(N16, K3, P1, S1), ReLU	$4 \times 4 \times 16$
3	Conv-(N8, K3, P1, S1), ReLU	$4 \times 4 \times 8$
4	Conv-(N4, K3, P1, S1)	$4 \times 4 \times 4$
5	Linear-(16)	$16$
6	Linear-(1)	1

ReLU: rectified linear unit.

Table 8.

Network for place recognition.

Layers	Place encoder	Output size
0	Input	$128 \times 128 \times 3$
1	Conv-(N64, K7, P3, S1), ReLU	$128 \times 128 \times 64$
2	Conv-(N64, K4, P1, S2), IN, ReLU	$64 \times 64 \times 64$
3	Conv-(N128, K4, P1, S2), IN, ReLU	$32 \times 32 \times 128$
4	Conv-(N128, K4, P1, S2), IN, ReLU	$16 \times 16 \times 128$
5	Conv-(N128, K4, P1, S2), IN, ReLU	$8 \times 8 \times 128$
6	Conv-(N64, K3, P1, S1), IN, ReLU	$8 \times 8 \times 64$
Layers	Appearance encoder	Output size
0	Input	$128 \times 128 \times 3$
1	Conv-(N64, K7, P3, S1), ReLU	$128 \times 128 \times 64$
2	Conv-(N64, K4, P1, S2), IN, ReLU	$64 \times 64 \times 64$
3	Conv-(N128, K4, P1, S2), IN, ReLU	$32 \times 32 \times 128$
4	Conv-(N256, K4, P1, S2), IN, ReLU	$16 \times 16 \times 256$
5	Conv-(N512, K4, P1, S2), IN, ReLU	$8 \times 8 \times 512$
6	AvgPool-(K8)	$8 \times 8 \times 512$
7	Conv-(N8, K1, P0, S1)	$1 \times 1 \times 8$
Layers	Decoder	Output size
0	Concat-( $8 \times 8 \times 64$ , $1 \times 1 \times 8$ )	$8 \times 8 \times 72$
1	Conv-(N128, K1, P0, S1)	$8 \times 8 \times 128$
2	Upsample	$16 \times 16 \times 128$
3	Conv-(N128, K5, P2, S1), LN, ReLU	$16 \times 16 \times 128$
2	Upsample	$32 \times 32 \times 128$
3	Conv-(N128, K5, P2, S1), LN, ReLU	$32 \times 32 \times 128$
2	Upsample	$64 \times 64 \times 128$
3	Conv-(N64, K5, P2, S1), LN, ReLU	$64 \times 64 \times 64$
2	Upsample	$128 \times 128 \times 64$
3	Conv-(N3, K7, P3, S1), Tanh	$128 \times 128 \times 3$
Layers	Place domain discriminator	Output size
0	Concat-( $8 \times 8 \times 64$ , $8 \times 8 \times 64$ )	$8 \times 8 \times 128$
1	Conv-(N256, K4, P1, S2), LN, LReLU	$4 \times 4 \times 256$
2	Conv-(N512, K4, P1, S2), LN, LReLU	$2 \times 2 \times 512$
3	Conv-(N1024, K4, P1, S2), LN, LReLU	$1 \times 1 \times 1024$
4	Conv-(N128, K1, P0, S1), LReLU	$1 \times 1 \times 128$
5	Conv-(N1, K1, P0, S1)	$1 \times 1 \times 1$
Layers	Appearance compatibility discriminator	Output size
0	Concat-( $8 \times 8 \times 64$ , $1 \times 1 \times 8$ )	$8 \times 8 \times 72$
1	Conv-(N128, K4, P1, S2), LN, LReLU	$4 \times 4 \times 128$
2	Conv-(N256, K4, P1, S2), LN, LReLU	$2 \times 2 \times 256$
3	Conv-(N512, K4, P1, S2), LN, LReLU	$1 \times 1 \times 512$
4	Conv-(N64, K1, P0, S1), LyReLU	$1 \times 1 \times 64$
5	Conv-(N1, K1, P0, S1)	$1 \times 1 \times 1$

ReLU: rectified linear unit; IN: instance normalization; LN: layer normalization.

Samples of style transfer

Figure 7 is the sample output by Anoosheh et al.¹², including three datasets (Partitioned Nordland, Alderley Day/Night, and YQ Day/Night).

Figure 7.

Image style transfer results of ToDayGAN.¹² Row 1: Partitioned Nordland dataset; row 2: Alderley Day/Night dataset; and row 3: YQ Day/Night dataset. (a) Spring; (b) winter; (c) winter to spring; (d) spring to winter; (e) day; (f) night; (g) night to day; (h) day to night; (i) day; (j) night; (k) night to day; and (l) day to night.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed the receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Nature Science Foundation of China under Grant 61903332, and in part by the Natural Science Foundation of Zhejiang Province under grant number LGG21F030012.

ORCID iD

Li Tang

References

Bresson

Alsayed

, et al. Simultaneous localization and mapping: a survey of current trends in autonomous driving. IEEE Trans Intell Veh 2017; 2(3): 194–220.

Cadena

Carlone

Carrillo

, et al. Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans Robot 2016; 32(6): 1309–1332.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004; 60(2): 91–110.

Rublee

Rabaud

Konolige

, et al. ORB: An efficient alternative to SIFT or SURF. In International conference on computer vision, 2011, pp. 2564–2571. DOI: 10.1109/ICCV.2011.6126544.

Schindler

Brown

Szeliski

. City-scale location recognition. In 2007 IEEE conference on computer vision and pattern recognition, Minneapolis, MN, 17 June 2007 to 22 June 2007, pp. 1–7. IEEE.

Cummins

Newman

. FAB-MAP: probabilistic localization and mapping in the space of appearance. Int J Robot Res 2008; 27(6): 647–665.

Milford

Wyeth

. SeqSLAM: visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE international conference on robotics and automation, St Paul, MN, USA, 14 May 2012–19 May 2012, pp. 1643–1649. IEEE.

Doan

Latif

Chin

, et al. Scalable place recognition under appearance change for autonomous driving. In Proceedings of the IEEE international conference on computer vision, Long Beach, CA, 16–20 June 2019, pp. 9319–9328.

Chen

Jacobson

Sünderhauf

, et al. Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation (ICRA), Marina Bay Sands, Singapore, From 29 May 2017 until 3 June 2017, pp. 3223–3230. IEEE.

10.

Arandjelovic

Gronat

Torii

, et al. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, 27–30 June 2016, pp. 5297–5307.

11.

Merrill

Huang

. Lightweight unsupervised deep loop closure. In Proceedings of robotics: science and systems (RSS), Pittsburgh, PA, 2018. (accepted).

12.

Anoosheh

Sattler

Timofte

, et al. Night-to-day image translation for retrieval-based localization. In 2019 International conference on robotics and automation (ICRA), Montreal, Canada, 20–24 May 2019, pp. 5958–5964. IEEE.

13.

Lee

Tseng

Mao

, et al. DRIT++: diverse image-to-image translation via disentangled representations. Int J Comp Vis 2020; 128: 2402–2417.

14.

Higgins

Amos

Pfau

, et al. Towards a definition of disentangled representations. arXiv preprint arXiv:181202230. 2018.

15.

Tang

Wang

Luo

, et al. Adversarial feature disentanglement for place recognition across changing appearance. In International conference on robotics and automation (ICRA), p. accepted, Paris, France, 31 May–15 June 2020, pp. 1301–1307. IEEE.

16.

Kosecka

. Probabilistic location recognition using reduced feature set. In Proceedings 2006 IEEE international conference on robotics and automation, 2006. ICRA 2006, Orlando, FL, USA, 5–19 May 2006, pp. 3405–3410. IEEE.

17.

Nowicki

Wietrzykowski

Skrzypczyński

. Real-time visual place recognition for personal localization on a mobile device. Wirel Pers Commun 2017; 97(1): 213–244.

18.

Zaffar

Ehsan

Milford

, et al. Memorable maps: a framework for re-defining places in visual place recognition. IEEE Trans Intell Transp Syst 2020; 1–15. DOI:10.1109/TITS.2020.3001228.

19.

Schlegel

Grisetti

. HBST: a hamming distance embedding binary search tree for feature-based visual place recognition. IEEE Robot Auto Lett 2018; 3(4): 3741–3748.

20.

Kanji

. Mining minimal map-segments for visual place classifiers. arXiv preprint arXiv:190909594 . 2019.

21.

Yin

Srivatsan

Chen

, et al. MRS-VPR: a multi-resolution sampling based global visual place recognition method. In 2019 International conference on robotics and automation (ICRA), Montreal, Canada, May 20–24 2019, pp. 7137–7142. IEEE.

22.

Hausler

Milford

. Hierarchical multi-process fusion for visual place recognition. In 2020 IEEE International conference on robotics and automation, ICRA 2020, Paris, France, May 31–August 31 2020, pp. 3327–3333. IEEE. DOI:10.1109/ICRA40945.2020.9197360. URL https://doi.org/10.1109/ICRA40945.2020.9197360.

23.

Schubert

Neubert

Protzel

. Unsupervised learning methods for visual place recognition in discretely and continuously changing environments. In 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May 2020–31 August 2020, pp. 4372–4378. DOI:10.1109/ICRA40945.2020.9197044.

24.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005, volume 1, pp. 886–893. IEEE.

25.

Murillo

Singh

Kosecká

, et al. Localization in urban environments using a panoramic GIST descriptor. IEEE Trans Robot 2013; 29(1): 146–160.

26.

Mur-Artal

Montiel

JMM

Tardos

. ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans Robot 2015; 31(5): 1147–1163.

27.

Gálvez-López

Tardos

. Bags of binary words for fast place recognition in image sequences. IEEE Trans Robot 2012; 28(5): 1188–1197.

28.

Jégou

Douze

Schmid

, et al. Aggregating local descriptors into a compact image representation. In CVPR 2010-23 rd IEEE conference on computer vision & pattern recognition, San Francisco, CA, USA, 13–18 June 2010, pp. 3304–3311. IEEE Computer Society.

29.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012. pp. 1097–1105.

30.

Sünderhauf

Shirazi

Dayoub

, et al. On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September–3 October 2015, pp. 4297–4304. IEEE.

31.

Deng

Dong

Socher

, et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, June 20 2009 to June 25 2009, pp. 248–255. IEEE.

32.

Facil

Olid

Montesano

, et al. Condition-invariant multi-view place recognition. arXiv preprint arXiv:190209516. 2019.

33.

Hausler

Jacobson

Milford

. Filter early, match late: improving network-based visual place recognition. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macao, China, 03 November 2019–08 November 2019, pp. 3268–3275. DOI:10.1109/IROS40897.2019.8967783.

34.

Pillai

Leonard

. Self-supervised visual place recognition learning in mobile robots. In: Learning for localization and mapping workshop at IROS 2017 IROS: IEEE/RSJ international conference on intelligent robots and systems, Vancouver, BC, Canada, 24–28 September 2019.

35.

Zhu

Zhang

, et al. Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Tran Neur Netw Learn Syst 2019; 31: 661–674.

36.

Sünderhauf

Shirazi

Jacobson

, et al. Place recognition with convnet landmarks: viewpoint-robust, condition-robust, training-free. In Proceedings of robotics: science and systems XII, 2015.

37.

Hong

Petillot

Lane

, et al. Textplace: visual place recognition and topological localization through reading scene texts. In Proceedings of the IEEE international conference on computer vision, Seoul, Korea, 27 October–2 November, pp. 2861–2870.

38.

Khaliq

Ehsan

Chen

, et al. A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE Trans Robot 2019; 36(2): 1–9.

39.

Chen

Liu

, et al. Learning context flexible attention model for long-term visual place recognition. IEEE Robot Auto Lett 2018; 3(4): 4015–4022.

40.

Xin

Cai

, et al. Localizing discriminative visual landmarks for place recognition. In 2019 International conference on robotics and automation (ICRA), Montreal, Canada, 20–24 May 2019, pp. 5979–5985. IEEE.

41.

Khaliq

Ehsan

Milford

, et al. Camal: context-aware multi-scale attention framework for lightweight visual place recognition. arXiv preprint arXiv:190908153. 2019.

42.

Lowry

Milford

. Supervised and unsupervised linear learning techniques for visual place recognition in changing environments. IEEE Trans Robot 2016; 32(3): 600–613.

43.

Garg

Milford

. Fast, compact and highly scalable visual place recognition through sequence-based matching of overloaded representations. In 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May 2020–31 Aug 2020, pp. 3341–3348. DOI:10.1109/ICRA40945.2020.9196827.

44.

Hinton

Salakhutdinov

. Reducing the dimensionality of data with neural networks. Sci 2006; 313(5786): 504–507.

45.

Shantia

Timmers

Schomaker

, et al. Indoor localization by denoising autoencoders and semi-supervised learning in 3d simulated environment. In 2015 international joint conference on neural networks (IJCNN), Killarney, Ireland, 11 July 2015–16 July 2015, pp. 1–7. IEEE.

46.

Mukherjee

Chakraborty

Saha

. Learning deep representation for place recognition in slam. In International conference on pattern recognition and machine intelligence, Pattern Recognition and Machine Intelligence, 2017, pp. 557–564. Springer.

47.

Gao

Zhang

. Unsupervised learning to detect loops using deep neural networks for visual slam system. Auto Robot 2017; 41(1): 1–18.

48.

Liu

Breuel

Kautz

. Unsupervised image-to-image translation networks. In Advances in neural information processing systems, 2017. pp. 700–708.

49.

Isola

Zhu

Zhou

, et al. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, 21–26 July 2017, pp. 1125–1134.

50.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial nets. In Advances in neural information processing systems, 2014. pp. 2672–2680.

51.

Porav

Maddern

Newman

. Adversarial training for adverse conditions: robust metric localisation using appearance transfer. In 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21 May 2018–25 May 2018, pp. 1011–1018. IEEE.

52.

Latif

Garg

Milford

, et al. Addressing challenging place recognition tasks using generative adversarial networks. In 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21 May 2018–25 May 2018, pp. 2349–2355. IEEE.

53.

Clement

Kelly

. How to train a cat: learning canonical appearance transformations for direct visual localization under illumination change. IEEE Robot Auto Lett 2018; 3(3): 2447–2454.

54.

Yin

, et al. A multi-domain feature learning method for visual place recognition. In 2019 International conference on robotics and automation (ICRA), Montreal, Canada, 20 May 2019–24 May 2019, pp. 319–324. IEEE.

55.

Torii

Arandjelovic

Sivic

, et al. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, Massachusetts, 8–10 June 2015, pp. 1808–1817.

56.

Garg

Suenderhauf

Milford

. Semantic–geometric visual place recognition: a new perspective for reconciling opposing views. Int J Robot Res. Epub ahead of print 8 April 2019. DOI: 10.1177/0278364919839761.

57.

Johnson

Alahi

Fei-Fei

. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, Amsterdam, Netherlands, 8–16 October 2016.

58.

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition. In: Bengio

LeCun

(eds) 3 rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings. URL http://arxiv.org/abs/1409.1556.

59.

Mao

Xie

, et al. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017, pp. 2794–2802.

60.

Kingma

. ADAM: a method for stochastic optimization. In: Bengio

LeCun

(eds) 3 rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings. URL http://arxiv.org/abs/1412.6980.

61.

Schroff

Kalenichenko

Philbin

. FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, 7–12 June 2015, pp. 815–823.

62.

Tang

Wang

Ding

, et al. Topological local-metric framework for mobile robots navigation: a long term perspective. Auto Robot 2019; 43(1): 197–211.

63.

Churchill

Newman

. Experience-based navigation for long-term localisation. Int J Robot Res 2013; 32(14): 1645–1661.

64.

Maas

Hannun

. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the ICML, Atlanta, Georgia, USA, 16–23 June 2013.

65.

LeCun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86(11): 2278–2324.

66.

Maaten Lvd and Hinton

. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(Nov): 2579–2605.

67.

Huang

Liu

Belongie

, et al. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 172–189.

68.

Sattler

Maddern

Toft

, et al. Benchmarking 6DOF outdoor visual localization in changing conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, Utah, 19–21 June 2018, pp. 8601–8610.

69.

Gomez-Ojeda

Lopez-Antequera

Petkov

, et al. Training a convolutional neural network for appearance-invariant place recognition. arXiv preprint arXiv:150507428. 2015.

70.

Maddern

Pascoe

Linegar

, et al. 1 Year, 1000 km: the oxford robotcar dataset. Int J Robot Res 2017; 36(1): 3–15.

71.

Wang

Liu

, et al. Retrieval-based localization based on domain-invariant feature learning under changing environments. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), Macau, China, 3 November 2019 until 8 November 2019. pp. 3684–3689. DOI:10.1109/IROS40897.2019.8968047.

72.

Ulyanov

Vedaldi

Lempitsky

. Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, 21–26 July 2017, pp. 6924–6932.

73.

Lei Ba

Kiros

Hinton

. Layer normalization. arXiv preprint arXiv:160706450. 2016.

Explicit feature disentanglement for visual place recognition across appearance changes

Abstract

Keywords

Introduction

Related work

Handcraft features

Supervised features

Self-supervised features

Data modeling

Generative process

Inference process

Adversarial disentangled feature learning

Network architecture

Autoencoder

Appearance compatibility discriminator

Place domain discriminator

Training strategies

Extension: Multiple domain case

Experiments

Toy case validation

Linear Gaussian

Colored MNIST

Datasets

Partitioned Nordland dataset

Alderley Day/Night dataset

RobotCar-Seasons dataset

YQ Day/Night dataset

Evaluation metrics and training details

Comparison methods

Ablation study

Two domains

Multiple domains

Conclusion

Appendices

Network architecture

Samples of style transfer

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References