Sketch face recognition based on domain adaptation scaled entropy meta-network

Abstract

In recent years, sketch face recognition has a wide application in law enforcement agencies and criminals. Deep learning plays a crucial role in the recent developments of face recognition, however, it is challenging to employ deep learning methods for sketch face recognition due to insufficient face photo–sketch data. Moreover, compared to photos, sketches lack detailed texture, and there exists a domain gap between photos and sketches, hence, traditional homogeneous face recognition methods perform poorly in sketch face recognition. In this paper, a novel deep learning method termed Domain Adaptation Scaled Entropy Meta-Network (DASEMN) is proposed to tackle sketch face recognition tasks. Specifically, a meta-learning training strategy is designed to tackle the few-shot problem and improve the generalization ability of the network. Then, a generalized entropy loss termed scaled mean entropy loss is proposed to guide the network to extract discriminate features. Finally, a domain adaptation module is introduced in the training set to reduce the domain gap between the sketch domain and the photo domain. Experiments on UoM-SGFS and CUFSF sketch face databases show that the proposed method is superior to other sketch face recognition methods.

Keywords

Sketch face recognition meta learning domain adaptation generalized entropy loss

Introduction

With the rapid improvement in computer vision, pattern recognition, and machine learning, face recognition¹ has achieved significant success in essential application fields, including video surveillance² and law enforcement.¹ As the application scenarios increase, collected face images contain various modalities, such as photos, near-infrared images,³ thermal infrared images, line drawing images,⁴ and sketches. Sketch face recognition⁵ refers to matching the photo from an enormous gallery set based on a given sketch. It has a wide application in criminals, especially when criminal suspect photos unavailable at the crime scene, the police have to draw a hand-drawn or a composite sketch by the description from eyewitness and match it from the gallery set.

Since digital devices capture the photo while painters draw the sketch, there exist significant modality variations between face photo and sketch images. For example, sketches lack some real important texture details of forehead, chin, and cheeks compared with face photos. Furthermore, current publicly available sketch-photo datasets comprise only a few numbers of sketch-photo pairs, there is usually one sketch per subject in most datasets, and this adds the difficulties for the network to learn robust and discriminative features.

To address the modality variation problem, several sketch face recognition methods have been designed. They can be broadly categorized into inter-modality and intra-modality¹ algorithms. The intra-modal algorithms reduce the modal gap by synthesizing a photo (sketch) into a sketch (photo) and then using traditional face recognition methods designed in target modality to match the synthetic sketch (photo) with the original sketch (photo). Intra-modality algorithms heavily rely on the quality of synthesized samples. The inter-modal algorithms map two different modals’ image features into a shared space and then learn a classifier to maximize inter-class separability while minimizing intra-class differences simultaneously. So the intra-modality algorithms are used to decrease the modality space by transforming face-photo into a sketch. The inter-modality method performs sketch with face photo recognition by mining modality consistent features to match face-photos with the sketches. Recently, convolutional neural networks (CNNs) based inter-modality methods have been widely utilized to learn discriminative features for sketch face recognition.^6,7 Due to the time-consuming human efforts of sketch drawing, the amount of the paired training data is limited, although CNNs offer near-perfect performance on standard image test set given abundant training data, they report poor results for the few-shot sketch face recognition.

Meta-learning^8–11 is a hot issue in the field of few-shot learning, which can address the critical issue of overfitting when the amount of the paired training data is limited. Meta-learning is most commonly understood as learning to learn, which refers to the process of improving a learning algorithm over multiple learning episodes. Specifically, a machine learning model gains experience over multiple learning episodes and uses it to enhance its future learning performance. The meta-learning can be divided into two steps. During base learning, an inner learning algorithm solves a task such as an image classification. During meta-learning, an outer algorithm updates the inner learning algorithm, such that the model learned by the inner algorithm improves an outer objective.

In this paper, inspired by the meta-learning, we propose a domain adaptation Scaled Entropy Meta-Network (DASEMN) for sketch face recognition. It consists of a scaled entropy meta module and a domain adaptation module. Since there are few sketch face image data in the current sketch face dataset, and the traditional deep learning method is easy to overfit with a small amount of training data, a scaled entropy meta module is designed to improve the learning level to the task. In the scaled entropy meta module, a meta-learning training strategy is designed to enhance the network’s generalization ability, and then the feature vectors of the photos and the sketches are extracted by the network separately. Finally, a scaled mean entropy loss is proposed by using the metric relationship between the two in the feature space. Since domain adaptation^12,13 is a variant of transfer learning that attempts to enhance the generalization capability of learned models by reducing the data shift between training and testing distributions, to further reduce the domain gap between sketches and photos, a domain adaptation module is introduced to minimize the maximum mean discrepancy between sketches and photos in the embedding space.

The rest of the paper is organized as follows: an overview of related work is provided in Section 1, followed by a description of the proposed methods in Section 2. The experimental results and analysis are reported in Section 3, and the conclusion is finally given in Section 4.

Related work

Sketch face recognition

Several sketch face recognition approaches are belong to intra-modality algorithms, also known as Face Hallucination (FH) techniques. Representative intra-modality approaches include the Eigen Transformations,¹⁴ Local Linear Embedding (LLE),¹⁵ feature-level GAN¹⁶ and generative adversarial multi-task learning method.¹⁷ Eigen Transformations is a linear face sketch recognition and synthesis method. It performed photo-sketch transformation under the assumption that the transformation can be viewed as a linear mapping. However, a simple linear transformation cannot represent the whole face and whole sketch well; specifically, the high-frequency information of the generated sketches is not well synthesized, and hence the quality of the generated sketches tend to be unsatisfactory. To address the limitation of the Eigen Transformations, Liu et al.¹⁵ proposed a non-linear transformation method for face sketch synthesis, which utilize the Local Linear Embedding (LLE). In recent years, Most intra-modality algorithms focus on the generative adversarial networks (GAN),¹⁸ which comprise a generative model G and a discriminative model D, jointly trained in an adversarial way. Inspired by the CycleGAN¹⁹ and conditional GANs,²⁰ Yu et al. proposed a feature-level GAN¹⁶ for face sketch-to-photo transformation, in which a new feature-level loss is designed to ensure the quality of the generated photos. Wan and Lee,¹⁷ a generative adversarial multi-task learning method is proposed to simultaneously deal with sketch face synthesis and recognition. This method uses a multi-task deep network to simultaneously train the generator and extract features for sketch face recognition.

Another strategy to reduce the modality discrepancy in sketch face recognition is to extract modality-invariance feature representation from face photo and sketch images, inter-modality methods. Traditional inter-modality methods used the hand-crafted features including Scale-Invariant Feature Transform (SIFT) and Multiscale Local Binary Pattern (MLBP) to extract modality-invariance feature representation. Klare and Jain²¹ proposed the Direct Random Subspace (D-RS) to extract feature, it convolves images with three filters, followed by extraction of SIFT and MLBP descriptors from overlapping patches, a cosine similarity measure is utilized to match the sketches and photos. Recently, several state-of-the-art deep learning-based inter-modality methods are proposed. A Fast-RSLCR method was proposed in Wang et al.,²² which randomly samples some patches offline, and then these patches are used to reconstruct the target sketch patch, improves the synthesis efficiency. Cheraghi and Lee,²³ proposed Sketch-Photo Net(SP-Net), it consists of a S-Net and a P-Net, the S-Net and the P-Net can learn discriminative features between the sketches and the photos. And this method employed a contrastive loss to discover the coherent visual structures between sketch and photo. Similarly, based on our proposed meta-learning training strategy, we proposed a scaled mean entropy loss to look for potential connections between facial sketches and photos of the same person.

Meta-learning

Meta-learning is also known as “Learning to Learn,” it uses previous knowledge and experience to learn new tasks. Meta-learning raises the learning level from data to task; it learns the network from labeled tasks instead of a training dataset. Existing meta-learning methods can be classified into three categories. The first is the optimization-based methods, in which the inner-level task is solved as an optimization problem, focuses on extracting meta-knowledge required to improve optimization performance. Ravi and Larochelle proposed the LSTM-based meta-learning optimizer²⁴ to improve the generalization performance, in which the meta-learner simultaneously considered the short-term memory of a single task and the long-term memory of all tasks. The second is model-based methods, in which the inner learning step is wrapped up in the feed-forward pass of a single model. Finn et al. proposed the Model-Agnostic meta-learning method.²⁵ This method aims to find the weight configuration of a given neural network so that the gradient descent update step effectively fine-tunes the small sample problem. The third is metric learning-based methods, which simulating the distance distribution between samples, and the samples of the same class are close to each other while the samples of different classes are far away from each other. Vinyals et al. proposed a matching network (Matching Networks),²⁶ which uses an attention mechanism to predict the class of unlabeled points (query sets) by learning to embed labeled sample sets (support sets). Wang et al. proposed Siamese Networks,²⁷ which learn a similarity metric to compare and match samples of new unknown categories with the learned metric. Garcia and Bruna²⁸ designed Graph Neural Network, which learns the relational task of information transmission end-to-end and regards each sample as a graph node. This method learns not only the node embedding but also the edge embedding. Since the meta-learning has achieved great success in few-shot learning and began to be applied to sketch face recognition in recent years,²⁹ we proposed a meta-learning training strategy which suitable to sketch face recognition to mimic few-shot tasks, and then domain-related query set and support set are further designed to incorporate domain information.

Domain adaptation

To avoid both labeling efforts and overfitting issues, domain adaptation is extensively studied to overcome the domain shift or dataset bias such that a learner trained on other large-scale datasets can be leveraged. In recent years, several domain adaptation methods have been proposed to learn domain invariant features for domain adaptation directly, they can be roughly categorized into statistic-based approaches and adversarial learning-based approaches. For statistic-based approaches, Long and Wang proposed the Deep Adaptation Network(DAN) in Long and Wang³⁰ to reduce domain shift to improve the feature’transferability of the task-specific layers of deep learning neural networks by embedding the hidden representations of all task-specific layers into a reproducing kernel Hilbert Space. Yaroslav et al. proposed the Domain-Adversarial Neural Network(DANN) in Ganin et al.,³¹ which learns domain uninformative features by back-propagating the gradients from the loss related to the domain classifier.For adversarial learning-based approaches, they learns domain invariant features using Generative Adversarial Networks(GAN) to synthesize images in different domains. Zhu et al. proposed Cycle-Consistent Adversarial Networks(cycleGAN)¹⁹ to improve the similarity of the distributions of real and generated images by conditioning the generator and discriminator on discriminative information. In our method, we proposed a domain adaptation module to reduce the impact of modal differences on recognition performance by minimizing the maximum mean discrepancy between sketches and photos in the embedding space.

Proposed method

The framework of the domain adaptation scaled entropy meta-network is shown in Figure 1; it consists of a scaled entropy meta module and a domain adaptation module. Since there are few face sketch data in the current sketch face dataset, and the traditional deep learning method is easy to overfit with a small amount of training data, we designed a scaled entropy meta module to improve the learning level to the task. In the scaled entropy meta module, a meta-learning training strategy is designed to enhance the network’s generalization ability, and then the feature vectors of the photos and the sketches are extracted by the network separately. Finally, the scaled mean entropy loss is calculated by using the metric relationship between the two in the feature space. To reduce the domain gap between sketches and photos, a domain adaptation module is introduced to minimize the maximum mean discrepancy between sketches and photos in the embedding space.

Figure 1.

The overall architecture of our domain adaptation scaled entropy meta network (DASEMN).

Meta-learning training strategy

Inspired by the idea of the representative prototype network,³² the following meta-learning training strategy is designed. $N (N < K)$ classes from the training set $D_{train} = {(I_{1}^{p}, y_{1}), (I_{1}^{s}, y_{1}) \cdot \cdot \cdot, (I_{K}^{p}, y_{K}), (I_{K}^{s}, y_{K})}$ is randomly samples to form a meta-training episode, where $I_{i}^{p}$ and $I_{i}^{s}$ represent photo and sketch respectively, and $y_{i}$ represents the corresponding label. In each episode, all images constitute a query set $Q = {(I_{1}^{s}, y_{1}), (I_{1}^{p}, y_{1}), \dots, (I_{N}^{s}, y_{N}), (I_{N}^{p}, y_{N})}$ , all photos constitute a photo support set $S_{p} = {(I_{1}^{p}, y_{1}), \cdot \cdot \cdot, (I_{N}^{p}, y_{N})}$ , and all sketches constitute a sketch support set $S_{s} = {(I_{1}^{s}, y_{1}), \cdot \cdot \cdot, (I_{N}^{s}, y_{N})}$ .

Since ResNet-18³³ has achieved state-of-the-art performance on traditional face recognition, it is utilized to extract the photo and sketch features. The ResNet-18 is an 18-layer residual network; it contains four residual channel modules with different channel sizes as the basic structure. Let $f (\cdot)$ denote the embedding function implemented by the ResNet-18, that is, given an input image $I$ , its corresponding feature is $f (I)$ .

The entire training process can be divided into two steps. During base training, the classification task is performed in each episode, and a model is trained based on the samples in the episode. During the meta training, the model trained in the previous episode is input to the next episode and optimized based on the data in the next episode. Finally, a final model optimized based on multiple episodes is obtained.

Base training

Scaled mean entropy loss

During base training, in a training episode, given a feature $f (I_{c})$ extracted from the query set $Q$ , the Euclidean metric between the $f (I_{c})$ and the photo feature $f (I_{i}^{p})$ extracted from the photo support set $S_{p}$ , and the Euclidean metric between the $f (I_{c})$ and the sketch feature $f (I_{i}^{p})$ extracted from the sketch support set $S_{s}$ can be computed respectively:

d (f (I_{c}), f (I_{i}^{p}))_{i} = ‖ f (I_{c}) - f (I_{i}^{p}) ‖,

(1)

d (f (I_{c}), f (I_{i}^{s}))_{i} = ‖ f (I_{c}) - f (I_{i}^{s}) ‖,

(2)

where $| | \cdot | |$ represents the Euclidean distance.

Our method, the Domain Adaptation Scaled Entropy Meta-Network, is based on the idea that the similarity between a point and another class is represented by the average of the distance between the point and the two points in the other class. In order to do this, we average the metric values in equations (1) and (2) to denote the similarity between the query set sample $I_{c}$ and the class $i$ . For a $f (I_{c})$ extracted from query set $Q$ , its mean metric over the support set is as follows:

M_{c, i} = \frac{1}{2} (‖ f (I_{c}) - f (I_{i}^{p}) ‖ + ‖ f (I_{c}) - f (I_{i}^{s}) ‖),

(3)

Furthermore, by introducing a metric scaling factor $α$ , its probability distribution over the support feature set is computed as follows:

P (y = y_{i} | I_{c}) = \frac{\exp (- α M_{c, i})}{\sum_{j = 1}^{N} \exp (- α M_{c, j})} .

(4)

Inspired by the cross entropy function, by minimizing the negative log probability of the entire query set, the scaled mean entropy loss is computed as follows:

\begin{matrix} L_{s - mean} = \frac{1}{| Q |} \sum_{(I_{i}, y_{i}) \in Q} - \ln p (y = y_{i} | I_{i}) \\ = \frac{1}{| Q |} \sum_{(I_{i}, y_{i}) \in Q} [α M_{c, i} + \ln \sum_{i = 1}^{N} \exp (- α M_{c, i})] . \end{matrix}

(5)

Maximum mean discrepancy loss

Since images from the photo support set and the sketch support set are from the different modal, photos are captured by digital devices while sketches are drawn by painters, there is a large modal difference between the photo and the sketch. They are resulting in a massive difference in the feature representation learned by the network. A domain adaptation module to reduce the impact of modal differences regards the sketch sets and photo sets as the source task and target task. During the training stage, the maximum mean discrepancy between the sketch sets and photo sets is continuously reduced to improve the recognition performance. The maximum mean discrepancy³⁴ was first proposed for the detection problem; it computed between two samples to determine whether the two distributions are the same. In this paper, The maximum mean discrepancy loss is used to evaluate the similarity between two different distributions in the Reproducing Kernel Hilbert Space (RKHS)³⁵:

L_{mmd} = | | E_{I^{p} ~ S_{p}} (f (I^{p})) - E_{I^{s} ~ S_{s}} (f (I^{s})) | |_{H}^{2},

(6)

where $S_{p}$ and $S_{s}$ refers to the distribution of photos and sketches in each episode.

It can be reformulated as follows:

\begin{matrix} L_{mmd} = | | \frac{1}{N} \sum_{i = 1}^{N} f (I_{i}^{p}) - \frac{1}{N} \sum_{j = 1}^{N} f (I_{j}^{s}) | |_{H}^{2}, \end{matrix}

(7)

where $N$ is the number of photos in the photo set and the number of sketches in the sketch set in each episode.

We square maximum mean discrepancy to construct inner product, the equation (7) can be rewritten as follows:

\begin{matrix} L_{mmd} = \frac{1}{N^{2}} [\sum_{i = 1}^{N} \sum_{j = 1}^{N} k (f (I_{i}^{p}), f (I_{j}^{p})) \\ - \sum_{i = 1}^{N} \sum_{j = 1}^{N} 2 k (f (I_{i}^{p}), f (I_{j}^{s})) \\ + \sum_{i = 1}^{N} \sum_{j = 1}^{N} k (f (I_{i}^{s}), f (I_{j}^{s}))], \end{matrix}

(8)

where $k (x, y)$ is the Gaussian kernel function³⁶ with bandwidth parameter $σ$ :

k (x, y) = e^{\frac{- {‖ x - y ‖}^{2}}{σ}} .

(9)

In practical applications, the maximum mean discrepancy usually averages multiple Gaussian kernel functions (e.g. different $σ$ ) as the final kernel function. Therefore, the $k (x, y)$ in equation (8) is:

k (x, y) = \sum_{i = 1}^{n} e^{\frac{- {‖ x - y ‖}^{2}}{σ_{i}}} .

(10)

where $n$ is the number of the Gaussian kernel, and n is set to five empirically.

The formula of $σ_{i}$ is:

σ_{1} = \frac{\sum_{j \in {‖ x - y ‖}^{2}} j}{[{(N + N)}^{2} - (N + N)] \times 2^{2}},

(11)

σ_{i} = σ_{1} \times 2^{i - 1} (i = 2, . . ., n),

(12)

Meta training

During meta training, to balance the classification performance of the domain adaptation scaled meta-network and reduce the modal difference performance, we integrate the scaled mean loss in the scaled entropy meta module and the maximum mean discrepancy loss in the domain adaptation module into a unified optimization problem:

min L_{total} = L_{s - mean} + λ L_{mmd} .

(13)

where $λ$ is the trade-off hyperparameter between the scaled mean entropy loss and the maximum mean discrepancy loss. The goal of learning is achieved through the backpropagation algorithm based on the end-to-end gradient to optimize the network parameters. Algorithm 1 introduces the loss calculation process for each episode.

Algorithm 1 Training episode loss computation for DASEMN.
Input: Training set $D_{train} = {(I_{1}^{p}, y_{1}), ((I_{1}^{s}, y_{1}), . . ., (I_{k}^{p}, y_{k}), ((I_{k}^{s}, y_{k})}$ including $K$ training classes, $N$ is the number of classes per training episode, and $λ$ is the trade-off hyperparameter. 1. Select class indices for training episode: ${1, . . ., N}$ ← $RandomSample {1, . . ., K}$ ; 2. Build a query set: $Q = {(I_{1}^{s}, y_{1}), (I_{1}^{p}, y_{1}), . . ., (I_{N}^{S}, y_{N}), ((I_{N}^{P}, y_{N})}$ ; 3. Build a photo support set: $S_{P} = {(I_{1}^{p}, y_{1}), . . ., (I_{N}^{P}, y_{N})}$ ; 4. Build a sketch support set: $S_{S} = {(I_{1}^{S}, y_{1}), . . ., (I_{N}^{S}, y_{N})}$ ; 5. Compute $L_{s - mean}$ by $Eq . (5)$ ; 6. Compute $L_{mm}$ by $Eq . (8)$ ; Output: Unified episode loss $L_{total} = L_{s - mean} + λ L_{mmd}$

Algorithm 1 Training episode loss computation for DASEMN.

Input: Training set

D_{train} = {(I_{1}^{p}, y_{1}), ((I_{1}^{s}, y_{1}), . . ., (I_{k}^{p}, y_{k}), ((I_{k}^{s}, y_{k})}

including

K

training classes,

N

is the number of classes per training episode, and

λ

is the trade-off hyperparameter.
1. Select class indices for training episode:

{1, . . ., N}

←

RandomSample {1, . . ., K}

;
2. Build a query set:

Q = {(I_{1}^{s}, y_{1}), (I_{1}^{p}, y_{1}), . . ., (I_{N}^{S}, y_{N}), ((I_{N}^{P}, y_{N})}

;
3. Build a photo support set:

S_{P} = {(I_{1}^{p}, y_{1}), . . ., (I_{N}^{P}, y_{N})}

;
4. Build a sketch support set:

S_{S} = {(I_{1}^{S}, y_{1}), . . ., (I_{N}^{S}, y_{N})}

;
5. Compute

L_{s - mean}

Eq . (5)

;
6. Compute

L_{mm}

Eq . (8)

;
Output: Unified episode loss

L_{total} = L_{s - mean} + λ L_{mmd}

Experiment and analysis

Dataset setup

The UoM-SGFS database³⁷ and CUFSF database³⁸ are used to evaluate our method. The UoM-SGFS database is currently the largest software-generated sketch database representing all sketches in full color. It contains 600 photos from color FERET face database³⁹ and two sets of sketches: set A contains 600 pairs of sketch-photo face images, where EFIT-V generates the sketches. Set B also contains 600 pairs of sketch and photo, where the sketches are more realistic obtained by fine-tuning the sketches in set A through an image editing program. Some examples of sketches and the corresponding photos from the UoM-SGFS database are shown in Figure 2.

Figure 2.

The photos of four subjects in the Color FERET database³⁹ and the corresponding sketches from the two sets of the UoM-SGFS database.³⁷

The CUHK Face Sketch FERET Database (CUFSF) is a representative database for studying sketch face recognition and synthesis. It contains 1194 individuals from the FERET database, each individual has a face photo with illumination changes and a face sketch drawn by an artist. This database is quite challenging since the photos are taken under different illumination conditions, and the sketches have numerous exaggerations compared to the photos. Figure 3 gives some examples of it.

Figure 3.

The photos of four subjects and the corresponding sketches in the CUFSF database.³⁸

We divide three setups based on the above two databases: In the first setup (S1), 450 subjects of UoM-SGFS set A is selected randomly for training and the remaining 150 subjects is assigned to the test set. In the test set, the sketches constitute the probe set, and the photos constitute the gallery set. In the second setup (S2), UoM-SGFS set B is divided into training and test sets; the specific settings are the same as S1. In order to better simulate the real scene, the gallery set in S1 and S2 is expanded with photos of 1521 subjects following the same strategy as in,¹ including 509 subjects from the MEDS-II database,¹ 199 subjects from the FEI database² and 813 subjects from the LFW database.⁴⁰ Because only the MEDS-II database and the FEI database are available, we randomly select 813 subjects from the LFW database to replace the original 476 subjects on the FRGCv2.0 database and 337 subjects on the Multi-PIE database. In the third setup (S3), we randomly select 500 subjects from the CUFSF database as the training set, the remaining 694 subjects are used for testing. To ensure the generalization ability of the proposed method, five random cross-validations were performed on the three setups. Table 1 details the experiment setup.

Table 1.

Experiment setups.

Setup name	Testing dataset	Training dataset	Train/pairs	Probe	Gallery
S1	UoM-SGFS set A	UoM-SGFS set A,MEDS-II,FEI,LFW	450	150	150 + 1521
S2	UoM-SGFS set B	UoM-SGFS set B,MEDS-II,FEI,LFW	450	150	150 + 1521
S3	CUFSF	CUFSF	500	694	694

Experimental details

Experimental settings

We employ the MTCNN⁴¹ to perform face detection and alignment in the UoM-SGFS dataset and the CUFSF database, and only retain the critical point information of the face to obtain face images of the same scale. Figure 4 shows some examples after performing the above processing from the UoM-SGFS database. All images are cropped to the size of $256 \times 256$ and images are flipped horizontally with a probability of 0.5 for data augmentation.

Figure 4.

Cropped images from the UoM-SGFS database. The first row is photos, the second and third row are the corresponding sketches from the two sets of the UoM-SGFS database, respectively.

Training details

In the training process, a total of $60 \times 100$ episodes are trained, and the model is saved every 100 episodes, and the final training models is saved for testing. For the S1, S2, and S3 datasets, each meta-training episode contains 32 classes randomly sampled from the training set; that is, N in Meta-Learning Training Strategy section is 32. The model is trained using the Adam optimizer⁴² with $β_{1} = 0.5$ and $β_{2} = 0.999$ . The initial learning rate is set to 0.0001, the trade-off hyperparameter $λ$ between the scaled mean entropy loss and the maximum mean discrepancy loss is set to 0.01, and the metric scaling factor is set to 0.5 empirically.

We show how to determine the hyperparameter $λ$ , the metric scaling factor $α$ , the number of the episodes and the number of classes of each episode N and analyze the sensitivity of them in this subsection.

To analyze the sensitivity of different loss hyperparameter to the overall loss, we set the $λ$ to 0.001, 0.01, 0.1, 1, and test the models learned in different $λ$ on S1 setup. As shown in Table 2, the recognition performance of the learned model is the best when the $λ$ is set to 0.1. This is because when the $λ$ is set to 0.1, the discriminative and transferability of the features learned by the feature extractor are balanced, which maximizes the performance of the network. When the $λ$ is less than 0.1, the classification loss has relatively greater weight, and the transferability of the features learned by the feature extractor is reduced. Conversely, the discriminativeness of the features learned by the feature extractor decreases.

Table 2.

Recognition accuracy of our method in different hyperparameter $λ$ on S1 setup.

$λ$	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
0.001	63.47	90.80	97.73
0.01	63.73	91.73	98.00
0.1	66.00	92.67	98.00
1	65.07	89.60	96.40

To analyze the sensitivity of the metric scaling factor $α$ on the recognition performance of our method, we set the $α$ to 0.01, 0.1, 0.5, 1, and statistics the recognition accuracy of our method in different $α$ on the S1 setup in Table 3. It’s obvious that our method reaches the best performance when the $α$ is set to 0.5. By linearly scaling the metric function, both the embedding distance between samples of the same class and the embedding distance between samples of the different class are smaller, which improve the difficulty of classification tasks and the learning ability of the model. However, when the $α$ is too small, the difference between the positive sample and the negative sample is too small for the network to distinguish correctly.

Table 3.

Recognition accuracy of our method in different the metric scaling factor $α$ on S1 setup.

$α$	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
0.01	62.53	91.73	97.73
0.1	63.47	92.00	97.60
0.5	66.00	92.67	98.00
1	62.93	90.40	97.07

To study the influence of the number of training episodes on the recognition performance of our method, we keep other experimental parameters unchanged and train the networks under different numbers of training episodes on the S1 setup. Table 4 shows the results of the learned model saved in different training episodes. When the number of training episodes is less than 6000, the recognition accuracy of the learned model increases as the number of episodes increases since it has not converged. However, when the number of training episodes increases from 6000 to 10,000, the model begins to be overfitting and the recognition performance decreases.

Table 4.

Recognition accuracy of our method in different training episodes on S1 setup.

The number of episodes	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
2000	60.40	90.67	97.20
4000	63.07	90.27	97.33
6000	66.00	92.67	98.00
8000	64.27	90.53	98.13
10,000	64.00	90.40	97.87

To study the influence of the number of classes in each episode on the recognition performance of our method, for the S1 setup, we set N to 8, 16, 32, 64, and 128, and other experimental parameters are consistent. The results are shown in Table 5. We can found that the recognition accuracy of the learned model reaches the best when N is set to 32. As N increases from 8 to 32, the recognition performance of the learned model is gradually increasing. The number of hard negative samples increases as the number of classes in each training episode increases, which improves the learning ability of the learned model. As N increases from 32 to 128, the learned model begins to be overfitting and leading to a decline in recognition performance.

Table 5.

Recognition accuracy of our method in different number of classes in each episode on S1 setup.

N	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
8	56.80	89.60	96.93
16	62.80	90.27	97.47
32	66.00	92.67	98.00
64	65.47	90.13	97.20
128	62.27	92.80	96.67

To further study the influence of the optimizer and initial learning rate on the recognition performance of our method, we keep other experimental parameters unchanged and train the networks under different optimizer and initial learning rate on the S1 setup. Table 6 shows the results of the learned model saved in SGD and Adam optimizer. Table 7 shows the results of the learned model saved in different initial learning rate. As shown in Table 6, the SGD optimizer and the Adam optimizer share comparative performance, illustrating the robustness of the proposed method. As shown in Table 7, when the initial learning rate is less than 0.0001, the recognition accuracy of the learned model increases as the initial learning rate increases since it has not converged. However, when the initial learning rate increases from 0.0001 to 0.001, the model skip the best results and the recognition performance decreases.

Table 6.

Recognition accuracy of our method in different optimizers on S1 setup.

Optimizer	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
SGD	66.53	92.40	97.6
Adam	66.00	92.67	98.00

Table 7.

Recognition accuracy of our method in different learning rate on S1 setup.

Learning rate	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
0.00001	46.67	83.60	93.60
0.00002	52.00	86.93	95.73
0.00005	63.20	89.87	96.93
0.00008	65.67	92.00	97.47
0.0001	66.00	92.67	98.00
0.0002	62.80	90.27	97.33
0.0005	57.60	89.47	97.60
0.0008	50.67	85.47	96.53
0.001	48.13	84.00	96.67

Ablation study

In this section, We conduct an extensive ablation study to evaluate each component’s contribution to our network.

DASEMN without mmd(w/o mmd): the maximum mean discrepancy loss is removed, and the proposed method is trained only with the meta-learning training strategy and scaled mean entropy loss.

DASEMN without ML&SMEL(w/o ML&SMEL): Our network is trained by traditional cross-entropy loss and the maximum mean discrepancy loss without meta-learning training strategy and scaled mean entropy loss, training all training data every epoch, and then updates the parameters to retrain all training data. The classification loss uses cross-entropy loss, and the maximum mean discrepancy loss is still retained to reduce the modal differences between the photo and the sketch. In the training process, a total of 50 epochs are trained, and the batch size is set to 4;

DASEMN without ML(w/o ML): Our method is trained by scaled mean entropy loss and the maximum mean discrepancy loss without meta-learning training strategy(e.g. the scaled mean entropy loss and the maximum mean discrepancy loss are minimized without using meta-learning), training all training data every epoch, and then updates the parameters to retrain all training data. The classification loss uses scaled mean entropy loss, and the maximum mean discrepancy loss is still retained to reduce the modal differences between the photo and the sketch. In the training process, a total of 50 epochs are trained, and the batch size is set to 4;

Baseline: the meta-learning training strategy, scaled mean entropy loss, and the maximum mean discrepancy loss are removed simultaneously, and only the traditional cross-entropy loss is used to train the network. During the training process of the baseline, the parameters are consistent with w/o ML. A total of 50 epochs are trained, and the batch size is set to 4.

The results of the four sets of ablation experiments are shown in Table 7. Figure 5 visualizes the top five matching images of the proposed DASEMN, w/o mmd, w/o ML&SMEL, w/o ML, and baseline on the S1 setup. It can be seen from the above table and figure that the proposed DASEMN outperforms all these variants, indicating that each component is essential and complementary to each other.

Figure 5.

The top five matching images of ours, w/o mmd, w/o ML&SMEL, w/o ML, and baseline on the S1 setup. The images in blue box are the sketch, images in the green box is the positive samples, and the images in the red box are negative samples.

We found that combining meta-learning, maximum mean discrepancy loss, and scaled mean entropy loss together to train the model can achieve the best results through the experimental results of the above four sets of ablation experiments on S1, S2, and S3 setups. As shown in Table 8, the recognition accuracy of w/o mmd, w/o ML&SMEL, and w/o ML are lower than the complete method indicate that each component is essential. The recognition accuracy of w/o mmd is higher than our baseline, which indicates that our meta-learning training strategy and the scaled mean entropy loss significantly improve the comprehensive generalization ability of the network by raising the learning level from the data to the task under the same data volume, thereby significantly improving the recognition performance. The recognition accuracy of w/o ML&SMEL is about 3% better than our baseline in all of the four setups indicate that the validity and stability of our domain adaptation module. In the Table 8, we can found that the recognition accuracy of w/o ML is significantly lower than the complete method and even be the worst. After the meta-learning is removed, the diversity of tasks is reduced and resulting in the network being unable to learn as much knowledge as before.

Table 8.

Recognition accuracy of w/o ML in different epochs on S1 setup.

The number of epochs	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
10	28.07	64.73	80.47
30	41.13	75.13	87.07
50	49.47	92.67	98.00
70	47.67	89.16	95.44
90	46.00	88.32	93.79

The number of epochs and batch size are important parameters in deep learning. Based on the w/o ML, we conduct two sets of experiments to study their influence on the recognition performance on the S1 setup. As shown in Table 9, we test the model trained under different epochs and statistics the recognition accuracy. As the number of epochs increases, the number of weight update iterations in the neural network increases. The w/o ML starts from the unfitted state, slowly enters the optimal fitting state, and finally enters the overfitting state. To study the influence of the batch size on the recognition performance of w/o ML, we set the batch size to 1, 2, 4, 8, 16, and keep other parameters fixed. The experimental results are illustrated in Table 10. When the batch size is set to 1 , the w/o ML does not converge and the recognition accuracy is zero. When the batch size increases from 2 to 16, the recognition accuracy only fluctuates slightly.

Table 9.

Recognition accuracy of our method in different batch size on S1 setup.

Batch size	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
1	0	0	0
2	49.40	92.38	97.60
4	49.47	92.67	98.00
8	49.32	92.40	97.07
16	49.38	92.51	97.33

Table 10.

Recognition accuracy for S1, S2, and S3 setups in Rank-1.

Setup Name	Ours (%)	w/o mmd (%)	w/o ML&SMEL (%)	w/o ML (%)	Baseline (%)
S1	66.00	62.80	54.00	49.47	51.33
S2	74.00	71.33	68.67	58.27	65.33
S3	83.29	82.51	78.70	70.37	77.72

Comparison to the state-of-the-art sketch face recognition methods

In this section, for the S1 and S2 setups, we compare our method with the representative sketch face recognition methods including ET (+ PCA),¹⁴ EP (+ PCA),⁴³ DEEPS,¹ D-RS + CBR,⁴⁴ LGMS,⁴⁵ SP-Net,²³ and Transfer deep feature learning.⁴⁶ ET + PCA and EP + PCA are intra-modality methods, DEEPS, D-RS + CBR, LGMS, SP-Net, and Transfer deep feature learning are inter-modality methods. Except for SP-Net and Transfer deep feature learning, the results of the comparison methods are directly derived from DEEPS.¹ Since the results of SP-Net and Transfer deep feature learning on the S1 and S2 setups are unavailable, we conduct the result of them by ourselves according to SP-Net²³ and Transfer deep feature learning.⁴⁶ Tables 11 and 12 show the experimental results of these compared methods and our method on the S1 and S2 setups, respectively (Because the FRGCv2.0 and Multi-PIE are unavailable, we use LFW instead of FRGCv2.0 and Multi-PIE. And the LFW is low resolution and uncontrolled face image dataset. Therefore, the experimental setup is not equivalent and we use ours* to represent the results of our method). Obviously, on the more challenging S1 setups, the algorithms’ recognition accuracy except our algorithm is low. As shown in Tables 11 and 12, the inter-modality methods are superior to the intra-modality methods in both lower rank and higher rank. CBR, DEEPS and SP-Net are specially designed for the matching method between software synthesis sketch and photo, and their low recognition performance on S1 and S2 setups indicates the challenge of the UoM-SGFS dataset. Benifit of transfer learning, the Transfer deep feature learning achieves the best performance method on S1 and S2 setups except for proposed method. Since a meta-learning training strategy and a generalized entropy loss are proposed to guide the network to extract discriminate features in our proposed method, our method achieves the best performance. Indeed, only our method correctly retrieves over 90% of subjects by Rank-10 on the S1 setup, which a mean match rate of 92.67%. Besides, we find that the results of baseline in Table 8 surpass the DEEPS regarding setup S1 and S2. In DEEPS, the basic network uses the VGG-16 network, and our baseline uses the more advanced ResNet-18 network, which cause our baseline gets better recognition performance. The highest recognition accuracy on S1 and S2 setup indicates the robust and discriminative of the face feature extracted by our method.

Table 11.

The recognition accuracy of the state-of-the-art sketch face recognition methods and our method on the S1 setup.

Methods	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
ET + PCA	8.40	30.00	54.53
EP + PCA	12.53	35.60	62.80
LGMS	21.87	51.20	72.40
D-RS + CBR	25.87	56.00	76.27
DEEPS	31.60	66.13	86.00
SP-Net	45.20	79.60	91.47
Transfer deep feature learning	52.13	83.07	95.87
Ours*	66.00	92.67	98.00

Table 12.

The recognition accuracy of the state-of-the-art sketch face recognition methods and our method on the S2 setup.

Methods	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
ET + PCA	12.13	39.07	63.47
EP + PCA	15.20	48.27	70.00
D-RS + CBR	42.93	75.87	90.13
LGMS	43.47	73.60	86.93
DEEPS	52.17	82.67	94.00
SP-Net	50.93	83.07	93.20
Transfer deep feature learning	59.60	87.47	96.27
Ours*	74.00	98.00	99.33

For the S3 setup, as shown in Table 13, we compare our method with some state-of-the-art methods, including MWF,⁴⁷ Fast-RSLCR,²² CDFL,⁴⁸ CMML,⁴⁹ Transfer Deep Feature Learning⁴⁶ and CMTDML.⁵⁰ MWF and Fast RSLCR are intra-modality methods, CDFL is a feature descriptor-based method, CMML is a common inter-modality method, and Transfer Deep Feature Learning and CMTDML are deep learning methods. The results of these compared methods are directly from CMTDML.⁵⁰ As we can see from Table 13, the performance of transfer learning is relatively poor, even though it is a deep learning method. The intra-modality methods of MWF and Fast-RSLCR heavily depend on the quality of the composite image, which results to degrade the recognition performance. Although the feature descriptor-based method and common inter-modality method have achieved good recognition performance, there exists some room to improve their recognition accuracy. Our method’s recognition accuracy is slightly lower than the CMTDML, but our method outperforms the other five compared methods significantly.

Table 13.

The recognition accuracy of the state-of-the-art sketch face recognition methods and our method on the S3 setup.

Methods	Rank-1 (%)
Transfer Deep Feature Learning	72.38
MWF	74.00
Fast-RSLCR	75.94
CMML	80.00
CDFL	81.30
CMTDML	83.86
Ours	83.29

Comparison to the state-of-the-art domain adaptation methods

We also conducted experiments on the above four setups on the state-of-the-art domain adaptation methods including DANN,³¹ CDAN,⁵¹ and BSP+CDAN.⁵² In all three experiments, during the training process, the feature extractor network uses the ResNet-18 to ensure the fairness of comparative experiments. Other experimental parameters follow the original paper settings.

The DANN comprises a deep label predictor and a domain classifier, jointly trained in an adversarial way. The deep label predictor is used to predict the label of each image, while the domain classifier is trained to distinguish the photo domain (e.g. the source domain) from the sketch domain (e.g. the target domain). During the training process, all training data are trained in every epoch, and then updates the parameters to retrain all training data. The classification loss uses the cross-entropy loss, and the domain classifier takes the sketch features and photo features learned by the feature extractor network as input and outputs a domain adaptation loss. The training proceeds in a standard way and minimizes the cross-entropy loss and the domain adaptation loss.

The CDAN is similar to the DANN and comprises a feature extractor, a source classifier, and a conditional domain discriminator, the difference is that the CDAN concatenates the feature representation and classifier prediction and feeds it to the conditional domain discriminator. During the training process, all training data are trained in every epoch, and then updates the parameters to retrain all training data. The cross-entropy loss is used as the classification loss, and the conditional domain discriminator takes the sketch features and photo features learned by the feature extractor network and the sketch classification score tensor and photo classification score tensor learned by the classifier as input and output domain discriminate loss. Finally, the cross-entropy loss and the domain discriminate loss are integrated into a unified optimization problem.

In the base of the CDAN, the BSP + CDAN applies SVD to obtain the largest k singular values of the photo feature matrix and the sketch feature matrix respectively. The Batch Spectral Penalization (BSP) as a regularization term over these largest k singular values and input a BSP loss. Finally, the classification loss, the domain discriminate loss, and the BSP loss are integrated into a unified optimization problem.

The experiment results are shown in Tables 14 to 16. As shown in these tables, we can find that our method’s performance is better than the other three domain adaptation methods in both lower rank and higher rank, which indicate the robustness of our method. On the S1 setup, only our method correctly retrieves over 90% of subjects by Rank-10.

Table 14.

The recognition accuracy of the state-of-the-art domain adaptation methods and our method on the S1 setup.

Methods	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
DANN	52.00	85.47	95.20
BSP + CDAN	55.73	89.20	97.33
CDAN	57.07	88.53	95.07
Ours	66.00	92.67	98.00

Table 15.

The recognition accuracy of the state-of-the-art domain adaptation methods and our method on the S2 setup.

Methods	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
CDAN	62.00	91.87	97.47
DANN	65.20	94.00	98.53
BSP + CDAN	67.60	91.47	97.73
Ours	74.00	98.00	99.33

Table 16.

The recognition accuracy of the state-of-the-art domain adaptation methods and our method on the S3 setup.

Methods	Rank-1 (%)	Rank-10 (%)	Rank-50 (%)
CDAN	78.96	97.84	99.14
DANN	79.25	97.84	99.42
BSP + CDAN	79.97	96.97	98.99
Ours	83.29	98.39	99.42

Conclusions

For sketch face recognition problem, to address the key issue of overfitting in the case of insufficient face photo-sketch data and the large modal gap between the sketches and the photos, in this paper, we propose a meta-learning based method termed Domain Adaptation Scaled Entropy Meta-Network (DASEMN). It consists of a scaled entropy meta module and a domain adaptation module. In the scaled entropy meta module, a meta-learning perspective coupled with episodic training is utilized to tackle the small sample problem, and a scaled mean entropy loss is proposed for extracting discriminate features in the training process. A domain adaptation module is further introduced to reduce the modal gap between the photo and sketch domains. By comparing the proposed DASEMN with state-of-the-art sketch face recognition and domain adaptation methods, experiments in the UoM-SGFS database and CUFSF database show that our method can effectively improve sketch face recognition performance. However, in the training process, hard negative samples and easy negative samples are treated equally, hard sample mining of negative examples is considered as an essential component in many optimization algorithms to improve the convergence speed and verification performance. In the future, we will further narrow the performance gap between hard and easy negative samples.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (62001033,U20A20163,62201066), the Qin Xin Talents Cultivation Program of Beijing Information Science and Technology University (QXTCP A201902, QXTCPC 202108), and by the General Foundation of Beijing Municipal Commission of Education (KZ202111232049, KM202011232021,and KM202111232014).

ORCID iDs

Changwu Chen

Yanan Guo

Notes

References

Galea

Farrugia

RA.

Matching software-generated sketches to face photographs with a very deep CNN, morphed faces, and transfer learning. IEEE Trans Inf Forensics Secur 2018; 13(6): 1421–1431.

Mittal

Vatsa

Singh

Composite sketch recognition via deep network – a transfer learning approach. In: 2015 international conference on biometrics Phuket, Thailand, 2015, pp.251–256.

Song

, et al. Coupled deep learning for heterogeneous face recognition. Proc AAAI Conf Artif intell 2018; 32. DOI: 10.1609/aaai.v32i1.11500.

Chen

Shum

, et al. Example-based facial sketch generation with non-parametric sampling. In: Proceedings IEEE international conference on computer vision, 2001, pp.433–438. New York: IEEE.

Uhl

Jr da Vitoria Lobo

Kwon

YH.

Recognizing a facial image from a police sketch. In: IEEE workshop on applications of computer vision, 1995. New York: IEEE.

Galea

Farrugia

RA.

Forensic face photo-sketch recognition using a deep learning-based architecture. IEEE Signal Process Lett 2017; 24(11): 1586–1590.

Mittal

Jain

Goswami

, et al. Recognizing composite sketches with digital face images via SSD dictionary. In: IEEE international joint conference on biometrics, 2014, pp.1–6. New York: IEEE.

Thrun

. Lifelong learning algorithms. In: Thrun

Pratt

(eds.) Learning to learn. Boston, MA: Springer, 1998, pp.181–209.

Brazdil

Soares

da Costa

JP.

Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results. Mach Learn 2003; 50(3): 251–277.

10.

Schweighofer

Doya

Meta-learning in reinforcement learning. Neural Netw 2003; 16(1): 5–9.

11.

Fei-Fei

Fergus

Perona

One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 2006; 28(4): 594–611.

12.

Pan

Yang

A survey on transfer learning. IEEE Trans Knowl Data Eng 2010; 22(10): 1345–1359.

13.

Candela

J Q

Sugiyama

Schwaighofer

, et al. Dataset Shift in machine learning. Cambridge, MA: The MIT Press, 2009.

14.

Tang

Wang

Face sketch recognition. IEEE Trans Circuits Syst Video Technol 2004; 14(1): 50–57.

15.

Liu

Tang

Jin

, et al. A nonlinear approach for face sketch synthesis and recognition. In: 2005 IEEE conference on computer vision and pattern recognition, 2005, pp.1005–1010. New York: IEEE.

16.

Han

Shan

, et al. Improving face sketch recognition via adversarial sketch-photo transformation. In: 2019 IEEE international conference on automatic face gesture recognition, 2019, pp.1–8. New York: IEEE.

17.

Wan

Lee

. Generative adversarial multi-task learning for face sketch synthesis and recognition. In: 2019 IEEE international conference on image processing, 2019, pp.4065–4069. New York: IEEE.

18.

Goodfellow

Abadie

Mirza

, et al. Generative adversarial nets. In: NIPS’14: Proceedings of the 27th international conference on neural information processing systems, Montreal, QC, Canada, 2014, pp. 2672–2680.

19.

Zhu

Park

Isola

, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE international conference on computer vision, 2017, pp.2242–2251. New York: IEEE.

20.

Mirza

Osindero

Conditional generative adversarial nets. Comput Sci 2014; 1411: 2672–2680.

21.

Klare

Jain

AK.

Heterogeneous face recognition: matching nir to visible light images. In: 2010 international conference on pattern recognition, Istanbul, Turkey, 2010, pp.1513–1516.

22.

Wang

Gao

Random sampling for fast face sketch synthesis. Pattern Recognit 2018; 76: 215–227.

23.

Cheraghi

Lee

HJ.

Sp-net: a novel framework to identify composite sketch. IEEE Access 2019; 7: 131749–131757.

24.

Ravi

Larochelle

Optimization as a model for few-shot learning. In International conference on learning representations, Toulon, France, 2017, pp.1–11.

25.

Finn

Abbeel

Levine

. Model-agnostic meta-learning for fast adaptation of deep networks. In: 2017 international conference on machine learning, 2017, pp.1126–1135.

26.

Vinyals

Blundell

Lillicrap

, et al. Matching networks for one shot learning. In: Advances in neural information processing systems, 2016, Barcelona, Spain, pp.3630–3638.

27.

Wang

Zhu

, et al. Attention based Siamese networks for few-shot learning. In: 2018 IEEE international conference on software engineering and service science, 2018, pp.551–554. New York: IEEE.

28.

Garcia

Bruna

Few-shot learning with graph neural networks. In: 2017 international conference on learning representations, 2017, pp.1–11.

29.

Guo

Zhu

Zhao

, et al. Learning meta face recognition in unseen domains. In: 2020 IEEE conference on computer vision and pattern recognition, 2020, pp.6162–6171. New York: IEEE.

30.

Long

Wang

Learning transferable features with deep adaptation networks. Clin Orthop Relat Res 2015.

31.

Ganin

Ustinova

Ajakan

, et al. Domain-adversarial training of neural networks. J Mach Learn Res 2017; 17(1): 2096–2030.

32.

Snell

Swersky

Zemel

. Prototypical networks for few-shot learning. In: 2017 advances in neural information processing systems, Long Beach, CA, USA, 2017, pp.4077–4087.

33.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, 2016, pp.770–778. New York: IEEE.

34.

Gretton

Borgwardt

Rasch

, et al. A kernel two-sample test. J Mach Learning Res 2012; 13: 723–773.

35.

Borgwardt

Gretton

Rasch

, et al. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006; 22(14): e49–e57.

36.

Long

Zhu

Wang

, et al. Unsupervised domain adaptation with residual transfer networks. In: 2016 advances in neural information processing systems, Barcelona, Spain, 2016, pp.136–144.

37.

Galea

Farrugia

RA.

A large-scale software-generated face composite sketch database. In: 2016 International conference of the biometrics special interest group, Darmstadt, 2016.

38.

Zhang

Wang

Tang

. Coupled information-theoretic encoding for face photo-sketch recognition. In: 2011 IEEE conference on computer vision and pattern recognition, 2011, pp.513–520. New York: IEEE.

39.

Rallings

Thrasher

Gunter

, et al. The feret database and evaluation procedure for face-recognition algorithms. Image Vis Comput 1998; 16(5): 295–306.

40.

Samma

Suandi

Mohamad-Saleh

Component-based face sketch recognition using an enhanced evolutionary optimizer. SN ApplSci 2019; 1(8): 939.

41.

Zhang

, et al. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 2016; 23(10): 1499–1503.

42.

Kingma

Adam: a method for stochastic optimization. Comput Sci 2014.

43.

Galea

Farrugia

RA.

Fusion of intra- and inter-modality algorithms for face-sketch recognition. In: Azzopardi

Petkov

(eds.) Computer analysis of images and patterns. CAIP 2015. Lecture Notes in Computer Science, vol. 9257. Cham: Springer, 2015, pp.700–711.

44.

Klum

Han

Klare

, et al. The facesketchid system: matching facial composites to mugshots. IEEE Trans Inf Forensics Secur 2014; 9(12): 2248–2263.

45.

Galea

Farrugia

RA.

Face photo-sketch recognition using local and global texture descriptors. In: 24th European signal processing conference (EUSIPCO), Budapest, 2016, pp. 2240–2244.

46.

Wan

Gao

Lee

HJ.

Transfer deep feature learning for face sketch recognition. Neural Comput Appl 2019; 31: 9175–9184.

47.

Zhou

Kuang

Wong

. Markov weight fields for face sketch synthesis. In: 2012 IEEE conference on computer vision and pattern recognition, 2012, pp.1091–1097, . New York: IEEE.

48.

Jin

Ruan

Coupled discriminative feature learning for heterogeneous face recognition. IEEE Trans Inf Forensics Secur 2015; 10(3): 640–652.

49.

Mignon

Jurie

. CMML: a new metric learning approach for cross modal matching. In: Asian conference on computer vision, Daejeon, 2012, pp.8–14.

50.

Feng

Huang

, et al. Cross-modality multi-task deep metric learning for sketch face recognition. In: 2019 Chinese automation congress, 2019, pp.2277–2281.

51.

Long

Cao

Wang

, et al. Conditional adversarial domain adaptation. In 2018 Advances in neural information processing systems, Montréal, QC, Canada, 2018, pp.1640–1650.

52.

Chen

Wang

Long

, et al. Transferability vs. discriminability: batch spectral penalization for adversarial domain adaptation. In: 2019 International conference on machine learning, Long Beach, CA, USA, 2019, pp.1081–1090.