Abstract
Convolutional neural networks (CNNs) were popular in ImageNet large scale visual recognition competition (ILSVRC 2012) because of their identification ability and computational efficiency. This paper proposes a palm vein recognition method based on CNN. The four main steps of palm vein recognition are image acquisition, image preprocessing, feature extraction, and matching. To reduce the processing steps in the recognition of palm vein images, a palm vein recognition method using a CNN is proposed. CNN is a deep learning network. Palm vein images are acquired using near-infrared light, under which the veins in the palm of the hand are relatively prominent. To obtain a good vein image, many previous methods used preprocessing to further enhance the image before using feature extraction to find feature matches for further comparison. In recent years, CNNs have been shown to have great advantages and have performed well in image classification. To reduce early-stage image processing, a CNN is used to classify and recognize palm vein images. The networks AlexNet and VGG depth CNN were trained to extract image features. The palm vein recognition rates by VGG-19, VGG-16, and AlexNet were 98.5%, 97.5%, and 96%, respectively.
Introduction
User authentication identification methods include passwords, cards, and biometrics (O’Gorman, 2003). With the increase in financial activities and safety awareness, followed by the development of science, technology, and societal progress, traditional identity authentication, such as passwords, personal identification numbers, and smart cards are largely unable to meet today’s convenience and reliability needs. Biometrics technologies classify authentication identification into physiological features, such as the user’s face, fingerprint, iris, and veins, or personal behaviour features, such as signature, gait, and voice (Jain
The process of recognizing palm veins is complicated by image preprocessing, feature extraction, and matching methods. The process is shown in Fig. 1. First, a palm vein image is preprocessed to make the image features more obvious. Second, a method must be found to extract the features of the image. Finally, feature matching must be performed via distance measures to identify the user. To reduce the manual selection process, deep learning is proposed, and target features can be automatically selected through model training. Deep learning excels in issues related to visual recognition, voice recognition, natural language, and time series. Convolutional neural networks (CNNs) were further studied by LeCun
The major contributions of this paper are as follows: We propose a palm vein recognition method based on a CNN that uses palm vein features to identify users. CNNs reduce the labourious manual selection process and simplifies authentication.
Palm vein recognition process. Finally, we test the performance metrics of CNN, such as the False Acceptance Rate (FAR) and the False Rejection Rate (FRR).

Vein recognition includes palm veins (Kang and Wu, 2014; Lu and Yuan, 2017; Raut and Humbe, 2015), finger veins (Huang Structure-based methods use line features and point features. This information is highly dependent on the selected coordinates and very sensitive to spatial occlusion; thus, the region of interest (ROI) must be carefully selected. Usually, when poorly performing images and feature points are used in feature extraction, difficulty in matching occurs. This method is also relatively sensitive to scaling, rotation, and displacement (Akinsowon and Alese, 2013; Xu, 2015). Appearance-based (subspace-based) methods use features to reduce dimensionality from high-dimensional space to low-dimensional space and retain the required information. Common methods include PCA, LDA, ICA, and subspace clustering (SSC) (Liu and Zhang, 2011; Raut and Humbe, 2015). Statistical-based methods use local statistics, such as the mean and variance of each small area, which are calculated and considered as characteristics. Gabor, wavelet, and Fourier transforms have been applied, including local binary patterns (LBP) or local derivative patterns (LDP) (Aglio-Caballero Local invariant-based methods use an image pyramid that is constructed to form a three-dimensional image space. The local maxima of each layer are obtained using a Hessian matrix, and a neighbourhood corresponding to the scale is selected at the feature point to find the main direction. With the main direction as the axis, coordinates can be established at each feature point with methods including scale invariant feature transform (SIFT) and speeded up robust features (SURF) (Gurunathan Fusion, also a very popular method, can improve the accuracy and security of the system, but the relative amount of data required is large. There are many fusion methods, including the use of palm, finger, vein, and face and many fusion rules, including algorithms, SVM, and neural networks (Garg Deep learning. In recent years, the prevalence of neural networks has made CNNs the most popular image classification method; the recognition rate is not inferior to previous methods (Huang
With the many research topics discussed above, because of the convenience and practicality of CNNs, we propose a palm vein recognition method based on CNN. The main contributions of our work are as follows. The vein pattern of the palm of the hand was photographed with a near-infrared camera using a pretrained CNN model called AlexNet (Krizhevsky
Preliminaries

Using the CNN framework for palm vein recognition.
In this section, we introduce the CNN model and the network architecture. In the proposed scheme, the model is trained and used for identification. A CNN has advantages in image recognition. According to the pioneering work of Lenet (Lécun
In this paper, a CNN is used to analyse the important parts of the network and the overall architecture. A CNN is a type of deep neural network. It has important advantages in image recognition processing. CNN feature extraction is different in each layer, and as the number of layers increases, feature classification improves.
Previous neural networks were still very limited when dealing with problems, such as computer vision, natural language processing and voice recognition. Due to the appearance of the convolution kernel, CNN has advantages for processing multidimensional data. The most important concepts of the CNN are the convolution layer, activation function, pooling layer, and fully connected layer.
Convolution Layer

Convolution operation of
The processing performed by the convolutional layer is called the “convolutional operation”. It uses filters to process and extract features and reduce noise. As the local links and weights are shared, the training parameters are significantly reduced. In high-dimensional images, all connected neurons will need to train a large number of parameters, which causes the calculation to be time consuming and leads to severe overfitting. The characteristics of local links and weight sharing simplifies parameter training. As shown in Fig. 3, the image is
The activation function is used to add nonlinear factors. If the activation function is not used in the neural network, the output and input cannot be separated from each other using a linear relationship. Therefore, it is meaningless to make a deep neural network without the activation function. The Sigmoid and Tanh functions were previously used as the activation function, but gradient disappearance easily occurs when using these functions. When the neural network is deepened, training obstacles often occur. The ReLU function can effectively overcome gradient disappearance and requires less computation. The ReLU function gradually replaced the Sigmoid and Tanh functions.

The activation functions.
The data extracted by the convolutional layer have a high data dimension. The primary role of the pooling layer is to reduce the data dimension to avoid overfitting. The commonly used methods are max pooling and average pooling. In addition to reducing data dimensionality, pooling is resistant to image translation and slight deformation. Figure 5 shows the max pooling processing. For reducing the data dimensionality, pooling extracts the largest features with a specific window size to avoid overfitting problems.

The max pooling processing.
The entire CNN acts as a classifier. If the operations of the convolutional layer, pooling layer, and activation function map the original data to the hidden layer feature space, the fully connected layer will learn the “distributed feature representation” maps to the role of the sample tag space. Among them, the result is classified by Softmax and normalized to obtain the probability value, as shown in (4).
Palm Vein Database
The two palm vein databases currently published have a multispectral palm vein database. The database equipment used by the Hong Kong Polytechnic University has a fixed contact type acquisition device (Zhang

Original palm vein images.
Krizhevsky

Structure of AlexNet.

Structure of VGG16 and VGG19.
In 2014, an image recognition method based on a VGG network was developed by the University of Oxford. VGG was the first to use small
Transfer Learning
In most machine learning, deep learning, and data mining tasks, we assume the training and inference use data following the same distribution and coming from the same feature space. However, in practical applications, this assumption is difficult to establish and often encounters some problems: The number of marked training samples is limited. For example, when dealing with the classification of the target domain, there is a lack of sufficient training samples. Additionally, there is a large number of training samples in the B (source) domain related to the A field, but the B field and the A field are in different feature spaces or their samples obey different distributions. The data distribution will change. The data distribution can be related to time, location or other dynamic factors. With a change in dynamic factors, the data distribution will change. Thus, the previously collected data are outdated, and so it must be recollected and the model must be rebuilt.
At this time, knowledge transfer, i.e. transferring the knowledge in the B field to the A field and improving the classification in the A field, whose data do not require much time to mark, is a good choice. Migration learning, a new method of transfer learning, was proposed to solve this problem (Pan and Yang, 2010).
Transfer learning is a research issue in machine learning. The model will be trained by transferring the trained model parameters to new problems. Considering that most of the data or tasks are relevant, we can learn the model parameters (this can also be understood as model learned knowledge) through transfer learning in a way that optimizes and improves the speed of the learning efficiency of the model and does not require learning from zero, as with most networks.
We use pretrained AlexNet and VGG to move from the 1 000 classes of the original model to our palm vein categories.
Table 1 shows the network architecture of AlexNet, including the use of each layer, stride and padding in the convolution layer, following the rules to adjust the image, and the dropout in the fully connected layer. We used transfer learning to convert the last three layers into our classification, as in the AlexNet example. VGG16 and VGG19 use the same method to replace the last three layers.
Architecture network of AlexNet.
Table 1 is the initial network architecture. The last three layers are initially set to 1 000 categories, and we adjust for the new categories in Table 2.
Fully connected layer and output layer adjustment.
In Table 2, we adjust the final fully connected layer, Softmax layer, and output layer to our experimental classification.
In this section we discuss the results of the CNN palm vein recognition experiments and analysis. We constructed AlexNet and VGG using MATLAB and the deep learning convolutional neural network framework. The main process includes the following: image acquisition, model training, parameter adjustment, and result identification.
Dataset
We established two set of datasets of palm veins. The first dataset contains twenty images from each of 50 individuals, for a total of 1 000 experimental images. The second dataset contains twenty images from each of 63 individuals, for a total of 1 260 experimental images. The size of a captured image is
Equipment of the first dataset.
Equipment of the first dataset.
Equipment of the second dataset.
The experimental environment for our equipment of the first dataset is shown in Table 3. We use a Windows 10 64-bit system, and 8 G memory, NVIDIA GTX940MX; the framework is MATLAB. The equipment of the second dataset is shown in Table 4. We use a Windows 10 64-bit system, and 32 G memory, NVIDIA GTX1080; the framework is MATLAB.

Training process of AlexNet.
The training of CNN is to optimize various parameters. Different parameters, as well as the network structure and usage, will affect the effectiveness and training speed of CNN recognition. The AlexNet neural network model is used as an example to illustrate the training process of palm vein recognition, whose network structure is shown in Fig. 9. The input image size of the first layer is 227*227*3, where 3 represents the colour of the image (RGB). The equation for the convolution and pooling layers is
The first convolution layer (
Experimental Results of the First Dataset
Training iterations results of the first dataset.
Training iterations results of the first dataset.
In this paper, the experimental process includes the following: randomly selecting the training image and validation sets, adjusting the batch value according to the performance of the hardware, and adjusting the learning rate. The hyperparameter adjustment on the model will affect the experimental results. In this process, the final numbers of iterations were 800, 800, and 1 000 in AlexNet, VGG-16 and VGG-19, respectively, as shown in Table 5, and the accuracy rates were 96%, 97.5%, and 98.5%, respectively. Experimental process diagrams are shown in Figs. 10, 11 and 12. The loss function was used to evaluate the degree of inconsistency between the predicted value and the true value of the model. It is a nonnegative real-valued function; the smaller the loss function, the better the robustness of the model. The loss function is as follows:
The number of iterations in Table 5 is either 800 or 1 000, and the recognition rate is as high as 98.5%.

Accuracy, iterations and loss for iterations of AlexNet of the first dataset.

Accuracy, iterations and loss for iterations of VGG-16 of the first dataset.

Accuracy, iterations and loss for iterations of VGG-19 of the first dataset.
The comparison with other authors’ research results are shown in Table 6. Fronitasari and Gunawan (2017) proposed using DCLBP, which is a LBP, average LBP, median LBP and LBP original corrections, and feature matching process using PNN. (Badrinath
Test results for various algorithms of the first dataset.
Based on the performance of biometrics-based verification systems (Seshikala
Performance metrics of the first dataset.
From Table 7, it can be seen that performance of CNN is quite remarkable among the unprocessed images.
In this study, we divided the image dataset into two sets, namely a training set and a validation set. For the training and validation of the three models AlexNet, VGG-16, and VGG-19, we set the number of iterations to be 1000 times.
The palm vein images in the database were collected by using a near-infrared light camera. Since these were contactless shots, each image had a different resolution and needed to be preprocessed. To obtain the best result of image contrast enhancement while avoiding noise amplification, we used a new technique for contactless palm vein recognition. To preprocess the input image, our design uses contrast limited adaptive histogram equalization (CLAHE) to enhance the image quality and thus the feature distinguishability. Figure 13(a) shows a raw image, and Fig. 12(b) shows the enhanced image.

(a) Original image; (b) Enhanced image.

Training processes and accuracy of the models of the second dataset.
The training processes of the three models are shown in Fig. 14. In Table 8, we present the final results of accuracy of the three models. All three models gave excellent results with the accuracy rate of AlexNet being 99.35%, the accuracy rate of VGG-16 being 99.45%, and the accuracy rate of VGG-19 being 99.5%.
Training iteration results of the second dataset.
The smaller the loss function, the better the stability. The loss function is the core part of the empirical risk function and an important part of the structural risk function. The common loss function includes the following factors: Hinge Loss: Mainly used in support vector machines (SVM). Cross Entropy Loss; Softmax Loss: Used in Logistic Regression and Softmax Classification. Square Loss: Mainly in Ordinary Least Squares (OLS). Exponential Loss: Mainly used in the Adaboost integrated learning algorithm. Other loss (such as 0–1 loss, absolute value loss)
Here we mainly use Cross Entropy Loss, or Softmax Loss. Figure 15 shows the loss function of the training processes of the four models. It can be seen that all four models were very stable after training. Figure 16 shows the training and verification results for four different users randomly selected after the final validation.

Training process and loss of the models of the second dataset.

Training results of four random users.
In addition, we also tested the models with two different graphics cards, and the training time and accuracy differed accordingly. The results are shown in Table 9. Please note that in this part of our experiment, the test image had not been preprocessed.
Performance comparison among models when using different graphics cards.
The statistics of our performance metrics for the three models of the second dataset are shown in Table 10.
Performance metrics of the second dataset.
The performance comparison between our approaches with others are shown in Table 11. Cancian
Performance comparison among similar methods.
This paper proposed a palm vein recognition method based on a convolutional neural network. Through AlexNet and VGG, the entire palm vein recognition process is simplified, and image processing, feature extraction, and matching methods are no longer as complicated. The unique advantages of CNNs compared to other methods include greater convenience and excellent accuracy rates; there are also good results for the performance metrics. In our first palm vein dataset, 1000 images from 50 individuals were adopted for testing, and an FRR of 0.6% was achieved. The above three models provide a new research method for palm vein recognition and prove the advantages of deep learning in the image field. The second dataset, including 1260 images from 63 individuals, was adopted for testing, and an FRR of 0.3% was achieved. We tried to preprocess the palm vein image, use the CLAHE method to increase the contrast of the image, highlight its feature, and improve the accuracy of the three types of CNN to 99%. When using different graphics cards, the training time will have an enormous impact, and the accuracy will be slightly affected.
In the future, this study can be improved in two directions. First, the palm vein images used in our method need to be clear and uncontaminated files, and the effect on shadows is not very good. In the future, we can further improve the pre-processing method of CLAHE proposed in this paper to remove some noise and shadows, which should improve the recognition rate. The second one is towards the finger vein recognition so that it can be applied to the security system control of handheld mobile devices.
