Dog recognition in public places based on convolutional neural network

Abstract

With the increase in the number of dogs in the city, the dogs can be seen everywhere in public places. At the same time, more and more stray dogs appear in public places where dogs are prohibited, which has a certain impact on the city environment and personal safety. In view of this, we propose a novel algorithm that combines dense–scale invariant feature transform and convolutional neural network to solve dog recognition problems in public places. First, the image is divided into several grids; then, the dense–scale invariant feature transform algorithm is used to split and combine the descriptors, and the channel information of the eight directions of the image is extracted as the input of the convolutional neural network; and finally, we design a convolutional neural network based on Adam optimization algorithm and cross-entropy to identify the dog species. The experimental results show that the algorithm can fully combine the advantages of dense–scale invariant feature transform and convolutional neural network to achieve dog recognition in public places, and the correct rate is 94.2%.

Keywords

Convolutional neural network dense–scale invariant feature transform dog recognition

Introduction

With the improvement of people’s living standards, more and more people are starting to raise dogs. Residential areas and parks are a great place to take your dog. There is no need for the dog owner to worry about joggers, kids on bikes, and inattentive drivers. Like any recreational area, however, dogs are still at risk in public places. People and dogs get injured in dog parks throughout the United States.¹ Unneutered male dogs and fearful dogs can be dangerous, because they might fight or bite as a fear reaction. Dog owners do not clean up after their dogs, which can cause the spread of disease. In 2018, a student from Xiangtan University was bitten by many stray dogs on campus.² After the incident was published on the Internet, it caused a great public opinion storm. For this reason, the reputation of Xiangtan University also caused certain negative impacts. As we all know, researchers have more research on face recognition in public areas, but little attention has been paid to dog recognition in public places. Therefore, we propose a dog recognition algorithm that can help managers in public places. For residential areas, schools, parks, and other areas that prohibit dogs, the safety factor of these areas can be significantly improved.

Dog detection in public places falls within the scope of target detection. Compared to face detection, dog detection has greater inter-species differences and is more challenging in pattern recognition. Y Zhang et al.³ proposed an algorithm for dog detection in an elevator. The algorithm was based on stacked autoencoder (SAE), combined with histogram of oriented gradient (HOG) to characterize the dog’s features. S Kumar and SK Singh⁴ proposed a fusion-based pet identification method to identify dogs by biometrics. H Perez-Espinosa et al.⁵ used dog-sounding to identify dog species through machine learning. AA Mohamed et al.⁶ proposed a novel face recognition technique based on discrete wavelet transformed and local binary pattern (LBP) with adapted threshold to recognize avatar faces in different virtual worlds. They all achieved good results. However, they were not in the public field environment for dog species detection, which had certain limitations.

In particular environment of public places, we propose a public place dog recognition algorithm based on convolutional neural networks. First, the image is divided into several grids; then the dense–scale invariant feature transform (SIFT) algorithm is used to split and combine the descriptors, and the channel information of the eight directions of the image is extracted as the input of the convolutional neural network (CNN); and finally, we design a CNN based on Adam optimization algorithm and cross-entropy to identify the dog species. The algorithm has achieved good results in both time and space.

Dense-SIFT feature extraction

Feature extraction of images is one of the important steps in image classification. The quality of image feature extraction directly affects the performance of image classification. DG Lowe⁷ proposed the image local feature extraction SIFT algorithm, which was widely used in image matching, object recognition, and other fields. After this, Lazebnik⁸ proposed an improved SIFT feature extraction algorithm dense-SIFT. The method first used the pre-set grid size and sampling step size, then traversed the image from top to bottom and from left to right, and finally, extracted the SIFT features of each image block to form a feature descriptor. The motivation of the dense-SIFT feature extraction is that dense-SIFT can extract the features of the image in eight directions in advance, which helps to improve the performance of the CNN model. Therefore, we chose to use the dense-SIFT algorithm to process the image. The steps for dense-SIFT feature extraction are as follows:

1. Divide the image into several equal-sized grids.

We divide the original image (m pixels × m pixels) into regions of size x and remove the regions with less than x pixels, which gives $(m - x) / x \times (m - x) / x$ grid images of size x × x.

2. SIFT feature extraction for each divided grid image.

The feature descriptor is obtained by performing feature extraction on each small rectangular block after division by SIFT algorithm. SIFT operators are rich in information and unique. It has the characteristics of being constant when the image is rotated, the intensity of the light changes, and the scale of the image changes. Moreover, it has strong robustness when the angle of view of the image changes and the target is occluded. DG Lowe⁹ pointed out that the area around the feature points is divided into 4 × 4 sub-areas, and the SIFT descriptor has the best performance. At this time, each sub-area serves as a key point, and each key point has eight directions, so each feature descriptor is a vector of 16 × 8 = 128 dimensions.

3. Split and combine the descriptors.

First, we describe the key points and extract the dimensional information in the same direction. Then, the descriptors are combined according to their original positional relationship to form a 4 × 4 dimensional intermediate information; finally, we combine the intermediate information into the same position of the rectangular block relative to the original image, and the description information of $4 (m - x) / x \times 4 (m - x) / x$ dimension can be obtained. The descriptor is split, as shown in Figure 1.

Figure 1.

The descriptor is split.

We extract the dimensional information of the 128-dimensional descriptors in the same direction and combine them according to their original positional relationships to form a 4 × 4 dimensional intermediate information. Its matrix representation (equation (1)) is as follows

[\begin{matrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{matrix}]

(1)

After the descriptors are normalized, the value of each dimension is not greater than 1. Since the descriptor has direction information, no negative values will occur $(0 \leq a_{ij} \leq 1)$ . By performing the same splitting of $4 (m - x) / x \times 4 (m - x) / x$ descriptors, we can get $4 (m - x) / x \times 4 (m - x) / x$ 4 × 4 dimensional intermediate information. The 4 × 4 dimensional intermediate information of each small rectangular block is placed to the same position of the rectangular block relative to the original image. We can get a description of the $[4 (m - x) / x] \times [4 (m - x) / x]$ dimension. Its matrix representation (equation (2)) is as follows

[\begin{matrix} \begin{matrix} a_{11} \\ M \\ a_{M 1} \end{matrix} & \begin{matrix} \land \\ o \\ Λ \end{matrix} & \begin{matrix} a_{1 M} \\ M \\ a_{MM} \end{matrix} \end{matrix}]

(2)

This is equivalent to the gray level information of a dimension image with resolution of $[4 (m - x) / x] \times [4 (m - x) / x]$ dimension, which can be used as input of CNN. If the feature maps in each direction are split and combined in the same way, then we can get the grayscale information of eight pictures. These eight grayscale images will be the input to the CNN. The algorithm flow is shown in Figure 2.

Figure 2.

Dense-SIFT feature extraction algorithm flow.

Construction of dog species recognition model based on CNN

CNNs are the most commonly used algorithms in image recognition, which have also achieved remarkable results in the fields of speech recognition, text analysis, and video processing. Mainstream CNN models such as LeNet,¹⁰ VGGNet,¹¹ and InceptionNet¹² are mostly based on RGB three-channel image input. Few models use eight-channel pictures as CNN inputs. So, for the eight-channel image input, we have carefully designed a CNN model. The network structure is given in Figure 3.

Figure 3.

Convolutional neural network structure.

After the dense-SIFT algorithm extracts features, the image feature size is 112 × 112 × 8.

The first layer: convolution layer; the size of the convolution kernel is 3 × 3; the convolution kernel has a step size of 1; the feature number is 64; the activation function is ReLU; and its output characteristic is 112 × 112 × 64.

The second layer: Max Pool layer; the largest pooling layer; the size of kernel is 3 × 3; a step size of 2; feature output of 56 × 56 × 64; using batch normalization for the output.¹³ Batch normalization was a technique for improving the performance and stability of artificial neural networks. It was a technique to provide any layer in a neural network with inputs that were zero mean/unit variance. It was used to normalize the input layer by adjusting and scaling the activations. For a layer with d-dimensional input x = (x(1),…, x(d)), we normalize each dimension

{\hat{x}}^{(k)} = \frac{x^{(k)} - E [x^{(k)}]}{\sqrt{Var [x^{(k)}]}}

(3)

We introduce, for each activation $x^{(k)}$ , a pair of parameters $λ^{(k)}$ and $β^{(k)}$ , which scale and shift the normalized value

y^{(k)} = λ^{(k)} {\hat{x}}^{(k)} + β^{(k)}

(4)

These parameters were learned along with the original model parameters and restored the representation power of the network.

The third layer: convolution layer; the size of the convolution kernel is 3 × 3; the convolution kernel has a step size of 1; the feature number is 192; the activation function is ReLU; and its output characteristic is 56 × 56 × 192.

The fourth layer: the largest pooling layer; the size of the core is 3 × 3; the step size is 2; the characteristic output is 28 × 28 × 192; using batch normalization for the output.

The fifth layer: convolution layer; the size of the convolution kernel is 3 × 3; the step size of the convolution kernel is 1; the feature number is 192; the activation function is ReLU; and its output characteristic is 28 × 28 × 192.

The sixth layer: Max Pool layer; the size of the core is 3 × 3; the step size is 1; the characteristic output is 28 × 28 × 192; using batch normalization for the output.

The seventh layer: two Inception Module¹² layers, using different scale convolution kernels to deal with features. The four results are connected and the total number of features obtained is 14 × 14 × 280.

The following levels are the same. Finally, through the average pooling layer, the linear layer, and the Softmax layer, we can get 121 categories. Detailed network model parameters are shown in Table 1.

Table 1.

Convolutional neural network model structure.

Type	Kernel size	Stride	Output size
Input			112 × 112 × 8
Convolution	3 × 3	1	112 × 112 × 64
Max Pool	3 × 3	2	56 × 56 × 64
Convolution	3 × 3	1	56 × 56 × 192
Max Pool	3 × 3	2	28 × 28 × 192
Convolution	3 × 3	1	28 × 28 × 192
Max Pool	3 × 3	1	28 × 28 × 192
2 × Inception	Inception Module		14 × 14 × 480
3 × Inception	Inception Module		7 × 7 × 832
2 × Inception	Inception Module		7 × 7 × 1024
Average Pool	7 × 7	1	1 × 1 × 1028
Linear	Logits		1 × 1 × 2048
Softmax	Classifier		1 × 1 × 121

The Adam¹⁴ optimization algorithm is an extension of the stochastic gradient descent algorithm, which uses the first-order moment estimation and the second-order moment estimation of the gradient to dynamically adjust the learning rate of each parameter. In the back propagation neural network (BPNN) process of Adam, after the offset correction, each iteration’s learning rate has a certain range, which makes the parameters relatively stable. So, we use the Adam optimization algorithm as the BPNN algorithm (equation (5))

\begin{matrix} m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}, v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} \\ {\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}} \end{matrix}

(5)

where $m_{t}$ and $v_{t}$ are the first-order matrix estimation and the second-order moment estimation for the shaving at time t, respectively; ${\hat{m}}_{t}$ and ${\hat{v}}_{t}$ are the deviation corrections for the $m_{t}$ and $v_{t}$ parameters, respectively; $α$ is the learning rate of the model; ε is the smoothing term used to avoid the denominator being 0 and taking 1e−8. According to the recommendation of Kingma and Ba,¹⁴ we set $β_{1}$ to 0.9; $β_{2}$ is set to 0.999; and the final formula $θ_{t + 1}$ is updated as follows

θ_{t + 1} = θ_{t} - α \frac{{m^{\land}}_{t}}{\sqrt{{v^{\land}}_{t}} + ε}

(6)

The loss function is used to describe the accuracy of the model’s classification of the problem. The smaller the loss function, the smaller the deviation between the classification result of the model and the real data. We use cross-entropy as a loss function. Cross-entropy first appeared in information theory and is widely used in multi-classification problems, communication, error correction codes, and game theory. At the same time, in the loss function, the L2 regularization term is added to prevent over-fitting. The expression of the final loss function is as shown in equation (7), where ${\hat{y}}_{i}$ represents the predicted probability distribution, $y_{i}$ represents the true probability distribution, and $λ$ is the regular term coefficient

H (y) = - \sum_{i = 1}^{n} y_{i} \log ({\hat{y}}_{i}) + \frac{λ}{2 n} \sum_{j = 1}^{m} | | w_{j} | |^{2}

(7)

Experiment and analysis

Introduction to the experimental environment

The experimental environment and configuration are given in Table 2.

Table 2.

Experimental environment and configuration.

Environment	Configuration
OS	Ubuntu Server 14.04
Memory	16GB
CPU	Intel^® Xeon^® CPU E5-4603 v2
GPU	Matrox Electronic Systems Ltd
TensorFlow	1.3.0
CUDA	8.0
CUDANN	8.0.61

Experimental data

There are a lot of public data sets about dogs on the Internet, but the type, quantity, or size of these data sets is not in line with our expectations (lack of Chinese garden dogs). On the Internet, pictures of dogs in various categories are very rich. Therefore, we use Scrapy and Redis distributed crawler technology to crawl image data on the web and dry, crop, classify, and label the captured images. We have crawled the image data of 121 categories of dogs (see Appendix 1 for specific categories), 500 pictures per type, and the image size is 560 × 560. The images are shown in Figure 4.

Figure 4.

Part of the data set.

Experimental recognition results and analysis

The image is extracted and the features are split and combined. We set the step size of the image to 20 pixels, so a picture with a resolution of 560 × 560 will be divided into 28 × 28 small pictures with a size of 20 × 20. Using the dense-SIFT algorithm, 28 × 28 descriptors of 4 × 4 × 8 dimensional information can be obtained. For all descriptors, according to the split combination method in section “Dense-SIFT feature extraction,” we can get eight channels of grayscale images with an image resolution of (28 × 4) × (28 × 4) = 112 × 112. In the training process, 80% of the data are used as a training sample, 10% of the data are used as a test sample, and 10% of the data are used as a verification sample. We use 1e–4 learning rate training; the accuracy and loss function of the test sample are shown in Figure 5.

Figure 5.

Accuracy (left) and loss function (right) of the test set.

It can be seen from Figure 5 that the accuracy rate is basically rising during the whole training process. After 1000 iterations, the accuracy and loss functions gradually increase steadily, and the recognition rate is about 90%. The speed of convergence is mainly due to the Adam optimization algorithm, and then, after 2000 iterations, the final accuracy rate is 94.2%. The accuracy and loss function values of the validation set are shown in Figure 6.

Figure 6.

Accuracy (left) and loss function (right) of the validation set.

We can see that when the model is iterated to 3000 times, the accuracy of the verification set reaches the highest and the value of the loss function reaches the lowest. When the training accuracy continues to decrease, an over-fitting occurs. So, our model iterates 3000 times under the learning rate of 1e–4.

The experimental results were compared with models such as support vector machine (SVM), InceptionNet V3, VGGNet, and BPNN (three layers). The results are shown in Table 3.

Table 3.

Comparison of experimental results.

Method	Recognition time (s)	Recognition rate (%)
Ours	0.432	94.2
InceptionNet V3	0.663	85.7
VGGNet	0.871	83.4
SVM	0.470	80.8
BPNN (three layers)	0.028	70.6

SVM: support vector machine; BPNN: back propagation neural network.

In terms of recognition rate, the proposed method can achieve a correct recognition rate of 94.2%, benefiting from the feature extraction of dense-SIFT. Through the dense-SIFT algorithm, the picture of the eight directions of the picture is extracted, and the feature preprocessing is performed before inputting the neural network. In terms of recognition speed, the algorithm proposed in this article is better than InceptionNet V3 and VGGNet networks in time. The main reason is that the network model proposed by the text is less than InceptionNet V3 and VGGNet at the network level, so it also has certain advantages in recognition time.

Relationship between learning rate and recognition rate

When the learning rate is set to 0.1, 0.01, 0.001, 0.0005, 0.0001, and 0.00001, the experimental results of the recognition rate are shown in Figure 7.

Figure 7.

Relationship between learning rate and recognition rate.

It can be seen from Table 4 that when the learning rate is 0.1, the recognition rate does not change substantially, which indicates that the feature cannot be effectively extracted at this time. When the learning rate is 0.01, the recognition rate is gradually increased before 50 iterations, but the recognition rate shows a downward trend after 50 iterations, indicating that over-fitting occurs after 50 iterations. When the learning rate is 0.001, the recognition rate can only reach 51%. When the learning rate is 0.00001, although the recognition rate does not appear over-fitting, it only reaches 54.67%. When the learning rate is 0.0001 and 0.0005, the recognition rate can be gradually increased and finally increased to about 70%, and there is no downward trend. This shows that both learning rates enable the network to perform effective feature extraction and learning without over-fitting. When the learning rate is 0.0001, the recognition rate is 70% only after 40 iterations. The learning rate is 0.0005, and the recognition rate is close to 70% when iteration is nearly 100 times. This shows that the network structure performance is better when the learning rate is 0.0001, so we choose the learning rate of 0.0001.

Table 4.

Comparison of recognition rates in different learning rates.

Learning rate	Iterations
	10	20	30	40	50	60	70	80	90	100
	Recognition rate (%)
0.1	30.33	30	30.67	30	30	30.67	30	30.67	30	30
0.01	40	53	55	57	62.67	56	46.33	44.67	40	38
0.001	50	50.67	53.33	51	51.67	51	50.67	48.67	44.67	43
0.0005	48	52	55.67	54.67	59	60	67.33	68	69.67	69
0.0001	49	64	68.33	70	71.67	71.67	71	72	70	70.33
0.00001	48	48.33	50	49.67	51	51.67	52.33	53.67	54	54.67

Summary and outlook

We studied the recognition of dogs in public places, designed a CNN model, and extracted the channel information of the eight directions of the image as the input of the CNN through the dense-SIFT extraction algorithm. We have achieved good results in accuracy and recognition time. The next work will continue to study in-depth from the following three aspects: (1) to improve the recognition speed and generalization ability of the feature model; (2) to detect the number and location of dogs; and (3) to test in bad weather.

Footnotes

Appendix

Appendix 1

Dog breed

1	Affenpinscher	32	Chow	62	Italian Greyhound	92	Rhodesian Ridgeback
2	Afghan Hound	33	Clumber	63	Japanese Spaniel	93	Rottweiler
3	African Hunting Dog	34	Cocker Spaniel	64	Keeshond	94	Saint Bernard
4	Airedale Terrier	35	Collie	65	Kelpie	95	Saluki
5	American Staffordshire Terrier	36	Curly	66	Kerry Blue Terrier	96	Samoyed
6	Appenzeller	37	Dandie Dinmont	67	Komondor	97	Schipperke
7	Australian Terrier	38	Dhole	68	Kuvasz	98	Scotch Terrier
8	Basenji	39	Dingo	69	Labrador Retriever	99	Scottish Deerhound
9	Basset Hound	40	Dobermann	70	Lakeland Terrier	100	Sealyham Terrier
10	Beagle	41	English Foxhound	71	Leonberg	101	Shetland Sheepdog
11	Bedlington Terrier	42	English Setter	72	Lhasa	102	Shih
12	Bernese Mountain Dog	43	English Springer	73	Malamute	103	Siberian Husky
13	Black	44	Entlebucher	74	Malinois	104	Silky Terrier
14	Blenheim Spaniel	45	Eskimo Dog	75	Maltese Dog	105	Soft
15	Bloodhound	46	Flat	76	Mexican Hairless	106	Staffordshire Bullterrier
16	Bluetick	47	French Bulldog	77	Miniature Pinscher	107	Standard Poodle
17	Border Collie	48	German Shepherd	78	Miniature Poodle	108	Standard Schnauzer
18	Border Terrier	49	German Short	79	Miniature Schnauzer	109	Sussex Spaniel
19	Borzoi	50	Giant Schnauzer	80	Newfoundland	110	Tibetan Mastiff
20	Boston Bull	51	Golden Retriever	81	Norfolk Terrier	111	Tibetan Terrier
21	Bouvier Des Flandres	52	Gordon Setter	82	Norwegian Elkhound	112	Toy Poodle
22	Boxer	53	Great Dane	83	Norwich Terrier	113	Toy Terrier
23	Brabancon Griffon	54	Great Pyrenees	84	Old English Sheepdog	114	Vizsla
24	Briard	55	Greater Swiss Mountain Dog	85	Otterhound	115	Walker Hound
25	Brittany Spaniel	56	Groenendael	86	Papillon	116	Weimaraner
26	Bullmastiff	57	Ibizan Hound	87	Pekinese	117	Welsh Corgi
27	Cairn	58	Irish Setter	88	Pembroke	118	West Highland White Terrier
28	Cardigan	59	Irish Terrier	89	Pomeranian	119	Whippet
29	Chesapeake Bay Retriever	60	Irish Water Spaniel	90	Pug	120	Wire
30	Chihuahua	61	Irish Wolfhound	91	Redbone	121	Yorkshire Terrier
31	Chinese Garden Dog

Handling Editor: Shigeng Zhang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by National Natural Science Foundation of China (NSFC) (61672495), Scientific Research Fund of Hunan Provincial Education Department (16A208), Project of Hunan Provincial Science and Technology Department (2017SK2405), and, in part, by the Construct Program of the Key Discipline in Hunan Province.

ORCID iD

Hao He

References

Dog bite law, https://dogbitelaw.com/dog-parks-liability-for-dog-bites/overview

Baijiahao, https://baijiahao.baidu.com/s?id=1599514030341927010

Zhang

Zang

Hua

et al . Research on algorithm of dog recognition in elevator based on video. Mech Des Manuf Eng 2018; 3: 103–107.

Kumar

Singh

SK.

Biometric recognition for pet animal. J Softw Eng Appl 2014; 7(5): 470–482.

Perez-Espinosa

Reyes-Meza

Aguilar-Benitez

et al . Automatic individual dog recognition based on the acoustic properties of its barks. J Intell Fuzzy Syst 2018; 34(5): 3273–3280.

Mohamed

Ibrahem

Dabour

et al . Dogs animal recognition system in IoT environment based on orthogonal statistical adapted local binary pattern. In: International conference on advanced intelligent systems and informatics, Cairo, Egypt, 1–3 September 2017. Berlin: Springer.

Lowe

. Object recognition from local scale-invariant features. In: IEEE international conference on computer vision, Kerkyra, 20–27 September 1999, p.1150. New York: IEEE.

Lazebnik

Schmid

Ponce

Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, vol. 2, 2006, pp.2169–2178. IEEE.

Lowe

DG.

Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004; 60: 91–110.

10.

Lécun

Bottou

Bengio

et al . Gradient-based learning applied to document recognition. P IEEE 1998; 86(11): 2278–2324.

11.

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition, 2015, https://arxiv.org/abs/1409.1556

12.

Szegedy

Liu

Jia

et al . Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, 7–12 June 2015, pp.1–9. New York: IEEE.

13.

Ioffe

Szegedy

Batch normalization: accelerating deep network training by reducing internal covariate shift, 2015, pp.448–456, https://arxiv.org/abs/1502.03167

14.

Kingma

Adam: a method for stochastic optimization, 2014, https://arxiv.org/abs/1412.6980