Abstract
Addressing the challenge that current metro video surveillance systems have not effectively handled crowd density estimation in metro environments, a Metro Crowd density estimation Network (MCNet) is proposed to automatically classify the crowd density level of passengers. First, an Integrating Multi-scale Attention (IMA) module is introduced to enhance the ability of plain classifiers to extract semantic crowd texture features. The innovation of the IMA module lies in fusing dilation convolution, multiscale feature extraction, and attention mechanisms to obtain multi-scale crowd feature activation from a larger receptive field with lower computational cost, strengthening the activation state of crowd features in the top layers. Second, a novel lightweight crowd texture feature extraction network is proposed to automatically extract image texture features for crowd density estimation, offering high efficiency and lower network parameters suitable for deployment on resource-constrained embedded platforms. Finally, this paper combines the IMA module and the lightweight crowd texture feature extraction network to form the MCNet, and tests its feasibility on the image classification dataset: Cifar10, as well as four crowd density datasets. Experimental results demonstrate that, with the help of the IMA module, the prediction accuracies of MCNet improve across these datasets, outperforming other competitors in terms of accuracy, total network parameters, and inference speed. Furthermore, experiments on power consumption and inference speed support the feasibility of MCNet being deployed on embedded metro platforms. These tests show that MCNet is a suitable solution for crowd density estimation in metro video surveillance in complex real-life scenes.
Get full access to this article
View all access options for this article.
