Sage Journals: Discover world-class research

Abstract

Crowd counting aims to estimate the number of individuals in images, and the use of multimodal data has been shown to significantly enhance counting accuracy. However, such approaches are highly sensitive to the loss or corruption of data from any single modality, leading to severe performance degradation. To address this limitation, a new problem setting—Modality-Reconfigurable Crowd Counting—is introduced, in which a model is required to maintain robust performance even when one of the input modalities (e.g., RGB or thermal) is perturbed or entirely unavailable. Modality reconfigurability is achieved through effective cross-modal information transfer, enabled by a Feature Patches Generator that leverages Margin Ranking Loss across multiple network layers to align and transfer discriminative features between modalities. Additionally, a Negative Knowledge Transfer Prevention module is incorporated to suppress misleading or detrimental cross-modal signals. State-of-the-art performance is demonstrated on RGB-T crowd counting benchmarks, with consistent accuracy maintained under both complete and degraded modality conditions.

Keywords

crowd counting multimodal modality reconfigurable transfer learning

1. Introduction

In recent years, rapid urbanization and population growth have led to dense gatherings in public spaces, posing significant challenges to public safety and epidemic control—as tragically illustrated by events such as the 2015 New Year’s Day stampede in Shanghai, China, and the 2022 Itaewon crowd crush in Seoul, South Korea. In response, crowd counting has attracted increasing research interest. The task involves estimating the number of individuals present in a given image, typically by generating a density map from the input. The total crowd count is obtained by integrating the values across this map, which simultaneously encodes spatial distribution information.

While extensive research has been devoted to crowd counting using RGB imagery,^1–4 such approaches exhibit notable limitations under low-light conditions, as RGB data is highly sensitive to illumination variations. In contrast, thermal imaging captures heat radiation emitted by objects, with the human body serving as a consistent thermal source. This modality remains unaffected by ambient lighting and often enables clearer human detection in darkness, where interfering visual cues are absent. The widespread deployment of thermal sensors—accelerated by public safety initiatives and the global response to the COVID-19 pandemic—has further increased the accessibility of thermal data. Nevertheless, thermal images are not without drawbacks; their relatively low spatial resolution and susceptibility to confusion with non-human heat sources (e.g., vehicles or machinery) can impair counting accuracy in complex scenes.

To leverage the complementary strengths of RGB and thermal modalities, numerous multimodal crowd counting approaches have been proposed.^5–10 These methods typically fuse multimodal features during both training and inference, yielding improved accuracy compared to unimodal counterparts. Despite their success, most existing frameworks rely on a critical yet often implicit assumption: all modalities are consistently available and reliable at inference time. In real-world deployments, especially in large-scale or outdoor environments, this assumption is frequently violated. Sensors may become unavailable or unreliable due to hardware malfunction, occlusion, adverse weather, thermal saturation, or severe illumination changes. Under such conditions, tightly coupled multimodal models often experience abrupt performance collapse, sometimes performing worse than unimodal baselines, thereby limiting their practical applicability.

To address this gap, a new problem setting—modality reconfigurable crowd counting—is introduced. This paradigm emphasizes the capability of a multimodal model to preserve accuracy and robustness even when one input modality is perturbed, corrupted, or entirely missing. The focus is placed on scenarios commonly encountered in real-world applications, such as lighting-induced RGB degradation or thermal sensor failure.

The proposed framework enables modality reconfigurability through a training strategy that exploits cross-modal complementarity without enforcing tight coupling during inference. A Feature Patches Generator is employed in conjunction with Margin Ranking Loss applied across multiple network layers to facilitate effective knowledge transfer between modalities. Crucially, feature extraction pathways for each modality remain independent, ensuring that inference can proceed using either modality alone when necessary. In the event of modality loss, the system seamlessly reconfigures to rely on the available input while retaining performance gains acquired during multimodal training. As a result, competitive counting accuracy is maintained under both complete and degraded input conditions.

Unlike prior multimodal crowd counting approaches that rely on tightly coupled fusion or modality recovery at inference time, the key novelty of this work lies in a modality-reconfigurable training paradigm that strengthens each unimodal branch through directional, performance-aware cross-modal supervision, while preserving strict modality independence during inference. This design enables robust deployment under missing or degraded modality conditions without requiring modality reconstruction or joint inference.

The main contributions of this work are summarized as follows:

The problem of modality reconfigurable crowd counting is formally introduced, defined as the ability of a multimodal model to sustain performance despite external disturbances that compromise one input modality.

A modality-reconfigurable training framework is proposed, which leverages cross-modal interactions to enhance individual modality representations while preserving inference-time flexibility.

Experimental results on RGB-T crowd counting benchmarks demonstrate that, with the same backbone architecture, the incorporation of the proposed approach yields improved counting accuracy and robustness under varying environmental conditions, including missing or degraded modalities.

2. Related work

This section reviews prior work most closely related to the proposed approach, focusing on crowd counting and multimodal learning.

2.1. Crowd counting

Crowd counting has become a critical task in computer vision, with applications in public safety, traffic analysis, and urban planning. The evolution of methodologies can be broadly categorized into three paradigms: detection-based, regression-based, and density map estimation-based approaches.

Detection-based methods aim to identify and localize individuals before aggregating counts. Early efforts in this direction include,^11–15 and the approach in¹⁶ specifically targets sparse crowd counting in intelligent building environments. Additional detection frameworks have been explored in.^17,18 While effective in low-density, unoccluded settings, these methods struggle significantly in crowded or heavily occluded scenes due to missed detections and overlapping instances.

Regression-based techniques were introduced to bypass explicit detection by mapping image features—either global or local—to a scalar crowd count. Representative works in this category include,^1–4 which demonstrate improved efficiency over detection-based strategies. However, regression approaches provide no spatial information about crowd distribution, limiting their utility in applications requiring localization.

The current dominant framework is based on density map estimation, where a continuous map is generated such that the integral over the image yields the total count. This paradigm was pioneered in¹ and subsequently advanced by numerous studies. Multi-column architectures were introduced in,^2,19 while context-aware modeling was explored in.^20–22 Scale-adaptive networks were developed in,^23,24 and attention mechanisms were incorporated in.^25–27 Further improvements include hierarchical feature fusion,^28–30 adversarial training,^31,32 and transformer-based designs.^33,34 Notably, spatially decomposable features were leveraged in³⁵ to enhance counting accuracy, and a weather-adaptive module was proposed in³⁶ to improve robustness under severe weather conditions. Despite their success with RGB imagery, these methods remain highly sensitive to lighting variations and often fail in dimly lit environments where reliable visual features are unavailable.

2.2. Multimodal learning

Multimodal learning seeks to enhance model performance by integrating complementary information from heterogeneous data sources. In crowd counting, this has motivated the use of modalities beyond RGB to overcome inherent limitations of visible-light imaging.

Depth-based crowd counting has been investigated in several works, including,^37–39 which demonstrate the utility of geometric cues for person localization. However, depth sensors exhibit practical constraints in outdoor environments due to sunlight interference and limited effective range.

Thermal imaging has gained prominence due to its invariance to lighting conditions and strong response to human body heat. With the widespread deployment of infrared devices, thermal data has become increasingly accessible. RGB–thermal (RGB-T) crowd counting has been explored in,^5–7 where fusion strategies such as cross-modal feature alignment and joint optimization are employed,^8,10 Additional multimodal frameworks have been proposed in,^40,41 and broader studies on multimodal representation learning include.^42–45 These approaches consistently demonstrate that leveraging complementary features from multiple modalities improves both accuracy and robustness in complex scenes,^9,46,47

Nevertheless, a common limitation across existing multimodal methods is their reliance on the simultaneous availability of all input modalities during inference. When one modality is missing or corrupted–due to sensor failure, occlusion, or environmental interference—the fused representation may degrade substantially, compromising real-world applicability. A large body of prior work focuses on recovering missing modalities or their latent representations. These approaches attempt to reconstruct unavailable inputs through modality imputation or modality generation, enabling multimodal models to operate under a complete-input assumption.^48,49 Another line of research adopts model-level strategies to cope with missing modalities without explicitly reconstructing them.^50–52

To further clarify the conceptual distinction from related multimodal robustness paradigms, Table 1 summarises key characteristics of missing-modality recovery, modality dropout, and cross-modal knowledge distillation in comparison with the proposed method.

Table 1.
Conceptual comparison of representative multimodal learning paradigms addressing missing or unreliable modalities.

Method Category Modality Recovery Training Paradigm Cross-modal Interaction Negative Transfer Control

Recovery-based Multimodal Learning ✓ Multimodal Implicit or generated ✗

Modality Dropout ✗ Multimodal Random or implicit ✗

Cross-modal Knowledge Distillation ✗ Teacher–student One-way ✗

Proposed Method ✗ Symmetric co-training Explicit and performance-aware ✓

Method Category	Modality Recovery	Training Paradigm	Cross-modal Interaction	Negative Transfer Control
Recovery-based Multimodal Learning	✓	Multimodal	Implicit or generated	✗
Modality Dropout	✗	Multimodal	Random or implicit	✗
Cross-modal Knowledge Distillation	✗	Teacher–student	One-way	✗
Proposed Method	✗	Symmetric co-training	Explicit and performance-aware	✓

The present work addresses this gap by introducing modality reconfigurability: the proposed method does so by employing a structured information exchange mechanism and explicitly mitigating negative knowledge transfer, thereby enabling more robust unimodal performance under missing-modality conditions.

Despite the progress of existing RGB–thermal crowd counting methods, most approaches share a common assumption that all modalities are consistently available and reliable at inference time. When this assumption is violated, performance often degrades substantially, limiting practical deployment in real-world surveillance scenarios where sensor failure or degradation is common. To better contextualise the proposed approach, Table 2 provides a side-by-side comparison of representative RGB–thermal crowd counting methods, highlighting their core design principles and their limitations under missing-modality conditions. This comparison clarifies the key distinction between existing fusion-based approaches and the modality-reconfigurable paradigm adopted in this work.

Table 2.

Side-by-side comparison of representative RGB–thermal crowd counting methods and their main limitations under missing-modality scenarios.

Method	Core Idea	Main Limitations
CSRNet+IADM⁵	Cross-modal collaborative representation learning using an Information Aggregation–Distribution Module (IADM) to jointly exploit RGB and thermal features.	Requires simultaneous availability of both modalities; limited capability to operate or generalise under missing-modality conditions.
CSRNet+R2T_ssim⁸	Cross-modal discriminative representation learning via an explicit non-linear mapping from the RGB domain to the thermal domain for feature fusion.	Relies on complete multimodal inputs; limited robustness when one modality is absent.
CmCaF³⁹	Cross-modal cyclic attention fusion that iteratively integrates features across modalities using cyclic attention mechanisms.	Assumes full-modality availability during inference; lacks adaptability to missing-modality scenarios.
CSRNet+MAT+SSP⁴¹	Cross-modal mutual-attention mechanism that dynamically enhances one modality using features from the other modality.	Strong dependence on dual-modality inputs; not designed for unimodal inference under missing-modality settings.
Ours	Structured information exchange that jointly enhances unimodal models during training while explicitly mitigating negative knowledge transfer.	While effective under missing-modality conditions, optimal integration of multimodal predictions remains an open research direction.

3. Proposed method

The proposed framework, termed Modality-Reconfigurable Crowd Counting (MRCrowd), enables robust inference under partial modality availability by enhancing individual modality performance through controlled cross-modal interaction during training. Modality independence is preserved at inference time, while multimodal synergy is exploited during training via two key components: the Feature Patches Generator (FPG) and the Preventing Negative Knowledge Transfer (PNKT) module. The FPG facilitates cross-modal alignment through patch-level feature comparisons, while the PNKT module ensures that knowledge transfer occurs only in a beneficial direction. The total loss combines regression and Margin Ranking Loss terms, and gradients are backpropagated separately through each modality stream.

3.1. Multimodal feature extraction via feature patches generator

To enable effective cross-modal knowledge transfer, discriminative local features are extracted from intermediate network representations using the Feature Patches Generator (FPG), inspired by unsupervised feature augmentation techniques.²⁴ Given a feature map $F \in R^{C \times H \times W}$ from the $k$ -th layer of a modality-specific backbone (where $C$ , $H$ , and $W$ denote the number of channels, height, and width, respectively), the FPG constructs a set of spatially aligned feature patches that capture multi-scale contextual information.

As illustrated in Figures 1 and 2, a central region—referred to as the Center Select Block—is defined with height $(1 - n) H$ and width $(1 - n) W$ , centered at $(H / 2, W / 2)$ , where $n \in (0, 1)$ is a fixed cropping ratio. A cropping center $p = (p_{x}, p_{y})$ is then sampled uniformly at random within this block. Using $p$ as the anchor, $M$ square patches are iteratively cropped from $F$ at scales $n, n^{2}, \dots, n^{M}$ , yielding intermediate patch tensors ${{\hat{v}}_{1}, \dots, {\hat{v}}_{M}}$ . Each ${\hat{v}}_{i}$ is resized via bilinear interpolation to match the spatial dimensions of $F$ , resulting in normalized patches ${v_{1}, \dots, v_{M}}$ . The original feature map $F$ is included as $v_{0}$ , forming the complete patch set:

\begin{aligned} S_{k} = {v_{0}, v_{1}, \dots, v_{M}}, v_{i} \in R^{C \times H \times W} . \end{aligned}

1 \times 1

convolutional layer is applied to the concatenated tensor

[v_{0}; v_{1}; \dots; v_{M}] \in R^{(M + 1) C \times H \times W}

to fuse multi-scale information, and the output is split back into

M + 1

tensors along the channel dimension to recover the refined patch set

S_{k}

Figure 1.

Overview of the proposed framework for modality-reconfigurable crowd counting. During training, feature maps from intermediate layers are processed by the Feature Patches Generator (FPG) to produce patch representations used in Margin Ranking Loss. The Preventing Negative Knowledge Transfer (PNKT) module computes adaptive parameters based on the regression losses of the RGB and thermal modalities, ensuring unidirectional knowledge transfer from the more accurate to the less accurate modality. The total loss—comprising regression and ranking terms—is backpropagated through each modality-specific network independently.

Figure 2.

Detailed illustration of the Feature Patches Generator (FPG). A cropping center point is randomly selected within a central region of the feature map. Subsequently, $M$ patches are cropped at progressively scaled ratios around this center.

The complete procedure is formalized in Algorithm 1.

3.2. Ensuring positive information transfer via the PNKT module

During training, the predictive accuracy of different modalities may vary due to scene-specific factors such as illumination or thermal interference. To prevent detrimental cross-modal influence, a Preventing Negative Knowledge Transfer (PNKT) module is employed to enforce unidirectional knowledge flow—from the more accurate modality to the less accurate one. The relative performance of the RGB and thermal modalities is quantified using the mean squared error (MSE), a standard regression loss. Let $L_{g}^{r}$ and $L_{g}^{t}$ denote the regression losses of the RGB and thermal branches, respectively.

The performance difference is defined as

\begin{aligned} Δ L^{r, t} = L_{g}^{t} - L_{g}^{r} . \end{aligned}

When

Δ L^{r, t} < 0

, the thermal modality exhibits lower regression error and is thus considered more reliable; conversely, when

Δ L^{r, t} > 0

, the RGB modality is superior.

An adaptive weighting function $ρ^{r, t}$ is introduced to modulate the strength of knowledge transfer based on this performance gap. The function is defined as

\begin{aligned} ρ^{r, t} = J (Δ L^{r, t}) = {\begin{cases} e^{β Δ L^{r, t}} - 1, & 0 < Δ L^{r, t} < 1, \\ 0, & Δ L^{r, t} \leq 0 or Δ L^{r, t} \geq 1, \end{cases} \end{aligned}

(1)where

β > 0

is a focusing parameter that controls the sensitivity to performance differences, and

J (\cdot)

is a thresholding operator. The formulation ensures that knowledge is transferred only when the source modality is more accurate, and the magnitude of transfer is smoothly attenuated as the performance gap widens. Additionally, to prevent over-correction caused by excessively large inter-modal discrepancies in certain scenarios, we suspend knowledge transfer when the loss difference between modalities grows too pronounced. This ensures that models for each modality can gradually learn beneficial features from others during training, while still preserving their intrinsic discriminative power. A symmetric weight

ρ^{t, r}

is computed by swapping the roles of

r

and

t

This mechanism prevents the less accurate modality from dominating the learning signal, thereby stabilizing convergence and avoiding negative transfer.

3.3. Loss function design in MRCrowd

The total loss in MRCrowd integrates regression objectives with cross-modal feature alignment, enabling modality-reconfigurable learning. The framework consists of two parallel modality-specific backbones trained jointly. Feature maps from multiple intermediate layers are processed by the Feature Patches Generator (FPG) to produce patch sets ${S_{1}^{r}, \dots, S_{K}^{r}}$ and ${S_{1}^{t}, \dots, S_{K}^{t}}$ for the RGB and thermal streams, respectively, where $K$ denotes the number of selected layers.

For each patch set $S_{k}^{b} = {v_{0}^{k, b}, v_{1}^{k, b}, \dots, v_{M}^{k, b}}$ (with $b \in {r, t}$ indicating the modality), the total crowd count within each patch is approximated by summing its spatial activations:

\begin{aligned} D_{i}^{k, b} = \sum_{x, y, c} v_{i}^{k, b} (x, y, c), \end{aligned}

(2)where

D_{i}^{k, b} \in R

is a scalar representing the estimated count in the

i

-th patch of the

k

-th layer for modality

b

, and the summation is performed over all spatial coordinates

(x, y)

and channels

c

A Margin Ranking Loss is then applied to enforce ordinal consistency among patches: larger patches (which encompass more people) should yield higher count estimates than smaller, nested patches. Given a margin $ϵ \geq 0$ (set to $0$ in this work), the unnormalized ranking loss is computed as

\begin{aligned} ℓ = \sum_{b \in {r, t}} \sum_{k = 1}^{K} \sum_{i = 1}^{M - 1} \sum_{j = i + 1}^{M} max (0, D_{i}^{k, b} - D_{j}^{k, b} + ϵ) . \end{aligned}

(3)The normalized Margin Ranking Loss is given by

\begin{aligned} L_{m}^{b} = \frac{2}{B K M (M - 1)} \sum_{k = 1}^{K} \sum_{i = 1}^{M - 1} \sum_{j = i + 1}^{M} max (0, D_{i}^{k, b} - D_{j}^{k, b} + ϵ), \end{aligned}

(4)where

B

is the batch size. This loss encourages the latent feature representations to preserve the relative crowd density ordering across spatial scales.

To enable controlled cross-modal interaction, the PNKT-generated weights $ρ^{r, t}$ and $ρ^{t, r}$ are used to scale the ranking losses of the opposite modalities, yielding the Feature Complementarity Loss terms:

\begin{aligned} L_{f}^{r} & = ρ^{r, t} \cdot L_{m}^{t}, \end{aligned}

(5)

\begin{aligned} L_{f}^{t} & = ρ^{t, r} \cdot L_{m}^{r} . \end{aligned}

(6)

Note that

L_{f}^{r}

represents the contribution from the thermal modality to the RGB branch, and vice versa. The directionality is governed by the PNKT weights: if the thermal modality is more accurate (

ρ^{r, t} > 0

), its ranking signal is injected into the RGB loss.

Finally, the total loss for each modality combines its own regression loss with the complementarity term from the other modality:

\begin{aligned} L^{r} & = L_{g}^{r} + λ L_{f}^{r} = L_{g}^{r} + λ ρ^{r, t} L_{m}^{t}, \end{aligned}

(7)

\begin{aligned} L^{t} & = L_{g}^{t} + λ L_{f}^{t} = L_{g}^{t} + λ ρ^{t, r} L_{m}^{r}, \end{aligned}

(8)

where

λ > 0

is a hyperparameter balancing the influence of cross-modal knowledge transfer.

During backpropagation, gradients of $L^{r}$ and $L^{t}$ are computed independently and flow only through their respective modality networks. This design ensures that at inference time, either modality can operate in isolation without dependency on the other, while still benefiting from multimodal synergy during training—thereby achieving modality reconfigurability.

4. Experiments

This section presents the experimental setup, including the datasets, evaluation metrics, implementation details, and comparative results of the proposed method against existing approaches.

4.1. Datasets

RGBT-CC:⁵ This large-scale RGB–thermal (RGB-T) crowd counting dataset was collected from a surveillance perspective. It contains 2,030 aligned RGB–thermal image pairs of resolution $640 \times 480$ , annotated with a total of 138,389 pedestrian instances. The dataset is balanced across lighting conditions, comprising 1,013 images captured under bright illumination and 1,013 under dark conditions.

DroneRGBT:¹⁰ This dataset provides RGB–thermal imagery from an aerial (drone) viewpoint. It consists of 3,600 image pairs of size $640 \times 512$ , with 175,698 annotated pedestrians. The images are distributed across diverse lighting scenarios: approximately 1,600 under dusk, 1,300 under daylight, and 900 under dark conditions.

4.2. Implementation details

The proposed framework was implemented using PyTorch. CSRNet¹ was adopted as the backbone architecture for both modality-specific branches. Stochastic Gradient Descent (SGD) was employed as the optimizer, with an initial learning rate of $10^{- 5}$ .

The hyperparameters were configured as follows: the focusing parameter $β$ in Eq. (1) was set to $2$ ; the number of cropped patches $M$ , the number of selected feature layers $K$ , and the batch size $B$ were set to $5$ , $3$ , and $1$ , respectively; the transfer strength coefficient $λ$ in Eqs. (7) and (8) was fixed at $10^{- 2}$ .

For multimodal inference (denoted RGB+T), predictions were selected based on a brightness threshold applied to the input RGB image. If the average pixel intensity exceeded the threshold, the prediction from the RGB branch was used; otherwise, the thermal branch prediction was adopted. All other implementation details follow those of CSRNet.¹

4.3. Evaluation metrics

The performance was evaluated using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Additionally, following the protocol in,⁵ the Grid Average Mean Absolute Error (GAME)⁵³ was employed on the RGBT-CC dataset to assess spatially localized counting accuracy. These metrics are defined as:

\begin{aligned} MAE & = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{C}}_{i} - C_{i} |, \end{aligned}

(9)

\begin{aligned} RMSE & = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{C}}_{i} - C_{i})}^{2}}, \end{aligned}

(10)

\begin{aligned} GAME (l) & = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{4^{l}} | {\hat{C}}_{i}^{j} - C_{i}^{j} |, \end{aligned}

(11)

where

N

denotes the total number of test images,

{\hat{C}}_{i}

and

C_{i}

represent the estimated and ground-truth total counts for the

i

-th image (obtained by integrating the predicted and ground-truth density maps, respectively), and

{\hat{C}}_{i}^{j}

and

C_{i}^{j}

denote the predicted and ground-truth counts in the

j

-th subregion of the

i

-th image. For a given level

l

, each image is partitioned into

4^{l}

non-overlapping grid cells of equal size. Notably, when

l = 0

GAME (0)

reduces to

MAE

4.4. Comparison with state-of-the-Art methods on RGBT-CC

The modality reconfigurability of the proposed framework is evaluated under three conditions: (1) normal operation with both modalities available, (2) missing-modality scenarios, and (3) varying illumination conditions. All experiments are conducted on the RGBT-CC dataset.

1) Full-modality setting

Under standard conditions—where both RGB and thermal inputs are available and no external perturbations are applied—the proposed method is compared against state-of-the-art multimodal approaches that share the same CSRNet backbone.¹ As shown in Table 3, the proposed framework achieves the lowest error across all metrics, with a GAME(0) of 12.54, outperforming CSRNet+IADM,⁵ CmCaF,³⁹ CSRNet+R2T_ssim,⁸ and CSRNet+MAT+SSP⁴¹ by 30.1%, 25.9%, 21.0%, and 8.13%, respectively. These results confirm the effectiveness of the proposed cross-modal training strategy in enhancing multimodal fusion.

Table 3.
Comparison of multimodal crowd counting methods on the RGBT-CC dataset under full-modality conditions. Bold values indicate the best performance. All methods use CSRNet¹ as the backbone.

Method Modality GAME(0) GAME(1) GAME(2) GAME(3) RMSE

CSRNet¹ RGB+T 20.40 23.58 28.03 35.51 35.26

CSRNet+IADM⁵ RGB+T 17.94 21.44 26.17 33.33 30.91

CSRNet+R2T_ssim⁸ RGB+T 16.92 20.72 25.90 32.10 32.91

CmCaF³⁹ RGB+T 15.87 19.92 24.65 28.01 29.31

CSRNet+MAT+SSP⁴¹ RGB+T 13.65 16.29 20.81 29.09 22.53

Ours RGB+T 12.54 15.84 19.55 23.56 22.73

To assess computational efficiency, floating-point operations (FLOPs) are reported in Table 4, using a $640 \times 480$ RGB image as input. The proposed method requires only 9.56 GFLOPs, significantly less than CSRNet+IADM (133.85 GFLOPs) and CSRNet+R2T_ssim (325.05 GFLOPs), while achieving superior accuracy (GAME(0) = 12.54). This demonstrates a favorable trade-off between performance and computational cost.

Table 4.
Accuracy and computational complexity comparison. GFLOPs are measured on a $640 \times 480$ input.

Method GAME(0) GFLOPs

CSRNet¹ 20.04 6.06

CSRNet+IADM⁵ 17.94 133.85

CSRNet+R2T_ssim⁸ 16.92 325.05

Ours 12.54 9.56

2) Missing-modality robustness

To evaluate robustness under partial input failure, we focus on failure patterns that most commonly arise in real-world multimodal sensing systems, where a modality may become unavailable or unreliable due to sensor malfunction, occlusion, or severe environmental interference. Accordingly, two representative failure modes are simulated: (i) replacing the unavailable modality with a zero tensor, which emulates a complete sensor dropout, and (ii) performing inference using only the available modality, corresponding to a unimodal fallback configuration.

Although these settings appear binary in form, they represent extreme yet practically meaningful endpoints of a broader spectrum of modality degradation. Importantly, the proposed framework does not rely on binary modality indicators during training. Instead, the Preventing Negative Knowledge Transfer (PNKT) mechanism continuously modulates cross-modal interaction based on regression loss differences, enabling adaptive learning behavior across varying degrees of modality reliability.

As shown in Table 5, conventional multimodal methods such as CSRNet+IADM suffer severe performance degradation when one modality becomes unreliable—particularly in the absence of RGB input—often performing worse than a unimodal CSRNet baseline. In contrast, the proposed framework maintains strong performance under both unimodal inference settings, achieving GAME(0) scores of 24.22 (RGB-only) and 13.30 (thermal-only), significantly outperforming all competing methods.

Table 5.
Modality reconfigurability evaluation on RGBT-CC under missing-modality conditions. ✓ indicates available modality; ✗ indicates missing. Bold values denote best results.

Method RGB T GAME(0) GAME(1) GAME(2) GAME(3) RMSE

CSRNet¹ ✓ ✗ 33.94 40.76 47.31 57.20 69.59

✗ ✓ 21.64 26.22 31.65 38.66 37.38

CSRNet+IADM⁵ (zero-fill) ✓ ✗ 69.70 71.50 74.43 75.60 101.11

✗ ✓ 22.20 25.45 29.50 33.25 35.74

Ours ✓ ✗ 24.22 29.97 35.77 41.81 52.82

✗ ✓ 13.30 16.46 20.06 23.84 23.60

These results demonstrate that the proposed training strategy enables true modality reconfigurability: the model can seamlessly fall back to a single modality at inference time without performance collapse, while still benefiting from multimodal supervision during training. This behavior directly reflects robustness to partial and progressive modality degradation encountered in practical deployment scenarios.

3) Performance under varying illumination

The dataset is partitioned into bright and dark subsets to assess illumination robustness. Results in Table 6 show that the proposed method consistently outperforms baselines in both conditions. Under bright lighting, the fused model achieves GAME(0) $=$ 14.16, while under darkness, it attains 10.77—surpassing even the thermal-only baseline of CSRNet+IADM. Notably, the thermal-only variant of the proposed method achieves GAME(0) $=$ 10.75 in darkness, demonstrating superior low-light performance. A simple brightness-based decision rule is used for fusion: if the mean RGB intensity exceeds a threshold, the RGB prediction is selected; otherwise, the thermal prediction is used. Minor suboptimality in dark scenes with artificial lighting is observed, as the brightness heuristic may misclassify such cases as “bright.” Nevertheless, the framework exhibits strong adaptability across illumination regimes.

Table 6.
Performance comparison under bright and dark illumination on RGBT-CC. Bold values indicate best results.

Illumination Method RGB T GAME(0) GAME(1) GAME(2) GAME(3) RMSE

Bright CSRNet¹ ✓ ✗ 23.49 30.14 37.47 48.46 45.40

✗ ✓ 25.21 28.98 34.82 42.25 40.60

CSRNet+IADM⁵ ✓ ✓ 20.36 23.57 28.49 36.29 32.57

Ours ✓ ✗ 15.69 19.96 25.31 31.68 31.20

✗ ✓ 15.64 18.79 22.63 26.86 25.73

✓ ✓ 14.16 17.59 21.63 26.29 24.17

Dark CSRNet¹ ✓ ✗ 44.72 51.70 57.45 66.21 87.81

✗ ✓ 17.79 23.38 28.39 34.95 33.74

CSRNet+IADM⁵ ✓ ✓ 15.44 23.79 30.28 29.11 19.23

Ours ✓ ✗ 33.51 39.61 47.16 52.84 59.05

✗ ✓ 10.75 13.92 17.26 20.56 21.04

✓ ✓ 10.77 13.94 17.29 20.59 21.05

Qualitative results in Figure 3 further illustrate these advantages. In bright scenes with thermal interference (e.g., reflections from glass or car grills), the thermal modality produces spurious detections, while the RGB modality remains reliable. Conversely, in dark scenes, RGB features degrade, but thermal predictions remain accurate. The proposed framework leverages these complementary strengths during training, enabling robust unimodal inference when needed. In contrast, fixed-fusion methods like CSRNet+IADM fail to adapt, often misinterpreting thermal artifacts as people (highlighted in red boxes) or missing counts in low-light RGB inputs.

Figure 3.
Visualization of density map predictions on RGBT-CC. Rows 1–3: bright scenes; rows 4–6: dark scenes. (a) RGB input, (b) thermal input, (c) ground truth, (d) RGB-only prediction, (e) thermal-only prediction, (f) CSRNet+IADM,⁵ (g) proposed decision fusion. Red boxes indicate regions where thermal interference causes false positives in fixed-fusion methods.
4.5. Performance on DroneRGBT dataset

Method	Modality	GAME(0)	GAME(1)	GAME(2)	GAME(3)	RMSE
CSRNet¹	RGB+T	20.40	23.58	28.03	35.51	35.26
CSRNet+IADM⁵	RGB+T	17.94	21.44	26.17	33.33	30.91
CSRNet+R2T_ssim⁸	RGB+T	16.92	20.72	25.90	32.10	32.91
CmCaF³⁹	RGB+T	15.87	19.92	24.65	28.01	29.31
CSRNet+MAT+SSP⁴¹	RGB+T	13.65	16.29	20.81	29.09	22.53
Ours	RGB+T	12.54	15.84	19.55	23.56	22.73

Method	GAME(0)	GFLOPs
CSRNet¹	20.04	6.06
CSRNet+IADM⁵	17.94	133.85
CSRNet+R2T_ssim⁸	16.92	325.05
Ours	12.54	9.56

Method	RGB	T	GAME(0)	GAME(1)	GAME(2)	GAME(3)	RMSE
CSRNet¹	✓	✗	33.94	40.76	47.31	57.20	69.59
	✗	✓	21.64	26.22	31.65	38.66	37.38
CSRNet+IADM⁵ (zero-fill)	✓	✗	69.70	71.50	74.43	75.60	101.11
	✗	✓	22.20	25.45	29.50	33.25	35.74
Ours	✓	✗	24.22	29.97	35.77	41.81	52.82
	✗	✓	13.30	16.46	20.06	23.84	23.60

Illumination	Method	RGB	T	GAME(0)	GAME(1)	GAME(2)	GAME(3)	RMSE
Bright	CSRNet¹	✓	✗	23.49	30.14	37.47	48.46	45.40
		✗	✓	25.21	28.98	34.82	42.25	40.60
	CSRNet+IADM⁵	✓	✓	20.36	23.57	28.49	36.29	32.57
	Ours	✓	✗	15.69	19.96	25.31	31.68	31.20
		✗	✓	15.64	18.79	22.63	26.86	25.73
		✓	✓	14.16	17.59	21.63	26.29	24.17
Dark	CSRNet¹	✓	✗	44.72	51.70	57.45	66.21	87.81
		✗	✓	17.79	23.38	28.39	34.95	33.74
	CSRNet+IADM⁵	✓	✓	15.44	23.79	30.28	29.11	19.23
	Ours	✓	✗	33.51	39.61	47.16	52.84	59.05
		✗	✓	10.75	13.92	17.26	20.56	21.04
		✓	✓	10.77	13.94	17.29	20.59	21.05

The proposed method is further evaluated on the DroneRGBT dataset, which features aerial-view RGB–thermal image pairs. As shown in Table 7, the framework achieves MAE = 6.99 and RMSE = 11.17 under full-modality conditions, outperforming CSRNet+IADM by 38.7% in MAE and 36.3% in RMSE. In unimodal settings, the thermal-only variant attains RMSE = 11.12, the lowest among all configurations. The same brightness-based fusion rule is applied, yielding consistent improvements. The slight RMSE increase in the fused output compared to thermal-only in darkness is attributed to occasional misclassification of artificially lit night scenes as “bright,” a limitation of the simple heuristic.

Table 7.
Results on droneRGBT dataset. Bold values indicate best performance.

Method RGB T MAE RMSE

CSRNet+IADM⁵ ✓ ✓ 11.41 17.54

CSRNet¹ ✓ ✗ 11.72 18.60

✗ ✓ 8.91 13.80

✓ ✓ 8.91 11.54

Ours ✓ ✗ 9.73 15.69

✗ ✓ 7.05 11.12

✓ ✓ 6.99 11.17

Method	RGB	T	MAE	RMSE
CSRNet+IADM⁵	✓	✓	11.41	17.54
CSRNet¹	✓	✗	11.72	18.60
	✗	✓	8.91	13.80
	✓	✓	8.91	11.54
Ours	✓	✗	9.73	15.69
	✗	✓	7.05	11.12
	✓	✓	6.99	11.17

4.6. Ablation study

Four ablation studies are conducted to evaluate the contribution of each component in the proposed framework. For consistency, all reported results correspond to the RGB modality and are evaluated on the RGBT-CC dataset.

1) Feature Patches Generator (FPG)

To assess the effectiveness of the Feature Patches Generator in enriching intermediate feature representations, an ablation variant was implemented in which raw feature maps from selected network layers were directly fed into the Margin Ranking Loss, bypassing the FPG. All other components remained unchanged. As shown in Table 8, the absence of patch-based augmentation leads to a measurable performance drop, confirming that the FPG effectively enhances cross-modal feature learning by providing multi-scale contextual information during training.

Table 8.
Ablation of core components (RGB modality, RGBT-CC).

Variant MAE RMSE

W/O FPG 25.35 57.90

W/O Margin Ranking Loss 25.44 55.56

W/O PNKT 26.86 58.48

Full Model 24.22 41.81

2) Margin Ranking Loss

The utility of the Margin Ranking Loss was verified by replacing it with the standard mean squared error (MSE) loss commonly used in crowd counting. In this variant, the MSE loss was applied to the same feature patches generated by the FPG. Results in Table 8 indicate a degradation in counting accuracy, demonstrating that the ordinal constraint imposed by the Margin Ranking Loss is critical for effective feature-level knowledge transfer. Unlike regression-based losses, the ranking loss preserves relative crowd density relationships across spatial scales, enabling more informative cross-modal alignment.

3) PNKT module

To evaluate the role of the Preventing Negative Knowledge Transfer (PNKT) module, a variant was tested in which the adaptive weights $ρ^{r, t}$ and $ρ^{t, r}$ were fixed to $1$ , thereby disabling performance-aware gating and allowing unrestricted bidirectional knowledge transfer. As reported in the third row of Table 8, this configuration results in reduced accuracy, indicating that unregulated cross-modal interaction introduces harmful interference. The PNKT module thus serves as a crucial safeguard against negative transfer by enforcing unidirectional flow from the more accurate to the less accurate modality.

4) Feature map position

The impact of feature extraction depth was examined by selecting intermediate layers from different stages of the VGG-16 backbone. Three configurations were evaluated, corresponding to shallow, intermediate, and deep layers. Results in Table 9 show that deeper layers yield more discriminative representations, leading to improved performance. Furthermore, aggregating features from multiple layers consistently outperforms single-layer extraction, highlighting the benefit of multi-level semantic fusion.

Table 9.
Impact of feature map selection (VGG-16 layers).

Layer 4 Layer 7 Layer 10 MAE RMSE

✓ ✗ ✗ 26.73 52.33

✗ ✓ ✗ 26.43 49.29

✗ ✗ ✓ 25.39 46.41

✓ ✓ ✓ 24.22 41.81

5) Number of patches

The influence of the number of generated patches $M$ on model performance was investigated. As summarized in Table 10, performance improves as $M$ increases from $1$ to $5$ , reflecting the gain from richer multi-scale context. However, further increasing $M$ beyond $5$ leads to diminishing returns and eventual degradation, likely due to redundant or noisy patch representations. An optimal trade-off between representational capacity and computational efficiency is achieved at $M = 5$ .

Table 10.
Effect of number of patches $M$ .

$M$ MAE RMSE

4 29.15 64.75

5 24.22 41.81

6 26.32 72.31

4.7. Sensitivity to the knowledge transfer weight $λ$

Variant	MAE	RMSE
W/O FPG	25.35	57.90
W/O Margin Ranking Loss	25.44	55.56
W/O PNKT	26.86	58.48
Full Model	24.22	41.81

Layer 4	Layer 7	Layer 10	MAE	RMSE
✓	✗	✗	26.73	52.33
✗	✓	✗	26.43	49.29
✗	✗	✓	25.39	46.41
✓	✓	✓	24.22	41.81

$M$	MAE	RMSE
4	29.15	64.75
5	24.22	41.81
6	26.32	72.31

The proposed training framework involves a weighting parameter $λ$ that controls the strength of cross-modal knowledge transfer regulated by the PNKT module. To examine the sensitivity of the proposed method to this hyperparameter, we evaluate performance under different values of $λ$ while keeping all other settings fixed. Table 11 reports the crowd counting performance obtained with different $λ$ values on the RGBT-CC dataset.

Table 11.
Sensitivity analysis of the knowledge transfer weight $λ$ on the RGBT-CC dataset.

RGB Thermal Fusion

$λ$ MAE RMSE MAE RMSE MAE RMSE

0.10 26.74 59.06 31.17 36.75 42.73 –

0.05 26.39 59.96 31.82 39.24 45.70 –

0.01 24.22 52.82 29.97 35.77 41.81 –

0.005 26.74 59.06 31.17 36.75 42.73 –

	RGB	Thermal	Fusion
0.10	26.74	59.06	31.17	36.75	42.73	–
0.05	26.39	59.96	31.82	39.24	45.70	–
0.01	24.22	52.82	29.97	35.77	41.81	–
0.005	26.74	59.06	31.17	36.75	42.73	–

The results indicate that the proposed method is not overly sensitive to the choice of $λ$ within a reasonable range. Performance remains stable across different settings, with moderate values (e.g., $λ = 0.01$ or $0.05$ ) yielding consistently strong results. Extremely large or small values may slightly degrade performance, which is expected as overly aggressive or overly weak knowledge transfer can reduce the effectiveness of cross-modal supervision. These observations suggest that the PNKT-regulated knowledge transfer mechanism provides a degree of robustness to hyperparameter selection, alleviating the need for fine-grained tuning in practice.

4.8. Generalisation to RGB–depth modality

While this work primarily focuses on RGB–thermal crowd counting, the proposed modality-reconfigurable training strategy is not inherently tied to a specific sensor pairing. To provide additional evidence of generality, we evaluate the proposed framework on the ShanghaiTechRGBD dataset⁵⁴ using RGB and depth modalities. The same network architecture and training protocol are adopted, without any modality-specific modification. This experiment is intended to assess whether the proposed training paradigm can be transferred to a different multimodal setting, rather than to establish a new state of the art for RGB–depth crowd counting.

The quantitative results are reported in Table 12, where unimodal inference performance after multimodal training is also presented.

Table 12.
Performance on the shanghaiTechRGBD dataset.⁵⁴ “no need for two modalities” indicates unimodal inference after multimodal training.

Testing setting Method Modality MAE RMSE

No need for Ours RGB 4.49 6.40

two modalities Depth 13.05 19.39

Both modalities CSRNet RGB+D 4.92 7.41

CSRNet+IADM RGB+D 4.38 7.06

CSRNet+R2T_ssim RGB+D 4.03 5.81

Testing setting	Method	Modality	MAE	RMSE
No need for	Ours	RGB	4.49	6.40
two modalities		Depth	13.05	19.39
Both modalities	CSRNet	RGB+D	4.92	7.41
	CSRNet+IADM	RGB+D	4.38	7.06
	CSRNet+R2T_ssim	RGB+D	4.03	5.81

These results suggest that the proposed modality-reconfigurable training strategy can be applied beyond RGB–thermal inputs, although more extensive evaluation on additional modalities and scenarios remains an important direction for future work.

5. Conclusion

In this work, a novel problem setting–modality reconfigurable crowd counting—is introduced, defined as the capability of a multimodal model to maintain robust counting performance when one input modality is missing, corrupted, or otherwise unavailable. To address the inherent fragility of existing multimodal approaches under partial-input conditions, a training framework is proposed that decouples cross-modal synergy during training from modality independence at inference.

During training, feature-level interactions between modalities are facilitated through a Feature Patches Generator (FPG), which produces multi-scale patch representations used in a Margin Ranking Loss to enforce ordinal consistency in crowd density estimates. Crucially, knowledge transfer between modalities is regulated by a Preventing Negative Knowledge Transfer (PNKT) module, which adaptively suppresses detrimental cross-modal signals based on relative regression performance. This mechanism enhances the representational capacity of each unimodal branch without introducing inference-time dependencies.

At test time, the system operates in a fully reconfigurable manner: either modality can be used independently, and a simple brightness-based decision rule enables dynamic selection between RGB and thermal predictions in multimodal deployment scenarios. Experimental results on two large-scale RGB–thermal crowd counting benchmarks–RGBT-CC and DroneRGBT–demonstrate that the proposed method achieves state-of-the-art accuracy under both complete and degraded modality conditions, thereby validating the effectiveness of the modality reconfigurability paradigm.

Footnotes

Acknowledgments

This work was supported in part by the Inner Mongolia Autonomous Region Science and Technology Breakthrough Project (2024KJTW0019), the National Natural Science Foundation of China (62301098), the Chongqing Postdoctoral Research Project Special Funding (2023CQBSHTBT004), Science and Technology Research Program of Chongqing Municipal Education Commission (KJQN202300618), Chongqing Postdoctoral Science Foundation Project (CSTB2023NSCQ-BHX0109) and the China Scholarship Council(202107000087), and the Jiangsu Distinguished Professor Programme.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Ferrante Neri

Yang Wang

References

Zhang

Chen

. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, CVPR, 2018, pp.1091–1100. DOI: 10.1109/CVPR.2018.00120.

Zhang

Zhou

Chen

, et al. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR, 2016, pp.589–597. DOI: 10.1109/CVPR.2016.70.

Liu

Salzmann

Fua

. Context-aware crowd counting. In: 2019 IEEE/CVF conference on computer vision and pattern recognition. CVPR, 2019, pp.5094–5103. DOI: 10.1109/CVPR.2019.00524.

Wei

Hong

, et al. Bayesian loss for crowd count estimation with point supervision. In: 2019 IEEE/CVF international conference on computer vision. ICCV, 2019, pp.6141–6150. DOI: 10.1109/ICCV.2019.00624.

Liu

Chen

, et al. Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, 2021, pp.4821–4831. DOI: 10.1109/CVPR46437.2021.00479.

Chen

Gao

Yuan

, et al. Mafnet: A multi-attention fusion network for rgb-t crowd counting. arXiv preprint arXiv:220806761 2022. DOI: 10.48550/arXiv.2201.04819.

Tang

Wang

Chau

. Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting. 2022 IEEE Int Sym Circ Syst (ISCAS) 2022; 3299–3303. DOI: 10.48550/arXiv.2202.08517.

Zhang

Kong

. Learning the cross-modal discriminative feature representation for rgb-t crowd counting. Knowled-Based Syst 2022; 257: 109944.

Chen

, et al. Mafusion: Multiscale attention network for infrared and visible image fusion. IEEE Trans Instrument Measur 2022; 71: 1–16.

10.

Peng

Zhu

. Rgb-t crowd counting from drone: A benchmark and mmccn network. In: Proceedings of the asian conference on computer vision. ACCV. DOI: 10.1007/978-3-030-69544-6_30.

11.

Topkaya

Erdogan

Porikli

. Counting people by clustering person detector outputs. In: 2014 11th IEEE international conference on advanced video and signal based surveillance. AVSS, 2014, pp.313–318. DOI: 10.1109/AVSS.2014.6918687.

12.

Viola

Jones

Snow

. Detecting pedestrians using patterns of motion and appearance. Int J Comput Vis 2005; 63: 153–161.

13.

Zhang

Huang

, et al. Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In: 2008 19th International conference on pattern recognition. ICPR, 2008, pp.1–4. DOI: 10.1109/ICPR.2008.4761705.

14.

Niroumand

Hajibabai

Hajbabaie

. Advancing the white phase mobile traffic control paradigm to consider pedestrians. Comput-Aid Civil Infrast Eng 2024; 39: 1946–1962.

15.

, et al. Modeling the collective behavior of pedestrians with the spontaneous loose leader–follower structure in public spaces. Comput-Aid Civil Infrast Eng 2025; 40: 1956–1974.

16.

Latortue

Kdayem

Peña

FAG

, et al. Evaluating supervision levels trade-offs for infrared-based people counting. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV) Workshops, 2024, pp.300–309. DOI: 10.48550/arXiv.2311.11974.

17.

Abousamra

, et al. Calibrating uncertainty for semi-supervised crowd counting. arXiv preprint arXiv:230809887 2023. DOI: 10.48550/arXiv.2308.09887.

18.

Liu

Cao

, et al. Point-query quadtree for crowd counting, localization, and more. In: Proceedings of the IEEE/CVF international conference on computer Vision(ICCV), 2023, pp.1676–1685. DOI: 10.48550/arXiv.2308.13814.

19.

Viola

Jones

Snow

. Scar: Spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 2019; 363: 1–8.

20.

Zhang

Lin

Chan

. Cross-view cross-scene multi-view crowd counting. In: 2021 IEEE/CVF conference on computer vision and pattern recognition. CVPR, 2021, pp.557–567. DOI: 10.1109/CVPR46437.2021.00062.

21.

Chen

Yan

, et al. Variational attention: Propagating domain-specific knowledge for multi-domain learning in crowd counting. In: 2021 IEEE/CVF international conference on computer vision. ICCV, 2021, pp.16045–16055. DOI: 10.1109/ICCV48922.2021.01576.

22.

Gong

Zhang

Yang

, et al. Bi-level alignment for cross-domain crowd counting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, 2022, pp.7542–7550. DOI: 10.1109/CVPR52688.2022.00739.

23.

Wan

Liu

Chan

. A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. CVPR, 2021, pp.1974–1983. DOI: 10.1109/CVPR46437.2021.00201.

24.

Gao

Huang

Lei

, et al.

s^{2}

fpr: Crowd counting via self-supervised coarse to fine feature pyramid ranking. arXiv preprint arXiv:220104819 2022. DOI: 10.48550/arXiv.2201.04819.

25.

Reddy

MKK

Rochan

, et al. Adacrowd: unlabeled scene adaptation for crowd counting. IEEE Trans Multimed 2021; 24: 1008–1019.

26.

Cheng

Dai

, et al. Improving the learning of multi-column convolutional neural network for crowd counting. In: Proceedings of the 27th ACM international conference on multimedia, 2019, pp.1897–1906. DOI: 10.1145/3343031.3350898.

27.

Cheng

Dai

, et al. Learning spatial awareness to improve crowd counting. In: Proceedings of the IEEE/CVF international conference on computer vision(ICCV), 2019, pp.6152–6161. DOI: 10.1109/ICCV.2019.00625.

28.

Cheng

Dai

, et al. Rethinking spatial invariance of convolutional networks for object counting. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2022, pp.19606–19616. DOI: 10.1109/CVPR52688.2022.01902.

29.

Zhang

Cheng

, et al. Crossnet: Boosting crowd counting with localization. In: Proceedings of the 30th ACM international conference on multimedia, 2022, pp.6436–6444. DOI: 10.1145/3503161.3547863.

30.

Huang

Cheng

, et al. Stacked pooling for boosting scale invariance of crowd counting. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2020, pp.2578–2582. DOI: 10.1109/ICASSP40776.2020.9053070.

31.

Liang

Xie

Zou

, et al. Crowdclip: Unsupervised crowd counting via vision-language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023, pp.2893–2903. DOI: 10.48550/arXiv.2304.04231.

32.

Peng

Yin

Yang

, et al. Exploring density rectification and domain adaption method for crowd counting. Neural Comput Appl 2023; 35: 3551–3569.

33.

Jain

Shamsolmoali

, et al. Slime mold optimization with hybrid deep learning enabled crowd-counting approach in video surveillance. Neural Comput Appl 2024; 36: 2215–2229. DOI: 10.1007/s00521-023-09083-x.

34.

Zeng

Shao

, et al. Unlabeled scene adaptive crowd counting via meta-ensemble learning. Transport Res Part C: Emerg Technol 2024; 159: 104465.

35.

Xiong

Liu

, et al. From open set to closed set: Supervised spatial divide-and-conquer for object counting. Proc IEEE/CVF Int Conf Comput Vis(ICCV) 2023; 131: 1722–1740.

36.

Huang

Chen

Chiang

, et al. Counting crowds in bad weather. arXiv preprint arXiv:230601209 2023. DOI: 10.48550/arXiv.2306.01209.

37.

Lian

Zheng

, et al. Density map regression guided detection network for rgb-d crowd counting and localization. In: 2019 IEEE/CVF conference on computer vision and pattern recognition. CVPR, 2019, pp.1821–1830. DOI: 10.1109/CVPR.2019.00192.

38.

Zhao

Pang

Zhang

, et al. Self-supervised pretraining for rgb-d salient object detection. In: AAAI conference on artificial intelligence. AAAI, 2021, pp.3463–3471. DOI: 10.48550/arXiv.2101.12482.

39.

Zhang

Kong

. Rgb-d crowd counting with cross-modal cycle-attention fusion and fine-coarse supervision. IEEE Trans Indust Informat 2023; 19: 306–316.

40.

Zhou

Pan

Lei

, et al. Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting. IEEE Trans Intell Trans Syst 2022; 23: 24540–24549.

41.

Liu

Zhang

, et al. Multimodal crowd counting with mutual attention transformers. In: 2022 IEEE international conference on multimedia and expo (ICME), 2022, pp.1–6. DOI: 10.1109/ICME52920.2022.9859777.

42.

Ngiam

Khosla

Kim

, et al. Multimodal deep learning. In: Proceedings of the 28th international conference on international conference on machine learning. ICML’11, 2011, p.689–696. DOI: 10.5555/3104482.3104569.

43.

Kiela

Bottou

. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 conference on empirical methods in natural language processing. EMNLP, 2014, pp.36–45. DOI: 10.3115/v1/D14-1005.

44.

Nagrani

Yang

Arnab

, et al. Attention bottlenecks for multimodal fusion. In: Advances in neural information processing systems. NeurIPS, 2021, pp.14200–14213. DOI: 10.48550/arXiv.2107.00135.

45.

Sajid

Chen

Sajid

, et al. Audio-visual transformer based crowd counting. In: Proceedings of the IEEE/CVF international conference on computer vision. ICCV, 2021, pp.2249–2259. DOI: 10.1109/ICCVW54120.2021.00254.

46.

Liu

, et al. Cross-modal weighting network for rgb-d salient object detection. In: European conference on computer vision, 2020, pp.665–681. DOI: 10.1007/978-3-030-58520-4_39.

47.

Bai

Liu

, et al. Circular complement network for rgb-d salient object detection. Neurocomputing 2021; 451: 95–106.

48.

Maillard

Zhang

, et al. Knowledge distillation from multi-modal to mono-modal segmentation networks. In: Medical image computing and computer assisted intervention – MICCAI 2020, 2020, pp.772–781. ISBN 978-3-030-59710-8.

49.

Wang

Zhang

Liu

, et al. Acn: Adversarial co-training network for brain tumor segmentation with missing modalities. In: Medical image computing and computer assisted intervention – MICCAI 2021, 2021, pp.410–420. ISBN 978-3-030-87234-2.

50.

Wang

Liu

, et al. Enhancing multi-modal learning: Meta-learned cross-modal knowledge distillation for handling missing modalities. CoRR abs/2405.07155 (2024).

51.

Wang

Zhang

, et al. Learnable cross-modal knowledge distillation for multi-modal learning with missing modality. In: International conference on medical image computing and computer-assisted intervention, 2023, pp.216–226. Springer.

52.

Reza

Prater-Bennette

Asif

. Robust multimodal learning with missing modalities via parameter-efficient adaptation. IEEE Trans Pattern Anal Mach Intell 2025; 47: 742–754. DOI: https://doi.org/10.1109/TPAMI.2024.3476487.

53.

Guerrero-Gómez-Olmedo

Torre-Jiménez

López-Sastre

, et al. Extremely overlapping vehicle counting. In: Iberian conference on pattern recognition and image analysis. IbPRIA, 2015, pp.423–431. DOI: 10.1007/978-3-319-19390-8_48.

54.

Zhang

Zhou

Chen

, et al. Shanghaitechrgbd: A large-scale dataset for rgb-d crowd counting. IEEE Trans Circ Syst Video Technol 2019; 29: 1805–1817.

	RGB		Thermal		Fusion
$λ$	MAE	RMSE	MAE	RMSE	MAE	RMSE
0.10	26.74	59.06	31.17	36.75	42.73	–
0.05	26.39	59.96	31.82	39.24	45.70	–
0.01	24.22	52.82	29.97	35.77	41.81	–
0.005	26.74	59.06	31.17	36.75	42.73	–

Robust multimodal crowd counting with modality reconfigurability

Abstract

Keywords

1. Introduction

2. Related work

2.1. Crowd counting

2.2. Multimodal learning

3.1. Multimodal feature extraction via feature patches generator

4.1. Datasets

4.2. Implementation details

4.3. Evaluation metrics

1) Full-modality setting

2) Missing-modality robustness

3) Performance under varying illumination

Table 7. Results on droneRGBT dataset. Bold values indicate best performance. Method RGB T MAE RMSE CSRNet+IADM 5 ✓ ✓ 11.41 17.54 CSRNet 1 ✓ ✗ 11.72 18.60 ✗ ✓ 8.91 13.80 ✓ ✓ 8.91 11.54 Ours ✓ ✗ 9.73 15.69 ✗ ✓ 7.05 11.12 ✓ ✓ 6.99 11.17

1) Feature Patches Generator (FPG)

Table 8. Ablation of core components (RGB modality, RGBT-CC). Variant MAE RMSE W/O FPG 25.35 57.90 W/O Margin Ranking Loss 25.44 55.56 W/O PNKT 26.86 58.48 Full Model 24.22 41.81

2) Margin Ranking Loss

3) PNKT module

4) Feature map position

5) Number of patches

Table 11. Sensitivity analysis of the knowledge transfer weight λ on the RGBT-CC dataset. RGB Thermal Fusion λ MAE RMSE MAE RMSE MAE RMSE 0.10 26.74 59.06 31.17 36.75 42.73 – 0.05 26.39 59.96 31.82 39.24 45.70 – 0.01 24.22 52.82 29.97 35.77 41.81 – 0.005 26.74 59.06 31.17 36.75 42.73 –

Footnotes

Acknowledgments

Funding

Declaration of conflicting interests

ORCID iDs

References

Table 7.
Results on droneRGBT dataset. Bold values indicate best performance.

Method RGB T MAE RMSE

CSRNet+IADM⁵ ✓ ✓ 11.41 17.54

CSRNet¹ ✓ ✗ 11.72 18.60

✗ ✓ 8.91 13.80

✓ ✓ 8.91 11.54

Ours ✓ ✗ 9.73 15.69

✗ ✓ 7.05 11.12

✓ ✓ 6.99 11.17

Table 8.
Ablation of core components (RGB modality, RGBT-CC).

Variant MAE RMSE

W/O FPG 25.35 57.90

W/O Margin Ranking Loss 25.44 55.56

W/O PNKT 26.86 58.48

Full Model 24.22 41.81