Abstract
This study investigates imbalanced class distributions in high-dimensional, sparse text corpora, where feature selection is often biased toward majority classes because dominant term frequencies overshadow minority evidence. Many conventional scoring functions fail to capture minority-class information adequately, thereby reducing categorisation effectiveness under skewed sampling. An additional challenge is selecting informative features while limiting the influence of ubiquitous, weakly discriminative terms that destabilise normalisation in sparse regimes. To address these issues, this paper proposes Chi-square Adaptive Weightage (CAW), an adaptive weight-scaling method that enhances minority recognition while maintaining stable, interpretable scoring under imbalance. CAW integrates three components: an adaptive weighting mechanism that compensates for uneven class frequencies during scoring, a normalised probability ratio that suppresses off-class and background evidence to retain class-specific informative terms, and a Chi-square component that statistically reinforces term-class dependence. Robustness is evaluated on two benchmark corpora with distinct characteristics, Reuters-21578 and Ohsumed, using four systematically sampled imbalance ratios (1:1, 2:1, 5:1, and 10:1) across three classifiers. Overall effectiveness is assessed using Macro-F1 and Micro-F1, while minority recognition is assessed using Minority Macro-F1 and mean AUPRC. Aggregated results show that CAW achieves the strongest Macro-F1 on Reuters across all ratios and remains top-tier on Micro-F1. Minority-focused evaluation under SVM provides direct evidence that CAW maintains minority influence as skew increases, ranking consistently in the top tier on Reuters under higher skew and remaining among the top methods on Ohsumed for Minority Macro-F1 and AUPRC. Overall, CAW offers a practical imbalance-aware feature selector for sparse text classification, with exceptionally reliable gains under linear decision boundaries.
Get full access to this article
View all access options for this article.
