Abstract
As large language models (LLMs) evolve rapidly, distinguishing AI-generated text (AIGT) from human-written text (HWT) is becoming increasingly challenging. Recently, some AIGT detectors have been developed to overcome this challenge and have achieved decent accuracy. However, their brittle text representations make them highly susceptible to text perturbations, such that even minor character-level perturbations can reverse their predictions. In this work, we propose a multi-grained latent feature denoising and contrastive representation learning architecture to enhance text representations in terms of granularity, robustness, and distinguishability of features, thereby achieving robust AIGT detection. Specifically, we first extract both document-level and fine-grained segment-level features using a dual network, which captures the global and subtle local differences between AIGT and HWT. To encourage feature stability under perturbations, we inject random noise into both latent features and employ a denoising network to reconstruct the original representations. While this does not precisely simulate discrete character-level perturbations, it acts as a feature-level regularizer that suppresses non-essential variations and promotes smoother, more stable representations. Considering the similarities between AIGT and HWT, we further design a contrastive augmentation mechanism to increase the distinguishability between them. Extensive experiments demonstrate that our method not only outperforms baseline models in terms of classification accuracy but also exhibits superior robustness against various text perturbations.
Keywords
Get full access to this article
View all access options for this article.
