Sage Journals: Discover world-class research

Abstract

The release of Large Language Models (LLMs) has achieved human-level text generation, leading to malicious uses such as disinformation propagation and academic dishonesty. Existing research has faced substantial challenges in low detection rates and poor generalization on multilingual generated text and short text. To fill these gaps, in this paper, we propose a generic bilingual generated text detection model to integrate semantic and statistical features, which exhibits proficiency in English and Chinese. To obtain fine-grained features, we employ the multilingual pre-trained language model xlm-RoBERTa to extract the CLS vector as overall semantic features, integrating with statistical features log rank, probability, and cumulative probability for detection. Moreover, Shapley additive explanations (SHAP) serves to interpret the decision-making process. The experimental results demonstrate significant advancements over baselines, notably with the F1 score improvements exceeding 10% and 5% on the English and Chinese HC3 sentence-level datasets, respectively. Our proposed method exhibits higher generalization for advanced LLMs and out-of-domain datasets with a 91.13% F1 score, thereby providing a more robust solution for detecting generated text.

Keywords

bilingual generated text detection CLS vector statistical features pre-trained language model shapley additive explanations

Get full access to this article

View all access options for this article.

References

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI Blog 2019; 1: 9.

OpenAI. GPT-4 technical report. arXiv preprint arXiv:230308774, 2023.

Yang

Jin

Tang

, et al. Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond. ACM Trans Knowl Discov Data 2024; 18: 160:1–160:32.

Pang

Shen

, et al. LLMDet: a third party large language models generated text detection tool. In: Bouamor H, Pino J and Bali K (eds) Findings of the association for computational linguistics: EMNLP 2023, Singapore, 6–10 December 2023. Association for Computational Linguistics, 2023, pp.2113–2133. DOI: 10.18653/V1/2023.FINDINGS-EMNLP.139.

Cingillioglu

. Detecting AI-generated essays: the ChatGPT challenge. Int J Inf Learn Technol 2023; 40: 259–268.

Gao

Howard

Markov

, et al. Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digit Med 2023; 6. DOI: 10.1038/S41746-023-00819-6.

Liu

Zhang

, et al. ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models, arXiv preprint arXiv:230407666, 2023.

Tang

Chuang

. The science of detecting LLM-generated text. Commun ACM 2024; 67: 50–59.

Chen

Liu

. STADEE: STAtistics-based dEEp detection of machine generated text. In: Huang D, Premaratne P, Jin B, et al. (eds) Advanced intelligent computing technology and applications - 19th international conference, ICIC 2023, Zhengzhou, China, 10–13 August 2023, proceedings, Part IV, Lecture Notes in Computer Science, volume 14089. Springer, 2023, pp.732–743. DOI: 10.1007/978-981-99-4752-2_60.

10.

Stiff

Johansson

. Detecting computer-generated disinformation. Int J Data Sci Anal 2022; 13: 363–383.

11.

Wang

Feng

. Self-information loss compensation learning for machine-generated text detection. Math Probl Eng 2021; 2021: 6669468.

12.

Fröhling

Zubiaga

. Feature-based detection of automated language models: tackling GPT-2, GPT-3 and Grover. PeerJ Comput Sci 2021; 7: e443.

13.

Gehrmann

Strobelt

Rush

. GLTR: statistical detection and visualization of generated text. In: Costa-jussà MR and Alfonseca E (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, volume 3: system demonstrations. Association for Computational Linguistics, 2019, pp.111–116. DOI: 10.18653/V1/P19-3019.

14.

Ippolito

Duckworth

Callison-Burch

, et al. Automatic detection of generated text is easiest when humans are fooled. In: Jurafsky D, Chai J, Schluter N, et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, 5–10 July 2020. Association for Computational Linguistics, 2020, pp.1808–1822. DOI: 10.18653/V1/2020.ACL-MAIN.164.

15.

Katib

Assiri

Abdushkour

, et al. Differentiating chat generative pretrained transformer from humans: detecting chatgpt-generated text and human text using machine learning. Mathematics 2023; 11: 3400.

16.

Sheng

Chang

Natarajan

, et al. Societal biases in language generation: Progress and challenges. In: Zong C, Xia F, Li W, et al. (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, 1–6 August 2021. Association for Computational Linguistics, 2020, pp.4275–4293. DOI: 10.18653/V1/2021.ACL-LONG.330.

17.

Guo

Zhang

Wang

, et al. How close is ChatGPT to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:230107597, 2023.

18.

Liu

Ott

Goyal

, et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:190711692, 2019.

19.

Zhou

Wen

Jia

, et al. C-net: a compression-based lightweight network for machine-generated text detection. IEEE Signal Process Lett 2024; 31: 1269–1273.

20.

Tian

Chen

Wang

, et al. Multiscale positive-unlabeled detection of AI-generated texts. In: The twelfth international conference on learning representations, 2024.

21.

Yang

Jiang

. Is ChatGPT involved in texts? Measure the Polish ratio to detect ChatGPT-generated text. APSIPA Trans Signal Inf Process 2024; 13: e103.

22.

Lee

Jang

. Enhancing machine-generated text detection: adversarial fine-tuning of pre-trained language models. IEEE Access 2024; 12: 65333–65340.

23.

Munyer

TJE

Tanvir

Das

, et al. Deeptextmark: a deep learning-driven text watermarking approach for identifying large language model generated text. IEEE Access 2024; 12: 40508–40520.

24.

Mitchell

Lee

Khazatsky

, et al. DetectGPT: zero-shot machine-generated text detection using probability curvature. In: Krause A, Brunskill E, Cho K, et al. (eds) International conference on machine learning, ICML 2023, 23–29 July 2023, Honolulu, Hawaii, USA, Proceedings of machine learning research, volume 202. PMLR, 2023, pp.24950–24962.

25.

Tang

Chuang

. The science of detecting LLM-generated text. Commun ACM 2024; 67: 50–59.

26.

Hernandez

Brown

Conerly

, et al. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:220510487, 2022.

27.

Broder

. On the resemblance and containment of documents. In: Carpentieri B, Santis AD, Vaccaro U, et al. (eds) Compression and complexity of SEQUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, 11–13 June 1997, proceedings. IEEE, 1997, pp.21–29. DOI: 10.1109/SEQUEN.1997.666900.

28.

Conneau

Khandelwal

Goyal

, et al. Unsupervised cross-lingual representation learning at scale. In: Jurafsky D, Chai J, Schluter N, et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, 5–10 July 2020. Association for Computational Linguistics, 2020, pp.8440–8451. DOI: 10.18653/V1/2020.ACL-MAIN.747.

29.

Holtzman

Buys

, et al. The curious case of neural text degeneration. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020, 2020, OpenReview.net.

30.

Sundararajan

Najmi

. The many shapley values for model explanation. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of machine learning research, volume 119. PMLR, 2020, pp.9269–9278.

31.

Shi

Sheng

Cao

, et al. Ten words only still help: improving black-box ai-generated text detection via proxy-guided efficient re-sampling. arXiv preprint arXiv:240209199, 2024.

32.

Chen

Kang

Zhai

, et al. GPT-Sentinel: distinguishing human and ChatGPT generated content. arXiv preprint arXiv:230507969, 2023.

33.

Macko

Móro

Uchendu

, et al. Multitude: large-scale multilingual machine-generated text detection benchmark. In: Bouamor H, Pino J and Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing, EMNLP 2023, Singapore, 6–10 December 2023. Association for Computational Linguistics, 2023, pp.9960–9987. DOI: 10.18653/V1/2023.EMNLP-MAIN.616.

Bilingual generated text detection through semantic and statistical analysis

Abstract

Keywords

Get full access to this article

References