A text dependent copy-paste plagiarism and text-rewriting plagiarism model

Abstract

Plagiarism is common in English writing exams. Researchers classify plagiarism into copy-paste and text-rewriting plagiarism, but existing models need help with problems such as the single way of checking and unsatisfactory results. Aiming at the copy-paste problem in English writing plagiarism, this paper proposes a digital fingerprint model based on the N-Gram window jumping mechanism. The model incorporates a sliding window and an improved matching tool to solve the problems of excessive fingerprint density and low checking efficiency in text extraction. Meanwhile, the model adds a Fisher-Yates shuffle algorithm with a salt parameter to crack the hash collision in text matching. The experiments show that the model can detect copy-paste plagiarism in English composition. For text rewriting plagiarism, this paper designs a TextCNN-BiGRU-based model, which combines TextCNN and BiGRU so that the extracted text semantic information considers the text’s local and contextual features. The experiments show that the model improves the accuracy by 1.9 percentage points and the F1 value by 1.2 percentage points on the MRPC dataset compared with other models.

Keywords

Plagiarism checking copy and paste text rewriting

Get full access to this article

View all access options for this article.

References

Alhawarat

M.O.

, Abdeljaber

, Hilal

, Effect of stemming on text similarity for Arabic language at sentence level[J], PeerJ Computer Science 7 (2021), e530.

Arora

, Liang

, Ma

, A simple but tough-to-beat baseline for sentence embeddings[C], PeerJ International Conference on Learning Representations, 2017.

Farouk

, Measuring text similarity based on structure and word embedding[J], Cognitive Systems Research 63 (2020), 1–10.

Karcioglu

A.A.

, Bulut

, Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences[J], Computers in Biology and Medicine 131 (2021), 104292.

Kukali

, Technological Dualism and Plagiarism in Universities: Analysis of Emerging Trends and Implications on Higher Education[J], Journal of African Interdisciplinary Studies 6(8) (2022), 36–51.

Kumar

, Sharma

S.C.

, Hybrid optimization and ontology-based semantic model for efficient text-based information retrieval[J], The Journal of Supercomputing 79(2) (2023), 2251–2280.

Kumar

S.D.A.

, Awareness and Perceptions Towards Plagiarism Among Faculty Members in Arya PG College Panipat (Haryana): A Study[J], Transition or Transformation of Libraries due to COVID Pandemic: Lessons to Learn (2021), 325–330.

Kusner

, Sun

, Kolkin

, et al, From word embeddings to document distances[C], PMLR (2015), 957–966.

Lan

, Chen

, Goodman

, et al, Albert: A lite bert for self-supervised learning of language representations[J], arXiv preprint arXiv: 1909.11942, 2019.

10.

Liu

, Ott

, Goyal

, et al., Roberta: A robustly optimized bert pretraining approach[J], arXiv preprint arXiv: 1907.11692, 2019.

11.

Mansoor

M.N.

, Al-Tamimi

M.S.H.

, Computer-based plagiarism detection techniques: A comparative study[J], International Journal of Nonlinear Analysis and Applications 13(1) (2022), 3599–3611.

12.

Mueller

, Thyagarajan

, Siamese recurrent architectures for learning sentence similarity[C], Proceedings of the AAAI Conference on Artificial Intelligence 30(1) (2016).

13.

Nguyen

H.T.

, Duong

P.H.

, Cambria

, Learning short-text semantic similarity with word embeddings and external knowledge sources[J], Knowledge-Based Systems 182 (2019), 104842.

14.

Sarwar

T.B.

, Noor

N.M.

, Miah

M.S.U.

, Evaluating keyphrase extraction algorithms for finding similar news articles using lexical similarity calculation and semantic relatedness measurement by word embedding[J], PeerJ Computer Science 8 (2022), e1024.

15.

Thanalakshmi

, Anitha

, Anbazhagan

, et al., A Hash-Based Quantum-Resistant Chameleon Signature Scheme[J], Sensors 21(24) (2021), 8417.

16.

Trabelsi

, Chen

, Davison

B.D.

, et al., Neural ranking models for document retrieval, Inf Retrieval J 24 (2021), 400–444.

17.

Trabelsi Mohamed Ali et al., Neural ranking models for document retrieval, Information Retrieval Journal 24 (2022), 400–444. 10.13209/j.0479-8023.2022.071.

18.

Xiao

, Qin

, Li

, et al., An unsupervised semantic text similarity measurement model in resource-limited scenes[J], Information Sciences 616 (2022), 444–460.

19.

Yi Tay, Luu Anh Tuan and Siu Cheung Hui, Multi-Cast Attention Networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18), Association for Computing Machinery, New York, NY, USA (2018), 2299–2308.

20.

Yin

, Xu

, Zhang

, et al., RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures[J], Bioinformatics 37(6) (2021), 873–875.

21.

Zhao Yueyang and Cui Lei, Fusion Matrix–Based Text Similarity Measures for Clustering of Retrieval Results[J], Scientometrics 128(2) (2022).

22.

Zhou

, Yu

, Fan

, Adversarial training and ensemble learning for automatic code summarization[J], Neural Computing and Applications 33(19) (2021), 12571–12589.