Integrating NLP and ensemble methods for large-scale phishing email detection

Abstract

Phishing emails continue to be an important threat to cybersecurity, necessitating the use of robust detection systems. On a large-scale dataset, this paper provides a holistic methodology for detecting phishing emails that combines NLP approaches with ensemble learning methods. AdaBoost, XGBoost, Gradient Boosting Machine (GBM), Light Gradient Boosting Machine (LGBM), CatBoost, Extra Trees, and Random Forest are among the ensemble approaches, along with the Stacking and Majority Voting procedures. The experimental results reveal that the Stacking ensemble obtained amazing performance, consisting of 98.89% accuracy, precision, recall, and F-measure, with unusually low FPR and FNR of 0.01 for each example. Similarly, the Majority Voting ensemble obtained remarkable results with an accuracy of 98.56%, precision, recall, and F-measure of 98.56%, FPR as 0.02, and FNR as 0.01. These findings highlight the capabilities of modern ensemble approaches to successfully detect phishing emails with high accuracy and low error rates. Combining natural language processing for feature extraction with complex ensemble models offers a viable method for combating phishing attacks in real-world applications.

Keywords

Phishing email detection natural language processing ensemble learning stacking majority voting cybersecurity

Get full access to this article

View all access options for this article.

References

Verizon Data Breach Investigations Report (DBIR) 2023. Available online: https://www.verizon.com/business/resources/reports/dbir/.

Anti-Phishing Working Group (APWG) Threat Report 2023. Available online: https://docs.apwg.org/reports/apwg_trends_report_q4_2023.pdf.

Salloum

Gaber

Vadera

, et al. A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access 2022; 10: 65703–65727.

Rashed

Ozcan

. A comprehensive review of machine and deep learning approaches for cyber security phishing email detection. Al-Iraqia J Sci Eng Res 2024; 3: 1–12.

Selamat

Krejcar

, et al. Deep learning for phishing detection: taxonomy, current challenges and future directions. IEEE Access 2022; 10: 36429–36463.

Thakur

Ali

Obaidat

, et al. A systematic review on deep-learning-based phishing email detection. Electronics 2023; 12: 4545.

Alkhalil

Hewage

Nawaf

, et al. Phishing attacks: a recent comprehensive study and a new anatomy. Front Comput Sci 2021; 3: 563060.

Abroshan

Devos

Poels

, et al. Phishing happens beyond technology: the effects of human behaviors and demographics on each step of a phishing process. IEEE Access 2021; 9: 44928–44949.

Catal

Giray

Tekinerdogan

, et al. Applications of deep learning for phishing detection: a systematic literature review. Knowl Inf Syst 2022; 64: 1457–1500.

10.

Jáñez-Martino

Alaiz-Rodríguez

González-Castro

, et al. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artif Intell Rev 2023; 56: 1145–1173.

11.

Basit

Zafar

Liu

, et al. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommun Syst 2021; 76: 139–154.

12.

Jampen

G’ur

Sutter

, et al. Don’t click: towards an effective anti-phishing training. A comparative literature review. Hum-cent Comput Inform Sci 2020; 10: 33.

13.

Das Guptta

Shahriar

Alqahtani

, et al. Modeling hybrid feature-based phishing websites detection using machine learning techniques. Ann Data Sci 2024; 11: 217–242.

14.

Butt

Amin

Aldabbas

, et al. Cloud-based email phishing attack using machine and deep learning algorithm. Complex Intell Syst 2023; 9: 3043–3070.

15.

Bezerra

Pereira

Rebelo

MÂ

, et al. A case study on phishing detection with a machine learning net. Int J Data Sci Anal 2024; 1–20. https://link.springer.com/article/10.1007/s41060-024-00579-w

16.

Ejaz

Mian

Manzoor

. Life-long phishing attack detection using continual learning. Sci Rep 2023; 13: 11488.

17.

Ghalechyan

Israyelyan

Arakelyan

, et al. Phishing URL detection with neural networks: an empirical study. Sci Rep 2024; 14: 25134.

18.

Sonowal

. Phishing email detection based on binary search feature selection. SN Comput Sci 2020; 1: 191.

19.

Ujah-Ogbuagu

Akande

Ogbuju

. A hybrid deep learning technique for spoofing website URL detection in real-time applications. J Elect Syst Inform Technol 2024; 11: 7.

20.

Atawneh

Aljehani

. Phishing email detection model using deep learning. Electronics 2023; 12: 4261.

21.

Bagui

Nandi

Bagui

, et al. Machine learning and deep learning for phishing email classification using one-hot encoding. J Comput Sci 2021; 17: 610–623.

22.

Alsuwaylimi

. Enhancing arabic phishing email detection: a hybrid machine learning based on genetic algorithm feature selection. Int J Adv Comput Sci Appl 2024; 15: 419–438.

23.

Rabbi

Champa

Zibran

. Phishy? Detecting phishing emails using machine learning and natural language processing. In: Software engineering and management: theory and application: Volume 16, 2024, pp.119–137. Springer, Cham.

24.

Jamal

Wimmer

Sarker

. An improved transformer–based model for detecting phishing, spam and ham emails: a large language model approach. Sec Privacy 2024; 7: e402.

25.

Altwaijry

Al-Turaiki

Alotaibi

, et al. Advancing phishing email detection: a comparative study of deep learning models. Sensors 2024; 24: 2077.

26.

Somesha

Pais

. Classification of phishing email using word embedding and machine learning techniques. J Cyber Secur Mobility 2022; 11: 279–320.

27.

Yasin

Abuhasan

. An intelligent classification model for phishing email detection. arXiv preprint. 2016. Available from: https://arxiv.org/abs/1608.02196.

28.

Dharmaraj Patil

Wagh

Punjabi

, et al. Enhanced phishing URLs detection using feature selection and machine learning approaches. Int J Wireless Microw Technol 2024; 14: 48–67.

29.

Patil

. A framework for malicious domain names detection using feature selection and majority voting approach. Informatica 2024; 48

30.

Patil

Pattewar

Punjabi

, et al. Detecting fake social media profiles using the majority voting approach. EAI Endorsed Trans Scalable Inf Syst 2024; 11: 1–18.

31.

Patil

Pattewar

Pardeshi

, et al. Learning to detect phishing web pages using lexical and string complexity analysis. EAI Endorsed Trans Scalable Inf Syst 2022; 10: 1–13.

32.

Trim

. The art of tokenization. Developer works, IBM. Jan. 23, 2013. Available from: https://web.archive.org/web/20190530/https://developer.ibm.com (accessed 11 December 2024).

33.

Zhu

Tang

, et al. A Unified tagging approach to text normalization. In: Proceedings of the 45th Annual meeting of the association for computational linguistics, 2007, pp.688–95.

34.

M’uller

Cotterell

Fraser

, et al. Joint lemmatization and morphological tagging with lemming. In: Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp.2268–74.

35.

Luk

RWP

Wong

, et al. Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 2008; 26: 1.

36.

Beel

Breitinger

. Evaluating the CC-IDF citation-weighting scheme – How effectively can ’Inverse Document Frequency’ (IDF) be applied to references? In: Proceedings of the 12th iConference, 2017. Available from: https://www.ideals.illinois.edu/handle/2142/97511.

37.

Goldberg

Levy

. word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. arXiv. 2014. Available from: https://arxiv.org/abs/1402.3722.

38.

Rong

. word2vec Parameter learning explained. arXiv. 2016. Available from: https://arxiv.org/abs/1411.2738.

39.

Qureshi

Greene

. EVE: explainable vector based embedding technique using wikipedia. J Intell Inf Syst 2018; 53: 137–165.

40.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

41.

Biau

Scornet

. A random forest guided tour. Test 2016; 25: 197–227.

42.

Geurts

Ernst

Wehenkel

. Extremely randomized trees. Mach Learn 2006; 63: 3–42.

43.

Freund

Schapire

. A decision-theoretic generalization of on-line learning and an application to boosting. In: Lecture notes in computer science, 1995, pp.23–37. Springer.

44.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016.

45.

Friedman

. Greedy function approximation: A gradient boosting machine. Ann Stat 2001; 29: 1189–1232.

46.

Meng

Finley

, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st conference on neural information processing systems, 2017, pp.3146–56.

47.

Dorogush

Ershov

Gulin

. CatBoost: a high-performance gradient boosting library. In: Proceedings of the 2018 data mining and knowledge discovery conference, 2018.

48.

Prokhorenkova

Gusev

Vorobev

, et al. CatBoost: unbiased boosting with categorical features. In: Advances in neural information processing systems, vol. 31, 2018, Available from: https://proceedings.neurips.cc/paper_files/paper/2018/file/f5f8590cd58a54e94377e6ae2eded4d9-Paper.pdf.

49.

Wolpert

. Stacked generalization. Neural Netw 1992; 5: 241–249.

50.

Seewald

. How to make stacking better and faster while also taking care of an unknown weakness. In: Proceedings of the 19th international conference on machine learning, 2002, pp.554–61.

51.

Sill

Takacs

Mackey

, et al. Feature-weighted linear stacking. In: Advances in neural information processing systems, Vol. 22, 2009.

52.

Littlestone

Warmuth

. The weighted majority algorithm. Inf Comput 1994; 108: 212–261.

53.

Littlestone

Warmuth

. Weighted Majority Algorithm. In: Proceedings of the IEEE symposium on foundations of computer science, 1989.

54.

Phishing Email Dataset [Internet]. Available from: https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset.

55.

Sokolova

Lapalme

. A systematic analysis of performance measures for classification tasks. Inf Process Manage 2009; 45: 427–455.