Sage Journals: Discover world-class research

Abstract

In this study, we set up a scalable framework for large-scale data processing and analytics using the big data framework. The popular classification methods are implemented, tuned, and evaluated by using intrusion datasets. The objective is to select the best classifier after optimizing the hyper-parameters. We observed that the decision tree (DT) approach outperforms compared with other methods in terms of classification accuracy, fast training time, and improved average prediction rate. Therefore, it is selected as a base classifier in our proposed ensemble approach to study class imbalance. As the intrusion datasets are imbalanced, most of the classification techniques are biased toward the majority class. The misclassification rate is more in the case of the minority class. An ensemble-based method is proposed by using K-Means, RUSBoost, and DT approaches to mitigate the class imbalance problem; empirically investigate the impact of class imbalance on classification approaches' performance; and compare the result by using popular performance metrics such as Balanced Accuracy, Matthews Correlation Coefficient, and F-Measure, which are more suitable for the assessment of imbalanced datasets.

Get full access to this article

View all access options for this article.

References

Internet users in India to double by 2021. Available online at: https://cio.economictimes.indiatimes.com/news/internet/internet-users-in-india-to-double-by-2021-cisco-vni/59066697 (last accessed December 1, 2020).

Gupta

, Kulariya

. A framework for fast and efficient cyber security network intrusion detection using apache spark. Procedia Comput Sci. 2016; 93:824–831.

Özdemir

The big picture on the “AI turn” for digital health: The internet of things and cyber-physical systems. OMICS. 2019; 23:308–311.

Selvi

, Valarmathi

. Optimal feature selection for big data classification: Firefly with Lion-Assisted Model. Big Data. 2020; 8:125–146.

Kharaishvili

, Hudson

, Kannan

, et al. Global health security risk assessment in the Biological Threat Reduction Program. Health Secur. 2020; 18:177–185.

Tully

, Selzer

, Phillips

, et al. Healthcare challenges in the era of cybersecurity. Health Secur. 2020; 18:228–231.

Rzeszutko

, Mazurczyk

. Insights from nature for cybersecurity. Health Secur. 2015; 13:82–87.

Ramya

, Sundar

. SecDedoop: Secure deduplication with access control of big data in the HDFS/Hadoop Environment. Big Data. 2020; 8:147–163.

Imamverdiyev

, Abdullayeva

. Deep learning method for denial of service attack detection based on restricted boltzmann machine. Big Data. 2018; 6:159–169.

10.

KDDCup99 Intrusion Dataset. Available online at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99 (last accessed December 1, 2020).

11.

NSL-KDD Intrusion Dataset. Available online at http://nsl.cs.unb.ca/NSL-KDD (last accessed December 1, 2020).

12.

Allen

, Christie

, Fithen

, et al. State of the practice of intrusion detection technologies. Technical report, Carnegie Mellon University, 2000.

13.

GureKDD Intrusion Dataset. Available online at www.sc.ehu.es/acwaldap/ (last accessed December 1, 2020).

14.

Vern

Bro: A system for detecting network intruders in real-time. Computer Netw. 1999; 31:2435–2463.

15.

Kyoto Intrusion Dataset. Available online at www.takakura.com/Kyoto_data/ (last accessed December 1, 2020).

16.

Song

, Takakura

, Okabe

. Cooperation of intelligent honeypots to detect unknown malicious codes. In: 2008 WOMBAT Workshop on Information Security Threats Data Collection and Sharing, Amsterdam, Netherlands: IEEE, 2008. pp. 31–39.

17.

Habeeb

, Nasaruddin

, Gani

, et al. Real-time big data processing for anomaly detection: A survey. Int J Inf Manage. 2019; 45:289–307.

18.

Keegan

, Ji

, Chaudhary

, et al. A survey of cloud-based network intrusion detection analysis. Human-Centric Comput Inf Sci. 2016; 6:19.

19.

Kulariya

, Saraf

, Ranjan

, Gupta

. Performance analysis of network intrusion detection schemes using Apache

Spark

. In: 2016 International Conference on Communication and Signal Processing (ICCSP), Tamilnadu, India: IEEE, 2016. pp. 1973–1977.

20.

Hsieh

, Chan

. Detection DDoS attacks based on neural-network using Apache

Spark

. In: 2016 International Conference on Applied System Innovation (ICASI), Okinawa, Japan: IEEE, 2016. pp. 1–4.

21.

Mavridis

, Karatza

Log file analysis in cloud with Apache Hadoop and Apache Spark. In: Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS), Krakow, Poland:2015. pp. 51–62.

22.

Rathore

, Paul

, Ahmad

, et al. Hadoop based real-time intrusion detection for high-speed networks. In: 2016 IEEE Global Communications Conference (GLOBECOM), 2016. pp. 1–6.

23.

Al-Sawwa

, Ludwig

. Performance evaluation of a cost-sensitive differential evolution classifier using Spark–Imbalanced binary classification. J Comput Sci. 2020; 40:101065.

24.

Camacho

, García-Giménez

, Fuentes-García

, et al. Multivariate Big Data Analysis for intrusion detection: 5 steps from the haystack to the needle. Comput Secur. 2019; 87:101603.

25.

Kumari

, Singh

, Jha

, et al. Anomaly detection in network traffic using K-mean clustering. In: 2016 3rd International Conference on Recent Advances in Information Technology (RAIT), Dhanbad, India: IEEE, 2016. pp. 387–393.

26.

Mavridis

, Karatza

. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J Syst Softw. 2017; 125:133–151.

27.

Zhang

, Huang

, Wu

, et al. An effective convolutional neural network based on SMOTE and Gaussian Mixture Model for Intrusion Detection in Imbalanced Dataset. Computer Netw. 2020; 177:107315.

28.

Hassan

, Gumaei

, Alsanad

, et al. A hybrid deep learning model for efficient intrusion detection in big data environment. Inf Sci. 2020; 513:386–396.

29.

Tang

, Alazab

, Luo

. Big data for cybersecurity: Vulnerability disclosure trends and dependencies. IEEE Trans Big Data. 2017; 5:317–329.

30.

Open-source software for reliable, scalable, distributed computing. Available online at https://hadoop.apache.org/ (last accessed December 1, 2020).

31.

A data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using

SQL

. Available online at https://hive.apache.org/ (last accessed December 1, 2020).

32.

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Available online at https://sqoop.apache.org/ (last accessed December 1, 2020).

33.

Apache Spark: Lightning-fast unified analytics engine. Available online at: https://spark.apache.org/third-party-projects.html (last accessed December 1, 2020).

34.

Sahu

, Sarangi

, Jena

. A detail analysis on intrusion detection datasets. In: 2014 IEEE International Advance Computing Conference (IACC). 2014. pp. 1348–1353.

35.

Rokach

, Maimon

. Top-down induction of decision trees classifiers—A survey. IEEE Trans Syst Man Cybernet C. 2005; 35:476–487.

36.

Sahu

, Jena

. A study of K-Means and C-Means clustering algorithms for intrusion detection product development. Int J Innov Manag Technol. 2014; 5:207–213.

37.

Sahoo

, Puthal

. SDN-assisted DDoS defense framework for the internet of multimedia things. ACM Trans Multimedia Comput Commun Appl. 2020; 16:1–18.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.29 MB

An Ensemble-Based Scalable Approach for Intrusion Detection Using Big Data Framework

Abstract

Get full access to this article

References

Supplementary Material