Abstract
In this study, we set up a scalable framework for large-scale data processing and analytics using the big data framework. The popular classification methods are implemented, tuned, and evaluated by using intrusion datasets. The objective is to select the best classifier after optimizing the hyper-parameters. We observed that the decision tree (DT) approach outperforms compared with other methods in terms of classification accuracy, fast training time, and improved average prediction rate. Therefore, it is selected as a base classifier in our proposed ensemble approach to study class imbalance. As the intrusion datasets are imbalanced, most of the classification techniques are biased toward the majority class. The misclassification rate is more in the case of the minority class. An ensemble-based method is proposed by using K-Means, RUSBoost, and DT approaches to mitigate the class imbalance problem; empirically investigate the impact of class imbalance on classification approaches' performance; and compare the result by using popular performance metrics such as Balanced Accuracy, Matthews Correlation Coefficient, and F-Measure, which are more suitable for the assessment of imbalanced datasets.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
