Special issue – Operationally relevant methods for Big Data problems

Abstract

Big Data is an emerging problem in a variety of domains, as detailed in Bihl et al.¹ While many different characteristics exist for Big Data, three are of primary interest: the volume at which it is created, the velocity at which it is sensed, and the variety of data types. These are colloquially termed the “3 Vs” of Big Data.¹

This special issue focuses on the operationally relevant application of methods; this implies the use of practical methods that solve real world Big Data problems. While much work and publications exist on academic problems, e.g. MNIST digit recognition, publications on real-world problems and solutions are more limited. From our perspective, this is due to real-world problems being incredibly complicated and messy², proprietary solutions being used, and/or the reliance on effective and efficient methods over complicated and elegant. Additionally, in real-world problems, ground truth is rarely available and complex algorithms can become brittle and fail.

When handling real-world Big Data challenges, the latest advanced algorithms are not always used since these can be brittle or unexplainable. However, advanced procedures are frequently seen whereby appropriate algorithms and logic are applied to solve problems. In Big Data analytics for real-world problems, there is primarily a practical utility focus; however, the theoretical utility can lag. Thus, research publications are not always seen on practically relevant problems.

Of interest to this special issue are such advanced procedures. For this aim, this special issue contains four papers. The first paper, by Butler et al., is entitled “The effectiveness of using diversity to select multiple classifier systems with varying classification thresholds.” While decision fusion, or ensemble, of classifiers largely improves results and provides robustness, the selection of classifiers can be difficult. Two approaches are generally considered: selecting classifiers based on their accuracy and selecting classifiers, which make different decisions. This paper considered the relative merits of both approaches.

Big Data challenges exist in variety and difficulties exist in fusing data from different sensing modalities. The second paper, by Shen et al., entitled “A Joint Manifold Leaning Based Framework for Heterogeneous Upstream Data Fusion,” aims to develop methods for these types of problems. In this work, a joint manifold learning fusion (JMLF) method is developed which is applicable to multi-modal or mixed sensor fusion. When compared with model-based approaches, JMLF is considerably more accurate.

Temporal issues are inherent in Big Data challenges wherein data and events are constantly logged and considerations must be made for dynamic operations. Storage allocation in warehouses is one such dynamic operation, whereby shipping operations dynamically changes the situation and allocation in the warehouse. Ghalehkhondabi and Masel address issues in this area in the third paper entitled “Storage allocation in a warehouse based on the forklifts fleet availability.” In this paper, storage allocation based on efficient utilization of resources is considered, rather than the traditional focus on product shipping distance. Simulations in this paper show that warehouse storage operations could reduce forklift idle time when compared to traditional storage allocation methods.

One domain that highlights all “3 Vs” of Big Data is cybersecurity. Cyber data is very much multi-modal (text, numbers, temporal, etc.) and is constantly being generated, and includes multitudes of massive amounts of data being generated constantly. One example of this problem is in firewall logs, wherein a large enterprise network could see tens of thousands of flagged events in a minute from even more general internet traffic activity. While algorithmic methods exist to handle the bulk of this data, these fail to find security issues that have not been seen before and forensic analysis is needed to find yet undiscovered patterns and events. Gutierrez et al. in “Cyber Anomaly Detection: Using Tabulated Vectors and Embedded Analytics for Efficient Data Mining” develop a method for this problem area which incorporates both machine learning methods and analyst insight to process large firewall logs. Beyond the methodology, Gutierrez et al. also present a dashboard module, which enables this work to be applied in an enterprise-level cyber forensics platform.

We thus hope that the papers in this special issue interesting to read. These papers were selected in their abilities to solve real-world Big Data problems, with emphasis on which of the “3 Vs” (volume, variety, and velocity) are being handled with the final paper handling all “3 Vs” in a real-world enterprise level context. We further thank the anonymous reviewers for prompt reviews and the publisher for publishing the papers quickly.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Bihl

Young

and Weckman

Defining, understanding, and addressing Big Data. Int J Bus Anal 2016; 3: 1–32.

Hernández

Stolfo

Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discov 1998; 2: 9–37.