Sage Journals: Discover world-class research

Abstract

Data Analytics Solution (DAS) engineering often involves multiple tasks from data exploration to result presentation which are applied in various contexts and on different datasets. Semantic modeling based on the open world assumption supports flexible modeling of linked knowledge. The objective of this paper is to review existing techniques that leverage semantic web technologies to tackle challenges such as heterogeneity and changing requirements in DAS engineering. We explore the application scope of those techniques, the different types of semantic concepts they use and the role these concepts play during DAS development process. To gather evidence for the study we performed a systematic mapping study by identifying and reviewing 82 papers that incorporate semantic models in engineering DASs. One of the paper’s findings is that existing models can be classified within four types of knowledge spheres: domain knowledge, analytics knowledge, services and user intentions. Another finding is to show how this knowledge is used in literature to enhance different tasks within the analytics process. We conclude our study by discussing the limitations of the existing body of research, showcasing the potential of semantic modeling to enhance DASs and presenting the possibility of leveraging ontologies for effective end-to-end DAS engineering.

Keywords

Semantic modeling data analytics software development ontology

1. Introduction

The business intelligence and analytics fields are rapidly expanding across all industry sectors, and many organisations are trying to make analytics an integral part of everyday decision making [12,47]. These fields include the techniques, technologies, systems, practices, methodologies and applications that are concerned with analysing critical business data to help an enterprise better understand its business and market and make timely business decisions [12,33].

There is no universally accepted definition for the process data analytics. CRISP-DM [11] and KD process proposed by Fayyad et al. [17] are two examples of well developed and popular definitions. In the context of this paper, we identify a “data analytics process” (also called an “analytics pipeline”) as an end-to-end Data Analytics Solution (DAS) that captures tasks related to data mining, knowledge discovery or business intelligence. A DAS can be itself decomposed into multiple tasks such as identifying suitable datasets, developing and validating analytics models and interpreting final results. The software engineering aspect of DAS, which we refer to as “DAS engineering”, involves designing, developing and deploying DASs and includes tasks such as requirement elicitation, data integration and process composition [43].

With the increasing popularity of big data as a research area, the focus of the most research efforts have been on developing specific analysis techniques (e.g. machine learning algorithm design) but not on supporting the overall DAS engineering [15]. Within many organisations, analysts with limited programming experience are often required to manually establish relationships between software components of the analytics solution such as software services used for computation, and data elements or data mining algorithms [16,32]. According to No-Free-Lunch theorem [31], DAS engineering becomes further challenging as there is not a model that works best for every problem and depending on the application context and input data, analysts have to try different techniques before getting optimal results. Most organisations are looking for flexible solutions that align with their specific objectives and IT infrastructures [1], usually resulting in the use of a mix of data sources and software frameworks. Understanding and managing these heterogeneous technologies need to be supported by a sound knowledge management infrastructure. In addition to incurring high software development costs, maintaining and evolving heterogeneous software infrastructures in the face of constant changes in both business requirements and technical specifications is very expensive [49].

Ontology is the formal foundation for semantic modeling. The primary roles of an ontology are to capture domain knowledge, to evaluate constraints over domain data and to guide domain model engineering [3]. It is a powerful tool for modeling and reasoning [1]. As ontologies can produce a sound representation of concepts and the relationships between concepts, they provide malleable models that are suitable for tracking various kinds of software development artifacts ranging from requirements to implementation code [39]. Although there have been many efforts in applying semantic modeling for DAS engineering, the overall picture of their capabilities is far from clear.

Hence, this study systematically explores various research studies that are focusing on engineering applications that support DASs with the aid of semantic technology. Further, we identify unresolved challenges and potential research directions in DAS engineering space. We follow the systematic mapping study process proposed by Petersen et al. [41], collect evidence from the publications in five prominent databases and extend the evidence further by snowballing [55] relevant references of identified studies. We conduct our study around the main research question of identifying the existing techniques that use semantic models in DAS engineering and two sub-questions related to that. We evaluate how semantic models represent different knowledge areas related to DASs such as mental models of the end-user, domain knowledge, semantics of data, applicability of analytics algorithms and tools for a particular task, compatibility between data and tools and leveraged for conducting tasks related to DAS engineering.

The rest of the paper is structured as follows. Section 2 presents the background related to this study. In Section 3 we explain the review method followed. Section 4 includes the results derived from the 82 identified studies followed by a discussion in Sections 5. Section 6 discusses the limitations of the study, and the paper concludes in Section 7.

2. Background

The literature emphasises the significance of knowledge management in different fields such as enterprise data analytics [53] and scientific workflow [51], and there have been many attempts at identifying knowledge specific to DAS. For example, the ADAGE framework [56] proposes an approach that leverages the capabilities of service-oriented architectures and scientific workflow management systems. The main idea is that the models used by analysts (i.e. workflow, service, and data models) contain concise information and instructions that can be viewed as an accurate record of the analytics process. These models can become useful artifacts for provenance tracking and can ensure reproducibility of such analytics processes. However, designing models that accurately represent the complex business contexts and expertise associated with a DAS remains a challenge. Development methods such as CRISP-DM for enterprise-level data mining [11] and Domain-oriented data mining [54] are advocating the necessity of using knowledge management techniques for capturing the business domain and understanding of data in order to build better analytics solutions.

There are multiple known knowledge representation approaches related to different aspects of DAS such as UML diagrams [16,28] and Petri-nets [30] which are syntactically oriented and limited in the ability to capture semantic details and reason from the knowledge [13]. The focus of this paper is on semantic knowledge representation with ontological models, a technology that originated from the Semantic Web concept [7], where ontologies are expressed through the RDF,1

¹
https://www.w3.org/RDF/
RDFS2 ²
https://www.w3.org/TR/rdf-schema/
and OWL3 ³
https://www.w3.org/OWL/
language notations. Although semantic technology has been part of the research landscape for a while, the industry is only just beginning to discover the power of linked open data, ontologies and semantic applications in assisting enhancements to the data analytics process. While there are significant examples of leading internet companies (e.g. Google, Amazon, and Facebook) beginning to exploit the power of semantic search and domain ontologies (e.g. Schema.org, DBpedia) [46], many organisations are still unaware of the value that these approaches can bring [6,9,29].

To our knowledge, there is no formal study conducted on how semantic modeling has contributed to DAS engineering except the surveys conducted by Abello et al. [1] and Ristoski and Paulheim [46]. In particular, Abello et al. [1] studied about using semantic web technologies for Exploratory OLAP, considering the data extraction and integration aspects while Ristoski and Paulheim [46] surveyed in 2016 about the different stages of the knowledge discovery process that use semantic web data. In comparison, our work is unique as it looks at the applications of semantic models in DAS engineering from data analytics as well as a software engineering perspective.
3. Research method

3.1. Introduction

As our goal is to provide a holistic view of how semantic modeling is used in DAS engineering landscape, we conducted a systematic mapping study (SMS). By definition, SMS is used to provide an overview of a research topic by showing the type of research and results that have been published by categorising them with the goal of answering a specific research question [23,41]. We followed the process proposed by Petersen et al. [41] to ensure the accuracy and quality of the outcome. We conducted initial evidence search on five databases, two conference proceedings and one journal related to semantic web technologies. Findings were extended through snowballing approach proposed by Wohlin [55].

3.2. Research questions

The primary focus of our SMS is to identify and understand how semantic modeling approaches are used to represent and communicate knowledge of a data analyst as well as how existing DAS engineering techniques are leveraging this knowledge. The review was conducted on a primary research question and two sub-questions which are stated as follows:

Primary Question: What are the existing techniques that use semantic modeling for DAS engineering?

Sub-questions:

What type of concepts are modeled/used by these techniques?

What tasks related to DAS engineering are enabled using the identified concepts?

3.3. Search of relevant literature

We adapted the work used in [23,37,41,50] and identified the following strategy to construct the search strings:

Derive major terms used in the review questions

Search for synonyms and alternative words

Use the Boolean OR to incorporate alternative spellings and synonyms

Use the Boolean AND to link the major terms

To obtain a balance between sensitivity and specificity as highlighted by Petticrew and Robert [42], we selected a search string that contains three significant terms related to the concepts: semantic technology, data analytics, and software engineering, connected by a Boolean AND operation. Each term contains a set of keywords related to the respective concept, connected by a Boolean OR operation.

The complete search string initially used for the searching of the literature was as follows:

((”knowledge management” OR semantic OR “linked Data” OR ontology OR “conceptual modeling”) AND (”big data analytics” OR “business analytics” OR “data analytics” OR “scientific workflow” OR “data mining”) AND (requirement OR “development process” OR “code generation”))

The primary search process involved the use of 5 online databases: Web of Science, Scopus, ACM Digital Library, IEEE Xplore, and ProQuest. The selection of databases was based on our knowledge about those that index major publications related to computer science, software engineering and semantic technology. Based on the recommendations of domain experts, we expanded the search space to the collection of proceedings of International Semantic Web Conference and European Semantic Web Conference, their associated workshops, and the publications by the Semantic Web Journal accessible via DBLP search API.

Upon completion of the primary search phase, we further expanded the identification of relevant literature through snowballing. All the papers identified from the primary search phase were reviewed for relevancy. If a paper satisfied the selection criteria, we included it in the list of studies qualified for the synthesis.

3.4. Selection of studies

Below are the exclusion criteria we adapted from [21]:

Books and news articles

Papers where semantic modeling was not applied directly to DAS engineering

Vision papers

Papers not written in English

Application specific research that does not generalise (such as text extraction and web search applications)

Infrastructure related or performance-oriented research.

Full text that was not available for public access and not licensed by the University of New South Wales digital library

We did not restrict the search to a period of time and included all research available in the selected databases up to 27/06/2018.

3.5. Study quality assessment

We designed a quality checklist to measure the quality of the primary studies by reusing some of the questions proposed in the literature [42,52]. Our quality checklist comprised four general questions stated below:

Is the study related to DAS engineering?

Does the study leverage semantic models for information modeling?

Does the study provide sound evaluation?

Are the findings credible?

Initially, one author went through the title, abstract and keywords of search results and divided papers into three categories by relevancy: “Yes”, “No” and “Maybe”. Then the second author went through the full text of the papers under the “Maybe” category to identify whether they were compliant or not with our quality checklist.

Through the initial database search, we identified 1414 empirical studies as candidates. Among those results, 63 (4.46% of 1414 studies) were identified as relevant studies, based on the study quality assessment and exclusion criteria. The same steps were applied to the literature identified through snowballing at the second stage. We iterated through the references of 63 papers selected during the initial search and identified 19 additional relevant papers for our study.

To avoid the inclusion of duplicate studies which would inevitably bias the result of the synthesis, we thoroughly checked if very similar studies were published in more than one paper. In total, 82 studies were included in the synthesis of evidence.

3.6. Constructing classification schemas

Our study requires two classification schemas to answer sub research question 1 and 2, i.e. what type of semantic concepts are modeled/used by identified studies and what tasks related to DAS engineering are enabled using the identified concepts.

To construct each classification schema for our mapping study, we adapted the systematic process proposed in [41]. We created the classification schema in a top-down fashion, incorporating different classifications proposed and used in literature. We used the abstract, introduction and conclusion of the selected 82 studies and aligned the studies with categories identified in the literature. When necessary, the classification schema was extended with keywords and categories defined in the selected literature to provide clarity and granularity.

To answer sub-question 1, we distinguished four broad classes of concepts represented through ontologies in identified studies, referred to as Domain, Analytic, Service and Intent. This classification was guided by the proposal of Nigro [38] to use three ontology types in data mining. The first two: “Domain Ontologies” and “Ontologies for Data Mining Process” were included in our schema as the domain and analytics concepts respectively. The third one, called “Metadata Ontologies”, defines how variables are constructed. Because this definition is high-level and incomplete, we introduced two new concept types: Intent concepts and Service concepts to capture knowledge that supports requirements management and implementation management within DAS engineering. Using the evidence of identified literature, we extended this classification to a more granular level with different subtypes. Section 4.2 discusses the details of this classification.

When we look at the problem of classifying different DAS-related tasks for sub-question 2, there was no unique definition in the literature regarding what constitutes a task in relation to DAS engineering. Fayyad et al. [17] propose a five-step process model for knowledge discovery – selection, preprocessing, transformation, data mining, evaluation and interpretation. CRISP-DM proposed by Chapman et al. [11] is more enterprise-oriented and breaks down the life-cycle into five steps: domain understanding, data understanding, data preparation, modeling, evaluation, and deployment. All identified studies do not follow any specific model; some of them focus more on high-level tasks such as domain understanding and process composition while others support more granular tasks such as model selection.

Fig. 1.

Tasks associated with DAS engineering identified from the literature.

Hence, we combined key categories extracted from the identified 82 studies with the tasks proposed in different literature and came up with nine tasks under two categories as illustrated in Fig. 1. We initially identified five data analytics related tasks aligning CRISP-DM process model with the tasks proposed in the 82 studies – Business Understanding, Data Extraction and Transformation, Model Selection, Model Building, Result Presentation and Interpretation. We further identified four tasks from the studies that were rooted in the software engineering literature: Data Integration, Service Composition, Analytics Solution Validation and Code Generation. Software engineering related tasks are important because they contribute to enhancing the overall quality of the DAS implementation. Data analytics related tasks are conducted in the order shown by the arrow, while software engineering related tasks can be used to support different data analytics related tasks without a particular order. Each DAS engineering project does not need to perform all these tasks. The selection of tasks depends on the nature of the analysis (e.g. whether we need to choose amongst multiple competing data mining algorithms or use a specific algorithm) and the context of development (e.g. whether we need to support automatic code generation to save software development cost or not).

3.7. Data extraction and mapping of studies

During the data extraction phase, we read, and sorted papers following the classification schemas and then reviewed them in detail. One author read and classified the 82 papers according to the two schemas, noting down the rationale of why each paper belongs to the selected category. The second author reviewed the table, discussed and resolved disagreements and compiled the final mapping. The classification schemas developed initially evolved through this phase, resulting in adding new subcategories and splitting in specific scenarios.

Next section describes the finalised mappings and its associated details.

4. Results

4.1. Primary question: What are the existing techniques that use semantic modeling for DAS engineering?

The Appendix contains the complete list of identified studies. RDF, RDFS and OWL are the encoding languages widely accepted by the semantic web community to represent ontologies. All the identified studies, except S35, are relying on this notation for their semantic model representation. S35 deviates from that common practice and uses the Predictive Model Markup Language (PMML) [20] and Background Knowledge Exchange Format (BKEF) [24] to represent knowledge associated with DASs.

By assessing the identified studies, we observed that these efforts vary in the application context they are addressing and in the way the analytics knowledge is modeled. Through the sub-questions in Sections 4.2 and 4.3, we explore the different semantic concepts and tasks used by these 82 identified studies according to the classification schemas mentioned in Section 3.6. Section 4.2 classifies different ontological concepts into four types and describes the characteristics of the knowledge they capture. In Section 4.3, we relate identified semantic concepts to their role in realising and facilitating various DAS engineering tasks.

4.2. Sub-question 1. What type of concepts are modeled/used by these techniques?

The mapping results according to the classification schema described in Section 3.6 are illustrated in Table 1. We identified that out of 82 studies, the majority (54 studies) model domain concepts related to various application domains and 11 of them reuse publicly available ontologies. Thirty of the studies capture and use analytics concepts and thirty capture service concepts. Smallest representation was in the intent category with 14 studies. Following subsections present a detailed analysis.

Table 1
Classification of semantic concepts modeled and used in identified literature

Classification criteria Study

Domain concepts Application specific Gene and protein analysis S11, S27, S29, S76

Health care S2, S50

Sensor & event information S30, S31, S33, S55, S69, S70, S74

Spatiotemporal information S40

Traffic information S18, S31, S52

Enterprise quality management S7

Agriculture S32

Hyper-spectral image data S24

Power grid S58, S64

Custom built S5, S8, S10, S12, S13, S19, S21, S23, S34, S35, S39, S42, S43, S44, S45, S47, S48, S49, S51, S57, S59, S61, S62, S63, S65, S67, S68, S72, S73, S77, S80, S81, S82

Publicly available ontology SSN ontology [14] S30, S31, S33, S40, S69, S74

Gene ontology [18] S11, S44, S76

GALEN ontology [45] S44

TOVE ontologies [22] S7

GeoVocab ontology [19] S40, S70

Analytic concepts Data preprocessing & integration S14, S20, S27, S37, S41, S54, S80

Data analytics & mining S15, S17, S22, S24, S27, S35, S36, S37, S38, S41, S53, S54, S56, S60, S62, S63, S66, S67, S69, S74, S75, S76, S78, S79

Control & data flow S1, S3, S62, S75, S76, S78

Service concepts Software component management Web services OWL-S based [35] S4, S11, S30, S31, S36

WSDL-S based [2] S11

SAWSDL based [25] S20

WSMO based [48] S4, S43

Hydra vocabulary based [34] S40

Custom models S32, S37, S52

Library specific APIs S27, S75, S79

Generic APIs S6

Data management Multidimensional data schema S10, S62, S71

Graph data schema S9, S46

Composition management Workflow templates S4, S16, S20, S26, S53, S75, S78

Provenance related S4, S25, S72

Quality related S30, S31

Implementation management Deployment concepts S3, S4, S8, S69

Data source S3, S8, S10, S38, S43, S71

Intent concepts Analytic query expressions S2, S9, S21, S48

Analytics requirements S3, S7, S28, S44, S53, S68

User goals S4, S39, S66, S75

Classification criteria	Study
Domain concepts	Application specific	Gene and protein analysis	S11, S27, S29, S76
Health care	S2, S50
Sensor & event information	S30, S31, S33, S55, S69, S70, S74
Spatiotemporal information	S40
Traffic information	S18, S31, S52
Enterprise quality management	S7
Agriculture	S32
Hyper-spectral image data	S24
Power grid	S58, S64
Custom built	S5, S8, S10, S12, S13, S19, S21, S23, S34, S35, S39, S42, S43, S44, S45, S47, S48, S49, S51, S57, S59, S61, S62, S63, S65, S67, S68, S72, S73, S77, S80, S81, S82
Publicly available ontology	SSN ontology [14]	S30, S31, S33, S40, S69, S74
Gene ontology [18]	S11, S44, S76
GALEN ontology [45]	S44
TOVE ontologies [22]	S7
GeoVocab ontology [19]	S40, S70
Analytic concepts	Data preprocessing & integration	S14, S20, S27, S37, S41, S54, S80
Data analytics & mining	S15, S17, S22, S24, S27, S35, S36, S37, S38, S41, S53, S54, S56, S60, S62, S63, S66, S67, S69, S74, S75, S76, S78, S79
Control & data flow	S1, S3, S62, S75, S76, S78
Service concepts	Software component management	Web services	OWL-S based [35]	S4, S11, S30, S31, S36
WSDL-S based [2]	S11
SAWSDL based [25]	S20
WSMO based [48]	S4, S43
Hydra vocabulary based [34]	S40
Custom models	S32, S37, S52
Library specific APIs	S27, S75, S79
Generic APIs	S6
Data management	Multidimensional data schema	S10, S62, S71
Graph data schema	S9, S46
Composition management	Workflow templates	S4, S16, S20, S26, S53, S75, S78
Provenance related	S4, S25, S72
Quality related	S30, S31
Implementation management	Deployment concepts	S3, S4, S8, S69
Data source	S3, S8, S10, S38, S43, S71
Intent concepts	Analytic query expressions	S2, S9, S21, S48
Analytics requirements	S3, S7, S28, S44, S53, S68
User goals	S4, S39, S66, S75

4.2.1. Domain concepts

These are context specific and high-level concepts which represent domain-specific knowledge and objects. We identified 54 studies that rely on different subtypes of domain concepts to support DAS engineering.

First subtype, application specific concepts represent objects and relationships in a variety of niche areas such as gene and protein analysis, health-care and transport as shown in Table 1. The solutions they provide are highly coupled with a single application context and provide less opportunity for adoption by other applications. The majority of studies in this subtype (32 studies) do not propose any specific domain concepts but provide users with the ability to customise concepts in any application context.

The second subtype represent studies that reused concepts from different publicly available ontologies. We identified five publicly available domain-specific ontologies used for designing analytics solutions – SSN Ontology [14], GeoVocab [19], Gene Ontology [18], GALEN ontology [45] and TOVE ontology [22].

4.2.2. Analytics concepts

Analytics concepts are closely aligned with the knowledge that reflects different algorithms, computational models and the data-flow nature of the analytics process in terms of inputs, outputs, and their compatibility. Analytic concepts provide a vocabulary for defining and communicating analytics operations and related attributes. Further, they can help in describing dependency relationships between variables. These concepts are not coupled to a specific application domain or context.

Thirty studies leverage analytics concepts. We classified them into three subtypes according to their role in representing different analytics tasks or methods: concepts related to data preprocessing and integration activities, concepts that capture data analytics and mining techniques and concepts that represent the control and data flow nature of an analytics process.

Under the first subtype, data preprocessing and integration, S14 proposes an analytics ontology for linking different temporal and geographical datasets using concepts from different predefined ontologies. S20 proposes a Rules ontology to store concepts related to rules that transform one data schema to another. S27, S37, S41 and S54 propose concepts to model different data preprocessing tasks such as null value removal, data format conversion, sampling and feature selection. S80 proposes the Analytics-Aware Ontology to represent data aggregation and comparison functions.

There are 25 studies under the subtype data analytics and mining. The majority of studies [S27, S36, S37, S38, S41, S53, S54, S60, S62, S66, S75, S76, S78, S79] model analytics tasks and algorithms such as classification, clustering and regression. Analytic Ontology [S15] is dedicated to statistical and machine learning models. The Actor Ontology [S24] is limited to image processing algorithms such as image classification and feature extraction. The Data Mining Ontology in S17, the Simple Data Mining Ontology in S22 and Expose ontology in S56 try to capture non-functional attributes, performance assessments such as sensitivity, specificity, accuracy and user satisfaction related to data mining algorithms. S63 and S74 propose Feature Ontologies to enhance feature set descriptions in data analytics. S67 captures event-condition-action rules for event processing with Rule Management Ontology. S69 proposes an extended event ontology to capture event processing queries. In addition to the ontologies, three of the identified studies [S22, S27, S35, S66] capture details of the analytics models using Predictive Model Markup Language [20].

Six studies focus on control and data flow related concepts. The Analytic Ontology [S1], DM Workflow Ontology [S75], SemNExt Ontology [S76] and DM OPtimization Ontology [S78] include concepts that capture the data flow nature of any analytics operation concerning an array of characteristics such as input requirements and preferences, input and output data types and accuracy. The Task-Method Ontology [S3] represents concepts that are useful in controlling and conducting MapReduce4

⁴
https://research.google.com/archive/mapreduce.html
type of analytics. S62 proposes an ontology derived from CRISP-DM terminology to represent control flow elements of a DAS.
4.2.3. Service concepts

Service concepts capture knowledge related to DAS architectures and platforms. We identified 30 studies that model different aspects, namely web services, software APIs, data schemas, workflow design, knowledge related to provenance or data quality, deployment information and data sources. We classified service concepts into four subtypes which are software component management, data management, composition management and implementation management (see Table 2).

Table 2
Classification of literature by their application to different DAS engineering tasks and concepts used

Related task Concept classification

Domain concept Analytic concept Service concept Intent concept

Domain understanding S5, S12, S18, S49 – – S28, S39

Data extraction and transformation S2, S8, S18, S29, S30, S33, S34, S40, S50, S52, S59, S70, S80 S80 S8, S10, S38, S52 –

Data integration S2, S10, S12, S13, S19, S21, S23, S29, S34, S39, S40, S42, S43, S44, S45, S47, S48, S49, S51, S55, S59, S64, S65, S70, S76, S77, S81 S14 S9, S46, S71 S2, S9, S21, S39, S44, S48

Model selection S27, S49, S61, S62 S15, S17, S22, S27, S37, S38, S41, S53, S56, S60, S62, S66, S75, S78, S79 – S53, S66, S75

Model building S7, S58, S63, S64, S67, S69, S70, S73, S74, S77, S80, S82 S3, S54, S56, S63, S67, S69, S74, S80 – S3, S7

Results presentation and interpretation S35, S48, S49, S57, S61, S76 S35 S4, S25, S26 S4, S48

Service composition S11, S24, S27, S30, S31, S32, S43, S52, S62 S1, S20, S24, S27, S36, S41, S53, S62, S75, S78 S6, S11, S16, S20, S27, S30, S31, S32, S36, S37, S40, S43, S52, S53, S62, S75, S78 S53, S75

Analytics solution validation S24, S61, S72, S76 S24, S75, S76 S25, S26, S72, S75 S4, S75

Code generation S47 S3 S3, S6, S38 –

Related task	Concept classification
Domain understanding	S5, S12, S18, S49	–	–	S28, S39
Data extraction and transformation	S2, S8, S18, S29, S30, S33, S34, S40, S50, S52, S59, S70, S80	S80	S8, S10, S38, S52	–
Data integration	S2, S10, S12, S13, S19, S21, S23, S29, S34, S39, S40, S42, S43, S44, S45, S47, S48, S49, S51, S55, S59, S64, S65, S70, S76, S77, S81	S14	S9, S46, S71	S2, S9, S21, S39, S44, S48
Model selection	S27, S49, S61, S62	S15, S17, S22, S27, S37, S38, S41, S53, S56, S60, S62, S66, S75, S78, S79	–	S53, S66, S75
Model building	S7, S58, S63, S64, S67, S69, S70, S73, S74, S77, S80, S82	S3, S54, S56, S63, S67, S69, S74, S80	–	S3, S7
Results presentation and interpretation	S35, S48, S49, S57, S61, S76	S35	S4, S25, S26	S4, S48
Service composition	S11, S24, S27, S30, S31, S32, S43, S52, S62	S1, S20, S24, S27, S36, S41, S53, S62, S75, S78	S6, S11, S16, S20, S27, S30, S31, S32, S36, S37, S40, S43, S52, S53, S62, S75, S78	S53, S75
Analytics solution validation	S24, S61, S72, S76	S24, S75, S76	S25, S26, S72, S75	S4, S75
Code generation	S47	S3	S3, S6, S38	–

There are 10 studies under the software component management subtype, with the majority focusing on modeling web services that realise different operations in the analytics process. Many of those studies adapt or extend publicly available semantic web service annotation ontologies: OWL-S [35], WSDL-S [2], SAWSDL [25], WSMO [48] and Hydra vocabulary [34]. Particularly, S31 extends OWL-S services to support event processing via the Complex Event Service Ontology. There are three studies [S32, S37, S57] that define custom concepts to represent web services used in DASs. S27 and S79 model Weka library specific software components. S6 provides the capability to model any generic API that can be used to implement DAS, through an ontology called Processing Element (PE) Knowledge Base. This ontology describes software components based on their input and output data types as well as relevant implementation details such as a URL for an HTTP request or a related JAVA class.

The second subtype represents data management concepts that support in representing and managing data structures and schemas. S10 proposes the BI Ontology to describe concepts such as dimension, hierarchy, level, property and measures that can capture data cubes related to Online Analytical Processing (OLAP) operations of a data warehouse. S62 proposes Corporate Data Model Ontology to capture metadata about data schemas useful in data access. S71 uses R2RML5 ⁵

https://www.w3.org/TR/r2rml/

to map relational databases into RDF data. Under the graph data schema category, S46 proposes OpenCube ontology to describe concepts surrounding OLAP cubes in a data warehouse, and similarly, S9 proposes to use an analytical schema to create RDF data warehouses.

The third subtype has concepts related to composition management, which model knowledge related to linking and executing multiple software services and data management operations together. Under that category, some studies propose concepts (e.g., Kepler ontology S4, Data Mining Workflow Ontology S78) to model scientific workflow templates and store their instances. Other composition management concepts are associated with provenance and quality. S4, S25 and S72 use ontologies to describe workflow provenance concepts. Quality related concepts are modeled in S30 and S31 to describe the quality and accuracy of different event-based services. We observed that S25 and S30 reuse concepts extended from the standard provenance ontology PROV-O [27].

The last subtype is related to the implementation management of a DAS. S3 uses Deployment Ontology to describe the necessary deployment details for a MapReduce based system such as configuration variables, initial inputs, variables for profiling and performance measurements. S4 uses a Simulation Ontology to capture runtime metadata related to workflows. S69 models event processing framework components such as alert streams via an event ontology. S3, S10, S38, S43 and S71 propose concepts that describe the implementation of different data sources and how to access them. S8 proposes concepts to capture data source access details as well as system development details that represent the mappings between the database implementation and domain concepts.

4.2.4. Intent concepts

Intent concepts capture knowledge concerning the data analyst’s requirements or goals. This knowledge can be in the form of low-level queries performed on data or high-level goals and intentions of users.

The first subtype represents concepts related to analytics query expression. The Analytical Queries (AnQ) model in S9 facilitates expressing user queries that need to be performed on data. S21 and S48 propose to maintain a global ontology based on user-defined or standard concepts and use it to express user queries. S2 proposes the i2b2 Information Ontology, an intent related model that helps analysts to describe various dimensions of interest in data that are related to a particular task.

Under analytics requirements subtype, S28, S44, S53 and S68 propose ontologies that capture user needs and constraints at a higher level. As an example, S44 uses an intent ontology called MIO (Multidimensional Integrated Ontology) which is auto-generated based on the topics, measures, and dimensions provided by the user. The KnowledgeDiscoveryTask class in Knowledge Discovery Ontology [S53] and the Problem component of the Decision Support Ontology [S68] are used as templates for instantiating analytics requirements. The Task-method Ontology proposed in S3 enables users to model desired methods that can be used to realise a particular task or to define the expected role of a variable within a task. S7 uses the Measurement Ontology to capture concepts regarding product inspection and testing requirements based on ISO 9001 standards.

Fig. 2.

Trend of publications over time, related to different DAS engineering tasks.

Under the user goals subtype, the Scientist’s Intent Ontology in S4, Goal Oriented Model in S39, Purpose and Goal classes in DM3 Ontology [S66] and Goal component of the Base Ontology [S75] provide the capability to express a set of high-level user goals such as the desired outcomes of analytics tasks and the decision-making processes around them.

4.3. Sub-question 2: What tasks related to DAS engineering are enabled using the identified concepts?

In this section, we analyse the association of 82 identified studies with the different tasks related to DAS engineering and how the semantic concepts discussed in Section 4.2 are used to realise these tasks. The classification schema for analytics tasks was described in Section 3.6.

Table 2 shows the mapping of 82 studies among 9 tasks and 4 types of concepts. One study can be focused on more than one task, using one or more concept types. Business understanding is the focus of 6 studies that leverage domain or intent concepts. Data extraction and transformation approaches that use domain, analytics or service concepts are proposed in 15 studies. 31 papers propose data integration approaches mostly using domain concepts, but some use analytics, service and intent concepts as well. Model selection (17 studies) and model building (15 studies) were conducted using domain, analytics or intent concepts. All four concept types were used to realise service composition (20 studies) and solution validation (8 studies). Code generation was supported by domain, analytics or service concepts in 4 studies. 9 studies that proposed approaches for result presentation and interpretation used one or two concept types out of four.

Different applications of those concepts are described in more detail in the rest of this section. Figure 2 illustrates the trend of publications related to each task. We observe a particular interest among researchers in applying semantic technology to support DAS engineering in 2014 and 2015.

4.3.1. Domain understanding

The domain understanding task focuses on analysing the domain, context of the problem and understanding available datasets, which helps to establish solid definitions and facilitate communication between different stakeholders. Moreover, the ontologies and concepts related to this task are inferable, and the resulting knowledge has the flexibility of expanding over time.

Five methods [S5, S12, S18, S49, S59] use domain concepts for domain understanding. The platforms proposed in S5 and S59 use these concepts to capture semantic and interpretive aspects of data whereas S18 uses them to provide a standard specification of data for analysts. In contrast, S49 uses these concepts to model expert knowledge related to an analytics problem which helps understand the constraints and expected behaviour. S12 proposes a feature-rich framework that can use custom-built domain concepts to understand the context of thorough data browsing and visualisation.

Two methods use intent concepts for domain understanding. S39 uses a Goal Model to define requirements that can help in understanding and designing a data warehouse model. S28 extracts analytics requirements expressed in natural language through an ontology and proposes to refine them and identify specific data requirements by interviewing the stakeholders.

4.3.2. Data extraction and transformation

This task focuses on retrieving data from one or more sources and preparing it for the subsequent analysis. It involves transforming data into desired formats and annotating with additional metadata as well.

8 studies apply domain concepts for this task. S30, S33 and S70 annotate streaming input data using domain concepts to make data queriable when necessary. In S18, data transformation is assisted by standard models built using domain concepts. S2, S29 and S40 propose approaches for on-demand data extraction based on domain concepts that describe data sources. S34 focuses on the extraction of JSON data from web resources, generating semantic concepts around them and using those to convert data into ontology instances. S50 conducts data label correction using a domain ontology of medical entities. S59 enables analysts to understand datasets to help data extraction by querying knowledge represented in domain ontologies. S80 uses domain and analytics concepts together to access and transform static as well as streaming data.

Three studies use data source related service concepts for data extraction. In S10, these concepts are used to define the organisation of data and how to access it on demand. S38 uses a set of concepts that can model any data source for model-driven code generation for data extraction. S52 uses service concepts to identify implementations of required data transformations in workflows.

S8 uses both domain and service concepts (data source and implementation management) to extract data from heterogeneous sources such as event data streams and map the extracted data into a database schema. S52 uses domain concepts and service concepts related to data processing and automatically generate new datasets on demand.

4.3.3. Data integration

Data integration implies combining heterogeneous datasets in order to obtain a high-level and coherent view of the data. Most studies use domain concepts to aid in this task. Some studies use intent concepts to conduct integration based on user needs. S9, S14, S46 and S71 are special cases that leverage analytics and service concepts to support the integration task.

We identified five different strategies in which intermixed concepts from different classes contribute to data integration. The first one, observed in S12, S13, S19, S39, S42, S46 and S47, transforms data from heterogeneous sources into instances of a global ontology. S12, S13, S39 and S42 conduct ETL processes incorporating this global ontologies to create semantic aware data warehouses.

The second one, identified in S2, S21, S23, S44, S48 and S77, uses a global ontology that represents the user’s perspective and a set of local ontologies to represent datasets. Then the data is matched accordingly by aligning or converting local ontologies into the global ontologies. Users can refer to the global ontology to query the data.

The third approach describes each dataset through a local ontology and achieves data integration by merging local ontologies together. S34 is an example where each local ontology is constructed by extracting data provided in JSON format, generating suitable semantic concepts and converting the extracted data into ontology instances using generated concepts.

The fourth method used in 10 studies, S10, S14, S29, S40, S45, S51, S59, S64, S70, S76, maintains linked meta-data about datasets from different data sources so that relevant data can be acquired at query time from multiple sources.

The fifth one, used in S9, S43, S49, S55, S65, S71 and S81, is a query or requirement driven approach for data integration where formal rules or program logic are used to represent user queries or analytics questions. Different datasets are mapped into those rules to derive answers. S9 is unique in that it captures analytics queries as intent concepts.

4.3.4. Model selection

Model selection is crucial for users who do not have intuitive knowledge about the performance of different models in different contexts or when many competing analytics models and techniques can be used for a single purpose. This task facilitates comparison of algorithms or makes recommendations of tools and models suitable for users.

There are 4 studies that use domain concepts for this purpose. S49 stores expert domain knowledge in a domain ontology and uses that to evaluate possible analytics models generated by associate rule mining, incorporating an “interestingness” measure. S61 proposes a recommendation engine that maintains a repository of meta-data related to historical analytic solutions using domain concepts, and when a user provides a new dataset, a matching solution is recommended to the user. In S27, suitable analytics technique for a particular dataset is identified by matching dataset features represented in domain concepts with the requirements of the analytics techniques captured through analytics concepts. S62 supports model selection by allowing users to define the context of analytics problem via domain concept and using analytics concepts to recommend appropriate analytics techniques.

There are 15 methods that apply data analytics, and mining focused concepts for model selection. The simplest method proposed in S17 uses an ontology to describe data analytics algorithms and creates a knowledge repository to provide a querying capability for the users.

The concepts defined in S15, S27, S37, S41, S53, S60, S66, S75 and S78 assist users in matching analytics components that suit their goal and constraints. S53, S66 and S75 are unique among those studies as they propose intent concepts to capture user goals and requirements which are then matched with suitable models represented as analytics concepts.

In S22, analytics concepts are used to model data mining algorithms, which are then linked to web services, providing the means for service composition. S38 and S79 use analytics concepts to capture existing analytics workflows to enable novice users to search and learn from them.

4.3.5. Model building

Model building is a core data analytics task, where a selected model needs to be applied and customised for the problem at hand.

S58, S64, S70, S73, S77 and S82 use domain concepts to write analytic queries or event patterns executed on data to get descriptive analytics insights. S67 and S69 follows a similar approach, but use analytics concepts to support users in writing queries or rules.

S63 and S74 use domain concepts and analytics concepts to describe and customise feature space for analytics model construction. S7 leverages domain concepts and intent concepts for model building. The generated analytics model consists of axioms based on formal competency questions to evaluate whether an enterprise model represented through domain concepts adhere to the compliance standards represented as intent concepts.

S56 uses analytics concepts to define analytics experiments in detail for supervised classification of propositional datasets. S54 proposes a case-based system for model selection and uses analytics concepts to adapt the solutions suggested by case based reasoning to fit user’s interest. S3 applies analytics concepts together with intent concepts for analytics model generation in a MapReduce based analytics solution.

4.3.6. Result presentation and interpretation

Storing knowledge related to different aspects of the analytics process through a semantic model inherently provides a certain inference and interpretation capability that helps in result representation. There have been some studies that specify particular methods that use semantic concepts to facilitate result presentation and interpretation explicitly.

Six studies use domain concepts for result presentation and interpretation. S49 uses domain concepts to store expert knowledge and use it to validate data mining results and evaluate their interestingness. S57 uses domain concepts to interpret hypotheses and related attributes on statistical datasets. S61 uses domain concepts to annotate data tables via ontology alignment, enabling easy interpretation. S35 incorporates domain concepts with analytics concepts to generate reports on the conducted data analytics tasks. S76 generates metadata about numerical analysis using domain and analytics concepts. S48 uses domain concepts with intent concepts to extract meta-data about OLAP operations and generate reports. Further, S48 proposes a method to automatically match the OLAP reports with other documents in a related repository.

When considering the use of service concepts for presenting results, S4, S25 and S26 capture knowledge about the different aspects of scientific workflows, especially aspects can help to describe and present the outputs/results.

In S4, intent concepts are used to capture initial goals of the analysts and annotate workflows with them, in order to identify different decisions that have led to the outcome and to explain the results from the perspective of the analyst.

4.3.7. Service composition

Service composition means identifying and putting together different analytics related services to provide a complete or partial DAS from data acquisition and extraction to results generation. Scientific workflow planning and service composition are largely incorporated into this task.

Identifying a suitable service or tool to include in a DAS is a major activity within service composition. Some studies [S32, S36, S37, S40] use software component management concepts to guide component selection, but leave the responsibility of process composition to the analyst. S32 proposes a model to represent and recommend web services using pre-defined rules, based on a context expressed through domain concepts. S37 and S40 use service concepts to model a wide array of software components to be selected from, including pre-processing capabilities such as null value removal. Users can query the ontologies to identify suitable components. S36 proposes a methodology to facilitate the selection of suitable service implementations based on the input data. In S43 and S52, suitable service implementations for datasets are identified by matching characteristics represented through domain and service concepts.

In contrast, S30 and S31 select matching data sources for a software component according to its ability to extract and process related data. Data provision services are annotated using OWL-S and SSN ontology concepts so that users can query them and identify suitable services. Moreover, S30 use quality related service concepts to incorporate essential quality attributes that need to be considered in component selection.

Other studies extend service selection to facilitate composition planning and execution by incorporate different domain, service and analytics concepts.

S24 uses domain concepts that describe datasets, as well as data analytics and mining concepts to support workflow composition in Kepler tool, through matching the analytics operations with data properties. S27 uses domain, analytics and service concepts to support the selection of suitable data sources, analytics techniques and software components respectively. S41 uses analytics concepts to represent components of the Weka analytics tool and link these components as a process. It does not provide executable workflows but recommends an analytics plan to be manually executed by the user. S1 proposes a similar approach for the MIT Lincoln Laboratory’s Composable Analytic Environment that including executable workflow definitions. S20 uses service concepts for software component modeling and link them by modeling data transformation rules with analytics concepts. S62 uses control and data flow concepts to guide the knowledge discovery process and supports decision making at each stage using a knowledge base that encompasses domain, analytic, and service concepts.

S78 uses analytics and workflow template concepts to generate analytics processes, with a significant focus on performance optimisation. S16 uses concepts related to workflow templates to store pre-composed analytics processes which can be queried by users in order to select a suitable implementation.

S53 captures analytics, service and intent concepts through Knowledge Discovery Ontology and offers support for planning abstract analytics processes. S6 supports software composition through components modeled as generic APIs by matching respective inputs and outputs. It assists users in planning and matching analytics components with a comprehensive goal-based planning algorithm by considering the input and output conditions. Similarly, S75 uses analytic, service and intent concepts to capture the user goals and KDD workflows implemented in RapidMiner. These concepts are used to identify optimal analytics process for a new analytics problem based on Hierarchical Task Network planning.

S11 supports comprehensive scientific workflow composition with a graphical user interface based on the Taverna workflow engine, SADI/BioMoby plug-ins and SADI-compliant web services. It uses web service concepts to model components that implement data analytics algorithms and domain concepts to describe input and output data. Those concepts are used to recommend services that match analytics requirements or data constraints.

4.3.8. Analytics solution validation

Validation of the analytics solutions involves capturing provenance data, validating solution workflows for service compatibility or data consistency and confirming that the solution addresses the analyst’s goals. S61 uses a domain ontology to generate metadata about input data and output results, enabling provenance. S24 uses domain concepts that describe data and analytics concepts to validate the structural and semantic correctness of workflow before execution. S76 captures provenance data related to datasets, data sources as well as operations around numerical analysis through domain and analytics concepts. S25 uses provenance-related concepts to model workflow and data, and allow users to query them in order to validate workflows, identify defects or extract further information. S26 defines Research Object concept as an instance of a scientific workflow, for provenance purposes. S75 proposes a solution validation approach that uses analytics, service and intent concepts to annotate data, operators, models, data mining tasks and KDD workflows as an extension to RapidMiner tool. The Scientist’s Intent Ontology in S4 uses goal focused intent concepts to describe user goals that are used for workflow validation. S72 uses domain and service concepts to capture provenance of ETL workflows.

4.3.9. Code generation

Methods that support code generation rely on ontologies to convert abstract models into executable analytics software. Such methods can reduce the burden of software programming for data analysts. Code generation can be used to support multiple stages of an analytics process (workflow) execution.

Most techniques use service concepts (e.g. S3, S6 and S38) to drive a code generation process in a Model Driven Engineering (MDE) fashion. For example, data sources modeled as service concepts in S38 are used to generate data extraction software modules. In S6 generic API modeling service concepts are used to generate an executable analytics process. In S3, analytics concepts and service concepts related to implementation management contributes to capturing the implementation details of an analytics task. This enables a semi-automated code generation scheme for selected analytics techniques. The method proposed in S47 is the only one that uses domain concepts to model data sources, which are then used for generating code that extracts data from data sources into linked datasets.

5. Discussion

5.1. Limitations of existing work

This section summarises the limitations we identified by studying what type of semantic concepts were used in DASs and how those different concepts types were applied in different analytics tasks.

5.1.1. Limited usage of intent concepts

Although there are 14 studies that propose intent concepts (see Table 2), we observe that only a few of them use intent concepts at different stages of the analytics process, except for data integration. The survey did not identify any studies that use intent concepts for data extraction and transformation, although this is a computationally expensive and time-consuming task that may waste resources if not performed aptly. Existing techniques use intent concepts mostly on facilitating the search of algorithms, data providers, web services, and computational software modules. Hence to a large extent, analytics requirements such as what business decisions will be supported by this analysis or what level of accuracy is required, are still a part of the mental model of the developer or the analyst who performs these tasks. In practice, several iterations of data cleansing, reformatting, model selection and process composition may be required in order to address the analytics problem at hand optimally. This can result in less effective DASs with degrading performance over time. Besides, modifications to the process can only be conducted by someone with a sound understanding of the original analytics requirements. Moreover, as discussed by Canhoto [8], cognitive and context information, which can be captured through intent concepts, is crucial for accurate interpretation and validation of a DAS. We believe that incorporating suitable intent concepts further can enhance the efficiency and effectiveness of DAS engineering.

5.1.2. Lack of proper concept classification

Semantic concepts can be classified in many different ways. For example, S43 separates domain, analytics and service knowledge in three ontologies and S79 uses a class hierarchy to separate data mining related entities as the process, information content and realisable entities. In some studies, this separation of concept types is not visible. One ontology with a unique prefix may contain concepts related to one or more categories without attempting to follow modular approaches such as class hierarchies. For example, S37 models both analytics and service concepts in one ontology, and S3 models both analytics and intent concepts in the Task-methods ontology. S27 contains two ontologies (WekaOntology and ProtOntology) that cut across concepts of all classes without proper separation.

5.1.3. Little support for end-to-end development process

Though there is an array of research on adapting semantic models for different development tasks such as data integration or model selection, only a few studies seem to go beyond addressing one or two tasks in the DAS engineering lifecycle. In many cases, knowledge from previous tasks would have been beneficial if carried over to the next tasks. Hence there is a lack of studies that propose semantic modeling based solutions to support the end-to-end DAS engineering lifecycle.

We identified 4 studies that use semantic models for code generation (Section 4.3.9), related to data transformation and analytics process execution. They are also limited to a specific domain or a tool and do not provide sufficient flexibility to be used for a broader class of DASs.

5.2. Recommendations for future research

Upon the findings of this study, we propose a set of recommendations for future research regarding the application of semantic models for DAS engineering.

5.2.1. Developing intent concepts for analytics

As discussed in Section 5.1.1, service composition techniques among the identified studies do not leverage intent concepts adequately. One reason could be those intent concepts are too high-level (e.g. a business goal) or too low level (a query). As the data analytics community is extending wider into different industries and organisations and as analytics contexts and requirements are changing rapidly, it is necessary to explore techniques that consider all dimensions such as business requirements, contexts and constraints. Hence a potential research area is to study how high-level user goals and contexts can be represented and incorporated in DAS engineering through data integration, process construction, and result interpretation. Initial work in this direction can be found in Bandara et al. [4].

Such approaches that promote the linking of user intentions and contexts into analytics models have the potential of changing static analytics models deployed today into dynamic and adaptable analytics models, responding to changes in user goals or the operational context.

5.2.2. Decoupling concept classes and encouraging concept reuse across development tasks

In Section 5.1.2 we discussed that it is more effective to decouple different concepts as separate ontologies as it leads to better and modular knowledge management. Then each concept type can be reused or evolve independently of the others, enabling users to change the application domain, implementation, data source or the analytics requirements without altering other models. The integration between those different knowledge areas has to be done separately within the DAS environment, considering the context as well. Some studies achieve concept integration through program logic or annotation schemes, but it would be useful to have standard, platform-independent ways of modeling the relationships between different types of analytics knowledge to match the context of a particular analytics process.

To promote the usage of semantic models among the research community and to enhance the value and the reusability of the research, it is essential to advocate the reuse of ontologies. This enables the creation of a common vocabulary, and the resulting data/models become interoperable among a variety of systems. We observed specific ontologies like SSN [14] and Gene Ontology[18] are being used in multiple research studies. There are ongoing efforts such as OBO Foundry6

⁶
www.obofoundry.org/
that recommend re-using classes already defined in other ontologies classes. Yet we believe there is room for the system development community to adopt more concepts from well-developed ontologies, particularly analytics concepts proposed in ontologies such as OntoKDD [S60] and OntoDM [S79], to improve user support for analytics process design.

As knowledge represented through ontologies can enhance each task of the DAS engineering process, a standard framework for designing and extending ontologies that is usable in all analytics process stages is necessary. The ontologies should incorporate knowledge related to domain concepts and business goals as well as the concepts useful for the execution level. For example, an ontological representation of a data source may contain information necessary to retrieve data, but also information about the data quality, the latency of data acquisition, metadata that can be used to decide which algorithm is suitable to process the data (e.g.the knowledge of whether the data is time-series or not can be used to reduce the set of algorithms should consider) and the relationship between the data and other concepts. Representation of existing knowledge and enabling efficient reuse of accumulated knowledge and resources can reduce the effort spent on expert consultations or employee training.
5.2.3. Semantic model-driven DAS engineering

Finally, our evidence reveals the opportunities of using semantic models for code generation in the light of model-driven engineering methods. This needs to be explored and experimented further as it has the potential of lifting the burden of software programming expertise from data analysts. There are already some examples of applying MDE for data analytics applications [44], such as creating Hadoop MapReduce analysis through conceptual models [10]. A promising finding is that four of the identified studies related to code generation (Section 4.3.9) use semantic models that are well aligned with the four ontologies proposed by Pan et al. in their book Ontology-Driven Software Engineering [39]. In this book, the authors align a Requirement ontology (intent concepts) to a Computational Independent Model (CIM), an Infrastructure Ontology (service concepts) to a Platform Specific Model (PSM). They propose to use a domain ontology for converting a CIM to a Platform Independent Model (PIM) and a business process ontology (analytics concepts) to convert a PIM to a PSM. Hence we believe that studying the use of ontologies to develop analytics solutions in a model-driven fashion, particularly adapting the framework proposed by Pan et al. [39], is timely and significant.

Semantic-based service orchestration plays a significant role in realising a semantic model driven analytics environment, as all operations from data exporting, integration into model building, execution and result publication can be done as independent services modules represented through semantic models. Yet there is a lack of applications that integrate the existing body of research related to semantic based service orchestration such as [5,26,36] with semantic DAS engineering research. Such a combination can contribute to a paradigm of service-based DASs and establish the basis for semantic model-driven data analytics systems.

6. Limitations of the study

There are limitations to our study, mainly due to the literature selection process, including the selection of keywords and the construction of inclusion and exclusion criteria. Firstly, our study focuses on peer-reviewed publications in academic literature only. So grey literature such as technical reports, white papers and unpublished work was not included. Secondly, the study might be missing some relevant work due to the search string failing to match other relevant publications within the digital libraries. Snowballing helped to eliminate this limitation to a certain level. These limitations are in-line with our exclusion criteria, yet they pose a risk for the completeness and validity of the results.

One other limitation is that the selected databases may not contain all related literature, especially applications published in domain-specific venues such as medical journals. We attempted to reduce the impact of such limitations by using the Web of Science7

⁷
http://apps.webofknowledge.com
database, which exposes our search query into diverse disciplines. Both authors have a software engineering and modeling background, and hence the paper selection and mapping may be biased to that point of view despite the efforts made to conduct an unbiased literature filtering process, especially in the cases where the inspected papers do not provide a clear-cut definition of their research problem and proposed approach.

As we conduct a systematic process to identify and map literature, this paper may contain some outdated work or not reflect the most recent achievements in the discipline. Recent ontologies published related to analytics and data modeling such as OntoDT [40] were not observed in any identified work. This may be due to the limitation of the search approach and authors believe there will be future research that utilises existing analytics related knowledge in DAS engineering.

As the goal of this study is to present a holistic overview on how semantic modeling has been used in engineering DAS, summarising two decades of research, it is beyond the scope of this paper to drill down certain specific characteristics of analytics solutions that support particular tasks and conduct a thorough evaluation. We limit our contribution to a mapping study which can be used by researchers to study certain aspects extensively in the future.
7. Conclusion

Capturing knowledge using models to drive the software development life cycle is at the heart of the software engineering discipline. Traditional models have severe limitations in the area of building DASs, which are characterised by the need to represent rich knowledge encompassing specialised application domains, complex computing infrastructures and changing analytics requirements. This has triggered our interest in the use of semantic modelling and ontologies as a way of underpinning new software development practices in this area. In this paper, we presented 82 studies identified through a systematic mapping study, that leverage semantic modeling for engineering DASs. We adopted a broad approach, encompassing distinct research areas such as data mining and service computing.

The results of our study reveal the diversity of knowledge representation in existing studies. Through sub-question 1 we identified what type of semantic concepts are modelled and used in the literature. They were falling under four main categories: domain, analytic, service and intent concepts. Through sub-question 2 we identified the different categories of analytics or software engineering related tasks mentioned in the identified literature. Different types of concepts were observed to play different roles in improving each task and supporting various stages of DAS engineering. Semantic modeling was highly used for tasks such as data integration, model selection, process composition and data extraction, which shows the ability of semantic models to represent heterogeneous resources. Studies that focus on model selection and process composition tasks highlight the capability of semantic models to provide end-user support for analytics solution engineering. We identified and discussed some limitations in existing work such as the limited usage of intent concepts and the lack of end-to-end support for DAS engineering.

Recommended future work, discussed in Section 5.2, emphasises the importance of moving semantic technology out of specific research silos. Researchers should aim at developing new research agendas around capturing high-level intents and goals of data analysts and translating them to executable analytics processes, incorporating a multitude of well-defined semantic knowledge repositories that can be developed, expanded and maintained independently from each other. This can be achieved within established software engineering frameworks, but they need to be specifically tailored to the particular characteristics of the DAS engineering life-cycle as presented in [43]. As the next stage of this research effort, authors are working on designing a requirement driven platform that provides support for end-to-end analytics process engineering, incorporating semantic concept types identified through this mapping study [4,5].

Footnotes

Acknowledgements

We are grateful to Capsifi, especially Dr. Terry Roach, for sponsoring the research which led to this paper.

List of included studies

References

Abelló,

Romero,

T.B.

Pedersen,

Berlanga,

Nebot,

M.J.

Aramburu and

Simitsis, Using semantic web technologies for exploratory OLAP: A survey, in: IEEE Transactions on Knowledge and Data Engineering, Vol. 27, IEEE, 2015, pp. 571–588. doi:10.1109/TKDE.2014.2330822.

Akkiraju,

Farrell,

J.A.

Miller,

Nagarajan,

A.P.

Sheth and

Verma, Web service semantics-WSDL-S, 2005, https://corescholar.libraries.wright.edu/knoesis/69.

Baader,

Calvanese,

McGuinness,

Patel-Schneider and

Nardi, The Description Logic Handbook: Theory, Implementation and Applications, Cambridge University Press, 2003.

Bandara,

Behnaz,

F.A.

Rabhi and

Demirors, From requirements to data analytics process: An ontology-based approach, in: Business Process Management Workshops. BPM 2018,

Daniel,

Sheng and

Motahari, eds, Lecture Notes in Business Information Processing, Vol. 342, Springer, Cham, 2019. doi:10.1007/978-3-030-11641-5_43.

Bandara,

F.A.

Rabhi and

Meymandpour, Semantic model based approach for knowledge intensive processes, in: Software Process Improvement and Capability Determination, Springer, Cham, 2018, pp. 215–229. doi:10.1007/978-3-030-00623-5_15.

V.R.

Benjamins,

Davies,

Baeza-Yates,

Mika,

Zaragoza,

Greaves,

J.M.

Gomez-Perez,

Contreras,

Domingue and

Fensel, Near-term prospects for semantic technologies, IEEE Intelligent Systems23 (2008), 76–88. doi:10.1109/MIS.2008.10.

Berners-Lee,

Hendler,

Lassilaet al., The semantic web, in: Scientific American, Vol. 284, New York, 2001, pp. 28–37.

A.I.

Canhoto, Ontology-Based Interpretation and Validation of Mined Knowledge: Normative and Cognitive Factors in Data Mining, IGI Global, 2008. doi:10.4018/978-1-59904-951-9.ch139.

Cardoso,

Hepp and

M.D.

Lytras, The Semantic Web: Real-World Applications from Industry, Vol. 6, Springer US, 2008. doi:10.1007/978-0-387-48531-7.

10.

Ceri,

Della Valle,

Pedreschi and

Trasarti, Mega-modeling for big data analytics, in: Conceptual Modeling, Springer, Berlin, Heidelberg, 2012, pp. 1–15. doi:10.1007/978-3-642-34002-4_1.

11.

Chapman,

Clinton,

Kerber,

Khabaza,

Reinartz,

Shearer and

Wirth, CRISP-DM 1.0 Step-by-step data mining guide, 2000.

12.

Chen,

R.H.L.

Chiang and

V.C.

Storey, Business intelligence and analytics: From big data to big impact, in: MIS Quarterly, Vol. 36, Management Information Systems Research Center, University of Minnesota, 2012, pp. 1165–1188. doi:10.2307/41703503.

13.

Cheng and

Ma, A literature overview of knowledge sharing between Petri nets and ontologies, in: The Knowledge Engineering Review, Vol. 31, Cambridge University Press, 2016, pp. 239–260. doi:10.1017/S0269888916000072.

14.

Compton,

Barnaghi,

Bermudez,

GarcíA-Castro,

Corcho,

Cox,

Graybeal,

Hauswirth,

Henson,

Herzoget al., The SSN ontology of the W3C semantic sensor network incubator group, Journal of Web Semantics17 (2012), 25–32. doi:10.1016/j.websem.2012.05.003.

15.

Crankshaw,

Gonzalez and

Bailis, Research for practice: Prediction-serving systems, in: Commun. ACM, Vol. 61, ACM, New York, 2018, pp. 45–49. doi:10.1145/3190574.

16.

Espinosa,

García-Saiz,

Zorrilla,

J.J.

Zubcoff and

J.-N.

Mazón, Enabling non-expert users to apply data mining for bridging the big data divide, in: Data-Driven Process Discovery and Analysis, Springer, Berlin, Heidelberg, 2015, pp. 65–86. doi:10.1007/978-3-662-46436-6_4.

17.

U.M.

Fayyad,

Piatetsky-Shapiro,

Smyth and

Uthurusamy, Advances in Knowledge Discovery and Data Mining, Vol. 21, AAAI Press, Menlo Park, 1996.

18.

Gene Ontology Consortium, The gene ontology (GO) database and informatics resource, Nucleic Acids Research32 (2004), 258–261. doi:10.1093/nar/gkh036.

19.

GeoVocab.org, GeoVocab, 2012 (accessed 04-Dec-2018), http://geovocab.org/.

20.

Guazzelli,

Zeller,

W.-C.

Lin and

Williams, PMML: An open standard for sharing models, The R Journal1 (2009).

21.

Khan,

Kunz,

Kleijnen and

Antes, Systematic Reviews to Support Evidence-Based Medicine, CRC Press, 2011.

22.

H.M.

Kim,

M.S.

Fox and

Grüninger, An ontology for quality management – Enabling quality problem identification and tracing, BT Technology Journal17 (1999), 131–140. doi:10.1023/A:1009611528866.

23.

Kitchenham and

Charters, Guidelines for Performing Systematic Literature Reviews in Software Engineering, 2007.

24.

Kliegr,

Svátek,

Šimunek,

Stastnỳ and

Hazucha, An XML schema and a topic map ontology for formalization of background knowledge in data mining, in: IRMLeS-2010, 2nd ESWC Workshop on Inductive Reasoning and Machine Learning for the Semantic Web, Heraklion, Crete, Greece, 2010.

25.

Kopeckỳ,

Vitvar,

Bournez and

Farrell, SAWSDL: Semantic annotations for WSDL and XML schema, IEEE Internet Computing11 (2007), 60–67. doi:10.1109/MIC.2007.134.

26.

A.-L.

Lamprecht,

Naujokat,

Margaria and

Steffen, Semantics-based composition of EMBOSS services, Journal of Biomedical Semantics2 (2011), 5. doi:10.1186/2041-1480-2-S1-S5.

27.

Lebo,

Sahoo,

McGuinness,

Belhajjame,

Cheney,

Corsar,

Garijo,

Soiland-Reyes,

Zednik and

Zhao, Prov-o: The prov ontology. W3C recommendation, World Wide Web Consortium, 2013, https://www.w3.org/TR/prov-o/.

28.

Luján-Mora,

Trujillo and

I.-Y.

Song, A UML profile for multidimensional modeling in data warehouse, in: Data & Knowledge Engineering, Vol. 59, IEEE, 2006, pp. 725–769. doi:10.1016/j.datak.2005.11.004.

29.

M.D.

Lytras and

García, Semantic web applications: A framework for industry and business exploitation – What is needed for the adoption of the semantic web from the market and industry, International Journal of Knowledge and Learning4 (2008), 93–108. doi:10.1504/IJKL.2008.019739.

30.

Macià,

Valero,

Díaz,

Boubeta-Puig and

Ortiz, Complex event processing modeling by prioritized colored Petri nets, in: IEEE Access, Vol. 4, IEEE, 2016, pp. 7425–7439. doi:10.1109/ACCESS.2016.2621718.

31.

Magdon-Ismail, No free lunch for noise prediction, in: Neural Computation, Vol. 12, MIT Press, 2000, pp. 547–564. doi:10.1162/089976600300015709.

32.

S.T.

March and

A.R.

Hevner, Integrated decision support systems: A data warehousing perspective, in: Decision Support Systems, Vol. 43, Elsevier, 2007, pp. 1031–1043. doi:10.1016/j.dss.2005.05.029.

33.

Marjanovic, Improvement of knowledge-intensive business processes through analytics and knowledge sharing, in: International Conference on Information Systems (ICIS), 2016, pp. 2820–2838.

34.

Markus Lanthaler, Hydra Core Vocabulary, 2017 (accessed 04-Dec-2018), https://www.hydra-cg.com/spec/latest/core/.

35.

Martin,

Burstein,

Hobbs,

Lassila,

McDermott,

McIlraith,

Narayanan,

Paolucci,

Parsia,

Payneet al., OWL-S: Semantic markup for web services, W3C member submission22 (2004), 2007-04.

36.

Mehandjiev,

Lécué,

Carpenter and

F.A.

Rabhi, Cooperative service composition, in: International Conference on Advanced Information Systems Engineering, Springer, Berlin, Heidelber, 2012, pp. 111–126. doi:10.1007/978-3-642-31095-9_8.

37.

Mendes, A systematic review of web engineering research, in: 2005 International Symposium on Empirical Software Engineering, IEEE, 2005, p. 10. doi:10.1109/ISESE.2005.1541857.

38.

H.O.

Nigro, Data Mining with Ontologies: Implementations, Findings, and Frameworks: Implementations, Findings, and Frameworks, IGI Global, 2007. doi:10.4018/978-1-59904-618-1.

39.

J.Z.

Pan,

Staab,

Aßmann,

Ebert and

Zhao, Ontology-Driven Software Development, Springer, 2012. doi:10.1007/978-3-642-31226-7.

40.

Panov,

L.N.

Soldatova and

Džeroski, Generic ontology of datatypes, Information Sciences329 (2016), 900–920. doi:10.1016/j.ins.2015.08.006.

41.

Petersen,

Feldt,

Mujtaba and

Mattsson, Systematic mapping studies in software engineering, in: Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering, EASE ’08, BCS Learning & Development Ltd., Swindon, 2008, pp. 68–77.

42.

Petticrew and

Roberts, Systematic Reviews in the Social Sciences: A Practical Guide, John Wiley & Sons, 2008. doi:10.1002/9780470754887.

43.

Rabhi,

Bandara,

Namvar and

Demirors, Big data analytics has little to do with analytics, in: Service Research and Innovation, Springer International Publishing, Cham, 2017, pp. 3–17. doi:10.1007/978-3-319-76587-7_1.

44.

Rajbhoj,

Kulkarni and

Bellarykar, Early experience with model-driven development of mapreduce based big data application, in: 2014 21st Asia-Pacific Software Engineering Conference, Vol. 1, IEEE, 2014, pp. 94–97. doi:10.1109/APSEC.2014.23.

45.

A.L.

Rector,

J.E.

Rogers and

Pole, The GALEN High Level Ontology, 1995, pp. 174–178.

46.

Ristoski and

Paulheim, Semantic web in data mining and knowledge discovery: A comprehensive survey, in: Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 36, Elsevier, 2016, pp. 1–22. doi:10.1016/j.websem.2016.01.001.

47.

Rivera and

Meulen, Gartner says advanced analytics is a top business priority, Gartner Press Release, 2014 (accessed 25-Feb-2019), https://www.gartner.com/en/newsroom/press-releases.

48.

Roman,

Keller,

Lausen,

de Bruijn,

Lara,

Stollberg,

Polleres,

Feier,

Bussler and

Fensel, Web service modeling ontology, Applied Ontology1 (2005), 77–106.

49.

Russom

et al., Big data analytics, TDWI Best Practices Report, Fourth Quarter19(4) (2011), 1–34.

50.

Salleh,

Mendes and

Grundy, Empirical studies of pair programming for CS/SE teaching in higher education: A systematic literature review, in: IEEE Transactions on Software Engineering, Vol. 37, IEEE, 2011, pp. 509–525. doi:10.1109/TSE.2010.59.

51.

Shumilov,

Leng,

El-Gayyar and

A.B.

Cremers, Distributed scientific workflow management for data-intensive applications, in: 2008 12th IEEE International Workshop on Future Trends of Distributed Computing Systems, 2008, pp. 65–73. doi:10.1109/FTDCS.2008.39.

52.

Spencer,

Ritchie,

Lewis and

Dillon, Quality in qualitative evaluation: A framework for assessing research evidence, 2014 (accessed 25-Feb-2019), http://www.cebma.org/wp-content/uploads/Spencer-Quality-in-qualitative-evaluation.pdf.

53.

Taylor, Framing requirements for predictive analytic projects with decision modeling, 2015 (accessed 25-Feb-2019), http://www.decisionmanagementsolutions.com/whitepaper/framing-analytic-requirements/.

54.

Wang and

Wang, 3DM: Domain-oriented data-driven data mining, in: Fundamenta Informaticae, Vol. 90, IOS Press, 2009, pp. 395–426. doi:10.3233/FI-2009-0026.

55.

Wohlin, Guidelines for snowballing in systematic literature studies and a replication in software engineering, in: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, EASE ’14, ACM, New York, 2014, pp. 38–13810. doi:10.1145/2601248.2601268.

56.

Yao and

F.A.

Rabhi, Building architectures for data-intensive science using the ADAGE framework, in: Concurrency and Computation: Practice and Experience, Vol. 27, Wiley Online Library, 2015, pp. 1188–1206. doi:10.1002/cpe.3280.

Related task	Concept classification

	Domain concept	Analytic concept	Service concept	Intent concept
Domain understanding	S5, S12, S18, S49	–	–	S28, S39
Data extraction and transformation	S2, S8, S18, S29, S30, S33, S34, S40, S50, S52, S59, S70, S80	S80	S8, S10, S38, S52	–
Data integration	S2, S10, S12, S13, S19, S21, S23, S29, S34, S39, S40, S42, S43, S44, S45, S47, S48, S49, S51, S55, S59, S64, S65, S70, S76, S77, S81	S14	S9, S46, S71	S2, S9, S21, S39, S44, S48
Model selection	S27, S49, S61, S62	S15, S17, S22, S27, S37, S38, S41, S53, S56, S60, S62, S66, S75, S78, S79	–	S53, S66, S75
Model building	S7, S58, S63, S64, S67, S69, S70, S73, S74, S77, S80, S82	S3, S54, S56, S63, S67, S69, S74, S80	–	S3, S7
Results presentation and interpretation	S35, S48, S49, S57, S61, S76	S35	S4, S25, S26	S4, S48
Service composition	S11, S24, S27, S30, S31, S32, S43, S52, S62	S1, S20, S24, S27, S36, S41, S53, S62, S75, S78	S6, S11, S16, S20, S27, S30, S31, S32, S36, S37, S40, S43, S52, S53, S62, S75, S78	S53, S75
Analytics solution validation	S24, S61, S72, S76	S24, S75, S76	S25, S26, S72, S75	S4, S75
Code generation	S47	S3	S3, S6, S38	–

Semantic modeling for engineering data analytics solutions

Abstract

Keywords

1. Introduction

2. Background

3.1. Introduction

3.2. Research questions

3.3. Search of relevant literature

3.4. Selection of studies

3.5. Study quality assessment

3.6. Constructing classification schemas

4. Results

4.1. Primary question: What are the existing techniques that use semantic modeling for DAS engineering?

4.2. Sub-question 1. What type of concepts are modeled/used by these techniques?

4.2.2. Analytics concepts

4 https://research.google.com/archive/mapreduce.html type of analytics. S62 proposes an ontology derived from CRISP-DM terminology to represent control flow elements of a DAS. 4.2.3. Service concepts

4.3.1. Domain understanding

4.3.2. Data extraction and transformation

4.3.3. Data integration

4.3.4. Model selection

4.3.5. Model building

4.3.6. Result presentation and interpretation

4.3.7. Service composition

4.3.8. Analytics solution validation

4.3.9. Code generation

5. Discussion

5.1. Limitations of existing work

5.1.1. Limited usage of intent concepts

5.1.2. Lack of proper concept classification

5.1.3. Little support for end-to-end development process

5.2. Recommendations for future research

5.2.1. Developing intent concepts for analytics

5.2.2. Decoupling concept classes and encouraging concept reuse across development tasks

6. Limitations of the study

Footnotes

Acknowledgements

List of included studies

References

⁴
https://research.google.com/archive/mapreduce.html
type of analytics. S62 proposes an ontology derived from CRISP-DM terminology to represent control flow elements of a DAS.
4.2.3. Service concepts