Abstract
Data Analytics Solution (DAS) engineering often involves multiple tasks from data exploration to result presentation which are applied in various contexts and on different datasets. Semantic modeling based on the open world assumption supports flexible modeling of linked knowledge. The objective of this paper is to review existing techniques that leverage semantic web technologies to tackle challenges such as heterogeneity and changing requirements in DAS engineering. We explore the application scope of those techniques, the different types of semantic concepts they use and the role these concepts play during DAS development process. To gather evidence for the study we performed a systematic mapping study by identifying and reviewing 82 papers that incorporate semantic models in engineering DASs. One of the paper’s findings is that existing models can be classified within four types of knowledge spheres: domain knowledge, analytics knowledge, services and user intentions. Another finding is to show how this knowledge is used in literature to enhance different tasks within the analytics process. We conclude our study by discussing the limitations of the existing body of research, showcasing the potential of semantic modeling to enhance DASs and presenting the possibility of leveraging ontologies for effective end-to-end DAS engineering.
Introduction
The business intelligence and analytics fields are rapidly expanding across all industry sectors, and many organisations are trying to make analytics an integral part of everyday decision making [12,47]. These fields include the techniques, technologies, systems, practices, methodologies and applications that are concerned with analysing critical business data to help an enterprise better understand its business and market and make timely business decisions [12,33].
There is no universally accepted definition for the process data analytics. CRISP-DM [11] and KD process proposed by Fayyad et al. [17] are two examples of well developed and popular definitions. In the context of this paper, we identify a “data analytics process” (also called an “analytics pipeline”) as an end-to-end Data Analytics Solution (DAS) that captures tasks related to data mining, knowledge discovery or business intelligence. A DAS can be itself decomposed into multiple tasks such as identifying suitable datasets, developing and validating analytics models and interpreting final results. The software engineering aspect of DAS, which we refer to as “DAS engineering”, involves designing, developing and deploying DASs and includes tasks such as requirement elicitation, data integration and process composition [43].
With the increasing popularity of big data as a research area, the focus of the most research efforts have been on developing specific analysis techniques (e.g. machine learning algorithm design) but not on supporting the overall DAS engineering [15]. Within many organisations, analysts with limited programming experience are often required to manually establish relationships between software components of the analytics solution such as software services used for computation, and data elements or data mining algorithms [16,32]. According to No-Free-Lunch theorem [31], DAS engineering becomes further challenging as there is not a model that works best for every problem and depending on the application context and input data, analysts have to try different techniques before getting optimal results. Most organisations are looking for flexible solutions that align with their specific objectives and IT infrastructures [1], usually resulting in the use of a mix of data sources and software frameworks. Understanding and managing these heterogeneous technologies need to be supported by a sound knowledge management infrastructure. In addition to incurring high software development costs, maintaining and evolving heterogeneous software infrastructures in the face of constant changes in both business requirements and technical specifications is very expensive [49].
Ontology is the formal foundation for semantic modeling. The primary roles of an ontology are to capture domain knowledge, to evaluate constraints over domain data and to guide domain model engineering [3]. It is a powerful tool for modeling and reasoning [1]. As ontologies can produce a sound representation of concepts and the relationships between concepts, they provide malleable models that are suitable for tracking various kinds of software development artifacts ranging from requirements to implementation code [39]. Although there have been many efforts in applying semantic modeling for DAS engineering, the overall picture of their capabilities is far from clear.
Hence, this study systematically explores various research studies that are focusing on engineering applications that support DASs with the aid of semantic technology. Further, we identify unresolved challenges and potential research directions in DAS engineering space. We follow the systematic mapping study process proposed by Petersen et al. [41], collect evidence from the publications in five prominent databases and extend the evidence further by snowballing [55] relevant references of identified studies. We conduct our study around the main research question of identifying the existing techniques that use semantic models in DAS engineering and two sub-questions related to that. We evaluate how semantic models represent different knowledge areas related to DASs such as mental models of the end-user, domain knowledge, semantics of data, applicability of analytics algorithms and tools for a particular task, compatibility between data and tools and leveraged for conducting tasks related to DAS engineering.
The rest of the paper is structured as follows. Section 2 presents the background related to this study. In Section 3 we explain the review method followed. Section 4 includes the results derived from the 82 identified studies followed by a discussion in Sections 5. Section 6 discusses the limitations of the study, and the paper concludes in Section 7.
Background
The literature emphasises the significance of knowledge management in different fields such as enterprise data analytics [53] and scientific workflow [51], and there have been many attempts at identifying knowledge specific to DAS. For example, the ADAGE framework [56] proposes an approach that leverages the capabilities of service-oriented architectures and scientific workflow management systems. The main idea is that the models used by analysts (i.e. workflow, service, and data models) contain concise information and instructions that can be viewed as an accurate record of the analytics process. These models can become useful artifacts for provenance tracking and can ensure reproducibility of such analytics processes. However, designing models that accurately represent the complex business contexts and expertise associated with a DAS remains a challenge. Development methods such as CRISP-DM for enterprise-level data mining [11] and Domain-oriented data mining [54] are advocating the necessity of using knowledge management techniques for capturing the business domain and understanding of data in order to build better analytics solutions.
There are multiple known knowledge representation approaches related to different aspects of DAS such as UML diagrams [16,28] and Petri-nets [30] which are syntactically oriented and limited in the ability to capture semantic details and reason from the knowledge [13]. The focus of this paper is on semantic knowledge representation with ontological models, a technology that originated from the Semantic Web concept [7], where ontologies are expressed through the RDF,1
To our knowledge, there is no formal study conducted on how semantic modeling has contributed to DAS engineering except the surveys conducted by Abello et al. [1] and Ristoski and Paulheim [46]. In particular, Abello et al. [1] studied about using semantic web technologies for Exploratory OLAP, considering the data extraction and integration aspects while Ristoski and Paulheim [46] surveyed in 2016 about the different stages of the knowledge discovery process that use semantic web data. In comparison, our work is unique as it looks at the applications of semantic models in DAS engineering from data analytics as well as a software engineering perspective.
Introduction
As our goal is to provide a holistic view of how semantic modeling is used in DAS engineering landscape, we conducted a systematic mapping study (SMS). By definition, SMS is used to provide an overview of a research topic by showing the type of research and results that have been published by categorising them with the goal of answering a specific research question [23,41]. We followed the process proposed by Petersen et al. [41] to ensure the accuracy and quality of the outcome. We conducted initial evidence search on five databases, two conference proceedings and one journal related to semantic web technologies. Findings were extended through snowballing approach proposed by Wohlin [55].
Research questions
The primary focus of our SMS is to identify and understand how semantic modeling approaches are used to represent and communicate knowledge of a data analyst as well as how existing DAS engineering techniques are leveraging this knowledge. The review was conducted on a primary research question and two sub-questions which are stated as follows:
What type of concepts are modeled/used by these techniques? What tasks related to DAS engineering are enabled using the identified concepts?
Search of relevant literature
We adapted the work used in [23,37,41,50] and identified the following strategy to construct the search strings:
Derive major terms used in the review questions
Search for synonyms and alternative words
Use the Boolean OR to incorporate alternative spellings and synonyms
Use the Boolean AND to link the major terms
To obtain a balance between sensitivity and specificity as highlighted by Petticrew and Robert [42], we selected a search string that contains three significant terms related to the concepts: semantic technology, data analytics, and software engineering, connected by a Boolean AND operation. Each term contains a set of keywords related to the respective concept, connected by a Boolean OR operation.
The complete search string initially used for the searching of the literature was as follows:
The primary search process involved the use of 5 online databases: Web of Science, Scopus, ACM Digital Library, IEEE Xplore, and ProQuest. The selection of databases was based on our knowledge about those that index major publications related to computer science, software engineering and semantic technology. Based on the recommendations of domain experts, we expanded the search space to the collection of proceedings of International Semantic Web Conference and European Semantic Web Conference, their associated workshops, and the publications by the Semantic Web Journal accessible via DBLP search API.
Upon completion of the primary search phase, we further expanded the identification of relevant literature through snowballing. All the papers identified from the primary search phase were reviewed for relevancy. If a paper satisfied the selection criteria, we included it in the list of studies qualified for the synthesis.
Selection of studies
Below are the exclusion criteria we adapted from [21]:
Books and news articles Papers where semantic modeling was not applied directly to DAS engineering Vision papers Papers not written in English Application specific research that does not generalise (such as text extraction and web search applications) Infrastructure related or performance-oriented research. Full text that was not available for public access and not licensed by the University of New South Wales digital library
We did not restrict the search to a period of time and included all research available in the selected databases up to 27/06/2018.
Study quality assessment
We designed a quality checklist to measure the quality of the primary studies by reusing some of the questions proposed in the literature [42,52]. Our quality checklist comprised four general questions stated below:
Is the study related to DAS engineering?
Does the study leverage semantic models for information modeling?
Does the study provide sound evaluation?
Are the findings credible?
Initially, one author went through the title, abstract and keywords of search results and divided papers into three categories by relevancy: “Yes”, “No” and “Maybe”. Then the second author went through the full text of the papers under the “Maybe” category to identify whether they were compliant or not with our quality checklist.
Through the initial database search, we identified 1414 empirical studies as candidates. Among those results, 63 (4.46% of 1414 studies) were identified as relevant studies, based on the study quality assessment and exclusion criteria. The same steps were applied to the literature identified through snowballing at the second stage. We iterated through the references of 63 papers selected during the initial search and identified 19 additional relevant papers for our study.
To avoid the inclusion of duplicate studies which would inevitably bias the result of the synthesis, we thoroughly checked if very similar studies were published in more than one paper. In total, 82 studies were included in the synthesis of evidence.
Constructing classification schemas
Our study requires two classification schemas to answer sub research question 1 and 2, i.e. what type of semantic concepts are modeled/used by identified studies and what tasks related to DAS engineering are enabled using the identified concepts.
To construct each classification schema for our mapping study, we adapted the systematic process proposed in [41]. We created the classification schema in a top-down fashion, incorporating different classifications proposed and used in literature. We used the abstract, introduction and conclusion of the selected 82 studies and aligned the studies with categories identified in the literature. When necessary, the classification schema was extended with keywords and categories defined in the selected literature to provide clarity and granularity.
To answer sub-question 1, we distinguished four broad classes of concepts represented through ontologies in identified studies, referred to as Domain, Analytic, Service and Intent. This classification was guided by the proposal of Nigro [38] to use three ontology types in data mining. The first two: “Domain Ontologies” and “Ontologies for Data Mining Process” were included in our schema as the domain and analytics concepts respectively. The third one, called “Metadata Ontologies”, defines how variables are constructed. Because this definition is high-level and incomplete, we introduced two new concept types: Intent concepts and Service concepts to capture knowledge that supports requirements management and implementation management within DAS engineering. Using the evidence of identified literature, we extended this classification to a more granular level with different subtypes. Section 4.2 discusses the details of this classification.
When we look at the problem of classifying different DAS-related tasks for sub-question 2, there was no unique definition in the literature regarding what constitutes a task in relation to DAS engineering. Fayyad et al. [17] propose a five-step process model for knowledge discovery – selection, preprocessing, transformation, data mining, evaluation and interpretation. CRISP-DM proposed by Chapman et al. [11] is more enterprise-oriented and breaks down the life-cycle into five steps: domain understanding, data understanding, data preparation, modeling, evaluation, and deployment. All identified studies do not follow any specific model; some of them focus more on high-level tasks such as domain understanding and process composition while others support more granular tasks such as model selection.

Tasks associated with DAS engineering identified from the literature.
Hence, we combined key categories extracted from the identified 82 studies with the tasks proposed in different literature and came up with nine tasks under two categories as illustrated in Fig. 1. We initially identified five data analytics related tasks aligning CRISP-DM process model with the tasks proposed in the 82 studies – Business Understanding, Data Extraction and Transformation, Model Selection, Model Building, Result Presentation and Interpretation. We further identified four tasks from the studies that were rooted in the software engineering literature: Data Integration, Service Composition, Analytics Solution Validation and Code Generation. Software engineering related tasks are important because they contribute to enhancing the overall quality of the DAS implementation. Data analytics related tasks are conducted in the order shown by the arrow, while software engineering related tasks can be used to support different data analytics related tasks without a particular order. Each DAS engineering project does not need to perform all these tasks. The selection of tasks depends on the nature of the analysis (e.g. whether we need to choose amongst multiple competing data mining algorithms or use a specific algorithm) and the context of development (e.g. whether we need to support automatic code generation to save software development cost or not).
During the data extraction phase, we read, and sorted papers following the classification schemas and then reviewed them in detail. One author read and classified the 82 papers according to the two schemas, noting down the rationale of why each paper belongs to the selected category. The second author reviewed the table, discussed and resolved disagreements and compiled the final mapping. The classification schemas developed initially evolved through this phase, resulting in adding new subcategories and splitting in specific scenarios.
Next section describes the finalised mappings and its associated details.
Results
Primary question: What are the existing techniques that use semantic modeling for DAS engineering?
The Appendix contains the complete list of identified studies. RDF, RDFS and OWL are the encoding languages widely accepted by the semantic web community to represent ontologies. All the identified studies, except S35, are relying on this notation for their semantic model representation. S35 deviates from that common practice and uses the Predictive Model Markup Language (PMML) [20] and Background Knowledge Exchange Format (BKEF) [24] to represent knowledge associated with DASs.
By assessing the identified studies, we observed that these efforts vary in the application context they are addressing and in the way the analytics knowledge is modeled. Through the sub-questions in Sections 4.2 and 4.3, we explore the different semantic concepts and tasks used by these 82 identified studies according to the classification schemas mentioned in Section 3.6. Section 4.2 classifies different ontological concepts into four types and describes the characteristics of the knowledge they capture. In Section 4.3, we relate identified semantic concepts to their role in realising and facilitating various DAS engineering tasks.
Sub-question 1. What type of concepts are modeled/used by these techniques?
The mapping results according to the classification schema described in Section 3.6 are illustrated in Table 1. We identified that out of 82 studies, the majority (54 studies) model domain concepts related to various application domains and 11 of them reuse publicly available ontologies. Thirty of the studies capture and use analytics concepts and thirty capture service concepts. Smallest representation was in the intent category with 14 studies. Following subsections present a detailed analysis.
Classification of semantic concepts modeled and used in identified literature
Classification of semantic concepts modeled and used in identified literature
These are context specific and high-level concepts which represent domain-specific knowledge and objects. We identified 54 studies that rely on different subtypes of domain concepts to support DAS engineering.
First subtype,
The second subtype represent studies that reused concepts from different
Analytics concepts
Analytics concepts are closely aligned with the knowledge that reflects different algorithms, computational models and the data-flow nature of the analytics process in terms of inputs, outputs, and their compatibility. Analytic concepts provide a vocabulary for defining and communicating analytics operations and related attributes. Further, they can help in describing dependency relationships between variables. These concepts are not coupled to a specific application domain or context.
Thirty studies leverage analytics concepts. We classified them into three subtypes according to their role in representing different analytics tasks or methods: concepts related to data preprocessing and integration activities, concepts that capture data analytics and mining techniques and concepts that represent the control and data flow nature of an analytics process.
Under the first subtype,
There are 25 studies under the subtype
Six studies focus on
Service concepts capture knowledge related to DAS architectures and platforms. We identified 30 studies that model different aspects, namely web services, software APIs, data schemas, workflow design, knowledge related to provenance or data quality, deployment information and data sources. We classified service concepts into four subtypes which are software component management, data management, composition management and implementation management (see Table 2).
Classification of literature by their application to different DAS engineering tasks and concepts used
Classification of literature by their application to different DAS engineering tasks and concepts used
There are 10 studies under the
The second subtype represents
The third subtype has concepts related to
The last subtype is related to the
Intent concepts capture knowledge concerning the data analyst’s requirements or goals. This knowledge can be in the form of low-level queries performed on data or high-level goals and intentions of users.
The first subtype represents concepts related to
Under

Trend of publications over time, related to different DAS engineering tasks.
Under the
In this section, we analyse the association of 82 identified studies with the different tasks related to DAS engineering and how the semantic concepts discussed in Section 4.2 are used to realise these tasks. The classification schema for analytics tasks was described in Section 3.6.
Table 2 shows the mapping of 82 studies among 9 tasks and 4 types of concepts. One study can be focused on more than one task, using one or more concept types. Business understanding is the focus of 6 studies that leverage domain or intent concepts. Data extraction and transformation approaches that use domain, analytics or service concepts are proposed in 15 studies. 31 papers propose data integration approaches mostly using domain concepts, but some use analytics, service and intent concepts as well. Model selection (17 studies) and model building (15 studies) were conducted using domain, analytics or intent concepts. All four concept types were used to realise service composition (20 studies) and solution validation (8 studies). Code generation was supported by domain, analytics or service concepts in 4 studies. 9 studies that proposed approaches for result presentation and interpretation used one or two concept types out of four.
Different applications of those concepts are described in more detail in the rest of this section. Figure 2 illustrates the trend of publications related to each task. We observe a particular interest among researchers in applying semantic technology to support DAS engineering in 2014 and 2015.
Domain understanding
The domain understanding task focuses on analysing the domain, context of the problem and understanding available datasets, which helps to establish solid definitions and facilitate communication between different stakeholders. Moreover, the ontologies and concepts related to this task are inferable, and the resulting knowledge has the flexibility of expanding over time.
Five methods [S5, S12, S18, S49, S59] use domain concepts for domain understanding. The platforms proposed in S5 and S59 use these concepts to capture semantic and interpretive aspects of data whereas S18 uses them to provide a standard specification of data for analysts. In contrast, S49 uses these concepts to model expert knowledge related to an analytics problem which helps understand the constraints and expected behaviour. S12 proposes a feature-rich framework that can use custom-built domain concepts to understand the context of thorough data browsing and visualisation.
Two methods use intent concepts for domain understanding. S39 uses a Goal Model to define requirements that can help in understanding and designing a data warehouse model. S28 extracts analytics requirements expressed in natural language through an ontology and proposes to refine them and identify specific data requirements by interviewing the stakeholders.
Data extraction and transformation
This task focuses on retrieving data from one or more sources and preparing it for the subsequent analysis. It involves transforming data into desired formats and annotating with additional metadata as well.
8 studies apply domain concepts for this task. S30, S33 and S70 annotate streaming input data using domain concepts to make data queriable when necessary. In S18, data transformation is assisted by standard models built using domain concepts. S2, S29 and S40 propose approaches for on-demand data extraction based on domain concepts that describe data sources. S34 focuses on the extraction of JSON data from web resources, generating semantic concepts around them and using those to convert data into ontology instances. S50 conducts data label correction using a domain ontology of medical entities. S59 enables analysts to understand datasets to help data extraction by querying knowledge represented in domain ontologies. S80 uses domain and analytics concepts together to access and transform static as well as streaming data.
Three studies use data source related service concepts for data extraction. In S10, these concepts are used to define the organisation of data and how to access it on demand. S38 uses a set of concepts that can model any data source for model-driven code generation for data extraction. S52 uses service concepts to identify implementations of required data transformations in workflows.
S8 uses both domain and service concepts (data source and implementation management) to extract data from heterogeneous sources such as event data streams and map the extracted data into a database schema. S52 uses domain concepts and service concepts related to data processing and automatically generate new datasets on demand.
Data integration
Data integration implies combining heterogeneous datasets in order to obtain a high-level and coherent view of the data. Most studies use domain concepts to aid in this task. Some studies use intent concepts to conduct integration based on user needs. S9, S14, S46 and S71 are special cases that leverage analytics and service concepts to support the integration task.
We identified five different strategies in which intermixed concepts from different classes contribute to data integration. The first one, observed in S12, S13, S19, S39, S42, S46 and S47, transforms data from heterogeneous sources into instances of a global ontology. S12, S13, S39 and S42 conduct ETL processes incorporating this global ontologies to create semantic aware data warehouses.
The second one, identified in S2, S21, S23, S44, S48 and S77, uses a global ontology that represents the user’s perspective and a set of local ontologies to represent datasets. Then the data is matched accordingly by aligning or converting local ontologies into the global ontologies. Users can refer to the global ontology to query the data.
The third approach describes each dataset through a local ontology and achieves data integration by merging local ontologies together. S34 is an example where each local ontology is constructed by extracting data provided in JSON format, generating suitable semantic concepts and converting the extracted data into ontology instances using generated concepts.
The fourth method used in 10 studies, S10, S14, S29, S40, S45, S51, S59, S64, S70, S76, maintains linked meta-data about datasets from different data sources so that relevant data can be acquired at query time from multiple sources.
The fifth one, used in S9, S43, S49, S55, S65, S71 and S81, is a query or requirement driven approach for data integration where formal rules or program logic are used to represent user queries or analytics questions. Different datasets are mapped into those rules to derive answers. S9 is unique in that it captures analytics queries as intent concepts.
Model selection
Model selection is crucial for users who do not have intuitive knowledge about the performance of different models in different contexts or when many competing analytics models and techniques can be used for a single purpose. This task facilitates comparison of algorithms or makes recommendations of tools and models suitable for users.
There are 4 studies that use domain concepts for this purpose. S49 stores expert domain knowledge in a domain ontology and uses that to evaluate possible analytics models generated by associate rule mining, incorporating an “interestingness” measure. S61 proposes a recommendation engine that maintains a repository of meta-data related to historical analytic solutions using domain concepts, and when a user provides a new dataset, a matching solution is recommended to the user. In S27, suitable analytics technique for a particular dataset is identified by matching dataset features represented in domain concepts with the requirements of the analytics techniques captured through analytics concepts. S62 supports model selection by allowing users to define the context of analytics problem via domain concept and using analytics concepts to recommend appropriate analytics techniques.
There are 15 methods that apply data analytics, and mining focused concepts for model selection. The simplest method proposed in S17 uses an ontology to describe data analytics algorithms and creates a knowledge repository to provide a querying capability for the users.
The concepts defined in S15, S27, S37, S41, S53, S60, S66, S75 and S78 assist users in matching analytics components that suit their goal and constraints. S53, S66 and S75 are unique among those studies as they propose intent concepts to capture user goals and requirements which are then matched with suitable models represented as analytics concepts.
In S22, analytics concepts are used to model data mining algorithms, which are then linked to web services, providing the means for service composition. S38 and S79 use analytics concepts to capture existing analytics workflows to enable novice users to search and learn from them.
Model building
Model building is a core data analytics task, where a selected model needs to be applied and customised for the problem at hand.
S58, S64, S70, S73, S77 and S82 use domain concepts to write analytic queries or event patterns executed on data to get descriptive analytics insights. S67 and S69 follows a similar approach, but use analytics concepts to support users in writing queries or rules.
S63 and S74 use domain concepts and analytics concepts to describe and customise feature space for analytics model construction. S7 leverages domain concepts and intent concepts for model building. The generated analytics model consists of axioms based on formal competency questions to evaluate whether an enterprise model represented through domain concepts adhere to the compliance standards represented as intent concepts.
S56 uses analytics concepts to define analytics experiments in detail for supervised classification of propositional datasets. S54 proposes a case-based system for model selection and uses analytics concepts to adapt the solutions suggested by case based reasoning to fit user’s interest. S3 applies analytics concepts together with intent concepts for analytics model generation in a MapReduce based analytics solution.
Result presentation and interpretation
Storing knowledge related to different aspects of the analytics process through a semantic model inherently provides a certain inference and interpretation capability that helps in result representation. There have been some studies that specify particular methods that use semantic concepts to facilitate result presentation and interpretation explicitly.
Six studies use domain concepts for result presentation and interpretation. S49 uses domain concepts to store expert knowledge and use it to validate data mining results and evaluate their interestingness. S57 uses domain concepts to interpret hypotheses and related attributes on statistical datasets. S61 uses domain concepts to annotate data tables via ontology alignment, enabling easy interpretation. S35 incorporates domain concepts with analytics concepts to generate reports on the conducted data analytics tasks. S76 generates metadata about numerical analysis using domain and analytics concepts. S48 uses domain concepts with intent concepts to extract meta-data about OLAP operations and generate reports. Further, S48 proposes a method to automatically match the OLAP reports with other documents in a related repository.
When considering the use of service concepts for presenting results, S4, S25 and S26 capture knowledge about the different aspects of scientific workflows, especially aspects can help to describe and present the outputs/results.
In S4, intent concepts are used to capture initial goals of the analysts and annotate workflows with them, in order to identify different decisions that have led to the outcome and to explain the results from the perspective of the analyst.
Service composition
Service composition means identifying and putting together different analytics related services to provide a complete or partial DAS from data acquisition and extraction to results generation. Scientific workflow planning and service composition are largely incorporated into this task.
Identifying a suitable service or tool to include in a DAS is a major activity within service composition. Some studies [S32, S36, S37, S40] use software component management concepts to guide component selection, but leave the responsibility of process composition to the analyst. S32 proposes a model to represent and recommend web services using pre-defined rules, based on a context expressed through domain concepts. S37 and S40 use service concepts to model a wide array of software components to be selected from, including pre-processing capabilities such as null value removal. Users can query the ontologies to identify suitable components. S36 proposes a methodology to facilitate the selection of suitable service implementations based on the input data. In S43 and S52, suitable service implementations for datasets are identified by matching characteristics represented through domain and service concepts.
In contrast, S30 and S31 select matching data sources for a software component according to its ability to extract and process related data. Data provision services are annotated using OWL-S and SSN ontology concepts so that users can query them and identify suitable services. Moreover, S30 use quality related service concepts to incorporate essential quality attributes that need to be considered in component selection.
Other studies extend service selection to facilitate composition planning and execution by incorporate different domain, service and analytics concepts.
S24 uses domain concepts that describe datasets, as well as data analytics and mining concepts to support workflow composition in Kepler tool, through matching the analytics operations with data properties. S27 uses domain, analytics and service concepts to support the selection of suitable data sources, analytics techniques and software components respectively. S41 uses analytics concepts to represent components of the Weka analytics tool and link these components as a process. It does not provide executable workflows but recommends an analytics plan to be manually executed by the user. S1 proposes a similar approach for the MIT Lincoln Laboratory’s Composable Analytic Environment that including executable workflow definitions. S20 uses service concepts for software component modeling and link them by modeling data transformation rules with analytics concepts. S62 uses control and data flow concepts to guide the knowledge discovery process and supports decision making at each stage using a knowledge base that encompasses domain, analytic, and service concepts.
S78 uses analytics and workflow template concepts to generate analytics processes, with a significant focus on performance optimisation. S16 uses concepts related to workflow templates to store pre-composed analytics processes which can be queried by users in order to select a suitable implementation.
S53 captures analytics, service and intent concepts through Knowledge Discovery Ontology and offers support for planning abstract analytics processes. S6 supports software composition through components modeled as generic APIs by matching respective inputs and outputs. It assists users in planning and matching analytics components with a comprehensive goal-based planning algorithm by considering the input and output conditions. Similarly, S75 uses analytic, service and intent concepts to capture the user goals and KDD workflows implemented in RapidMiner. These concepts are used to identify optimal analytics process for a new analytics problem based on Hierarchical Task Network planning.
S11 supports comprehensive scientific workflow composition with a graphical user interface based on the Taverna workflow engine, SADI/BioMoby plug-ins and SADI-compliant web services. It uses web service concepts to model components that implement data analytics algorithms and domain concepts to describe input and output data. Those concepts are used to recommend services that match analytics requirements or data constraints.
Analytics solution validation
Validation of the analytics solutions involves capturing provenance data, validating solution workflows for service compatibility or data consistency and confirming that the solution addresses the analyst’s goals. S61 uses a domain ontology to generate metadata about input data and output results, enabling provenance. S24 uses domain concepts that describe data and analytics concepts to validate the structural and semantic correctness of workflow before execution. S76 captures provenance data related to datasets, data sources as well as operations around numerical analysis through domain and analytics concepts. S25 uses provenance-related concepts to model workflow and data, and allow users to query them in order to validate workflows, identify defects or extract further information. S26 defines Research Object concept as an instance of a scientific workflow, for provenance purposes. S75 proposes a solution validation approach that uses analytics, service and intent concepts to annotate data, operators, models, data mining tasks and KDD workflows as an extension to RapidMiner tool. The Scientist’s Intent Ontology in S4 uses goal focused intent concepts to describe user goals that are used for workflow validation. S72 uses domain and service concepts to capture provenance of ETL workflows.
Code generation
Methods that support code generation rely on ontologies to convert abstract models into executable analytics software. Such methods can reduce the burden of software programming for data analysts. Code generation can be used to support multiple stages of an analytics process (workflow) execution.
Most techniques use service concepts (e.g. S3, S6 and S38) to drive a code generation process in a Model Driven Engineering (MDE) fashion. For example, data sources modeled as service concepts in S38 are used to generate data extraction software modules. In S6 generic API modeling service concepts are used to generate an executable analytics process. In S3, analytics concepts and service concepts related to implementation management contributes to capturing the implementation details of an analytics task. This enables a semi-automated code generation scheme for selected analytics techniques. The method proposed in S47 is the only one that uses domain concepts to model data sources, which are then used for generating code that extracts data from data sources into linked datasets.
Discussion
Limitations of existing work
This section summarises the limitations we identified by studying what type of semantic concepts were used in DASs and how those different concepts types were applied in different analytics tasks.
Limited usage of intent concepts
Although there are 14 studies that propose intent concepts (see Table 2), we observe that only a few of them use intent concepts at different stages of the analytics process, except for data integration. The survey did not identify any studies that use intent concepts for data extraction and transformation, although this is a computationally expensive and time-consuming task that may waste resources if not performed aptly. Existing techniques use intent concepts mostly on facilitating the search of algorithms, data providers, web services, and computational software modules. Hence to a large extent, analytics requirements such as what business decisions will be supported by this analysis or what level of accuracy is required, are still a part of the mental model of the developer or the analyst who performs these tasks. In practice, several iterations of data cleansing, reformatting, model selection and process composition may be required in order to address the analytics problem at hand optimally. This can result in less effective DASs with degrading performance over time. Besides, modifications to the process can only be conducted by someone with a sound understanding of the original analytics requirements. Moreover, as discussed by Canhoto [8], cognitive and context information, which can be captured through intent concepts, is crucial for accurate interpretation and validation of a DAS. We believe that incorporating suitable intent concepts further can enhance the efficiency and effectiveness of DAS engineering.
Lack of proper concept classification
Semantic concepts can be classified in many different ways. For example, S43 separates domain, analytics and service knowledge in three ontologies and S79 uses a class hierarchy to separate data mining related entities as the process, information content and realisable entities. In some studies, this separation of concept types is not visible. One ontology with a unique prefix may contain concepts related to one or more categories without attempting to follow modular approaches such as class hierarchies. For example, S37 models both analytics and service concepts in one ontology, and S3 models both analytics and intent concepts in the Task-methods ontology. S27 contains two ontologies (WekaOntology and ProtOntology) that cut across concepts of all classes without proper separation.
Little support for end-to-end development process
Though there is an array of research on adapting semantic models for different development tasks such as data integration or model selection, only a few studies seem to go beyond addressing one or two tasks in the DAS engineering lifecycle. In many cases, knowledge from previous tasks would have been beneficial if carried over to the next tasks. Hence there is a lack of studies that propose semantic modeling based solutions to support the end-to-end DAS engineering lifecycle.
We identified 4 studies that use semantic models for code generation (Section 4.3.9), related to data transformation and analytics process execution. They are also limited to a specific domain or a tool and do not provide sufficient flexibility to be used for a broader class of DASs.
Recommendations for future research
Upon the findings of this study, we propose a set of recommendations for future research regarding the application of semantic models for DAS engineering.
Developing intent concepts for analytics
As discussed in Section 5.1.1, service composition techniques among the identified studies do not leverage intent concepts adequately. One reason could be those intent concepts are too high-level (e.g. a business goal) or too low level (a query). As the data analytics community is extending wider into different industries and organisations and as analytics contexts and requirements are changing rapidly, it is necessary to explore techniques that consider all dimensions such as business requirements, contexts and constraints. Hence a potential research area is to study how high-level user goals and contexts can be represented and incorporated in DAS engineering through data integration, process construction, and result interpretation. Initial work in this direction can be found in Bandara et al. [4].
Such approaches that promote the linking of user intentions and contexts into analytics models have the potential of changing static analytics models deployed today into dynamic and adaptable analytics models, responding to changes in user goals or the operational context.
Decoupling concept classes and encouraging concept reuse across development tasks
In Section 5.1.2 we discussed that it is more effective to decouple different concepts as separate ontologies as it leads to better and modular knowledge management. Then each concept type can be reused or evolve independently of the others, enabling users to change the application domain, implementation, data source or the analytics requirements without altering other models. The integration between those different knowledge areas has to be done separately within the DAS environment, considering the context as well. Some studies achieve concept integration through program logic or annotation schemes, but it would be useful to have standard, platform-independent ways of modeling the relationships between different types of analytics knowledge to match the context of a particular analytics process.
To promote the usage of semantic models among the research community and to enhance the value and the reusability of the research, it is essential to advocate the reuse of ontologies. This enables the creation of a common vocabulary, and the resulting data/models become interoperable among a variety of systems. We observed specific ontologies like SSN [14] and Gene Ontology[18] are being used in multiple research studies. There are ongoing efforts such as OBO Foundry6
As knowledge represented through ontologies can enhance each task of the DAS engineering process, a standard framework for designing and extending ontologies that is usable in all analytics process stages is necessary. The ontologies should incorporate knowledge related to domain concepts and business goals as well as the concepts useful for the execution level. For example, an ontological representation of a data source may contain information necessary to retrieve data, but also information about the data quality, the latency of data acquisition, metadata that can be used to decide which algorithm is suitable to process the data (e.g.the knowledge of whether the data is time-series or not can be used to reduce the set of algorithms should consider) and the relationship between the data and other concepts. Representation of existing knowledge and enabling efficient reuse of accumulated knowledge and resources can reduce the effort spent on expert consultations or employee training.
Finally, our evidence reveals the opportunities of using semantic models for code generation in the light of model-driven engineering methods. This needs to be explored and experimented further as it has the potential of lifting the burden of software programming expertise from data analysts. There are already some examples of applying MDE for data analytics applications [44], such as creating Hadoop MapReduce analysis through conceptual models [10]. A promising finding is that four of the identified studies related to code generation (Section 4.3.9) use semantic models that are well aligned with the four ontologies proposed by Pan et al. in their book Ontology-Driven Software Engineering [39]. In this book, the authors align a Requirement ontology (intent concepts) to a Computational Independent Model (CIM), an Infrastructure Ontology (service concepts) to a Platform Specific Model (PSM). They propose to use a domain ontology for converting a CIM to a Platform Independent Model (PIM) and a business process ontology (analytics concepts) to convert a PIM to a PSM. Hence we believe that studying the use of ontologies to develop analytics solutions in a model-driven fashion, particularly adapting the framework proposed by Pan et al. [39], is timely and significant.
Semantic-based service orchestration plays a significant role in realising a semantic model driven analytics environment, as all operations from data exporting, integration into model building, execution and result publication can be done as independent services modules represented through semantic models. Yet there is a lack of applications that integrate the existing body of research related to semantic based service orchestration such as [5,26,36] with semantic DAS engineering research. Such a combination can contribute to a paradigm of service-based DASs and establish the basis for semantic model-driven data analytics systems.
Limitations of the study
There are limitations to our study, mainly due to the literature selection process, including the selection of keywords and the construction of inclusion and exclusion criteria. Firstly, our study focuses on peer-reviewed publications in academic literature only. So grey literature such as technical reports, white papers and unpublished work was not included. Secondly, the study might be missing some relevant work due to the search string failing to match other relevant publications within the digital libraries. Snowballing helped to eliminate this limitation to a certain level. These limitations are in-line with our exclusion criteria, yet they pose a risk for the completeness and validity of the results.
One other limitation is that the selected databases may not contain all related literature, especially applications published in domain-specific venues such as medical journals. We attempted to reduce the impact of such limitations by using the Web of Science7
As we conduct a systematic process to identify and map literature, this paper may contain some outdated work or not reflect the most recent achievements in the discipline. Recent ontologies published related to analytics and data modeling such as OntoDT [40] were not observed in any identified work. This may be due to the limitation of the search approach and authors believe there will be future research that utilises existing analytics related knowledge in DAS engineering.
As the goal of this study is to present a holistic overview on how semantic modeling has been used in engineering DAS, summarising two decades of research, it is beyond the scope of this paper to drill down certain specific characteristics of analytics solutions that support particular tasks and conduct a thorough evaluation. We limit our contribution to a mapping study which can be used by researchers to study certain aspects extensively in the future.
Capturing knowledge using models to drive the software development life cycle is at the heart of the software engineering discipline. Traditional models have severe limitations in the area of building DASs, which are characterised by the need to represent rich knowledge encompassing specialised application domains, complex computing infrastructures and changing analytics requirements. This has triggered our interest in the use of semantic modelling and ontologies as a way of underpinning new software development practices in this area. In this paper, we presented 82 studies identified through a systematic mapping study, that leverage semantic modeling for engineering DASs. We adopted a broad approach, encompassing distinct research areas such as data mining and service computing.
The results of our study reveal the diversity of knowledge representation in existing studies. Through sub-question 1 we identified what type of semantic concepts are modelled and used in the literature. They were falling under four main categories: domain, analytic, service and intent concepts. Through sub-question 2 we identified the different categories of analytics or software engineering related tasks mentioned in the identified literature. Different types of concepts were observed to play different roles in improving each task and supporting various stages of DAS engineering. Semantic modeling was highly used for tasks such as data integration, model selection, process composition and data extraction, which shows the ability of semantic models to represent heterogeneous resources. Studies that focus on model selection and process composition tasks highlight the capability of semantic models to provide end-user support for analytics solution engineering. We identified and discussed some limitations in existing work such as the limited usage of intent concepts and the lack of end-to-end support for DAS engineering.
Recommended future work, discussed in Section 5.2, emphasises the importance of moving semantic technology out of specific research silos. Researchers should aim at developing new research agendas around capturing high-level intents and goals of data analysts and translating them to executable analytics processes, incorporating a multitude of well-defined semantic knowledge repositories that can be developed, expanded and maintained independently from each other. This can be achieved within established software engineering frameworks, but they need to be specifically tailored to the particular characteristics of the DAS engineering life-cycle as presented in [43]. As the next stage of this research effort, authors are working on designing a requirement driven platform that provides support for end-to-end analytics process engineering, incorporating semantic concept types identified through this mapping study [4,5].
Footnotes
Acknowledgements
We are grateful to Capsifi, especially Dr. Terry Roach, for sponsoring the research which led to this paper.
