Abstract
RDF validation is a field where the Semantic Web community is currently focusing attention. Besides, there is a recent trend to migrate data from different sources to semantic web formats. Therefore, in order to facilitate this transformation, we propose: a set of mappings that can be used to convert from XML Schema to Shape Expressions (ShEx), a prototype that implements a subset of the proposed mappings, an example application to obtain a ShEx schema from an XML Schema and a discussion on conversion implications of non-deterministic schemata. We demonstrate that an XML and its corresponding XML Schema are still valid when converted to their RDF and ShEx counterparts. This conversion, along with the development of other format mappings, could drive to an improvement of data interoperability due to the reduction of the technological gap.
Introduction
Data validation is a key area when normalisation and confidence are desired. Normalisation – which can be defined, in this context, as using an homogeneous schema or structure across different sources of similar information – is desired as a way of making a dataset more reliable and even more useful to possible consumers because of its standardised schema. Validation can excel data cleansing, querying and standardisation of datasets. In words of P.N. Fox et al. [16]: “
XML Schema [5] was designed as a language to make XML validation possible with more expressiveness than DTDs [4]. Using XML Schema developers can define the structure, constraints and documentation of an XML vocabulary. Besides DTD and XML Schema, other alternatives for XML validation (such as Relax NG [11] and Schematron [17]) were proposed.
In the Semantic Web, RDF was missing a standard constraints validation language which covers the same features that XML Schema does for XML. Some alternatives were OWL [40] and RDF Schema [10]; however, they do not cover completely what XML Schema does for XML [36]. For this purpose, Shape Expressions (ShEx) [30,31] was proposed to fulfill the requirement of a constraints validation language for RDF, and SHACL [19] (another proposed language for RDF validation) has recently become a W3C recommendation.
As many documents and data are persisted in XML, the need for migration and interoperability to more flexible data is nowadays more pressing than ever, many authors have proposed conversions from XML to RDF [2,6,13,26], with the goal of transforming XML data to Semantic Web formats.
Although these conversions enable users to migrate their data to Semantic Web, means for validating the output data after converting XML to RDF are missing. Therefore, we should ensure that the conversion has been done correctly and that both versions – in different languages – are defining the same meaning.
Conversions between XML and RDF, and between XML Schema and ShEx are necessary to alleviate the gap between semantic technologies and more traditional ones (e.g., XML, JSON, CSV, relational databases). With that in mind, providing generic transformation tools from non-semantic technologies to semantic technologies can enhance the migration possibilities; in other words, if we can create tools that ease the transformation and adaptation among technologies we will encourage future migrations. Taking Text Encoding Initiative (TEI) [37] as an example, digital humanities can take benefit from Semantic Web approaches [33,35]. There are many manuscripts transcribed to XML – using TEI – that can be converted to RDF. But transcribers are hesitant to deal with the underlying technology although they can benefit from it [25]. Those are the cases where generic approaches, as the one introduced here, can offer a solution and where automatic conversion of schemata has its place when transformations are to be checked.
Taking into account what we previously presented, the questions that we want to address in the present work are the following:
RQ1: What components should have a mapping from XML Schema to ShEx?
RQ2: How to ensure that both schemata are equivalent?
RQ3: Is it possible to ensure a backwards conversion in all cases?
RQ4: Are non-deterministic schemata (i.e., ambiguous schemata) possible to translate and validate?
In this paper, we describe a solution on how to make the conversion from XML Schema to ShEx. We describe how each element in XML Schema can be translated into ShEx. Moreover, we present a prototype that can convert a subset of what is defined in the following sections.
The rest of the paper is structured as follows: Section 2 presents the background; Section 3 gives a brief introduction to ShEx; Section 4 describes a possible set of mappings between XML Schema and ShEx; Section 5 presents a prototype used to validate a subset of previously presented mappings and how this conversion works against existing RDF validators; Section 6 discusses the implications of Non-Deterministic schemata on our work. Finally, Section 7 draws some conclusions and future lines of work and improvement.
Background
The related work of XML ecosystem conversion can be divided in three main categories: conversions from XML to Semantic Web formats, conversions from XML schemata to non Semantic Web schemata and conversions from XML schemata to RDF schemata.
From XML to Semantic Web formats
Along with schemata conversions, data transformation has to be tackled. Therefore many authors have worked on this topic of converting from XML to Semantic Web formats and more specifically to RDF. For these conversions there are plenty of strategies that have been proposed and followed by other authors.
In [26], authors describe their experience on developing this transformation for business to business industry in the case of the Semantic Mediation tools. An XML Schema to RDF Schema transformation is performed as part of the requirement of the Semantic Mediation tool.
In [13], a transformation between XML and RDF depending on an ontology is described. This transformation takes an XML document, a mapping document and an ontology document and makes the transformations to RDF instances compliant with the input ontology. Using the mapping file, conversions between the XML Schema and the ontology are established.
In [1], the author explains how XML can be converted to RDF – and vice versa – using XML Schema as the base for the mappings. This work is then expanded in [2] where the author tries to solve the lift problem (the problem of how to map heterogeneous data sources in the same representational framework) from XML to RDF and backwards by using the Gloze mapping approach on top of Apache Jena.
In [39], the authors present a mechanism to query XML data as RDF. Firstly, a matching from XML Schema to RDF Schema class hierarchy is performed. Then XML elements can be interpreted as RDF triples. The same procedure but using DTDs is described in [38].
In [9], the author presents a technique for making standard transformations between XML and RDF using XSLT. A case study in the field of astronomy is used to illustrate the solution.
Another approach using XSLT is [34] where authors describe a mapping mechanism using XSLT that can be attached to schemata definition.
In [3], a transformation from RDF to other kind of formats, including XML, is proposed using in XSLT stylesheets embedded SPARQL which by means of these extensions, could query, merge and transform data from the Semantic Web.
In [6], authors describe XSPARQL which is a framework that enables the transformation between XML and RDF based on XQuery and SPARQL and solves the disadvantages of using XSLT for these transformations.
However, these works (except [26]) are not covering the schemata mapping problem.
From XML schemata to other schemata
Although data migration is important, during this process it is desirable to transform the constraint rules or schemas too. This is also a way to verify that the transformations have been done correctly. Therefore, many authors have proposed different techniques and transformations from XML Schema.
In [28], a transformation from XML Schema to JSON Schema is proposed. These transformations are made using equivalent constraints when it is possible and concrete transformations when no equivalent constraints exist.
In [12], an algorithm that converts from XML Schemata to ER diagrams is proposed. This algorithm (called Xere mapping) is proposed as a part of the Xere technique to assist the integration of XML data.
In [23], the authors propose an algorithm to convert from a relational schema to an XML Schema and two algorithms to convert from an XML Schema to a relational schema. All these techniques preserve the structure and the semantics.
However, none of these works bring XML schemata to Semantic Web technologies.
From XML schemata to RDF schemata
In the Semantic Web community there has been an effort to convert XML schemata to OWL [15,32] and to RDF Schema [26]. Moreover, when no schema is available the transformation can be performed from XML to OWL [7,20,22,29].
However, RDF Schema and OWL were not designed as RDF validation languages. Their use of Open World and Non-Unique Name Assumptions can pose some difficulties to define the integrity constraints that RDF validation languages require [36].
FHIR approach
Another approach for transformation between schemas is to take a domain model as the main representation of data structure and constraints and then transform between that model and other schema formats like XML Schema, JSON Schema or ShEx. This has been the approach followed by FHIR.1
Various languages have recently been developed for RDF validation. Shapes Constraint Language (SHACL) [19] has been developed by the W3C Data Shapes Working Group and Shape Expressions (ShEx) [31] is being developed by the W3C Shape Expressions Community Group.
To the best of our knowledge, no conversion between XML Schema and ShEx/SHACL has been proposed to date. This might be due to the recent introduction of ShEx and SHACL.
In this paper, ShEx is used to describe the mappings due to its compact syntax and its support for recursion whereas in SHACL recursion depends on the implementation. However, we consider that converting the mappings proposed in this paper to SHACL is feasible and can be an interesting line of future work given that it has already been accepted as a W3C recommendation and that there are some ways to simulate recursion by target declarations or property paths.
Brief introduction to ShEx
ShEx was proposed as a language for RDF validation in 2014 [31]. It was one of the foundations for the W3C Data Shapes Working Group which developed the Shapes Constraint Language (SHACL) for the same purpose. SHACL was also inspired by SPIN [18] and although both languages can perform RDF validation there are some differences between them like the support of recursion or the emphasis on validation versus constraint checking (see chapter 7 of [21] for more details). In this paper, we will focus on ShEx because it has a well-defined semantics for recursion [8] and its semantics are more inspired by grammar-based formalisms like Relax NG.
ShEx syntax was inspired by Turtle, SPARQL and Relax NG with the aim to offer a concise and easy to use syntax. In July 2017, version 2.0 was released together with a draft community group report and the community group is currently developing version 2.1.
ShEx uses shapes to group different validations associated with the same node ‘type’. That is, a shape can define how a node and its triples should be in order to be valid. Listing 1 illustrates an example of a ShEx document defining a shape with a

ShEx shape example.
Prefixes are defined at the beginning of the snippet and use the same syntax as in Turtle. Triple constraints are defined inside the shape where a purchase order must have an
The

RDF validation example.
In Listing 2 there is an example of two purchase orders defined in RDF. The first one passes validation and conforms to the shapes declaration given in Listing 1 whereas
ShEx supports different serialization formats:
ShExC: a concise human readable compact syntax which is the one presented in previous example.
ShExJ: a JSON-LD syntax which is used as an abstract syntax in the ShEx specification [30].
ShExR: an RDF representation syntax based on ShExJ.
ShEx defines an extension mechanism through which users can embed portions of code written in a programming language or SPARQL. This feature is known as Semantic Actions and are introduced between definition of triples with the
In this paper, ShExC syntax was used because it is easy to read and understand. The goal of this introduction was to provide a basic understanding of ShEx. For more examples and a longer comparison between ShEx and SHACL readers can consult [21].
XML Schema defines a set of elements and datatypes for validation that need to be converted to ShEx. In this section, we describe different XML Schema elements and a possible conversion to ShEx. All examples use the default prefix
Element
Elements are treated as a triple predicate and object, i.e., we convert them to a triple constraint whose predicate is the name of the element:

Element mapping.
The

Element mapping with linked type.

Element mapping with nested type.
As presented in Listing 5, when an element has its complex type nested the shape name will be the
Cardinality in ShEx is defined with the following symbols: ‘*’ for 0 or more repetitions, ‘+’ for 1 or more repetitions, ‘?’ for 0 or 1 repetitions (optional element) or ‘{m, n}’ for m to n repetitions where m is

Cardinality mapping.
ShEx treats attributes like elements because it makes no difference between an attribute and an element. This difference is part of XML data model whereas the RDF data model does not have the concept of attributes. One possibility to transform attributes is to use their
ComplexType
Complex types are translated directly to ShEx shapes. The

Complex type mapping.
While sequences in XML Schema define sequential order of elements, representing the same modeling in ShEx is complex due to RDF graph structure. There are several ways to represent order in RDF, the most obvious one is using RDF lists (cf., other ways to represent it [14,24]).
The example in Listing 8 shows how the mapping is done for an

Sequence mapping.
Choices in XML Schema are the disjunction operator to select between two options, for instance: choice between two elements. This operator is supported in ShEx using the

Choice mapping.
While sequences are an ordered set of elements,

All mapping.
XSD Types can be used in ShEx as they are used on XML Schema, e.g., whenever a string type is required we can use
Enumerations (using NMTokens)
Enumerations in XML Schema can be used to declare the possible values that an element can have. In ShEx, this is supported using the symbols ‘[’ and ‘]’. The enclosed values are the possible values that the RDF object can take. See Listing 11 for an example.

Enumarations (using NMTokens) mapping.

Pattern mapping.
Simple types in XML Schema are based on XSD Types (see Section 4.4) and allow some enhancements like: restrictions, lists and unions. Depending on the content, translation is performed following different strategies which we detail bellow. For translation of restrictions, see Section 4.7.
List
Lists inside simple types define a way of creating collections of a base XSD type in XML Schema. These lists are supported in RDF using RDF Collections.2

List mapping.

Example of an RDF list construction.
Unions are the mechanism that XML Schema offers to make new types that are the combination of two simple types. With this kind of disjunction, a new type which allows any value admitted by any of the members of the

Union mapping.
Complex contents and simple contents are a way to define a new type from a base type using restrictions or extensions. The base type is the one that is used as a base for the restriction (or extension) clause and the new type is the one that is been restricted (or extended). Complex content allows to extend or restrict a base
Restriction
Restrictions are used in XML Schema to restrict possible values of a base type. A new type can be defined using restrictions applied to a base type. Depending on how the type and the restrictions are defined, the translation strategies vary.
Simple Content: If Complex Content: If Future versions of ShEx are planning to include inheritance. See:
With extensions in XML Schema, it is possible to define a new type as an extension of a previously defined one. This is a case of classic inheritance, where the child inherits its parent elements that are added to its own defined elements. Depending on the content, i.e.,
Simple content: If
Complex content: If
Restrictions and extensions in ShEx are not supported directly in the current version (i.e., ShEx has no support for extensions, restriction or inheritance) with the same semantics as XML Schema. Therefore, we use the normal syntax provided by ShEx and create the two resulting shapes – by solving the

Restrictions and extensions mapping, where extensions and restrictions are directly transformed into the equivalent shape.
Enumeration

Enumeration mapping.

Fraction digits mapping.
This feature allows to restrict the total number of digits permitted in a numeric type. In ShEx, this is possible using

Total digits mapping.

Length mapping.

Max length and min length mapping.
These features allow restricting number types to an interval of desired values. This is the same notion as in open and closed intervals. In ShEx, these features are supported directly. Therefore, transformation is done as shown in Listing 21.

Max exclusive, min exclusive, min inclusive and max inclusive mapping.
Preserve: This option will not remove any white space character from the given string. Replace: This option will replace all white space characters (line feeds, tabs, spaces and carriage returns) with spaces. Collapse: This option will remove all white spaces characters: Line feeds, tabs, spaces and carriage returns are replaced with spaces. Leading and trailing spaces are removed. Multiple spaces are reduced to a single space.
In ShEx,

WhiteSpace mapping.

Unique mapping.
In addition to the proposed mappings from XML Schema to Shape Expressions, and in order to answer RQ2, a prototype has been developed. This prototype uses a subset of the presented mappings and converts a given XML Schema input to a ShEx output.
The prototype has been developed in Scala and is available online.5
Supported and pending of implementation features in XMLSchema2ShEx prototype. * Not natively supported in ShEx 2.0
The tool is built on top of Scala parser combinators [27]. Once the XML Schema input is analysed and verified, it is converted to ShEx based on different elements and types declared on it. These conversions are made recursively and printed to the ouput in ShEx Compact Format (ShExC).

XML Schema to ShEx example.

XML to RDF example.
The input XML Schema document example presented in Listing 24 is used to ensure that the prototype can work and do the transformation as expected. This example includes complex types, attributes, elements, simple types and patterns among others. Complex types are converted to shapes, elements and attributes to triple predicates and objects, restrictions (max/minExclusive and max/minInclusive) to numeric intervals, cardinality attributes to ShEx cardinality and so on. Although it is a small example, it has the structure of typical XML Schemas used nowadays and the prototype can convert it properly as it is stated in Listing 24.
Once conversion from XML Schema to ShEx is done, it must be verified that the same validation that was performed on XML data using XML Schema, but now on RDF data using ShEx, is working equivalently. The translation of a valid XML to RDF is executed which is presented in Listing 25. The conversion presented in the snippet uses blank nodes to represent the nested types. This is done to avoid creating a fictitious node every time a triple is pointing to another triple (in other words, every time it has a nested type). The conversion was performed following similar equivalences to those proposed in the mappings. That is, complex types to triple subjects or predicates, simple types to triple objects, cardinality translated directly and so on.
For RDF validation using ShEx there are various implementations in different programming languages that are being developed.6 A list of ShEx implementations is available at:
Using the examples given above the validation can be performed with the mentioned tool which allows the RDF and the ShEx inputs in various formats and then the option to validate the RDF against ShEx or SHACL schema. As seen in Fig. 2, validation is performed trying to match the shapes with the existing graphs, whenever the tool matches a pattern it shows the evidence in green and a short explanation of why this graph has matched.

Validation result using Shaclex validator. The RDF data is entered in the left text area whereas the ShEx schema is entered on the right text area. In the bottom, a ShapeMap is declared to make the validator know where and how to begin the validation, in this case we commanded to validate :order1 node with <PurchaseOrderType> shape. In the top of the page, the result is shown detailing how each node was validated and what are the evidences or failures for the validation. A link to the validation example can be found in Supplementary Material.

Validation result using Shaclex validator of a ShEx schema converted from a non-deterministic XML Schema document. In the Shape map input area text we have indicated to Shaclex validator to check if :nondeterministic1 and :nondeterministic2 hold the form of shape <nondeterministic>. In the top of the page the satisfactory result is shown in green.
There is an issue that arises in XML Schema documents that should be solved when proposing a transformation from XML Schema. This is the topic of Non-Deterministic schemata where the parser is unable to determine the sequence to validate due to the Unique Particle Attribution. This issue appears, for example, in a choice between two sequences that begin with the same element. This event can be formulated with the regular expression:
These sequences are translated as shown in Section 4.3.1 and the final result can be seen in Listing 26. The question is that if this non-determinism is also transferred to the converted schemata. In order to check the actual behaviour we have run this example on Shaclex validator which shows that the validation is performed correctly (see Fig. 3).
This behaviour is motivated by two things: firstly, the structure of RDF lists is different from XML Schema sequences which makes the validation to be performed in a different form; consequently, the validation in ShEx is performed recursively trying to match shape by shape. Therefore, if an element match with a shape this will scale up into the recursion tree without creating ambiguity problems.

Non-Deterministic schema and its ShEx counterpart.
In this work, a possible set of mappings between XML Schema and ShEx has been presented. With this set of mappings, automation of XML Schema conversions to ShEx is a new possibility for schema translation which is demonstrated by the prototype that has been developed and presented in this paper. Using an existing validator helped to demonstrate that an XML and its corresponding XML Schema are still valid when they are converted to RDF and ShEx.
One future line of work that should be tackled is the loss of semantics: with this kind of transformations some of the elements could not be converted back to their original XML Schema constructs. Nevertheless, it is a difficult problem due to the difference between ShEx and XML data models and it would involve some sort of modifications and additions to the ShEx semantics (like the previously mentioned inheritance).
To cover more business cases and make this solution more compatible with existing systems, there is the need to create mappings for Schematron and Relax NG as a future work. Relax NG is grammar-based but Schematron is rule based, which will make conversion from Relax NG to ShEx more straightforward than from Schematron to ShEx, as ShEx is also grammar-based. Another line of future work is to adapt the presented mappings to SHACL: most of the mappings follow a similar structure. Moreover, the rule-based Schematron conversion seems more feasible using the advanced SHACL-SPARQL features which allow to expand the core SHACL language by using SPARQL queries to validate complex constraints.
With the present work, validation of existing transformations between XML and RDF is now possible and convenient. This kind of validations makes the transformed data more reliable and trustworthy and it also facilitates migrations from non-semantic data formats to semantic data formats.
Conversions from other formats (such as JSON Schema, DDL, CSV Schema, etc.) will also be investigated to permit an improvement of data interoperability by reducing the technological gap.
Footnotes
Acknowledgements
This work has been partially funded by the Vice-rectorate for Research of the University of Oviedo under the call of “
