Abstract
A typical case of producing records within the domain of conservation of cultural heritage is considered. During condition and collection surveys in memory organisations, surveyors observe types of multiple components of an object but without creating a record for each one. They also observe the absence of components. Such observations are significant to researchers and are documented in registration forms but they are not easy to implement using popular ontologies, such as the CIDOC CRM which primarily consider individuals. In this paper techniques for expressing such observations within the context of the CIDOC CRM in both OWL and RDFS are explored. OWL cardinality restrictions are considered and new special properties deriving from the CIDOC CRM are proposed, namely ‘typed properties’ and ‘negative typed properties’ which allow stating the types of multiple individuals and the absence of individuals. The nature of these properties is then explored in relation to their correspondence to longer property paths, their hierarchical arrangement and relevance to thesauri. An example from bookbinding history is used alongside a demonstration of the proposed solution with a dataset from the library collection of the Saint Catherine Monastery in Sinai, Egypt.
Introduction
The problem addressed in this paper is observed often in research. It will be introduced within the cultural heritage domain using a representative example from bookbinding history and the Conceptual Reference Model (CIDOC CRM) [9]. The CIDOC CRM is a popular standard for modelling records in memory organisations such as museums, libraries, galleries and archives. It can be used to model bookbinding records of observations of material evidence on books used for answering research questions in the field of bookbinding history [27].
Datasets integrated with the CIDOC CRM can be jointly queried in a knowledge base through the property relations offered by the CIDOC CRM. It is often the case that observations recorded in knowledge bases are contradictory, for example, as different understandings or views of a phenomenon. Such contradictions are interesting cases for further research, so identifying their corresponding statements in the knowledge base is important. In the field of bookbinding history these have a significant impact on decision making for professionals in memory organisations, primarily for the conservators treating and repairing books but also curators and scholars studying them for understanding the history of the dissemination of copies of the texts and their reception.
The maturity and robustness of the CIDOC CRM is demonstrated, among others, by the apparatus that describes the materiality of objects; this apparatus has been critically reviewed for many years. For example, a frequently used property of the CIDOC CRM is ‘P46 is composed of (forms part of)’, which can be used to express the link between a physical thing (e.g. a book) belonging to the CIDOC CRM class ‘E18 Physical Thing’ and its component part belonging to the same class. For example, a leaf marker is a small piece of material attached to a leaf for marking an important part of the text such as the beginning of a chapter.
An important construct within the CIDOC CRM is the use of the property ‘P2 has type’ with domain ‘E1 CRM Entity’ (i.e. the property is inherited by all CIDOC CRM classes) and range ‘E55 Type’. The property can be used to classify individuals based on terminological systems such as thesauri. The use of this property allows the CIDOC CRM to remain an ontology primarily of generic properties while being able to accommodate the granularity offered by domain experts through thesauri (see [9], on Minimality). ‘E55 Type’ can be used to model concepts in thesauri and the property ‘P2 has type’ can be considered as a way to extend the CIDOC CRM by using classes from thesauri as if they were individuals.
These properties are used regularly in descriptions of material aspects of books, including components making up a book structure. Books are complex objects which may include several hundred components and many observations can be recorded for each one of them. However, often due to limited resources it is not possible for all components to be recorded in a knowledge base. A survey of the manuscripts in the Saint Catherine Monastery in Sinai, Egypt [21], which was generally accepted as very detailed, was limited to observations which were needed for answering specific research questions. For example, leaf markers are recorded on page 1 of the Saint Catherine data registration form [20]. A book may have many leaf markers but recording each one of them is logistically not possible. In the Saint Catherine data form, only the materials and types of leaf markers are recorded without reference to individual ones. Therefore, there is a record of the type ‘leaf markers’ [11], but there is no record for any individual leaf marker.
Another important set of questions expressed in data registration forms is that of existence. For example, it is important to know if there are no leaf markers attached to a book. This information is important when planning conservation work or when studying leaf markers. If a book does not have any leaf markers, then there exists no individual to be described. Therefore the absence of individuals of a type (leaf marker type) needs to be recorded. This raises questions about the capacity of observation of material evidence and the set of real world constraints that are necessary to establish certainty of lack of existence. While the main focus of this work is how it is best to express lack of existence in a knowledge base, some discussion on the nature of observing non-existence is included in Section 4.3.
To summarise: the problems explored here are: a) how to record the type of things when these are too numerous to be documented individually (e.g. for a book with too many leaf markers) and b) how to produce records of things of a type for which individuals do not exist (e.g. for a book without leaf markers).
Solutions to these problems in both RDF Schema (RDFS) and OWL 2 DL (OWL) are discussed in this paper. OWL offers additional expressiveness in comparison to RDFS, but an RDFS solution is considered valuable because RDFS is popular in the CIDOC CRM community, especially for Linked Data implementations. It is noted that neither RDFS nor OWL can express the full semantics of the CIDOC CRM, in particular with respect to the concept of shortcuts as described by Meghini and Doerr [16] (p. 138), but here the aim is to provide a practical solution in either case.
Structure of the paper
Following the introduction, Section 2 presents the formalisation of the problem including a set of competency questions which are used to evaluate possible solutions. Section 3 presents related work which includes possible solutions and their evaluation based on these questions which shows that they have limitations. Section 4 provides recommended solutions including an analysis of the implications in the context of the CIDOC CRM and documentation practice. The recommended solutions are evaluated based on the same competency questions and this evaluation shows that the recommended solutions can answer the competency questions successfully. Section 5 presents an implementation of the proposed solutions for a sample dataset on book history from the library of the St. Catherine Monastery in Sinai. Section 6 summarises the conclusions and Section 7 points to future work.
Formalisation
When discussing solutions in OWL in this paper, the OWL 2 DL language in the functional notation is adopted alongside the notion of ontology as defined in that language.
The RDF language and the turtle notation are adopted when discussing solutions in RDFS. In both cases CIDOC CRM classes and properties are used, abbreviated by their identifiers. For example, ‘E55’ as opposed to ‘E55 Type’.
International Resource Identifiers (IRIs) standing for individuals are written as XML Q-names. For example:
The instances of class ‘E55 Type’ are also referred to as ‘types’ and object property assertions involving the property ‘P2 has type’ are referred to as ‘type assertions’. For example:
For each problem a set of ‘selection’ competency questions [24] is defined, i.e. those which return individuals as results. These are used to help with assessing possible solutions. These questions are also articulated in a simplified SPARQL notation for clarity.
Problem 1: Numerous things
In the first problem, an ontology O is considered containing several large sets of assertions of the form:
The problem is how to compress O into O’ so that the above assertion sets
The competency questions are selected based on the following criteria to ensure relevance:
Questions involving
Questions involving
Questions involving Π are relevant where Π is considered as a constant.
The first kind of assertion in O
In addition, the classes to which
From the above list, questions a), b), d), e), g), h), k) are not relevant as they ask for or mention individual members of
If the exercise is repeated for the second kind of assertion (type assertion),
Q1:
Q2:
Q3:
Q4:
Q5:
Q5 can also be posed for individual
The answers to both these queries can be obtained from the answer to Q5, so they do not need to be considered separately.
Q1 and Q5 may also be stated over super-properties of Π which will be considered in Section 4. Q3 and Q5 might also be stated over super-properties of
It is also noted that questions Q1–Q5 may also occur as part of larger ones. The substitution strategy considered in Section 4 can be applied directly to Q1–Q5 and also to larger questions.
Therefore the problem considered here is replacing
A similar problem has been described previously as ‘MISO-R’, ‘multiple indirectly specified objects through a relationship’ [26]: “There is a distinguished object that is related, by the same kind of relationship, to (possibly even one, but usually) multiple undistinguished objects of a certain type.” The issue addressed in MISO-R is how to express the multiplicity of these undistinguished objects beyond the statements of existence. Problem 1 is a special case of MISO-R for the CIDOC CRM property ‘P2 has type’, considering at least one undistinguished object but without aiming to quantify them.
Problem 2: Non-existing things
In the second problem, an ontology ON is considered containing a set of assertions of the form:
The second problem is often solved in databases using the Closed World Assumption (CWA) but given that the formulation of the problem is done using OWL and the CIDOC CRM, both of which adhere to the Open World Assumption (OWA), CWA databases are not considered here. However, in Section 4.3.2 it is shown how a reduced scope of the OWA can be used to reason about non-existing things.
The competency questions for problem 2 have the same structure as those for problem 1, but they are posed with negative polarity. For example, Q1’ articulated as: “Which individuals are not connected to an individual through Π?” Q1’ is impossible to answer in OWA systems unless there is knowledge that allows to deduce that an individual satisfies the question condition. E.g. it is impossible to know whether a book has no components since assertions about non-existing components are made for specific types while components of other types may exist. Q2’ and Q4’ are not considered because knowing all classes that individuals in
Q5’: ¬
The domain expertise of the maintainer of ON is important for identifying the relevant negative statements to include. In considering negative statements with property Π, only knowledge that involves individuals in the domain of Π are relevant. The rest may be logically true but not relevant. For example, the domain of property ‘P46 is composed of’ is ‘E18 Physical Thing’. Only statements that negate ‘P46 is composed of’ for physical things are relevant because they bring significant knowledge. Any statements that negate ‘P46 is composed of’ for individuals which do not belong to ‘E18 Physical Thing’, do not contribute to knowledge.
Problem 1 and problem 2 are related as the existence of one implies the existence of the other.
The competency questions for each problem will be used as the basis for evaluating existing solutions in Section 3 and also the recommended solutions in Section 4. As it will be shown none of the solutions in Section 3 can correctly answer the competency questions while the proposed solutions in Section 4 can.
Existing solutions
An obvious solution is to declare two disjoint classes: a) one for individuals for which the connection to a type through Π and
The existing solutions evaluated in the following sections are summarised in Table 1.
Summary of existing solutions based on related work
Summary of existing solutions based on related work
Mirza et al. [18] describe a method for automatically introducing ‘counting quantifiers’ in a knowledge base with examples from wikipedia. Counting quantifiers are statements about the number of types of statements:
Existential restrictions
OWL existential restrictions can be applied to indicate that at least one individual exists, that connects to
This allows answering questions Q1 to Q5. In order to answer Q5’, it would be possible to use the complement of the class expression contained in the above assertion:
However, the last class expression would also denote all individuals not connected to any type apart from
Alternatively it is possible to use the axiom:
Existential restrictions are sometimes implemented in RDFS using blank nodes, i.e. to indicate that one unknown individual exists. This reduces the capacity of the solution to express the possible (and likely) existence of many individuals. Using multiple blank nodes implies that a fixed number of individuals exist which, in the case considered here, is unknown and cannot be specified. Other limitations of blank nodes have been reported (e.g. [15] and [8]) regarding the inconsistency of software implementations processing blank nodes and the lack of understanding of blank nodes by people who work within a Linked Data context.
Placeholder individuals
One way of solving problem 1 is by defining one individual which represents all of the numerous things that need to be described. Svátek et al. [26] call these individuals ‘some instances placeholder individuals’.
Semantically, this would be consistent if the identity criterion for
Shortcut properties
Another potential solution is defining a new property
For example a contradiction would be created when asserting the following for
Shortcut properties do not answer the competency questions of problems 1 and 2 since they do not refer to Π.
Linguistic annotations
Svátek et al. [26] consider existential restrictions and placeholder individuals as possible solutions for problem 1 and flag the problem of one individual potentially representing multiple individuals. To resolve this problem they propose a solution based on linguistic annotations of restrictions. They make an argument for wider adoption of annotations as part of computing processes but this relies on naming conventions which may be difficult to use for reasoning as the semantics of the labels used may not be known.
Reification
Another potential solution is declaring a new class
For example:
This is a viable solution for both problems and the competency questions could be encoded and answered based on the reification structure, in particular in RDFS. A solution based on class reification is not recommended because when querying the knowledge base, explanations are required to define: a) which part of it contains direct statements with individuals as subjects and b) which part of it contains reified statements with individuals as objects. Reification methods based on properties which eliminate this problem are possible [7] and this is explored further in Section 4.2.
Negated properties
Negated properties are properties whose semantics are understood as negation. These are often used in wikidata within an RDFS context. For example the property
Property chains
OWL property chains can be used to specify the path from
This can then be used to assert statements:
Asserting both for the same individual will identify a contradiction. For this solution to work it is necessary to specify manually the
Cardinality and typed properties
Two recommended solutions, one for OWL and one for RDFS are presented next.
OWL cardinality restrictions
OWL cardinality restrictions can be used to define unnamed classes based on cardinality of properties. The axioms defining these classes are called ‘Object Property Cardinality Restrictions’ in OWL 2 DL [19]. For problem 1 the range cardinality of Π is at least 1 and for problem 2 the range cardinality of Π is at most 0:
For example, the following statements indicate a contradiction in the knowledge base:
OWL cardinality restrictions are preferred to the solution discussed in Section 3.2 because they allow explicit articulation of the intended statements making them more readable.
In Section 5 it is shown that this solution allows answering the competency questions for both problems and a proof is exemplified for a test dataset.
RDFS typed properties
The solution in Section 4.1 is specific to OWL and this paper aims to offer a solution also in RDFS where cardinality restrictions for properties cannot be used.
The main limitation of a shortcut property
For example, for multiple existing leaf markers the property is
The reification statements apply to the property and therefore this solution is free from the class reification problems mentioned in Section 3.6, i.e. the individuals
Intuitively, a
Syntactically, the identifiers of the new properties can be produced automatically by inserting ‘T’ (typed) and ‘NT’ (negative typed) in front of the CIDOC CRM property identifier. The labels require human processing to ensure readability and generally fall into this pattern:
TP: “[CIDOC CRM property label] of type”
NTP: “[negation (e.g. “is not” or “does not”)] [CIDOC CRM property label] of type”.
The additional statements required to correctly answer the competency questions Q1–Q5 and Q5’ in the compressed graphs O’ and ON’ for TP properties and NTP properties respectively are examined next.
Additional statements for TP properties
In order to identify the impact of the TP properties in a knowledge base, it is appropriate to consider the way properties are axiomatised in RDFS. In particular, RDFS offers three properties to describe the semantics of a property (see Hayes and Patel-Schneider [6], Section 9.2.1):
For each
Therefore, assuming that Π
For each super-property of Π a new
The property ‘P2 has type’ does not have any super-properties but if that were the case additional properties TP” would be necessary for each super-property of
Additional statements for NTP properties
A similar process is followed for NTP properties. In the following cases individuals
that the individual
that the individual
For both cases the domain and range of NTP are defined as follows:
Case b) is considered first. In case b), for every sub-property of
In case a), for every sub-property of Π, in whose domain
Adding these statements in the knowledge base for any of the three cases does not lead to contradictions in relation to the observations and it does not affect the capacity to answer Q5’.
In Section 5, it will be shown that the compressed graphs O’ and ON’ do allow to answer the competency questions, exemplifying the proof using a specific dataset for concreteness. In the rest of this section, the implications of the proposed solution are discussed in relation to documentation practice with the CIDOC CRM.
Observation, negation and categorical statements
The philosophical discourse around non-existence is often introduced with the concept of ‘referential fallacy’ (for example, see the discussion of the existence of Pegasus in [23]), i.e. the assumption that a referenced entity in a knowledge base exists in real life when it could be fictitious. Fictitious things are not considered in this paper. In heritage research and when producing documentation records the following are considered: a) a potentially real individual and b) the absence of any real individual.
A potentially real individual
A potentially real individual may be the result of interpreting references and other sources of evidence or the result of indirect observation. For example a publication referencing leaf markers existing on a specific book, or evidence of adhesive on the leaf at the location where a leaf marker would be expected, may indicate the existence of an individual leaf marker for a period of time. In these cases, the available knowledge only constitutes a finite set of constraints
Additional knowledge beyond these sets of constraints, such as direct observation, means that the individual is known to exist.
Absence of any real individual
Absence of any real individual is described by the same constraints. For example the question whether a book has leaf markers requires a complete observation of every leaf of the book which in itself is limited by constraints of the conditions of observation (for example, part of the book may be inaccessible). There may be cases when previously unknown leaves of a book with leaf markers are reunited with the book thus creating a new boundary for complete observation (this is often illustrated in the field of biodiversity where previously thought extinct species have reappeared [25]). Therefore negative statements about the absence of any individuals presuppose complete observation within a set of constraints.
In the context of a knowledge base Razniewski and Nutt [22] have summarised the nature of partially-complete knowledge bases which follow neither the OWA, nor the CWA. In their work, knowledge base queries are characterised based on completeness to allow users to understand whether the results assume OWA or CWA. This characterisation can be done through providing contextual information about data completeness (i.e. similar to a set of constraints). Darari et al. [4] explore the Semantic Web as an Open World dataset with pockets of complete data under CWA. In a similar fashion they consider the certainty of answers as a metric to evaluate results of queries by comparing to a hypothetical complete dataset within a given context. In the example of leaf markers, completeness of observation is reflected by the material aspects of the book. The context of the limited Closed World for the NTP properties consists of: a) the domain of the property, i.e. the individual being completely observed, b) the range of the property, i.e. the type that it is observed for, and c) the original CIDOC CRM property Π included in the reification statements through
Typed properties as categorical statements
The importance of categorical and cross-categorical knowledge in the CIDOC CRM has been discussed before [5]. Lin et al. [12] discuss issues around categorical knowledge using an example from the field of biodiversity: “The Kobra eats rodents and lives in India”. This statement is expressed as if the category of ‘Kobra snakes’ is an instance of a snake (instance of ‘E18 Physical Thing’) although in reality it is an instance of ‘E55 Type’. The example goes further mixing categories and individuals: “a specific snake of the type Kobra eats rodents”. This is in parallel to the example of a specific book carrying leaf markers. In order to accommodate such statements a proposal for the MetaCRM [1] was established where all domains and ranges of CIDOC CRM properties were replaced by ‘E55 Type’. These highlight the switch from statements about individuals to statements about types of things similar to TP properties. However, the TP properties additionally offer direct links through
Characteristics of typed properties
Characteristics of TP and NTP are considered next.
Typed properties as CIDOC CRM shortcuts
TP properties can be considered as shortcuts within the CIDOC CRM. For the example of
NTP properties are not CIDOC CRM shortcuts since the property of the chain for which the negation applies is unclear.
Existing CIDOC CRM typed properties
The scope note of the CIDOC CRM property ‘P125 used object of type’ reads: “This property associates an instance of E7 Activity to an instance of E55 Type, which defines the type of object used in an instance of E7 Activity, when the specific instance is either unknown or not of interest, such as use of ‘a hammer’.” Its sub-property ‘P32 used general technique’ can also be considered as typed property.
Hierarchy of typed properties
CIDOC CRM property inheritance applies to derived TP and NTP properties. This does not conflict with the additional statements as a result of the requirements for the RDFS entailment patterns.
Negative typed properties and thesauri
Thesauri used with the CIDOC CRM are often hierarchical using broader/narrower relationships provided by standards like ISO 25964-1:2011 [10] and SKOS [17]. For example, in the field of bookbinding history the Language of Bindings Thesaurus (LoB) [30] provides such relationships. The concept for ‘leaf markers’ has broader concept ‘bookmarks’. TP properties are consistent with broader relationships in thesauri, but NTP properties are not. For example, if:
It is noted that the quality of the thesauri should be such that it allows such reasoning.
Application and correctness of the proposed solutions
Dataset
Data collected during the survey of the Library of the St. Catherine Monastery in Sinai, Egypt is used to demonstrate the two solutions. The data describes whether the manuscripts in the library feature leaf markers. In total there are 3,277 records [28]. Two of them are shown in Table 2 [29].
Example records of existence of leaf markers on manuscripts
Example records of existence of leaf markers on manuscripts
The records were encoded for the OWL solution as shown next:
The records were also encoded for the RDFS solution as shown next:
The URI
Tables 3 and 4 show the queries for questions Q1–Q5 implemented in OWL DL query expressions and SPARQL for the OWL and RDFS solutions respectively where
OWL DL queries for questions Q1–Q5
OWL DL queries for questions Q1–Q5
SPARQL queries for questions Q1–Q5
An example of a query involving super-properties of Π is included next:
Q5’ can be answered using OWL DL query expressions for the OWL solution in ON and ON’. It is noted that due to the OWA the assertions in ON cannot answer Q5’ and the query can only be constructed involving individual members of
In ON’ the query can be formulated based on cardinality restrictions:
Q5’ can be answered using SPARQL for the RDFS solution in ON’ only. ON does not contain relevant RDFS statements. Negation in ON through SPARQL tools such as the MINUS operator (for example see [2]) does not return relevant knowledge.
This query can also include sub-properties of Π:
The SPARQL query for Q5’ can be articulated with
The identification of contradictory statements about the existence of individuals is important for scholarship as they indicate areas of further discussion. In OWL such contradictions are automatically identified. For example the following is inconsistent:
An assertion that the individual
In RDFS statements matching contradictory observations can be identified through SPARQL queries:
This will identify the individuals
Proof of correctness
It is now possible to show the correctness of the method by proving the following proposition using the notions and the notations introduced in [6]: The proof is carried out only for Q1 and the positive case, as the proofs for the other queries and the negative case are similar. For every RDFS model J of O, there exists a corresponding model H of O’ such that
When documenting heritage, the following two problems often appear a) how can the typology of numerous individuals be recorded without including them in the knowledge base and b) how the non-existence of individuals can be recorded. These problems were summarised with a set of competency questions in comparison to the knowledge available when individuals are included. The competency questions were then filtered based on the significance of the knowledge for research.
Following a review of potential solutions, in OWL the use of cardinality restrictions is recommended as an optimal solution as it excludes statements about the numerous individuals
When describing the non-existence of individuals, the reified statements of NTP properties apply to both parts of the property chain when, in reality, it could be that only one of them is negated, but negating both does not have a negative impact on the results of the competency questions.
The use of NTP properties requires a context to define completeness of observation. In practice this means full capacity to observe the individual. The proposed NTP properties derive from the CIDOC CRM properties. Completeness of observation is described by the domain and range of the NTP property as well as the original CIDOC CRM property from which the NTP property is derived.
TP properties are shortcuts in the CIDOC CRM whereas NTP properties are not. The hierarchy of TP and NTP properties mirrors that of the CIDOC CRM property hierarchy. When discussing reasoning about broader/narrower concepts from thesauri, statements using NTP properties also apply to narrower terms of a thesaurus in contrast to statements using TP properties where this is not the case.
Future work
An implementation extension of the CIDOC CRM which will allow easy use of TP and NTP properties is in preparation. The development of that extension is undertaken as part of work for the Linked Conservation Data project [13]: a project which explores ways of sharing data produced by conservators with significant representation from book and paper conservators working with historic books. The progress of the development of the extension can be followed in the Linked Conservation Data GitHub repository [14].
Footnotes
Acknowledgements
The authors thank the CIDOC CRM special interest group. This work has been initiated by the Linked Conservation Data project and is partly funded by the Arts and Humanities Research Council in the UK. The problems were introduced by Prof. Nicholas Pickwoad during the condition survey of the manuscripts of the library of the St. Catherine’s Monastery in Sinai, Egypt.
