Sage Journals: Discover world-class research

Abstract

The Henri Poincaré correspondence is a corpus of letters sent and received by this mathematician. The edition of this correspondence is a long-term project begun during the 1990s. Since 1999, a website is devoted to publish online this correspondence with digitized letters. In 2017, it has been decided to reforge this website using Omeka S. This content management system offers useful services but some user needs have led to the development of an RDFS infrastructure associated to it. Approximate and explained searches are managed thanks to SPARQL query transformations. A prototype for efficient RDF annotation of this corpus (and similar corpora) has been designed and implemented. This article deals with these three research issues and how they are addressed.

Keywords

History of science Digital Humanities Henri Poincaré scientific correspondence RDF(S)approximate and explained search SPARQL query transformation corpus annotation

1. Introduction

Jules Henri Poincaré is a great name of the history of science. Born in Nancy (France) on April 29, 1854, he died in Paris on July 17, 1912.

His name is associated with discoveries or works of primary importance. He is responsible for, among other things, the discovery of Fuchsian functions in mathematics, an essential contribution to the resolution of the Three-Body problem in celestial mechanics, for which he won the Oscar II Prize, King of Sweden’s mathematical competition, in 1889, as well as the introduction in topology of the concepts of homotopy and homology. His theoretical research on new mechanics after 1900 prepared the discovery of the special theory of relativity by Albert Einstein in 1905.

Mathematician, physicist, astronomer, research administrator, Henri Poincaré was also very active in the field of philosophy as a regular contributor to the Revue de métaphysique et de morale. His philosophical books, such as La science et l’hypothèse [Science and Hypothesis] [26] contributed to the birth of French philosophy of sciences and gave him a great reputation in France and at international level. Although not very engaged on the political scene, he nevertheless played a significant role in the Dreyfus Affair through several mathematical expertises.

Henri Poincaré was elected in 1887 at the Académie des sciences of Paris, at the age of 33. Throughout his life he was also a member of the Bureau des longitudes (1893), of the Académie Française (1908) and of numerous foreign learned societies and academies.

In 1992, the laboratory of history of science and philosophy Archives Henri Poincaré was created to study Henri Poincaré’s manuscripts and to organize the publication of his scientific and private correspondence. For more than 25 years, this long-term project has produced four volumes of letters. The first one is devoted to the letters exchanged between Henri Poincaré and the Swedish mathematician Gösta Mittag-Leffler [23]. The second one concerns the correspondence with physicists, chemists and engineers [30]. The third volume gathers the correspondence with astronomers and, in particular, geodesists [31]. The fourth one deals with the Henri Poincaré’s youth correspondence [28]. Two other volumes are in preparation. The first one will be devoted to the letters from or of mathematicians. The second will contain Henri Poincaré’s administrative, academic and private correspondence.

The corpus1

¹
The notion of corpus is in accordance with the terminology in history i.e. a collection of documents of a specific subject gathered and stored to be exploitable. In natural language processing, the appropriate term could be semantic digital library.

consists of around 2100 letters, 1126 sent by Henri Poincaré and 956 received by him. Some other letters have been discovered recently. About

50 %

of these letters form a correspondence with scientists. Original letters come from 63 different archive centers and libraries from 14 countries. All known letters are digitized2

Due to copyright laws, some images of letters are not available online, though transcripts are.

and around

60 %

of them are in plain text (in LaTeX and XML versions). Lots of letters contain mathematical and physical formulae. The correspondence is available on Henri Poincaré website.3

http://henripoincare.fr.

The letters of this website are indexed with Dublin Core extended metadata.4

⁴

Different projects exist that are devoted to scientific correspondences for example the CKCC project (http://ckcc.huygens.knaw.nl [32]) or Mapping the Republic of Letters project (http://republicofletters.stanford.edu). More generally, several projects in Semantic Web are dedicated to history. A recent paper [21] proposes a state of the art on this issue. For instance, SemanticHPST is a project in which Semantic Web principles are applied to the history and philosophy of science and technology [5]. A newly founded European consortium called Data for History (http://dataforhistory.org/) aims at uniting such projects.

This enables to query the corpus by e.g.

Plain text search engine exists for the letters that are already transcribed5

⁵

The search-engine Solr is installed in the platform.

but the proposed set of results can be incomplete or incorrect. For instance, if the query is to find letters sent or received by Henri Poincaré with the topic “lacunary functions”, the result is only one letter though there are more than 10 letters referring partially to this topic. The problem is that, in these letters, the term “lacunaire” does not explicitly occur, hence the incompleteness of this search from a semantic viewpoint. In other letters, “espaces lacunaires” can be found but it refers to titles of Henri Poincaré’s papers, hence its incorrectness.

This article is organized as follows. Section 2 presents the Henri Poincaré correspondence and how it was edited, annotated and published on the web of documents. It has appeared that principles and technologies of the Semantic Web should prove useful for the exploitation of this corpus, in particular, the RDF(S) technology that is briefly recalled in Section 3, with some examples related to the corpus. The remainder of the paper presents three main works related to Semantic Web for the Henri Poincaré correspondence. Section 4 explains how the RDF(S) infrastructure is added on the Henri Poincaré correspondence website, which involves some translation mechanisms. This makes possible the interrogation of this corpus by SPARQL queries. The need for more flexible querying is motivated in Section 5, together with a way of handling this flexibility. Finally, the prototype of a tool for efficient editing of RDF triples for the purpose of indexing the Henri Poincaré correspondence is described in Section 6.

2. Editing and annotating the correspondence of Henri Poincaré

From 1999 to 2015, the Henri Poincaré Papers website which was created and maintained by Scott Walter to highlight the Henri Poincaré correspondence knew several developments. The last version was a web application associated with the open source full text search engine Sphinx.6

⁶
http://sphinxsearch.com.

One could find, when available, the digitized letters, a transcript, critical apparatus and Dublin Core metadata. This website was harvested by OAI-PMH.7

⁷

OAI-PHM is a protocol developed by the Open Archives Initiative for harvesting metadata description of archives. It is mainly based on the Dublin Core model. See http://www.openarchives.org/pmh/.

In 2017, the Archives Henri Poincaré decided to reforge the website. In order to better structure the site and to benefit from semantic annotation, this new platform has been based on the content management system (CMS) Omeka S.8

⁸

https://omeka.org/s/.

Developed by the Roy Rosenzweig Center for History and New Media, this CMS has been created for publishing and promoting cultural heritage collections. This system has been used to build digital collections from various institutions such as the Metropolitan New York Library Council (METRO) collection [16], the University of Binghamton [13] or the University of São Paulo [25]. It can also be relevant for the management and share of educational resources [22].

Fig. 1.

Example of transcription and metadata from a letter.

Omeka S allows semantic annotations. In the backoffice, several vocabularies are already available: Dublin Core Terms, Friend of a Friend, and Bibliographic Ontology. Adding other vocabularies is possible, like the Archives Henri Poincaré Ontology.9

⁹

For now, this ontology is rather basic. It still requires an epistemological investigation which will rely on other ontologies in digital humanities such as the CIDOC-CRM ontology (http://www.cidoc-crm.org).

A search engine based on the properties is available but it is not very user friendly. All data of the previous website (digitized letters, transcripts, metadata) are available in the platform. An environment has been created in this CMS in which letters can be visualized (as an image or in plain text) with a critical apparatus essentially coming from the printed edition of this correspondence and an indexation (Fig. 1).

For all letters, there are two types of indexing. The first one is a physical description (type of letter – telegram, autograph letter, minutes, etc. –, number of pages, location of the letter within the archive, sender, recipient, date and place of expedition, etc.). Some pieces of information are missing. For example, there are letters for which the exact sending date is unknown, but for which the year and the month are known from the context and associated to the letter.

The second type of indexing relates to the content of the letters; all relevant information can be indexed (see Fig. 2 for an example). These data are relevant from the viewpoint of historians. For instance, people or publication quoted, mathematical theories or formulae, philosophical concepts are taken into account.

3. Preliminaries on RDFS and SPARQL

This section makes a presentation of RDF, RDFS and SPARQL that is simplified for the needs of this article, and exemplified in the domain of the correspondence of Henri Poincaré.

The atoms of RDF are resources and literals. A resource is either anonymous or named. An anonymous resource (aka blank node) is an existentially quantified variable; by convention its identifier starts with a question mark (e.g., $?x$ or $?firstName$ ). A named resource is identified by a name without a question mark; it is a constant of any type: instance (e.g., $henriPoincaré$ ), class (e.g., $Mathematician$ ), etc. A property is a resource denoting a binary relation (e.g., $sentBy$ is a property relating a letter to the person who sent it). A literal is a constant of a predefined datatype, such as $integer$ , $float$ or $date$ . The term “value” is used in this paper for “resource or literal”.

An RDF triple is a triple $τ = ⟨ s p o ⟩$ where s is a resource (the subject of τ), p is a property (the predicate of τ) and o is either a resource or a literal (the object of τ). For example, the RDF triple $⟨ letter22 sender henriPoincaré ⟩$ states that $letter22$ was sent by Henri Poincaré. An RDF graph represents a set of RDF triples: if $⟨ s p o ⟩$ belongs to an RDF graph $G$ then s and o are nodes of $G$ , and p labels the edge $(s, o)$ of $G$ . The set of named resources of an RDF graph $G$ is denoted by $Res (G)$ .

Fig. 2.

Examples of indexation by RDF triples.

Fig. 3.

An RDF graph $G_{e x}$ and a part of its deductive closure.

RDFS is a logic whose language is RDF and inference relation ⊢ is defined by a set of inference rules, that are based on some given resources: $rdf:type$ , $rdfs:subclassof$ , $rdfs:subpropertyof$ , $rdfs:domain$ and $rdfs:range$ , respectively abbreviated in $a$ , $subc$ , $subp$ , $domain$ and $range$ .

Let p, q and r be properties, x and y be resources, and C, D, E and R be classes. The inference rules considered in this paper are $\begin{array}{l} \frac{⟨ x a C ⟩ ⟨ C subc D ⟩}{⟨ x a D ⟩} r_{1}, \\ \frac{⟨ C subc D ⟩ ⟨ D subc E ⟩}{⟨ C subc E ⟩} r_{2}, \\ \frac{⟨ p subp q ⟩ ⟨ q subp r ⟩}{⟨ p subp r ⟩} r_{3}, \\ \frac{⟨ x p y ⟩ ⟨ p subp q ⟩}{⟨ x q y ⟩} r_{4}, \\ \frac{⟨ x p y ⟩ ⟨ p domain D ⟩}{⟨ x a D ⟩} r_{5} and \\ \frac{⟨ x p y ⟩ ⟨ p range R ⟩}{⟨ y a R ⟩} r_{6} . \end{array}$ The (RDFS) deductive closure of an RDF graph $G$ is $\begin{matrix} G^{⊢} = {τ : RDF triple ∣ G ⊢ τ} \end{matrix}$ An RDF graph $G$ can be partitioned into $G = O \cup D$ where $O$ is an ontology (containing all the triples whose predicates are $subc$ , $subp$ , $domain$ or $range$ ) and $D$ gathers the data (that are about the individuals). The RDFS graph $G_{e x}$ of Fig. 3 is used as an example in the remainder of the paper.

SPARQL is a query language for RDF. In this article, SPARQL querying is assumed to be performed by an engine using RDFS entailment, meaning that querying an RDF graph $G$ by a SPARQL query $Q$ gives the same result as interrogating $G^{⊢}$ with $Q$ . Moreover, the SPARQL queries considered have the following syntax: selectvarswhere {body}, where $vars$ is a sequence of anonymous resources and $body$ is a sequence of statements separated by “ $.$ ”, a statement being either an RDF triple written without ⟨ and ⟩ (i.e., s p o instead of $⟨ s p o ⟩$ ) or a filter statement of the form $FILTER (condition)$ , where $condition$ is a Boolean expression using Boolean operators, equality, and inequality relations on numbers and dates. Moreover, without loss of generality, it is assumed that only one filter statement occurs in the body of a SPARQL query. For example, the following is a query for letters sent by Henri Poincaré to a scientist before 1898:

More generally, the execution of a SPARQL query Q on an RDF graph $G$ consists in finding matchings between the body of the query to $G$ and results as a set of bindings, a binding being an assignment of each variable by a resource or a literal, this assignment having to satisfy the constraint given in the filter statement. The result of this execution, denoted by $exec (Q, G)$ , is the set of bindings restricted to the variables given after the $SELECT$ keyword. For example:

4. Adding an RDFS infrastructure on an Omeka S site

The Henri Poincaré website is hosted and managed using a Huma-Num service. Huma-Num is a French infrastructure dedicated to Digital Humanities [17]. A system, branded “Huma-Num Box”, has been developed to address several issues related to the treatment of data for Humanities (scalability, volume, accessibility, security, etc.). Using such a service enhances the development of standards when dealing with Digital Humanities. The Omeka S instance (vocabularies, content, website modules, etc.) has been installed on a shared server.

Omeka S comes with a search engine called Solr,10

¹⁰
Solr is an open source search platform built on Apache Lucene.

that lets users query the database to get information about transcribed letters. A specific configuration has been made by using lemmatization to improve search results. Although it can be very powerful, this search tool is limited in some situations and does not take advantage of the Semantic Web technologies. That is why the need to install a SPARQL endpoint has emerged. Omeka S data is stored using a MySQL dedicated database which cannot be directly interrogated through SPARQL queries. A specific RDFS base had to be installed. In addition to the server used to manage Omeka S installation, a virtual machine has been configured in which a dedicated Java application is running to manage an RDF database. This application uses the Jena engine [20] to manipulate the RDF documents and to execute SPARQL queries. The textual RDF syntax Turtle [6] was chosen to write the triples because it is easy to read. An associated interface has been developed and is embedded on the Omeka S website (with no impact on user navigation).

An automatic script retrieves data from Omeka S to update this RDFS base on a daily basis.11

¹¹

However, although Omeka S allows export to different data formats, Turtle is not part of it. The script exports Omeka S data in JSON-LD format, and then converts it to Turtle before updating the RDFS base.

If an ontology is modified, it will also be updated. At the time of writing this paper, the RDFS base is composed of more than

200 000

triples. The ontology used contains 17 classes and 60 properties, and the database is gathering more than

6 000

intances. For instance, there are around

2 100

letters,

1 700

instances of the class

Person

, 700 documents of the class

Article

and more than 80 identified

ArchivePlace

Fig. 4.

The architecture of the Henri Poincaré website.

On the website, a user can choose between using the basic input search and creating more complex queries with SPARQL. Three SPARQL query editing modes are proposed:

Classical mode

The user can directly write SPARQL queries, within a text area, to access the RDFS database It allows creating complex queries by taking advantage of the expressiveness of the SPARQL language. But this requires a good understanding of SPARQL syntax which is not well suited for historians of science and people that are not familiar with the Semantic Web technologies.

Form-based mode

A form containing a set of inputs is proposed to the user to help him/her building the query. This is a mode suitable for all users, as it does not require any specific knowledge.

Graphical mode

A graphical interface is presented to let the user construct a graph corresponding to an inference graph. This mode can be a good compromise because it is not too difficult to apprehend, but also keeps a certain expressiveness when formulating queries. The interface has been created using D3.js library, which is adapted for manipulating documents based on data [3].

It is important to mention that the different blocks shaping the website are invisible for users. They can easily retrieve data they are interested in by choosing the appropriate search engine. In practice, for the daily use, historians and experts of the domain tend to use the form-based and graphical modes. Figure 4 illustrates the website architecture presented above. For the historians, Omeka S grant access to a specific back interface to visualize and update data (collections, vocabularies and content). As the data transfer from Omeka S to the RDFS base is automatic, they do not need to manage this particular task.

Consider a user who wants to express the following informal query using SPARQL:

He/she can use any of the three modes to express the SPARQL query and to find the corresponding letters. Figure 5 shows the edition of this query in the three modes.

Fig. 5.

SPARQL query editing modes available on Henri Poincaré website (http://henripoincare.fr/s/correspondance/page/sparql).

5. Approximate and explained search in the Henri Poincaré correspondence

To be exploitable, the Henri Poincaré correspondence has to be queried. SPARQL querying, with adequate user interfaces, serves this purpose. However, more flexible searches can be useful, which is explained in Section 5.1. The way flexible searches are managed is based on SPARQL query transformations, as explained in Section 5.2. These transformations are based on rules, and a tool for expressing and managing these rules is briefly presented in Section 5.3. This work is implemented and works well, but can be improved: some future improvements are described in Section 5.4.

5.1. Motivating approximate and explained searches

Two reasons why flexible search is useful in the context of the correspondence of Henri Poincaré are presented below and exemplified.

Taking into account vagueness Consider the following informal query:

The time period “end of the $19^{th}$ century” is vague, as many notions used by human beings. For example, the years 1876 and 1903 might be considered to belong to this time period.12

¹²
Some historians of science state that the end of the $19^{th}$ century is the year 1905 during which Albert Einstein published three of his most important articles, while some historians of geopolitics consider that the $20^{th}$ century starts in 1914.

By contrast, there is a large consensus on the year interval

[1890, 1900]

to be a part of this time period. Therefore, if two letters

ℓ_{1}

and

ℓ_{2}

have Henri Poincaré as sender and Felix Klein as recipient,

ℓ_{1}

written in 1892 and

ℓ_{2}

, in 1876, then

ℓ_{1}

undoubtedly is an answer to

Q

but

ℓ_{2}

could be considered also as an answer to

Q

One way to handle these kinds of informal queries, where boundaries of notions are imprecise, consists in using the tools of fuzzy set theory: a fuzzy query with a kernel corresponding to the years $[1890, 1900]$ seems appropriate. However, for the examples presented below, fuzzy set theory seems to be inappropriate or, at least, incomplete.

Finding related results Now, consider a historian of science querying the corpus about the query $Q$ of the letters sent by Felix Klein to Henri Poincaré that are about complex analysis. Such a query can be easily formulated in SPARQL:

The execution of this query gives a set of letters that match exactly the formulated query $Q$ . Now, imagine that the historian wants to go further, to get more letters related to the query $Q$ . This may occur in particular in the following situations:

The set of results is empty or, at least, too small (for the historian accessing the corpus): the historian may desire that the system provides more letters associated to explanations pointing out the mismatch between these letters and the query, something like “This is not a letter about complex analysis but it is about analysis.”

The corpus is incomplete: there exist letters sent or received by Henri Poincaré that do not exist anymore and also, there probably exist such letters that are not yet discovered. However, there may be a letter $ℓ_{reply}$ of the corpus sent by Henri Poincaré to Felix Klein that is a reply to a letter exactly matching $Q$ but that is not in the corpus. Therefore $ℓ_{reply}$ should be interesting for the historian whose initial query was $Q$ despite the fact that $ℓ_{reply} \notin exec (Q, G_{HP})$ .

Different tools and methods have been proposed to deal with the notion of approximation when using SPARQL querying. The f-SPARQL engine [7] is a flexible extension of SPARQL which introduces the use of fuzzy set theory. Several new terms and operators (high, recent, close to, at most, etc.) are proposed. Other approaches try to integrate user preferences within the SPARQL query. The PrefSPARQL extension [14] introduces new operators (highest, lowest, around, more than, etc.) which can be used within SPARQL filter clauses. New clauses are also proposed (preferring and prior to).

Query transformations can be useful for taking into account vagueness and for finding results that are related to the query but do not match it exactly. A method proposes a framework to relax query by analysing the failing causes of the initial query [11].

Another line of work related to this issue is the approach of case retrieval in case-based reasoning based on query transformations that has been applied in various application domains with various query languages (e.g., in machine translation [29], organic chemistry synthesis [18] and cooking [9]).

An approach based on the use of transformations rules is explained in the next section.

5.2. SPARQL query transformations

Consider first the informal query $Q$ of equation (1). Let a and b be two integers with $a ⩽ b$ . Let $Q_{a}^{b}$ be the query for the letters sent by Henri Poincaré to Felix Klein between year a and year b:

Let $L_{a}^{b}$ be the set of letters resulting from the execution of the query $Q_{a}^{b}$ on the RDFS graph $G_{HP}$ of the Henri Poincaré correspondence ( $L_{a}^{b} = {ℓ ∣ (? ℓ, ℓ) \in exec (Q_{a}^{b}, G_{HP})}$ ). A letter $ℓ_{0} \in L_{1890}^{1900}$ is, under a strong consensus, an answer to the informal query $Q$ . A letter $ℓ_{1} \in L_{1885}^{1900} ∖ L_{1890}^{1900}$ can be accepted as an answer to $Q$ , but it is more debatable. For a letter $ℓ_{1} \in L_{1880}^{1900} ∖ L_{1885}^{1900}$ , this is even more debatable. Thus, the idea is to generate sets of letters, starting from $L_{1890}^{1900}$ and enlarging progressively the interval by steps of 5 years to get results that could be considered as answers to $Q$ but with less and less relevance.

Fig. 6.

A search tree (truncated at depth 2) for an informal query related to the end of the $19^{th}$ century (penalties are below the queries, assuming $cost (r_{past}) = cost (r_{future}) = 1$ ).

To implement this idea, the notion of SPARQL query transformation rules is considered. Let $r_{past}$ and $r_{future}$ be two such rules that are both applicable to a query $Q_{a}^{b}$ and such that their application to this query respectively gives $Q_{a - 5}^{b}$ and $Q_{a}^{b + 5}$ . For example, $Q_{1890}^{1900} \overset{r_{past}}{\to} Q_{1885}^{1900}$ . A search tree can be defined using SPARQL queries as states and these two rules to generate successor states. Figure 6 presents this search tree. To each rule r is associated a transformation cost $cost (r) > 0$ that is assumed to be additive: this cost is used to associate to a query generated by rules a penalty such that if $Q$ has a penalty of π and $Q \overset{r}{\to} Q^{'}$ , then a penalty $π^{'} = π + cost (r)$ is associated to $Q^{'}$ .

Thus, the informal query $Q$ is modeled by an elastic query $(Q_{1885}^{1900}, {r_{past}, r_{future}})$ . More generally, an elastic query is an ordered pair $(Q, R)$ where $Q$ is a SPARQL query and R is a finite set of SPARQL query transformation rules. The result of an elastic query is a stream of letters ordered by increasing penalty and resulting from a search in the tree associated to $(Q, R)$ : the states of this tree are SPARQL queries with penalties $(Q, π)$ , the root of this search tree is $(Q, 0)$ a successor of a tree node $(Q, π)$ is a node $(Q^{'}, π^{'})$ such that there is a $r \in R$ that can be applied on $Q$ with $Q \overset{r}{\to} Q^{'}$ and $π^{'} = π + cost (r)$ .

For the example of the query $Q$ of equation (2), the elastic query $(Q, R)$ can be considered, with R a set of query transformation rules such as the ones defined informally below: $(r_{GenPred})$

Substitution of the predicate p of a triple by a superproperty q of p (i.e., $O ⊢ ⟨ p subp q ⟩$ ).

(r_{GenObjInst})

Substitution of the object o of a triple $⟨ s p o ⟩$ when it is an instance of a class C by a variable instance of this class (i.e., replace this triple by the triples $⟨ s p ?x ⟩$ and $⟨ ?x a C ⟩$ ).

(r_{exchangeSenderRecipient})

If the body of a query $Q$ has two triples of the form $⟨ s sentBy o_{1} ⟩$ and $⟨ s sentTo o_{2} ⟩$ then this rule can be applied on it and its application consists in replacing these triples with $⟨ s sentTo o_{1} ⟩$ and $⟨ s sentBy o_{2} ⟩$ .

(r_{substByColleague})

If the body of a query $Q$ has a triple $⟨ s p o ⟩$ such that $O ⊢ ⟨ o worksWith c ⟩$ then the rule substitutes this triple by $⟨ s p c ⟩$ . E.g., this transforms a query from Felix Klein by a query from a colleague of this mathematician.

It is noteworthy that the first two rules are generalization rules that can be applied to other domains.13

¹³

A generalization rule is a rule r such that if $Q \overset{r}{\to} Q^{'}$ then, for any RDFS graph $G$ , $exec (Q, G) \subseteq exec (Q^{'}, G)$ .

Several such rules can be defined (generalization of classes in subject or object, removal of a triple, etc.) The two other rules are not generalization rules and are rule defined for the application domain of a correspondence corpus.

5.3. SQTRL

SQTRL (SPARQL Query Transformation Rule Language) is a language associated with a tool designed for handling SPARQL query transformations. This tool has been used in various application contexts [4].

An SQTRL transformation rule r is characterized by the following fields:

$name (r)$ : an identifier of the rule by a string;

$context (r)$ : a set of RDFS triples;

$left (r)$ : a set of RDFS triples;

$right (r)$ : a set of RDFS triples;

$cost (r)$ : a positive float;

$explanation (r)$ : a text describing the transformation that may contain blank nodes occurring in the fields $context$ , $left$ and $right$ .

The RDFS triples of such a rule appear in SPARQL syntax.

Given an RDFS graph $G$ , a SPARQL query $Q$ , and an SQTRL rule r, r is applicable on $Q$ (given $G$ ) if $context (r)$ can be bound with the graph $G$ and $left (r)$ can be bound with the body of $Q$ , provided that these bindings are consistent: if $?x$ occurs in both $context (r)$ and $left (r)$ , only the values x such that $(?x, x)$ appears in both bindings are kept. If so, the application of r gives queries $Q^{'}$ such that $left (r)$ is substituted by $right (r)$ , where blank nodes are substituted by their values in the bindings.

For example, consider first the query transformation rule $r_{GenPred}$ , presented in Section 5.2 and corresponding to the generalization of a predicate p by a superproperty q of p. This rule can be described as follows ( $r = r_{GenPred}$ ): $\begin{array}{l} name (r) = "Generalize a property " \\ + "in predicate position" \\ context (r) = ?p subp ?q \\ left (r) = ?s ?p ?o \\ right (r) = ?s ?q ?o \\ cost (r) = 1.0 \\ explanation (r) = "Generalize ?p in ?q " \end{array}$

The rule $r_{GenPred}$ is a generalization rule that can be applied to many application contexts. As pointed out in Section 5.2, it is possible to add domain-dependent rules. For example, the rule $r = r_{exchangeSenderRecipient}$ can be represented as follows: $\begin{array}{l} name (r) = "Exchange sender and recipient" \\ context (r) = (empty context) \\ left (r) = ?s sentBy ?o1 . ?s sentTo ?o2 \\ right (r) = ?s sentTo ?o1 . ?s sentBy ?o2 \\ cost (r) = 1.0 \\ explanation (r) = "Exchange sender ?o1 " \\ + "and recipient ?o2 " \end{array}$

An XML syntax has been chosen to properly define a transformation rule. As an example, Fig. 7 illustrates the XML syntax associated with the rule $r_{GenObjInst}$ corresponding to the generalization of an object instance.

Fig. 7.

An example of SQTRL rule in XML syntax.

The SQTRL tool internally reuses the Corese engine [8]. Such an engine enables to query Semantic Web data, stored as RDF(S) files.

5.4. Future work on approximate and explained search

The current version of SQTRL has some limitations that require to be overcome. This section lists some of them and points out future studies for addressing them.

The first limitation is related to the costs associated to a rule: such a cost is a constant, whereas it is sometimes more relevant to have costs that depend on the bindings of anonymous resources occurring in the rule (in the fields $context$ , $left$ and $right$ ) at rule application time. For instance, the cost of generalizing a class C into a class D may depend on the “generalization leap” from C to D.

The second limitation is linked with the unnecessary applications of some rules given the rules already applied in the same branch of the search tree. For example, it is not necessary to apply sequentially twice the rule $r_{exchangeSenderRecipient}$ on a query $Q$ , since it leads back to $Q$ . Another example is linked with the rules $r_{past}$ and $r_{future}$ : it is unnecessary to apply both in the same branch since the set of results they add are already generated by queries with a lower or equal cost in the search tree. Therefore, in Fig. 6, the two nodes labelled by $Q_{1885}^{1905}$ are both useless (the letters answering this query are in the union of the answers to queries $Q_{1885}^{1900}$ and $Q_{1890}^{1905}$ that are generated at a higher level in the tree). The objective of a future work is thus to avoid such unnecessary composition of rules, which would have a positive impact both on the computing time and (more importantly) on the user load, that would have to examine fewer generated results appearing several times with equivalent (but not necessarily equal) explanations.

The third limitation is related to the filter statements of the SPARQL queries. Currently, only filter statements representing intervals (e.g., $r_{past}$ and $r_{future}$ ) are handled by SQTRL rules. More complex filter statements (involving, e.g., negations or disjunctions) are not considered currently. Taking them into account is a complex research direction, since complex filter statements are propositionally closed which involves that a purely syntactic handling is not sufficient if query transformation rules are considered at a semantical level (i.e., if the execution of two queries $Q$ and $Q^{'}$ give the same results on any RDFS graph then a SQTRL rule should be applicable in the same manner on $Q$ or $Q^{'}$ ).

The fourth limitation is more domain-dependent since it is related to the notion of time in history. For example, the rules $r_{past}$ and $r_{future}$ enlarge time intervals by steps of 5 years. By contrast, a historian often considers the time scale according to milestones (e.g., the year 1914 for political history and the year 1905 for history of physics). Therefore, using such milestones in time interval transformation rules, as well as their significance in the context of the search, is a challenging future work. For this purpose, ontologies for history14

¹⁴
In particular data for history, http://ontome.dataforhistory.org/.

should be useful.

6. A tool for efficient editing indexing triples

Currently, the Henri Poincaré correspondence is indexed in a satisfying way by RDFS files. However, new properties may emerge from research in history that would require additional annotation work and, more importantly, new correspondence corpora exist that are not yet annotated and that are of interest for nearby colleagues in history of science. This justifies the development of an editing tool of RDF triples for the purpose of corpus annotation. The RDF graph $G$ is partitioned in an ontology $O$ that is supposed to be given and in a set of triples $D$ that has to be edited by this tool. The development of this tool is an ongoing work that is described below in four parts: the edition tool, the use of deductive inferences in RDFS for improving the efficiency of this tool, the use of hypothetical inferences on RDFS for the same purpose using case-based reasoning principles, and a first evaluation of this work. Finally, some research directions on this issue are presented.

6.1. RDFWebEditor4Humanities: An editor for indexing corpora

Fig. 8.

A screenshot of RDFWebEditor4Humanities.

The indexing work done by historians of science for the Henri Poincaré correspondence was mainly done using the user interface of Omeka S: the RDF files are generated by translating the information edited via Omeka S (in a SQL server) to RDF, as described before, in Fig. 4. It was decided to implement an RDF triple editor prototype for the new annotation tool in order to benefit from the RDF(S) infrastructure.

Several tools and methods have been proposed to assist users in RDF data editing. Protégé [24] is one frequently used tool which benefits from an user friendly interface but lacks of real assistance to find resources and to edit new triples. Several methods use language processing to let users edit RDF databases in a simpler way. For example, the GINO editing tool [2] introduces a guided and controlled natural language to define statements. Another example is the CLOnE language [12] which provides a method for editing data without any knowledge about Semantic Web standards.

Figure 8 presents a screenshot of the developed interface. An autocomplete mechanism has been implemented which assists users in writing triples values. Applying such a filter is particularly relevant when working with a base containing a large number of triples. The set of potential values for a field are ranked according to the alphabetical order.

The editor also displays the context of the edition. For the edition of a correspondence, the context is a letter and the already edited triples about this letter. If the current letter is $letter44$ , the context is the set of triples $⟨ letter44 p o ⟩$ where p and o result from the execution of the query select?p?owhere {letter44 ?p ?o}.

When editing a state for which only one of the three values is missing, the tool avoids giving to the user a value that forms a triple already known. For example, when the subject s and the predicate p are already edited, the values o resulting from the query select?owhere {s p ?o} are not proposed.

Figure 8 example displays the context of the letter which is currently being edited. This letter is identified as item n^o9866 of the Henri Poincaré database and its context contains four triples.

6.2. Indexing support using RDFS deduction

Consider an expert indexing the Henri Poincaré correspondence: $D$ is the set of triples already edited at current time. At a given time, an editing question is raised, that is composed of (1) a triple with 1, 2 or 3 missing values, denoted by anonymous resources, and (2) a mark on the field that the editor wants to fill, denoted by framing the highlighted anonymous resource. For example, the editing question $⟨ letter4 4 \begin{matrix} ?p \end{matrix} ?o ⟩$ corresponds to the state when the editor has already edited the subject of the triple but neither its predicate nor its object, and wants to edit now its predicate. The editing assistance proposed here aims essentially at re-ranking the set of potential values for the highlighted field, the objective being that the rank of the value chosen by the expert is, on the average, lower than the rank given by the alphabetical order. For this purpose, some features about the missing values are deduced from $G = O \cup D$ and the more features a potential value has, the higher its rank in the list will be. In the following, given a value c that is in the set of the candidate values for the highlighted field, $c . #f$ denotes the number of features of c.

For example, consider the editing question $⟨ letter44 sentTo \begin{matrix} ?o \end{matrix} ⟩$ and $G = G_{e x}$ from Fig. 3. Once $?o$ is replaced by a value o, it can be deduced that o is an instance of $Agent$ : $\begin{matrix} G \cup {⟨ letter44 sentTo o ⟩} ⊢ ⟨ o a Agent ⟩ \end{matrix}$ The set of candidate values for $?o$ is $Res (G)$ . For such a candidate value c, two situations may occur: S⁺

Either $G ⊢ ⟨ c a Agent ⟩$ ;

S^?

Or $G ⊬ ⟨ c a Agent ⟩$ : it cannot be deduced from $G$ that c is an instance of $Agent$ .

Let

C^{+}

(resp.,

C^{?}

) bet the set of

c \in Res (G)

in situation S⁺ (resp., S^?). If

c_{1} \in C^{+}

and

c_{2} \in C^{?}

, all other things being equal, then

c_{1} . #f = c_{2} . #f + 1

Fig. 9.

A screenshot of RDFWebEditor4Humanities using RDFS deduction to rank the set of potential values.

Fig. 10.

Editing question types and associated features of candidate values.

This editing question is of the type $⟨ s p \begin{matrix} ?o \end{matrix} ⟩$ . There are 12 editing question types: cf. first column in Fig. 10. For an editing question of the type $⟨ s p \begin{matrix} ?o \end{matrix} ⟩$ , the knowledge about the ranges of p can be used (there may be several ranges for a property, each of them providing an additional constraint on the value for $?o$ ). Let R be such a range (i.e., $G ⊢ ⟨ p range R ⟩$ ). If R is a datatype then the value for $?o$ has to be a literal of this datatype. Otherwise, the value for $?o$ is either new or has to be selected in $Res (G)$ . Therefore, for each $c \in Res (G)$ , if $G \cup {⟨ s p c ⟩} ⊢ ⟨ c a R ⟩$ then $c . #f$ is incremented. This can be computed by executing the query select?cwhere {?c aR} on $G$ : if $(?c, c)$ belongs to the result, $c . #f$ is incremented. This is called $rangePred$ in Fig. 10. For editing questions of the type $⟨ ?s p o ⟩$ , the idea is similar, except that $domain$ is used instead of $range$ (called $domainPred$ ).

Consider an editing question of the type $⟨ s \begin{matrix} ?p \end{matrix} o ⟩$ . First, the value to be edited for $?p$ has to be a property, so a candidate value that is a property (written $G ⊢ ⟨ c a rdf:Property ⟩$ ) has its feature count $f . #f$ incremented (this is called $predProperty$ in Fig. 10). The value of s can also be used: if s is an instance of a class D then a candidate value c having D as domain (i.e., $G ⊢ ⟨ c domain D ⟩$ ) has its $c . #f$ incremented (called $subjectInDomain$ ). This can be computed thanks to the execution of the query select?cwhere {sa?D.?cdomain?D} on $G$ . In a symmetrical way, the value of o can be used (with $range$ instead of $domain$ ): this is called $objectInRange$ in Fig. 10.

For the editing questions of the type $⟨ \begin{matrix} ?s \end{matrix} ?p o ⟩$ , the feature count $c . #f$ for the subject is increased each time c is in the domain of a property whose range contains o, which corresponds to the query

For the editing questions of the type $⟨ s ?p \begin{matrix} ?o \end{matrix} ⟩$ , a similar idea applies (with exchange of $domain$ and $range$ ). For both types of editing questions, this is denoted by subjImRel(subject-image relationship) in Fig. 10.

For editing questions of the type $⟨ s \begin{matrix} ?p \end{matrix} ?o ⟩$ (resp., $⟨ ?s \begin{matrix} ?p \end{matrix} o ⟩$ ), the edition of a value for $?p$ is similar as for $⟨ s ?p o ⟩$ , except that $range$ (resp., $domain$ ) is not used. For $⟨ ?s \begin{matrix} ?p \end{matrix} ?o ⟩$ , this also applies, but without the use of $domain$ and $range$ (the fact that the value of $?p$ should be a property is used).

For the other possibilities no re-ranking is provided in the current version of the tool.

Figure 9 illustrates the use of RDFS deduction in the edition tool. The editing question is the same as the one presented in Fig. 8. This is a question of the type $⟨ s \begin{matrix} ?p \end{matrix} o ⟩$ , where s is a $Letter$ ( $letter9866$ ) and o is a Person ( $thomasCraig$ ). Imagine that the user wants to define $thomasCraig$ as the sender of the $letter9866$ . The question is $⟨ letter9866 \begin{matrix} ?p \end{matrix} thomasCraig ⟩$ . The graph contains the triples: $\begin{array}{l} ⟨ sentBy domain Letter ⟩, \\ ⟨ sentTo domain Letter ⟩, \\ ⟨ sentBy range Person ⟩, \\ ⟨ sentTo range Person ⟩, \\ ⟨ letter9866 a Letter ⟩, and \\ ⟨ thomasCraig a Person ⟩ \end{array}$

As the class $Letter$ is $domain$ of the properties $sentBy$ and $sentTo$ , the counts associated with the two properties are incremented. Moreover, the class $Person$ is $range$ of the two properties which also increases the counts by one. The property $sentBy$ is now the first proposition among the values proposed by the tool while it was in third position when using the basic editor. This version of the tool provides a better ranking in this situation. It is important to point out that applying RDFS deduction does not add any sensitive delay to the use of the tool.

6.3. Indexing support using case-based reasoning

The work presented in this section is freely inspired from the research on the UTILIS system that assists users in RDFS graph updating [15] and on the case-based reasoning methodology.

Case-based reasoning (CBR [27]) aims at solving problems with the help of a case base, where a case is a representation of a problem-solving episode. The target problem is the problem currently under resolution. A source case is an element of the case base. The reasoning process is usually composed of several steps: (1) retrieval chooses one or several source cases judged similar to the target problem, (2) adaptation modifies, if necessary, the retrieved case, (3) storage adds the newly formed case, possibly after a correction from the user. In this ongoing work, only retrieval is considered (adaptation is currently a mere copy and storage is quite straightforward).

In this application framework, a problem is an editing problem, defined by an editing question (as in the previous section), and the context of the problem (as defined in Section 6.1). For example, consider the following target problem for the running example developed below:

A solution is a plausible value for the highlighted field in the editing question of the problem. The set of edited data, $D$ , is used as a case base. It can be noticed that $D$ is not a set of problem-solution pairs, but appears as a whole in which such pairs can be “cut out”. Such situations occur for other CBR systems, as depicted already in [27].

The retrieval step aims at extracting cases from $D$ . This gives candidate solutions that are proposed for the edition and ranked according to the similarity to the target problem (the more similar the cases are to the target problem, the higher in the list they are proposed). For example, consider $D = D_{e x}$ from Fig. 3. Both letter instances $letter22$ and $letter33$ have topics associated to them: $\begin{array}{l} G_{e x} & ⊢ ⟨ letter22 topic fuchsianFunc ⟩ \\ G_{e x} & ⊢ ⟨ letter33 topic complexAnalysis ⟩ \end{array}$ Both topics can be proposed as candidate values, but which one should be proposed before the other? The answer is based on the similarity between the letters of $D$ and the letter of the target problem: if $letter22$ is more similar to $letter44$ than $letter33$ then the value $fuchsianFunc$ is proposed before the value $complexAnalysis$ .

So, the question is how to assess the similarity between a letter and the target problem. The answer proposed for the UTILIS system cited above is based on query relaxation. For RDFWebEditor4Humanities, this idea is reused, and query relaxation is replaced with SQTRL query transformation (cf. Section 5). More precisely, the target problem $tgt$ is transformed into a SPARQL query $Q_{tgt} = SELECT var WHERE {body}$ , where $var$ is the variable highlighted in the editing question ( $var = ?o$ in the running example) and $body$ is constituted by the triple of the editing question and the triples of the context, with the substitution of the current letter ( $letter44$ in the example) by a variable $? ℓ$ . This gives, for the running example:

Then retrieval amounts to an approximate search, as described in Section 5, taking $Q_{tgt}$ as an initial query and a set of SQTRL rules. More precisely, the search tree of root $Q_{tgt}$ is developed with a given maximal transformation cost and then, for each $Q$ of this tree associated with a transformation cost $t c$ , the execution of $Q$ on the edited graph gives a (possibly empty) set of values v (in the example, values for $?o$ ). These values v are then ranked according to the increasing value of the transformation cost $t c$ , which provides the list of potential values for the expert that indexes the letters.

With the running example, consider only the SQTRL rules $r_{GenPred}$ and $r_{GenObjInst}$ , defined in Section 5.2, with a cost of 1 for both. In the search tree of root $Q_{tgt}$ , the query $Q_{1}$ is generated at depth 2 (corresponding to 2 applications of $r_{GenPred}$ on $Q_{tgt}$ ):

The execution of $Q_{1}$ on $D_{e x}$ gives $\begin{matrix} {(?o, fuchsianFunc)} . \end{matrix}$ At depth 3, the query $Q_{2}$ is generated (corresponding to 1 application of $r_{GenObjInst}$ on $Q_{1}$ ):

The execution of $Q_{2}$ on $D_{e x}$ gives $\begin{matrix} {(?o, fuchsianFunc), (?o, complexAnalysis)} . \end{matrix}$

Finally, the transformation cost for obtaining $fuchsianFunc$ is $t c = min {2, 3} = 2$ and the one for $complexAnalysis$ is $t c = 3$ : the former is proposed before the latter in the list proposed during the editing process.

This work and its combination with the work of Section 6.2 are still in their early stages, so it is not evaluated yet and thus, not considered in the next section.

6.4. A first evaluation

A first evaluation has been carried out that is fully automated, though it is acknowledged that a complete evaluation of an editing tool has to involve human editors.

For this evaluation, the baseline is the editor tool using only alphabetical order to rank the values (cf. Section 6.1) and it is compared to the editing prototype using RDFS entailment (cf. Section 6.2). The autocomplete mechanism has been disabled for the automatic evaluation.

Given an editing question $e q$ , the measure used in this evaluation is the rank $rank (e q)$ of the value chosen by the editor in the list of the values proposed by the tool: the lower $rank (e q)$ is, the better the tool is.

The test set is built on the basis of the current RDF graph $G_{HP} = O_{HP} \cup D_{HP}$ that indexes the Henri Poincaré correspondence. More precisely, the ontology $O_{HP}$ (that imports other ontologies, such as FOAF) is given and the edition of the data in $D_{HP}$ is simulated as follows:

Initially, $D = \emptyset$ , $Ranks$ is the empty multiset.

For each triple $⟨ s p o ⟩ \in D_{HP}$ considered in a random order,

For each of the 3 fields subject, predicate and object, considered in a random order,

Let $e q$ be the editing question corresponding to this field;

Let $rank (e q)$ be the rank of the edited value (i.e., s, p or o) in the list of the proposed values given by the evaluated editing tool using $G = O_{HP} \cup D$ ;

Add $rank (e q)$ to $Ranks$ ;

Add $⟨ s p o ⟩$ to $D$ .

At the end $D = D_{HP}$ and $Ranks$ is the multiset of the ranks. The average and standard deviation of the elements of $Ranks$ are computed.

The results are presented in Fig. 11.

Fig. 11.

The average and the standard deviation of the values $rank (e q)$ , for editing questions $e q$ simulated from the Henri Poincaré correspondence RDF graph, for two versions of the editor.

Although the evaluation has been carried out on a subset of the Henri Poincaré Correspondence, this was constructed in order to be representative of the whole correspondence. What emerges of these results is that the use of RDFS deduction brings a better general ranking and thus could effectively assist historians during the indexing work.

6.5. Future work on assisted indexing

It is noteworthy that the tool presented here is still a prototype needing some technical and ergonomic improvements that are necessary before carrying out the complete evaluation involving human users.

The CBR approach presented in Section 6.3 is not fully implemented (and not evaluated) and this constitutes the first future work.

The combination of the deductive and CBR approaches for editing also has to be investigated. In particular, the following hypothesis is made: the editing improvements of these methods are largely independent one from the other, so a well-designed combination of them should give interesting results. One promising way to do such a combination is to apply one of these methods and then, to decide between the ties using the other method. More sophisticated combination techniques can also be investigated.

Another future work is related to the deductive approach and can be illustrated by the example of the editing question $⟨ letter4 4 \begin{matrix} ?p \end{matrix} ?o ⟩$ introduced in Section 6.2. Two types of situations were distinguished for a candidate value c for $?o$ : S⁺ and S^?. A third type of situations can be considered, provided that the representation language is extended to class complement (as in OWL DL and many of its fragments): S⁻

$G ⊢ ⟨ c a \neg Agent ⟩$ .

where

\neg Agent

represents the entities that are not agents,

G

is the graph in its current state and c is a candidate value for

?o

. For example, if

G

entails that the classes

Letter

and

Agent

are disjoint, then any c such that

G ⊢ ⟨ c a Letter ⟩

is in situation S⁻. For such a c, it is certain that it is not a potential value for

?o

and then it can be removed from the list of candidate values proposed to the expert. More generally using a more expressive logic than RDFS that contains some sort of negation would enable to remove elements from the list of candidates. The drawback of this extension is the computation time involved by using this more expressive logic, which is a particularly sensitive issue when dealing with a user interface. Thus, an option to be investigated would be to make offline deductions in this more expressive than RDFS logic and to use these deduced data online in order to point out elements that cannot be correct values for a given editing question.

A last potential future work would be to integrate this work in an existing RDF editing tool such as Neologism [1], Protégé [24] and Vitro [19].

7. Conclusion and future work

The Henri Poincaré correspondence is a corpus gathering letters received and sent by this famous scientist. This article has presented the application of Semantic Web technologies on this corpus. Before these technologies were used on this corpus, letters were digitized, transcribed in plain text and associated to metadata, within the content management system Omeka S.

Then, a translation script of these pieces of information into RDFS has been implemented, which has made possible the access to the corpus via an RDF endpoint using the vocabularies of standard ontologies and also the ontology developed for the purpose of this corpus. Therefore, the SPARQL querying became possible, and three user interfaces were developed for this purpose: a classical interface with the whole SPARQL querying, a form-based interface more suited to the habits of the users and an interface using a graphical view.

Now, SPARQL querying is sometimes insufficient, when approximate searches are needed. This is the case when vague notions have to be taken into account in the query (e.g., “the end of the $19^{th}$ century”) or when results that are somehow related to the initial query can be of interest for the user. For this purpose, a mechanism based on the notion of elastic query has been designed. The result of the execution of an elastic query is a stream ordered by increasing cost, each result being associated with explanations of the mismatch between the SPARQL query and this result.

Currently, the indexing of the corpus is done via Omeka S. It is planned to do it directly in RDFS and an editing tool is currently under study. A prototype has been developed in which the user edits each element of a triple by selecting a resource in a list (or creating a new resource or a literal). This prototype has been improved by the use of RDFS entailment in order to propose most promising values first. Another (potential) improvement of this prototype using case-based reasoning principles has been presented and is an ongoing work.

These contributions have different degrees of maturity and have been presented from the more mature one (that is currently in use) to the less mature one (that is an ongoing work). To the best of our knowledge, there exists no other work that uses the methods presented in this article in the context of cultural heritage. Future works have been planned for approximate and explained search and for assisted indexing of triples (cf. Sections 5.4 and 6.5) and are not recalled here. Therefore, the first ongoing work consists in putting all these contributions in practice so that they are all as routine uses.

Another future work will aim at dealing with uncertainty and imprecision of some information about the letters, and, more specifically, about the dates. When the date is written in the preamble of a letter, it is reasonable to consider it as its writing date, but in many letters this information is missing. However, an imprecise date can be given, e.g., because it relates to an event (a historical event or a relation to another letter that is well-dated). In such situations, the precise date is often difficult to know, thus an imprecise date (e.g., giving only the month and the year) is associated to the letter. Sometimes, a precise yet hypothetical date can be inferred by historians, e.g., when the letter refers to something that is interpreted by a historical event. Dealing with imprecise and/or uncertain dates is a challenging issue that requires both a careful ontology modeling and inference mechanisms. Another challenge related to the ontology is the representation within it of mathematical formulae. Historians of science would benefit from the implementation of a search engine for mathematical content. Different methods have already been investigated [10] and should be considered in order to propose an integration in the framework of this research.

In this paper, the representation formalism for the ontologies and for the data does not go beyond RDFS. However, in order to carry out some of the future studies described in this conclusion as well as in other sections of the paper, the use of larger fragments of OWL is likely to be required. For instance, in Section 6.5, it was shown how the use of negation can improve the editor using entailment. It was also noted in this same section that the issue of computing time increasing potentially involved by a more expressive formalism can be harmful to the practical use of such a system. Some offline deductions may be used to alleviate the online computing burden.

Although this work has been carried out for the Henri Poincaré correspondence, it is planned to be applied to other correspondences of known scientists. More widely, it should be reusable for another history of science corpora.

Footnotes

Acknowledgements

This work was supported partly by the French PIA project “Lorraine Université d’Excellence”, reference ANR-15-IDEX-04-LUE. It was also supported by the CPER LCHN (Contrat de Plan État-Région Lorrain “Langues, Connaissances et Humanités Numériques”) that financed engineer Ismaël Bada who participated to this project.

References

Basca,

Corlosquet,

Cyganiak,

Fernández and

Schandl, Neologism: Easy vocabulary publishing, in: Proceedings of the 4th Workshop on Scripting for the Semantic Web,

Bizer,

Auer,

G.A.

Grimnes and

Heath, eds, CEUR Workshop Proceedings, Vol. 368, 2008, http://CEUR-WS.org/Vol-368/paper10.pdf .

Bernstein and

Kaufmann, GINO – a guided input natural language ontology editor, in: The Semantic Web – ISWC 2006,

Cruz,

Decker,

Allemang,

Preist,

Schwabe,

Mika,

Uschold and

L.M.

Aroyo, eds, Springer, 2006, pp. 144–157. doi:10.1007/11926078_11.

Bostock, D3.js – data-driven documents, 2012, http://d3js.org/.

Bruneau,

É.

Gaillard,

Lasolle,

Lieber,

Nauer and

Reynaud, A SPARQL query transformation rule language – application to retrieval and adaptation in case-based reasoning, in: Case-Based Reasoning Research and Development. ICCBR 2017,

Aha and

Lieber, eds, Lecture Notes in Computer Science, Vols 10339, Springer, 2017, pp. 76–91. doi:10.1007/978-3-319-61030-6_6.

Bruneau,

Garlatti,

Guedj,

Laubé and

Lieber, SemanticHPST: Applying semantic web principles and technologies to the history and philosophy of science and technology, in: The Semantic Web: ESWC 2015 Satellite Events,

Gandon,

Guéret,

Villata,

Breslin,

Faron-Zucker and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9341, Springer International Publishing, 2015, pp. 416–427. doi:10.1007/978-3-319-25639-9_53.

Carothers and

Prud’hommeaux, RDF 1.1 turtle, W3C, 2014, http://www.w3.org/TR/2014/REC-turtle-20140225/.

Cheng,

Z.M.

Ma and

Yan, f-SPARQL: A flexible extension of SPARQL, in: Database and Expert Systems Applications, 21st International Conference, DEXA 2010,

P.G.

Bringas,

Hameurlain and

Quirchmayr, eds, Springer, 2010, pp. 487–494. doi:10.1007/978-3-642-15364-8_41.

Corby,

Dieng-Kuntz and

Faron Zucker, Querying the semantic web with corese search engine, in: European Conference on Artificial Intelligence, Valence, Spain, 2004, pp. 705–709, https://hal.inria.fr/hal-01531219 .

Cordier,

Dufour-Lussier,

Lieber,

Nauer,

Badra,

Cojan,

Gaillard,

Infante-Blanco,

Molli,

Napoli and

Skaf-Molli, Taaable: A case-based system for personalized cooking, in: Successful Case-Based Reasoning Applications-2,

Montani and

L.C.

Jain, eds, Studies in Computational Intelligence, Vol. 494, Springer, 2014, pp. 121–162, https://hal.inria.fr/hal-00912767 . doi:10.1007/978-3-642-38736-4_7.

10.

Elizarov,

Kirillovich,

Lipachev and

Nevzorova, Semantic formula search in digital mathematical libraries, in: 2017 Second Russia and Pacific Conference on Computer Technology and Applications (RPC), IEEE, 2017, pp. 39–43. doi:10.1109/rpc.2017.8168063.

11.

Fokou,

Jean,

Hadjali and

Baron, Handling failing RDF queries: From diagnosis to relaxation, Knowledge and Information Systems50(1) (2017), 167–195. doi:10.1007/s10115-016-0941-0.

12.

Funk,

Tablan,

Bontcheva,

Cunningham,

Davis and

Handschuh, Clone: Controlled language for ontology editing, in: The Semantic Web,

Aberer,

K.-S.

Choi,

Noy,

Allemang,

K.-I.

Lee,

Nixon,

Golbeck,

Mika,

Maynard,

Mizoguchi,

Schreiber and

Cudré-Mauroux, eds, Springer, 2007, pp. 142–155. doi:10.1007/978-3-540-76298-0_11.

13.

A.E.

Gay, Using a content management system for student digital humanities projects: A pilot run, Online Searcher43(1) (2019), 32–35.

14.

Gueroussova,

Polleres and

S.A.

McIlraith, SPARQL with qualitative and quantitative preferences, in: OrdRing@ ISWC, 2013, pp. 2–8, http://ceur-ws.org/Vol-1059/ordring2013-paper1.pdf .

15.

Hermann,

Ferré and

Ducassé, An interactive guidance process supporting consistent updates of RDFS graphs, in: Knowledge Engineering and Knowledge Management. 18th International Conference, EKAW 2012,

ten Teije,

Ölker,

Handschuh,

Stuckenschmidt,

d’Acquin,

Nikolov,

Aussenac-Gilles and

Hernandez, eds, Springer, 2012, pp. 185–199. doi:10.1007/978-3-642-33876-2_18.

16.

Kucsma,

Reiss and

Sidman, Using Omeka to build digital collections: The METRO case study, D-Lib magazine16(3/4) (2010), 1–11. doi:10.1045/march2010-kucsma.

17.

Larrousse and

Marchand, A techno-human mesh for humanities in France: Dealing with preservation complexity, in: DH 2019, Utrecht, Netherlands, 2019, https://hal.archives-ouvertes.fr/hal-02153016 .

18.

Lieber and

Napoli, Using classification in case-based planning, in: Proceedings of the 12th European Conference on Artificial Intelligence (ECAI’96), Budapest, Hungary,

Wahlster, ed., John Wiley & Sons, Ltd., 1996, pp. 132–136.

19.

Lowe,

Caruso,

Cappadona,

Worthington,

Mitchell and

Corson-Rikert, The vitro integrated ontology editor and semantic web application, in: ICBO 2011: International Conference on Biomedical Ontology,

Bodenreider,

M.E.

Martone and

Ruttenberg, eds, CEUR Workshop Proceedings, Vol. 833, 2011, pp. 296–297, http://ceur-ws.org/Vol-833/paper54.pdf .

20.

McBride, Jena: A semantic web toolkit, IEEE Internet computing6(6) (2002), 55–59. doi:10.1109/mic.2002.1067737.

21.

Meroño-Peñuela,

Ashkpour,

van Erp,

Mandemakers,

Breure,

Scharnhorst,

Schlobach and

van Harmelen, Semantic technologies for historical research: A survey, Semantic Web Journal (2015), 1–27, http://iospress.metapress.com/content/A842V11135QK5055. doi:10.3233/SW-140158.

22.

J.-M.

Meunier,

Szoniecky and

Berthereau, Utilisation d’Omeka-S pour la conception et le partage de ressources pédagogiques, in: Zotero & Omeka – des Outils Pour les Humanités Numériques, Poitiers, France, 2019, https://hal-univ-paris8.archives-ouvertes.fr/hal-02018389 .

23.

Nabonnand (ed.), La Correspondance Entre Henri Poincaré et Gösta Mittag-Leffler, Birkhäuser, Basel, 1998. doi:10.2307/3621538.

24.

N.F.

Noy,

Sintek,

Decker,

Crubézy,

R.W.

Fergerson and

M.A.

Musen, Creating semantic web contents with Protégé-2000, IEEE intelligent systems16(2) (2001), 60–71. doi:10.1109/5254.920601.

25.

F.C.

Paletta,

M.M.

Macambyra,

S.L.

Ferreira and

V.M.A.

Lima, Digital library of the artistic production of ECA USP IFLA 2019 (2019), preprint. doi:10.31219/osf.io/cp5a2.

26.

Poincaré, La Science et L’Hypothèse, Ernest Flammarion, Paris, 1902.

27.

C.K.

Riesbeck and

R.C.

Schank, Inside Case-Based Reasoning, Lawrence Erlbaum Associates, Inc., Hillsdale, New Jersey, 1989. doi:10.4324/9780203781821.

28.

Rollet (ed.), La Correspondance de Jeunesse D’Henri Poincaré: Les Années de Formation. De L’École Polytechnique à L’École des Mines (1873–1878), Publications of the Henri Poincaré Archives, Springer International Publishing, Basel, 2017. ISBN 9783319559599. doi:10.1007/978-3-319-55959-9.

29.

Shimazu,

Kitano and

Shibata, Retrieving cases from relational data-bases: Another stride towards corporate-wide case-based systems, in: Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI’93), Chambéry, 1993, pp. 909–914.

30.

Walter,

Bolmont and

Coré (eds), La Correspondance Entre Henri Poincaré et les Physiciens, Chimistes et Ingénieurs, Birkhäuser, Basel, 2007. doi:10.1007/978-3-7643-8303-9.

31.

Walter,

Krömer and

Schiavon (eds), La Correspondance Entre Henri Poincaré Avec les Astronomes et les Géodésiens, Birkhäuser, Basel, 2014. doi:10.1007/978-3-7643-8293-3.

32.

Wittek and

Ravenek, Supporting the exploration of a corpus of 17th-century scholarly correspondences by topic modeling, in: Supporting Digital Humanities 2011: Answering the Unaskable,

Maegaard, ed., Copenhagen, Denmark, 2011, Reporting year: 2011.