Abstract
During the nineties of the last century, historians and computer scientists
created together a research agenda around the
Introduction
Historians have a long tradition in using computers for their research [15]. The field of historical research is currently undergoing major changes in its methodology, largely due to the advent and availability of high-quality digital data sources. More recently, the Web has shaken the paradigm of research data publication, particularly since the inception of the Semantic Web [13] and the Linked Data principles [46]. This paper looks forward on how Semantic Web technology has been applied to historical data, and how these technologies can facilitate, boost and improve research by historians. This survey revisits the open problems in historical data and historical research, and analyses current contributions, namely papers, projects, online resources and tools, that apply semantic technologies to solve such problems. We study how successful these solutions have been and propose some challenges for the future.
Historical research is an interesting domain for the Semantic Web. Historical data are extremely context dependent, and always open to a variety of possible interpretations. Availability on the Web of historical research data, which concerns the study and understanding of our past, is growing. The Semantic Web is an evolution of the existing Web (based on the paradigm of the document) into a Semantic Web (based on the paradigm of structured data and meaning). It is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. This survey studies the crossroads of the Semantic Web and history as research domains.
We consider surveying the state of the art in Semantic Web and history a fundamental task for both fields. First, it is necessary as a knowledge organisation task, in order to articulate research and discern contributions. Second, it fosters development of semantic technology and history, both individually and as a unique field, and helps on building research agendas. Other attempts on gathering research efforts on Semantic Web and history exist, but most of them study specific history subfields [25,94,97] or analyse concrete task-oriented tools [33,39] and methodologies [46,53,54]. Moreover, none of them consist in surveys or literature reviews. To the best of our knowledge, this is the first survey reviewing contributions on history and the Semantic Web as generic fields of research.
The elaboration of the study in this paper is not free of obstacles. The first of
them is the large amount of research contributions to survey, which had to be
filtered to fit strictly the Semantic Web goals and the historical research goals in
order to be feasible. By historical research we mean strictly research performed by
historians, and talk about history as a research domain. Thus, we exclude other
fields of the humanities in which historical research is also performed, such as art
history or history of literature. Nevertheless, in the end the number of
contributions amount to more than a hundred. Secondly, and even though the corpus of
available literature is large, we also encountered difficulties on accessing some of
the sources. To solve this, we combined the contributions with the knowledge of
domain experts, conducting eight interviews with pioneers in this area. Third,
structuring and articulating all this work is an arduous task that requires a lot of
schemas, tables and discussions. Finally, the clash of the vocabularies used by two
different research communities, usually pointing at similar issues, is problematic.
To bridge different jargon we devote some space to cover existing classifications of
historical data, especially discussing terms like
The paper delivers four contributions. First, it describes a classification of
historical data depending on several factors, merging existing distinctions by
historians with structural approaches from computer science. Second, it articulates
the research conducted in the emerging field of historical Semantic Web in terms of
several
While we concentrate on historical research, similar solutions emerge also in other humanities fields at the turn to e-humanities or Digital Humanities [14,92]. As historical research overlaps with literary studies, ancient language studies, archaeology, art history and other humanities fields, these areas of encounter are also predestined candidates for the travel of generic methods developed from a semantic technology perspective for historical research to other humanities fields [68].
The survey is organized as follows. In Section 2 we introduce some background on historical research and the Semantic Web. In Section 3 we study the ecosystem of historical data. We describe the life cycle of historical information, propose a classification for historical sources, and show open problems of historical data. In Section 4 we articulate contributions that apply semantic technologies to historical research. In Section 5 we answer the question on how the contributions presented in Section 4 solve the open problems we describe in Section 3. In Section 6 we show the challenges that are still left to solve. Finally, in Section 7 we discuss our findings and establish some conclusions.
Background
Historical research
The field of historical research concerns the study and the understanding of the past. The field is currently undergoing major changes in its methodology, largely due to the advent of computers and the Web [15].
Computer science has inspired historians from the start.
Although computing tools are currently embedded in the daily life of most researchers, the use of these tools did not revolutionized all sciences equally. Accordingly, history failed to acknowledge many of the tools computing had come up with [15]. Instead of improving the quality of the work of historians and assisting them in their processes, software developed for historians often requires attending several summer schools [16]. Currently there are still many challenges and information problems in historical research. These difficulties mainly range from textual, linkage, structuring, interpretation, to visualization problems [15].
Despite these challenges, computing in history and in the broader sense the humanities, also brought some significant contributions in certain fields like linguistics (corpus annotations, text mining, historical thesauri etc.), archaeology (impossible without geographic information systems (GIS) nowadays), and other fields using sources that have been digitized for historical (comparative) research and converted to databases [15]. The use of electronic tools and media is incredibly valuable and important for opening up various sources for research which would otherwise remain unused. Open access to research data has always been an issue, especially in the humanities. However, over the past years various efforts have been made in opening up these black boxes and making them available for researchers. These different sources contain rich information from various fields, which are often digital in nature in the form of databases, text corpora or images. These sources, in practice isolated databases, often contain a lot of semantics, but their data models were asynchronously designed, making them difficult to compare. So, while more and more sources are being digitized, more attention has to be given to the development of computational methods to process and analyze all these different types of information [45].
A key issue for historians and other humanities researchers when dealing with historical data for comparative research concerns the lack of consistency and comparability across time and space, due to changing meanings, various interpretations of the same historical situations or processes, changing classifications, etc.
Though not all research dreams materialized in the way initially envisioned [60], the inception of the Web allowed historians to aim for world-wide, large scale collaborations, especially in the area of economic and social history. This kind of web based cooperation allows to collect, distribute, annotate and analyze historical information all around the globe [30].
Changes in historical research are closely connected to the emergence of new scientific methods, and this co-evolution holds for decades and centuries. Statistics has influenced many fields including history, and paved the ground for quantitative studies [61]. However, these kind of historical studies became more and more the domain of sociologists, economists and demographers than scientists educated as historians [89]. Late important changes are consequences of recent technological trends connected to the emergence of the Web [76] and the inception of Semantic Web technologies [4].
The Semantic Web
The advent of the Semantic Web poses new perspectives, challenges and research
opportunities for historical research. Envisioned in 2001 by Berners-Lee,
Hendler and Lassila [13], the Semantic
Web was conceived as an evolution of the existing Web (based on the paradigm of
the document) into a Semantic Web (based on the paradigm of structured data and
meaning). By that time, most of the contents of the Web were designed for humans
to read, but not for computer programs to process meaningfully. Although
computer programs could parse the source code of Web pages to extract layout
information and text, computers had no mechanism to process the semantics. In
other words, the Semantic Web
More practically, the Semantic Web can be defined as the collaborative movement
and the set of standards that pursue the realization of this vision. The World
Wide Web Consortium [12] (W3C) is the
leading international standards body, and the Resource Description Framework
[121] (RDF) is the basic layer in
which the Semantic Web is built on. RDF is a set of W3C specifications designed
as a metadata data model. It is used as a conceptual description method:
entities of the world are represented with nodes (e.g.
Efforts on standardization have produced ontologies and vocabularies to describe
multiple domains. An ontology is an
A large number of RDF datasets have been published and interlinked on the Web, using these ontologies and vocabularies and following the Linked Data principles [11]. In the middle of the document-Web and the data-Web, formats and vocabularies for rich structured document markup (such as RDFa [122] or schema.org [91]) are enabling software agents to crawl semantics from web pages, bridging the gap between the Web for humans and the Web for machines. These efforts have evolved the Web into a global data space [46] where data can be queried e.g. using the SPARQL query language (SPARQL Protocol and RDF Query Language) [120]. Although the transition from the document-Web to the database-Web exists in the form of these standards and technologies, the simple idea of the Semantic Web remains largely unrealized [98].
Historical data
Since the introduction of computers in the field, historical research has produced high-quality digital resources [15]. Historical datasets encompass texts, images, statistical tables and objects that contain information about events, people and processes throughout history. Converted or born-digital, historical datasets are now analyzed at big scale and published on the Web. Their temporal perspective makes them valuable resources and interesting objects of study.
In this section we describe the ecosystem where historical information lives. First we introduce the life cycle of historical information, which is the framework we use to study how historical data is created, enriched, edited, retrieved, analysed, and presented. Then we propose a classification of historical data depending on several factors. Finally, we revisit the traditional open problems of historical data. Some of these problems have found solutions in current Semantic Web developments we present in Section 4.
The life cycle
The main object of study in historical research is historical information, and the multiple ways to create, design, enrich, edit, retrieve, analyse and present historical information with help of information technology. It is important to distinguish historical information from raw data in historical sources. These data are selected, edited, described, reorganized and published in some form, before they become part of the historian’s body of scientific knowledge. We use the life cycle of historical information proposed by Boonstra et al. [15] to study the workflow of historical information in historical research.
Historical objects go through distinct phases in historical research. In each
phase, these objects are transformed in order to produce an outcome meeting
specific historical requirements. The phases can be laid out as the workflow of
a

The life cycle of historical information (Boonstra et al. [15]). The phases in the life cycle are: (1) creation; (2) enrichment; (3) editing; (4) retrieval; (5) analysis; and (6) presentation.
The life cycle of historical information consists of six phases:
In the middle of the historical information life cycle, three aspects are identified which are central to history and computing, but also in the humanities in general:
The continuous usage of computing in different areas of historical research has produced digital historical data with different formats, perspectives and goals. To be used in the Semantic Web, these historical data have to be represented semantically, using the current standards (see Section 2.2). In this section we propose a classification of historical data in order to bridge the gap between the data representation tradition in historical research, and the standard modelling paradigms of the Semantic Web [4,46].
Primary and secondary sources
Historical sources can be characterized and divided in many ways. A basic
distinction used by historians is between
Primary sources are original materials created at the time under study [10]. They present information in its
original form, neither interpreted, condensed nor evaluated by other
writers, and describe original thinking and data [56]. Examples of primary sources are scientific
journal articles reporting experimental research results, persons with
direct knowledge of a situation, government documents, legal documents (e.g.
the Constitution of Canada), original manuscripts, diaries (e.g. the Diary
of Anne Frank) and creative work. Primary sources can be distinguished into
Secondary sources are materials that have been written by historians or their predecessors about the past [128]. They describe, interpret, analyze and evaluate the primary sources. Usually, secondary sources gather modified, selected, or rearranged information of primary sources for a specific purpose or audience [56]. Examples of secondary sources are bibliographies, encyclopedias, review articles and literature reviews, or works of criticism and interpretation.
Since historical data have not been produced under the controlled conditions of an experiment, historical research always has something of the work of a detective, and certain details (read: annoying inconsistencies) cannot be destroyed or manipulated. These details may contain relevant information. On the other hand, to be able to extract statistical information and come up with more general statements, some formalization, relating information and harmonizing expressions of what is later used as variables is needed. Harmonization, the process of making data-sources uniformly accessible without altering its original form, is closely related to issues of standardization and formalization [69].
Intended further processing
Some historians [15] propose to
structure historical data depending on their required further machine
processing. They distinguish between
Source oriented vs. goal oriented
Researchers make the distinction between
Level of structure
At the end of the creation phase (see Section 3.1) one may expect to have a historical dataset suitable for
further processes. However, the nature of the steps to be taken thereafter
may strongly depend on the way the resulting dataset is structured. Indeed,
attaching Semantic Web technologies to these historical sources (e.g. to
extract RDF triples from them, or enrich them semantically) is strongly
dependent on their level of structure. We propose the historical data
classification shown in Fig. 2. We distinguish
three levels of inner structure in historical datasets:

Classification of historical data according to their level of structure. Dotted arrows indicate the direction of usual transformations in workflows that identify historical entities (and their relations), from unstructured to structured representations.
The use of the terms
Although structure really matters for deciding what specific computing
technique or semantic model has to be applied to the sources, being those
sources administrative or narrative, deliberate or inadvertent, does not
really matter if their inner structure is clearly identified. Their
belonging to one type of another may have an influence at some point, but in
general the procedure to extract RDF triples from the sources strongly
relies on the type of source we have regarding their structure. The goal is
a faithful representation of the source in Semantic Web formats: a
source-close representation allowing to model data as-is, meeting the same
requirements of faithfulness than critical source editions (which is the
standard for historians). It is critical for semantic representations to
consider
Open problems
The classification proposed in Section 3.2.4 is not strict and admits hybrid examples. For instance, annotated digital text sources can be provided both as XML files or stored in a relational database (e.g. for statistical analysis). Some authors classify sources that combine primary and secondary sources like these as tertiary sources [129].
Although many advances have been made in different fields and computers are seen
as valuable assets, a high percentage of historians are unfamiliar with or
remain unconvinced that semantic technologies may become a new methodological
asset [3,106]. The reason is that the weapon of choice of
historians was and remains mostly the database, particularly in relational form
[3]. This not only enabled
historians to retain some of the integrity of the original data sources, but
also paved way for rapid advances on issues such as classifications and record
linkage. Therefore, historians typically do research using their
Historical data problems can be divided into four main categories: information problems of historical sources, information problems of relationship between sources, information problems in historical analysis, and information problems of the presentation of sources [15].
Historical sources
The first set of open problems in historical research happens in phase 1 of the historical data life cycle (see Section 3.1). This is when the historical data are created.
Manually encoded or OCR-scanned, the creation of the dataset reveals the first barriers. Some characters, words or entire phrases in the original material may be lost or impossible to read or recognise by the human or the computer. Moreover, different techniques may extract historical entities differently. An example would be: what is the word that is written on this thirteenth-century manuscript?
The next question usually is: what does it mean? Background knowledge is provided by libraries in the offline world. But the computer aiding tools also need to have means to help the historian, using the Web as channel and semantics as meaning.
Related to background knowledge is the provenance of the data. Even if the source is clearly identified and its meaning deciphered, the historian needs to know more. To which issue does it relate? Why was it put there? Why was the text written? Who was the author? Who was supposed to read the manuscript? Why has it survived?
Another main issue relates to the structuring problem of historical data [131]. How can historical objects be encoded in a database? Researchers have to decide on what is an adequate data model for their datasets. As historians often have no clear research question when starting an investigation, it is neither possible nor desirable to model the data according to certain requirements in advance. Moreover, different sources have been produced throughout different periods in history with different views and motives. Historical census data is a good example, having varying structures and changing levels of detail which hinders comparative social history research both in past and present efforts [131].
The main discussion regarding this involves whether to use a source or a goal oriented data model for historical data (see Section 3.2.3). Researchers in favor of the source oriented approach claim that a commitment to a certain data model suitable for analysis should be postponed to the final stages of a project, in order to maintain flexibility and build on the data in a non destructive manner. This is especially the case when the database is supposed to be shared with other researchers or used in the future [65].
Relationships between sources
As historical researchers deal with various isolated sources, they face the problem of how to integrate these dissimilar sources for their purposes. This typically happens in phase 2 (enrichment) of the life cycle of historical information (see Section 3.1). An example would be: is this Lars Erikson, from this register, the same man as the Lars Eriksson from this other register?
Quite often several sources are used in historical research, which makes
linking different sources another key problem. Micro data of the same person
contained in different censuses, parish registers, marriage or death
certificates are a good example. Obvious linkage problems are how to
disambiguate between persons with the same name, how to manage changing
names (e.g. in case of marriage of a woman) and how to standardize spelling
variations in the names. In databases, several issues affect data
comparability.
Other problems relate to how to link historical data with their spatial and temporal context. For example, some historical facts may need to be linked with occupational titles that evolve over time [48] or with countries with changing geographical boundaries (compare for example the contemporary geographic position of countries in Europe with the situation in 1930 and in 1900; or the fact that the city of Rotterdam suffered nine major changes in its composition between 1886 and 1941 [133]). As historical research often deals with changes in time and space, historians require tools which enable them to deal with these aspects. Accordingly several techniques have been developed for historical research, but the applicability of these has yet to be determined [15].
Historical analysis
Historical analysis is a fundamental part of the life cycle (see phase 5 in Section 3.1). It usually implies data transformations that aid historians in guiding their research. It also builds the bridge between their hypotheses and historical evidence.
The first issue in analysis is the massive treatment of historical data
processed in previous stages to satisfy historical requirements, or to
support a specific historical interpretation. An example would be: from this
huge amount of digital records, is it possible to discern patterns that add
to our knowledge of history? Various statistical techniques are borrowed
from the social sciences to this end, like multilevel regression, and other
techniques have been specifically developed for historical research, such as
In historical research the meaning of data cannot exist without interpretations [15]. Due to drifting concepts in history, different interpretations could exist with regards to certain data [137]. However as interpretation of data is a subjective matter, this information should be added in a non destructive way, preserving the original source data.
Presentation
Presentation is the final phase of the historical information life cycle (see Section 3.1). Its goal is to use visualizations to aid the study and comprehension of historical data. An example problem of such phase would be: how do you put time-varying historical information on a historical map?
Presentation of historical data must be adequate. Different types of presentations are suitable at different stages of a research project. Presentation may take different shapes, varying from digitized documents, poorly and well modelled databases, or visualizations and representations on Geographic Information Systems (GIS). Currently there is a great need for tools and methods to present changes over time and space.
Findings
In this section we review the current state of the art in the application of semantic technologies to historical research, describing relevant contributions towards a historical Semantic Web.
Reviewed papers. The ✓ and ∘ signs
indicate a strong and a medium relationship, respectively, between the
contributions (rows) and the tasks
(columns)
Reviewed papers. The ✓ and ∘ signs indicate a strong and a medium relationship, respectively, between the contributions (rows) and the tasks (columns)
Reviewed projects. The ✓ and ∘ signs indicate a strong and a medium relationship, respectively, between the contributions (rows) and the tasks (columns)
Online resources. The ✓ and ∘ signs indicate a strong and a medium relationship, respectively, between the contributions (rows) and the tasks (columns)
Tools, ontologies, and lexical resources. The ✓ and ∘ signs indicate a strong and a medium relationship, respectively, between the contributions (rows) and the tasks (columns)
Under this category we study research that has been conducted to model historical knowledge or historical facts using standard Semantic Web representations (see Section 2.2). We group contributions to a semantically enabled historical web by the following emphasis of research: historical ontologies, and linking historical data.
Historical ontologies
Data models are necessary for giving structure to any historical data, since they are the abstract models that document and organise data properly for communication. Ontologies encode such models in the Semantic Web [13] (see Section 2.2), and attention has been given to the need of historical ontologies [54]. In historical research, ontologies are the providers of metadata and background knowledge in phases 2 (enrichment) and 3 (editing) of the historical information life cycle (see Section 3.1). Semantic Wikis [86,96] are a great resource for historians to collaboratively build such ontologies.
We find a first category of such models in the form of (typically
XML-encoded) taxonomies for historical research. A taxonomy is a collection
of controlled vocabulary terms organized into a hierarchical structure, in
general with less expressivity than an ontology. The first important example
of such knowledge organization is the CLIO system, a databank oriented
system for historians [113]
appeared in 1980. CLIO included a tag/content representation for historical
data that could be structured in complex hierarchies, supporting the
recoding of material with doubtful semantics. CLIO remained as
More recently, the Semantic Web for Family History [118] exposes a set of genealogy markup languages based on XML to semantically tag genealogical information on sources containing that kind of historical data. In the context of the Text Encoding Initiative [111] (TEI) there is an important discussion about building the bridge between XML (taxonomies) and OWL (ontologies) in historical data. SIG: Ontologies [101] contains a full log on contributions on how to use ontologies with TEI formats; namely, how TEI-XML encoded documents can refer to historical concepts and properties that have been previously formalized in an external OWL ontology.
The Historical Event Markup and Linking Project [47,88] (HEML) was probably the first project with the goal of creating a Semantic Web of history. Started in 2001, it explored the use of W3C markup technologies to encode and visualize historical events on the Web. Although in the beginning XML was the selected language to provide tagging and markup for describing historical events, the project later experimented with RDF to model and visualize them [87]. This transition was also happening in the whole historical ontologies community, as researchers better understood RDF and its differences with XML.
The modelling and representation of events, often defined as
Another big focus in historical ontologies is given to geographical modelling. Owens et al. [83] describe a geographically-integrated history, and stress the importance of dynamics and semantics in Geographic Information Systems (GIS). They set an agenda for historical GIS systems that includes important semantic modelling tasks involving ontologies and geography for historical analysis. Moot et al. [74] depict the interesting crossroad between text analysis, historical semantics and geography in a work that structures geographical knowledge from a historical corpus of itineraries. Vocabularies for historical place names are under discussion [85]. Although not intended for historical research, the GeoNames ontology [117] is the reference for geographical modelling in the Semantic Web.
Since entities like places, persons or events change their over history and time, there is work raising the importance of a change-aware modelling in ontologies [36,68,70]. In historical research and the Semantic Web this is especially true for geographical names, places and regions [52], but also for demographical, social and economical indicators, such as occupations [48].
Linking historical data
By understanding the use and advantages of semantic technologies,
practitioners and researchers of historical data can not only connect their
own data sources but moreover, also disseminate their data into the Semantic
Web and integrate it with other data sources which were previously not
possible or cumbersome. The approaches reviewed in this section match the
historical data problem of the
If one side of knowledge modelling stresses the importance of ontologies and
formalization of the semantics of historical domains, the other side pursues
the usage of such ontologies to interlink related historical data on the
Web. Some researchers in history have centered their interest in how
semantics can help relating and linking historical sources and entities:
There is a wide variety of project types looking for that structure, though
not doing so solely (or explicitly) in RDF. For instance, the Circulation of
Knowledge and Learned Practices in the 17th-century Dutch Republic [24] (CKCC project) studies the
epistolary network for circulation of knowledge in Europe in the 17th
century, extracting all entities and links from the correspondence of
scientific scholars of that time. The LINKing System for historical family
reconstruction (LINKS) project [63] reconstructs the links between individuals of historical
families across several registries. The CCed [20] project follows a similar approach with clerical
careers from the Church of England Database. While these projects mine the
historical sources for important historical personalities and their
relationships, other approaches, such the SAILS [90] project, dive into more concrete historical
events and links various World War I naval registries together. The common
goal in these initiatives is to produce a
Many other projects expose their domain specific historical datasets using
RDF. These datasets facilitate their linkage to others using existing
ontologies (see Section 4.1.1), achieving
shared goals with the old task of historical record linkage. For instance,
the Agora project [1] aims at
formally describing museum collections and linking their objects with
historical context using the SEM [95] (Simple Event Model). Historical events are found elsewhere
in historical data. The FDR Pearl Harbor project links events, persons,
dates, and correspondence found on government letters and memoranda on the
surroundings of the Pearl Harbor attack on 1941 between the US and Japanese
governments. All these entities are represented in RDF to model a graph of
historical knowledge about that particular event. From a more
socio-historical point of view, the Verrijkt Koninkrijk [25] project links RDF concepts found
on a structured version of De Jong’s studies on
Some general purpose tools facilitate the creation of historical Linked Data. The Fawcett toolkit [33] and the Armadillo project [5] are good examples. The latter exports RDF from any unstructured historical source, producing an RDF graph of historical knowledge that encodes the historical entities and their relationships expressed in that source. Other tools like Open Refine [82] or TabLinker [108] are tailored to produce such Linked Data from structured sources like tables (see Section 3.2).
Text processing and mining
In
Structuring historical information from textual resources for further analysis is the bottom line of many research projects. The interesting differences come usually from the various source materials these projects mine. The general public-aimed Agora project [1] enriches museum collections with historical knowledge in order to help users place museum objects in their historical contexts. To this end, Agora employs information extraction techniques from statistical natural language processing to extract named entities (actors, locations, times, event names) from textual resources such as Wikipedia and collection catalogues which are used to populate SEM [95] (see Section 4.1.1) instances. From the object descriptions, also relevant historical entities are extracted which can be linked to the events. To formalize this workflow, Segers et al. [94] present a prototype extraction pipeline for extracting events and their properties from text using off-the-shelf natural language processing tools such as named entity recognition and pattern-based approaches. The main problem they encounter is that the notion of events is still ill-defined in NLP research, and as such tools are not yet readily available.
Textual encoding of the media have also been the source to extract historical knowledge in several projects. The Bridge project [18] aims at bringing more cohesion into Dutch television archives by finding relevant links between the official archives maintained at the Netherlands Institute for Sound and Vision and other information sources such as program guides and broadcasting organizations websites. It is thus focused on improving access to television archives for media professionals. In order to do so, relevant entities are extracted from archives by using statistical NLP techniques. Furthermore, they will detect interesting events in television archives by detecting redundant stories, utilising the structure of the archive to identify links between different entities [19]. The Poli Media project [84] mines the text of minutes of the general state debates to extract and link historical entities from the archives of historical newspapers, radio bulletins and television programs.
The Historical Timeline Mining and Extraction (HiTiME) project [50] is aimed at detecting and structuring biographical events. To this end they analyze biographies of persons from the Dutch union history to create timelines that tell the life story of these persons, and social networks of the persons they interacted with. Van de Camp and Van den Bosch [130] describe an approach to build networks of historical persons by mining biographies for person names and relationships between persons. They use standard named entity recognition tools and utilise the inherent structure of biographies (the topic of the biography is a particular person, and any persons mentioned in this biography should have something to do with this person) to detect interpersonal relations.
Many ehumanities and ehistory projects are exploring document summary techniques or document enrichment techniques from NLP to aid search in their archives. One of these techniques is topic modelling, which can be used to add topic indicators to a document, which may help cluster search results or create more fine grained indexes of archive records. Wittek and Ravenek [139] explore the state of the art in topic modelling techniques to index 19,000 letters of correspondences between 16th and 17th century Dutch scientists.
Other high-level text analysis methods, such as frequency-based corpus analysis to compare e.g. work from different authors or investigation of other stylometry characteristics, are also popular in the ehumanities domain [115]. These methods are not domain-dependent and fit more easily into the ehumanities researcher search-based toolbox.
The spectrum of tools to extract knowledge from unstructured historical data is wide. Important contributions are essentially domain-independent [7], thus not particularly focused on historical text processing. Gangemi [39] presents a recent and complete comparison of generic knowledge extraction tools for the Semantic Web, which will aid historical researchers working in the phases 2 (enrichment) and 3 (editing) of the historical information life cycle (see Section 3.1).
Search and retrieval
In
It is not a coincidence that a high number of contributions that aim at extraction of structured entities from historical data also point at some desired system able to improve search and retrieval of such entities. Indeed, by means of constructing a semantic graph of historical knowledge, search and retrieval of that knowledge, as well as indexing systems that give exact pointers to the source in which particular historical entities are mentioned, can be easily built and improved. The Agora [1] (museum collections), BRIDGE [18] (historical TV metadata), CHOoral [23] (historical audio metadata), Historical Timeline Mining and Extraction (HiTiME) [50] (biographical events), Verrijkt Koninkrijk [25] (Dutch post-war social clusters concepts) and FDR Pearl Harbor [34] (historical events around Pearl Harbor attack on 1941) projects are all good examples of this tendency. Once the knowledge is successfully extracted from the historical sources and formalized appropriately, entities structured this way can be used for a graph-based search and retrieval, for instance through SPARQL queries (see Section 2.2), although most systems use specific access methods [53]. Other projects, like the H-BOT [44] project, use a natural language interface instead of a query system for retrieval of such historical structured knowledge.
Indexing of historical contents is another way of improving search and retrieval of historical data. Indexing and historical data storage systems have a long tradition in historical research [15]. CLIO [113] is a traditional example of such a system, nowadays indexing is performed by XML annotation-oriented approaches, such as described by Robertson [88]. These initiatives should consider the emerging RDFa, microformats and microdata technologies (see Section 2.2) to study the ways they fit in the vast domain of historical text annotation systems.
Semantic interoperability
In this section we analyze to what extent contributions consider the problem of data integration and use the Semantic Web to deal with it. The specific problems encountered are data model mismatching, schema incompatibilities and disparate source formats. Semantic interoperability has much to do with data integration, namely, how to commonly query and uniformly represent data that come from multiple sources (i.e. fitting several, probably non-compatible data models).
Semantic heterogeneity of historical sources is especially present on social history projects. The North Atlantic population project [78] (publication of microdata of several Atlantic countries) has this problem of data harmonization, in which heterogeneity of sources requires an intense work on resolving data model inconsistency between datasets.
The source material for the Historical Sample of the Netherlands [49] (HSN) database consists mainly of the certificates of birth, marriage and death, and of the population registers. From those sources the life courses of about 78.000 people born in the Netherlands during the period 1812–1922 have been reconstructed. Stored in a database and downloadable as files, this information forms a unique tool for research in Dutch history and in the fields of sociology and demography. As in the case of the HSN this type of sources is usually stored in archives, and, for the majority from a more remote past, not yet machine readable and not easy to analyse with NLP techniques. There is one major pitfall in linking this kind of data: extracting data about persons, events, institutions, locations is one thing, but linking to their different instantiations (for instance different name spellings, or persons with the same name) and keeping good documentation is the real challenge [65].
The CEDAR project [21], located in the crossroads of the Semantic Web, statistical analysis and social history, exposes the Dutch historical census data in the Semantic Web. Censuses are a great source of non-biased socio-historical information, but they present complex problems in both internal (i.e. between the time series) and external (i.e. other datasets) interlinking [71].
The work developed by Sieber et al. [100] provides a deep analysis of how semantic heterogeneity can be addressed exclusively with semantic technologies, and describes how to achieve success in environments with very disparate data models. In the history-related domain of geographic information systems (GIS), already discussed in Section 4.1.1, Manso and Wachowicz [66] provide an extensive review on current issues in interoperability.
Classification systems
Multiple publications in classification systems [31,42,72,103] are especially aimed at solving interoperability problems in historical data. Classification systems provide a standard mechanism to compare such data, but their specific implementation and effectiveness depends on the orientation towards source or goals of the historical data (see Section 3.2.3) created in phase 1 of the historical data life cycle (see Section 3.1).
When dealing with vast amounts of historical data, classification systems are a necessity in order to organize and make sense of the data. The main goal of a classification system is therefore to put things into meaningful groups [9]. This entails an allocation of classes which are created according to certain relations or similarities. The main issue with historical classification systems is that they are not consistent over time, making comparative historical studies problematic. Historical census data is a typical example of this problem [21,69]. Census data is the only historical data on population characteristics which are not strongly distorted and yields an extremely valuable source of information for researchers [89].
However, major changes in the classification and coding of the different
censuses, have hindered comparative historical research in both past and
present efforts [131]. Researchers
are forced to create their own classifications systems in order to answer
their research question; however, this process often results in disparate
systems, which are not comparable, contain a lot of expert knowledge,
different interpretations of the data and could not be easily (re)used by
other researchers. The fact that many of the modelling techniques are
destructive in nature (we cannot go back to the source) makes it even more
cumbersome to comprehend these sources. In order to deal with the changing
classifications and vast differences at both national and international
level, we need to connect the gaps between the datasets and conform to
certain
Currently several significant efforts have been made in this direction. The Integrated Public Use Microdata Series (IPUMS) project [55] for example faces the problem of bridging 8 different occupational classification systems and a total of 3200 different categories, containing the richest source of quantitative information on the American population. The North Atlantic Population Project [78] (NAPP) project provides a machine-readable database of nine censuses from several countries. The main focus of the NAPP project is to harmonize these data sets and link individuals across different censuses for longitudinal and comparative analysis. Their linking strategy involves the use of variables which do not change over time. In this process records are only checked if there is an exact match for some variables, such as race and state of birth. Other variables like age and name variables are permitted to have some variations. Another significant historical classification system is the Historical International Standard Classification of Occupations [48] (HISCO). As occupations are one of the most problematic variables in historical research, HISCO aims to overcome the problem of changing occupational terminologies over time and space. It encodes historical occupations gathered from different historical sources coming from different time periods, countries and languages, and classifies tens of thousands of occupational titles, linking these to short descriptions and images.
Transversal approaches
Finally, there are few but key contributions we have classified as being
The CLIO system [113], a databank
oriented system for historians, is the first of such contributions. CLIO was,
for decades,
In the Linked Data universe, the Agora project [1] is one of such transversal contributions. It generates historical RDF of events extracted using NLP techniques from unstructured texts, uses it for enhanced search and retrieval, improves semantic heterogeneity and gives context by linking to other datasets. Similarly, the Verrijkt Koninkrijk [25] and Multilingual Access to Large Spoken Archives (NSF-ITR/MALACH) [79] projects perform these tasks in their particular domains (see Section 4.1.2). The FDR Pearl Harbor project [34] also contributes on this line, but additionally opening the very promising field of historical knowledge inference through the formalization and usage of historical OWL ontologies. All these are good examples on how historical data get much richer when their semantics are explicitly expressed and they are interlinked through standard vocabularies and ontologies.
Regarding tools, the Armadillo architecture of Semantic Web Services [5] and the Fawcett toolkit [33] contain the generic plot behind all these contributions, and cover the whole pipeline of semantic historical data management. The latter extracts RDF event-oriented triples from unstructured texts, and additionally allows historians to install a full semantic toolbox with widgets to experiment with their data. Open Refine [82], in combination with its RDF-export plugin, allows the extraction, transformation, modelling and publishing of historical Linked Data when the sources come in tabular format.
Additionally, the theoretical study by Boonstra et al. [15] envisages possibilities on how the Semantic Web can enhance research by historians. It constitutes, besides, a major work on the evolution of historical computing, ehistory and historical information science, and gives a deep intuition on how computer science can help to solve ancient problems in historical research.
Solving historical problems
Mapping between the open
problems of historical data (see Section 3.3) and the surveyed contributions in historical Semantic
Web (see Section 4). The sign ✓ indicates
that the problem is directly addressed in the Semantic Web task. The
sign ∘ indicates that the problem is indirectly or partially addressed
in the Semantic Web task
Mapping between the open problems of historical data (see Section 3.3) and the surveyed contributions in historical Semantic Web (see Section 4). The sign ✓ indicates that the problem is directly addressed in the Semantic Web task. The sign ∘ indicates that the problem is indirectly or partially addressed in the Semantic Web task
The first interesting result is that some of the problems identified in
As part of the problems in historical sources, provision of background historical knowledge has been successful only partially. The infrastructure (Linked Data cloud, SPARQL endpoints on historical data) is set up and running. But the amount of historical data available is still too low to give good support to any historian creating historical datasets in the beginning of the life cycle (see Section 3.1). Consequently, little background knowledge can help today these historians in solving e.g. errors or inconsistencies at that phase. Similarly, the generic infrastructure for provenance publishing and retrieval in the Semantic Web is very mature and extensively used in other domains [123], but scarce or non existing in the historical domain although being identified as a very important requirement (see Section 3.3.1). The provision of such provenance on historical datasets needs to be guaranteed in projects using semantic technologies to publish historical data.
Solutions to the problem of
The problems in
In Table 5 all open problems have Semantic Web tasks
providing solutions, but not all tasks are mapped to some historical open problem.
Concretely, the tasks of
The use of semantic technologies has contributed significantly to solving the open problems of historical data (see Section 5). However, there is a lot of room for improvement. The open problems are being addressed as shown, but they are far from being solved until they get additional attention. The scarce amount of historical data on the Semantic Web is a good example. Other problems, some more specific, some more generic, could be also tackled with semantic solutions. In this section we explore some aspects of the Semantic Web that have not been used yet or could be furtherly exploited in historical research.
Semantics of time, change, language, uncertainty and interpretation
Classifications and ontologies in history do exist, but not for all areas, not in
Semantic Web languages and not always agreed upon. Although several historical
ontologies have been developed (see Section 4.1.1), these models are insufficient for the vast amount and
variety of historical data that still has to be published in the Semantic Web,
especially when key issues for historians like
Reasoning
From the point of view of Linked Data, ontologies and vocabularies are designed in order to control the terms in which datasets may express data, as well as the data model in which these data are represented. However, in a more Semantic Web perspective, one may expect these ontologies and vocabularies to facilitate new knowledge discovery; that is, to make explicit some implicit fact that was not trivial to deduce for the human eye, especially in big knowledge bases.
Reasoning is one of the key mechanisms of the Semantic Web still to be used in
historical research. The absence of specific methods and tools for automatic
historical inference, so that new,
Historical ontologies can be used to facilitate historical knowledge discovery using reasoners. Assuming that a particular domain is completely formalized as historical ontologies, then it is possible to run a reasoner on these ontologies to produce derived, implicit rules and facts that were not present in the original model as explicit knowledge (i.e. specifically encoded in the ontology), but that were there as underlying knowledge. For instance, if an ontology describes, on the one hand, the fact that a letter was sent from one government to another, and on the other hand, the fact that governments have a person responsible of sending and receiving letters, then it may be possible for the reasoner to infer what concrete persons sent and received what letters. As the knowledge base grows, implicit knowledge is not evident anymore and reasoners can facilitate an enormous work and produce high-value pieces of historical knowledge.
Since historians have different interpretations and no clear research question when starting an investigation, abductive reasoning (i.e. given the conclusions and a rule, try to select possible premises that support the conclusion) may be more convenient than deductive reasoning (i.e. deduce true conclusions given a premise and a rule) in historical research [22,51]. These would revert the order of some phases of the life cycle of historical information (see Section 3.1), generating a more bottom-up, data-based generation of hypotheses supporting evidence. The impact of abductive reasoning in historical research and its relationship with the life cycle needs further study and clarification.
The introduction of any kind of reasoning in the life cycle needs to be done with the goal of supporting, not replacing, the task of the historian, who must keep control of the implementation of the different phases.
Linking more historical data
We show in Section 4.1.2 that great efforts are being devoted to publish historical Linked Data. However, the amount of structured historical knowledge available on the Web is still insufficient to aid tasks that need high amounts and different kinds of historical background knowledge. While many different data and information sources exist, they are not always interlinked. This isolation of historical data sources hampers that they can be found, but it also inhibits how they can be further processed and connected.
One of the big claims of linked data is that, by linking datasets, relations established between nodes of these datasets highly enrich the information contained in them. That way, browsing datasets is not an isolated task anymore: by allowing users (and machines) to explore entities through their predicate links, data get new meanings, uncountable contexts and useful perspectives for historians.
For example, consider a scenario with three different SPARQL endpoints exposing RDF triples of a census with occupational data, a historical register of labour strikes, and a generic classification system for occupations (in the context of one particular country, for instance). Suppose that: the occupational census of the data exposes triples with countings on occupations (for example, how many men and women worked in a particular occupation in a concrete city), the historical register of labour strikes contains countings on how many people participated in labour strikes (number of women and men, per occupation and city), and the generic classification system harmonizes names of the occupations between both previous datasets (for example, gives a common number for representing occupation names that may vary between census occupations and labour strike occupations). Then, it is clear that several SPARQL queries can be constructed to give very meaningful and interesting linked data to the historian. For instance, such a query may return, given a city and an occupation code, which ratio of men and women followed a particular well-known labour strike. Another SPARQL query may return an ordered list of historical labour strikes by relevance, according to several indicators (strike successfulness ratio, total number of workers on strike, density of people on strike depending on the location, etc.). It is obvious that the possibilities increase if we think of more related historical sources to link, like datasets describing historical weather or historical geographical names and areas.
Flexibility of data models
It is considered to be a bad practice in historical research not doing the historical data modelling at phase 1 of the historical information life cycle (see Section 3.1). The choice of a particular data model to represent historical data is a critical issue for most historical computing projects. The election of some appropriate data model may seem a good design decision at some stage of the project. However, new requirements, research directions or stakeholder priorities may convert that data model into an obstacle more than an aid. Flexibility of data with respect to the data model used to represent historical domains is desired to avoid restructuring entire databases. Comparison in historical research requires flexibility of the models to be able to match them to one another. At the end, that enforces historians to make their data selection and processing dependent of a certain data model that can not be easily replaced or altered if needed. This happens usually in environments with changeable and creep requirements [57].
Applying semantic technologies and Linked Data principles to historical data may
have a major advantage regarding historical data models, providing flexibility
at the historical data modelling phase. Two different approaches regarding
historical data modelling have been followed traditionally in historical
computing: the
Moreover, additional questions arise when considering the traditional perspectives on data modelling: the conceptual, logical and physical data models. These perspectives help in detaching data management technology, like relational databases or RDF triplestores, from conceptual schemas (i.e. the semantics of a domain). While conceptual data models are currently shared on the Web as e.g. historical ontologies (see Section 4.1.1), the flexibility of the whole modelling stack towards semantic changes needs to be better understood.
Non-destructive data transformations
The non-flexibility of data models (see Section 6.4) is related to the non-flexibility of historical data transformations. Historical data are modified in the life cycle of historical information (see Section 3.1). But if update, enrichment, analytic and interpretative operations are not controlled, these transformations lead to different historical data representations which can hardly be related to each other any more, nor in terms of provenance nor in terms of relatedness.
Another issue is supporting data transformations under two constraints: (a) without modifying source data (so the originals stay intact); and (b) with tracking of changes. Consequently, destructive updates are a major concern when selecting, aggregating and modifying historical data. On the one hand, modifications to specific encodings (CSV, spreadsheets, XML) do not support non-destructive updates, and version control systems are necessary to retrieve previous states. On the other hand, relational databases can be inefficient when querying all recorded transformations, edits and manipulations.
Non-destructive updates are well supported by current Semantic Web technology
like SPARQL (see Section 2.2). SPARQL
Discussion and conclusions
In this paper we present a general overview of semantic technologies applied to historical research. We describe a general approach to historical research and the Semantic Web, and motivate why the combination of the two is an interesting field of research. We introduce core elements of historical research, such as the life cycle of historical information, several classifications for historical data, and the open problems shared by historians and computer scientists. Then, we overview contributions to the young historical Semantic Web in form of papers, projects and tools, articulating the work into several tasks and trends within these tasks. We provide a mapping to see to what extent the work on these tasks is helping to solve the open problems of historical data and historical research. Finally, we dig out a list of interesting open challenges for the future, like working out the semantics of critical aspects for historians, such as interpretation and time, and encouraging reasoning in the historical Semantic Web.
It is interesting to observe the sparsity in Tables 1,
2, 3 and 4. There is a significant difference in the number of
empty spaces (i.e. specificity of the contributions) between Tables 1 and 4 (papers and tools,
ontologies), and Tables 2 and 3 (projects and online resources). While the former set has essentially
lots of
We show how the Semantic Web and history communities understand the need for representing inner semantics implicitly contained in historical sources, and how these semantics can be conveniently identified, formalized and linked. With the appropriate pipelines, algorithms can extract entities from digital historical sources and transform these occurrences into RDF triples, according to some historical ontology or vocabulary. These entities can be linked between them and with other historical Linked Data, contributing to an open, world wide, online persistent graph of historical knowledge: an historical Semantic Web. The work presented in this survey contributes in one phase or another in this graph-building pipeline. We leave to the reader if this historical Semantic Web building pipeline is, in fact, the Semantic Web version of the life cycle of historical information.
The challenge of the realisation of a historical Semantic Web meeting as many requirements as possible may bring new facilities for a number of stakeholders. On the one hand, humanities researchers, also outside history, will be able to integrate the historical Semantic Web to their own information life cycle. They will be able to search, retrieve and compare historical knowledge and use it for the construction of their narratives, still the final outcome of historical research. On the other hand, practitioners will be able to search new data sources to develop history-aware applications for public institutions, private companies and citizens.
Footnotes
Acknowledgements
The work on which this paper is based has been supported by the Computational
Humanities Programme of the Royal Netherlands Academy of Arts and Sciences, under
the auspices of the CEDAR project (see
This work has been supported by the Dutch national programmes COMMIT.
The authors want to thank all contributing colleagues, especially the interviewees Onno Boonstra, Peter Doorn, Jan Kok, Henk Laloli and Dirk Roorda.
