Abstract
This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and representing semantic change, with its main application in humanities research. The paper’s aim is to provide the starting point for the construction of a workflow and set of multilingual diachronic ontologies within the humanities use case of the COST Action
Introduction
Detecting semantic change in diachronic corpora and representing the change of concepts over time as linked data is a core challenge at the intersection between digital humanities (DH) and Semantic Web (SW). Semantic Web technologies have already been used successfully in humanistic initiatives such as the Mapping the Manuscripts project [29] and in Pelagios [80]. They facilitate the creation, publication and interlinking of FAIR (Findable, Accessible, Interoperable and Reusable) datasets [191]. In particular, using a common data model, common formalisms and common vocabularies in linked data helps to render datasets more interoperable; the use of readily available technologies such as the query language SPARQL also makes such data more (re-)usable. Semantic change data can be highly heterogeneous and potentially include linguistic, historical, bibliographic and geographical information. The linked data model is well suited to handling this. For instance, the lexical aspect of semantic change data is already served by the existing OntoLex-Lemon vocabulary and its extensions, and there are also numerous vocabularies and datasets dealing with bibliographic metadata, historical time periods and geographic locations. In addition, the Web Ontology Language (OWL) and associated reasoning tools allow for basic ontological reasoning to be carried out on such data, which is useful for dealing with different classes of entities referred to by word senses.
Although significant advances in the development of natural language processing (NLP) methods and tools for extracting historical entities and modelling diachronic linked data, as well as in the field of Linguistic Linked (Open) Data (LL(O)D),1 We have added parentheses around the word ‘open’ because although the focus is often on linked data, and in our case linguistic linked data, that has been published with an open license, this is not always the case and linked data may have other types of license. One exception is [54].
The contribution of this paper is a literature survey intended to consider these areas together. We posit that to better contextualise and target the combination of NLP and LL(O)D techniques for detecting and representing semantic change, the main workflow implied in the process should be taken into account. The term
The current study is developed as part of the use case in the humanities (UC4.2.1) carried out within the COST Action
The paper is organised in eight sections describing the survey methodology and the state-of-the art in data, tools, and methods for NLP and LL(O)D resources that we deem important to a workflow designed for the diachronic analysis and ontological representation of concept evolution. Our main focus is concept change for humanities research, which involves investigations and data that include a time dimension, but the concepts may also apply to other domains. The various sections will focus on the essential aspects needed to understand the current trends and to build applications for detecting and representing semantic change. The remainder of this paper is organised as follows. Section 2 presents the methodology applied to build the survey. Section 3 discusses existing theoretical frameworks for tracing different types of semantic change. Section 4 presents current LL(O)D formalisms (e.g. RDF, OntoLex-Lemon, OWL-Time) and models for representing diachronic relations. Section 5 is dedicated to existing methods and NLP tools for the exploration and detection of semantic change in large sets of data, e.g. diachronic word embeddings, named entity recognition (NER) and topic modelling. Section 6 presents an overview of methods and NLP tools for (semi-) automatic generation of (diachronic) ontological structures from text corpora. Section 7 provides an overview of the main diachronic LL(O)D repositories from the humanities domain, with particular attention to collections in various languages, and emerging trends in publishing ontologies representing semantic change as LL(O)D data. The paper is concluded by Section 8 where we discuss our findings and future directions.
The motivation of combining DH approaches with semantic technologies is mainly related to the target audiences of the survey. That is, researchers, students, teachers interested in detecting how concepts in a certain domain evolve and how this evolution can be represented via semantic Web and linked data technologies that support the production and dissemination of FAIR data on the Web. Therefore, the paper addresses the study of semantic change and creation of diachronic ontologies in connection with areas in the humanities such as the history of concepts and history of ideas on the one side, and linguistics on the other. This topic may be of potential interest to other researchers interested in semantic change detection within a particular domain and its modelling as linked data. Scholars in Semantic Web technologies may be interested in such areas of application and further development of the linked data paradigm and the possibilities of integrating diachronic representation of data from the humanities into the LL(O)D cloud in the future.
The scope of the paper covers diachronic corpora that may span more distant or more recent periods in time. Therefore, the article focuses on studies dealing with diachronic variation, that is change over time, but not with synchronic variation, which can refer, for instance, to variation across genre (or register), class, gender or other social category [117], within a given, more limited period of time. The survey also targets the construction of diachronic ontologies that, unlike synchronic ontologies ignoring the historical perspective, allow us to capture the temporal dimension of concepts and investigate gradual semantic changes and concept evolution through time [76].
As mentioned above, the survey follows a workflow for detecting and representing semantic change as LL(O)D ontologies, based on diachronic corpora. Figure 1 illustrates the main building blocks of such a workflow and the possible interconnections among the various areas of research considered relevant for the study. Each block can be mapped onto one of the subsequent sections (referred to as Sections 3–7, in Fig. 1). It should be noted that not all relationships displayed in the figure are explicitly expressed in the surveyed literature. Some of them represent work in progress or projections of possible developments implied by the intended workflow. For instance, we consider that theoretical modelling of semantic change in diachronic corpora can play an important role in designing the following steps in the workflow, such as LL(O)D modelling, detection of lexical semantic change and ontology generation, and thus, a survey of this area is worth investigating together with the other blocks. Moreover, approaches from the domain of lexical semantic change detection may inform and potentially bring about new perspectives on learning or generating (diachronic) ontologies from unstructured texts, which in turn, connects with existing or future means of publishing such ontologies in the LL(O)D cloud.

Generic workflow and related sections.
Our methodology consisted of three phases: (1) selecting or searching for (recent) surveys or reference works in areas related to the five blocks depicted in Fig. 1; (2) expanding the set by considering relevant references cited in the works collected during the previous phase; (3) refining the structure of the covered areas and corresponding sections and subsections, as shown in Table 1. The first phase started with works already known to the authors, as related to their field of research, or resulting from searching by keywords such as “semantic change/shift/drift”, “history of concepts/ideas”, “historical linguistics/semantics”, “diachronic/synchronic variation/ontology”, “ontology generation/acquisition/extraction/learning”, “semantic change” + “word embeddings”. Keyword search mainly involved the use of Google and selection of journal articles, conference papers, book sections usually made available via ResearchGate, arXiv.org, ACL Anthology, IEEE Xplore, Semantic Scholar, Google Scholar, Academia.edu, open source journals, such as Journal of Web Semantics and Semantic Web, and institutional libraries. The filtering process included criteria such as relevance to the topic discussed in a certain section, subsection and the workflow as a whole, and timeframe with reference, when available, to recent research (in particular, last decade). Publication year and citation number provided by various platforms, e.g. Google Scholar, ACL Anthology, were also taken into account as pointing to newer and influential research. Finally, the co-authors reached a consensus on the works to be analysed and cited. Table 1 summarises the structure and size of the referenced material presented in the survey.
Structure and size of the surveyed material
Different disciplines (within or applied in the humanities) make use of different interpretations, theoretical notions and approaches in the study of semantic change. In this section, we survey various theoretical frameworks that rest in the domain of either linguistics or knowledge representation and that can serve the theoretical modelling purposes of block 1 in the generic workflow (Fig. 1). These theoretical frameworks come from two distinct lines of enquiry, arising from two traditions: one coming from philosophy, history of concepts and history of ideas, the other from linguistics. Although there are no strict demarcations between the two threads and some overlap exists, the first is more closely associated with Semantic Web technologies (and the corresponding representation of knowledge, including ontologies), and the second with corpus-based analysis.
Knowledge-oriented approaches
Scholars in domains such as history of ideas, history of concepts and philosophy focus on concepts as units of analysis. In his comparative reading of German and English conceptual history, Richter [145] accounts for the distinction between words and concepts in charting the history of political and social concepts, where a concept is understood as a “forming part of a larger structure of meaning, a semantic field, a network of concepts, or as an ideology, or a discourse” (p. 10). Basing his study on three major reference works by 20th-century German-speaking theorists, Richter notes that outlining the history of a concept may sometimes require tracking several words to identify continuities, alterations or innovations, as well as a combination of methodological tools from history, diachronic, and synchronic analysis of language, semasiology, onomasiology, and semantic field theory. He also highlights the importance of sources (e.g. dictionaries, encyclopaedias, political, social, and legal materials, professional handbooks, pamphlets and visual, nonverbal forms of expression, journals, catechisms and almanacs) and procedures to deal with these sources, employed in tracing the history of concepts in a certain domain, as demonstrated by the reference works mentioned in his analysis.
Within the framework of intellectual history, Kuukkanen [103] proposes a vocabulary allowing for a more formal description of conceptual change, in response to critiques of Lovejoy’s long-debated notion of “unit-ideas” or “unchangeable concepts”. Assuming that a concept X is composed by two parts, the “core” and the “margin”, underlain by context-unspecific and context-specific features, Kuukkanen describes the core as “something that all instantiations must satisfy in order to be ‘the same concept”’, and the margin as “all the rest of the beliefs that an instantiation of X might have” (p. 367). This paradigm enables us to record a full spectrum of possibilities, from conceptual continuity, implying core stability and different degrees of margin variability, to conceptual replacement, when the core itself is affected by change.
Another type of generic formalisation, combining philosophical standpoints on semantic change, theory of knowledge organisation and Semantic Web technologies, is proposed by Wang et al. [186] who consider that the meaning of a concept can be defined in terms of “intension, extension and labelling applicable in the context of dynamics of semantics” (p. 1). Thus, since reflecting a world in continuous transformation, concepts may also change their meanings. This process, called “concept drift”,4 The term “semantic drift” is also used, although the difference is not explicitly defined. See also the discussion on [168]. SKOS (Simple Knowledge Organization System); RDFS (RDF Schema), RDF (Resource Description Format); OWL (the W3C Web Ontology Language); OBO (Open Biomedical Ontologies).
Drawing upon methodologies in history of philosophy, computer science and cognitive psychology, and elaborating on Kuukkanen’s and Wang et al.’s formalisations, Betti and Van den Berg [15] devise a model-based approach to the “history of ideas or concept drift (conceptual change and replacement)” (p. 818). The proposed method deems ideas or concepts (used interchangeably in the paper) as models or parts of models, i.e. complex conceptual frameworks. Moreover, the authors consider that “concepts are (expressible in language by) (categorematic) terms, and that they are compositional; that is, if complex, they are composed of subconcepts” (p. 813). Arguing that both the
Starting with an overview of concept change approaches in different disciplines, such as computer science, sociology, historical linguistics, philosophy, Semantic Web and cognitive science, Fokkens et al. [54] propose an adaption of [103]’s and [186]’s interpretations for modelling semantic change. Unlike [186], [54] argue that only changes in the concept’s intension (definitions and associations), provided that the core remains intact, are likely to be understood as concept drift across domains; what belongs to the core being decided by domain experts (oracles). Changes to the core would determine conceptual replacement (following [103]), while changes in the concept’s extension (reference) or label (words used to refer to it) are considered related phenomena of semantic change that may or may not be relevant and indicative of concept drift. Fokkens et al. [54] apply these definitions in an example using context-dependent properties and an RDF representation in Lemon6 Lemon (the Lexicon Model for Ontologies). Note that although [54] cites the original Lemon model the example featured in that article seems to be using the later OntoLex-Lemon model.
A different interpretation is offered by Stavropoulos et al. [168] through a background study intended to describe the usage of terms such as
Scholars from computational semantics employ a slightly different terminology from scholars from history of ideas, history of concepts and philosophy. Kutuzov et al. [102], for example, describe the evolution of word meaning over time in terms of “lexical semantic shifts” or “semantic change”, and identify two classes of semantic shifts: “linguistic drifts (slow and regular changes in core meaning of words) and cultural shifts (culturally determined changes in associations of a given word)” (p. 1385).
Disciplines from more traditional linguistics-related areas provide other types of theoretical bases and terminologies to research semantic change and concept evolution. For instance, Kvastad [104] underlines the distinction made in semantics between concepts and ideas, on one side, and terms, words and expressions, on the other side, where a “concept or idea is the meaning which a term, word, statement, or act expresses” (p. 158). Kvastad also proposes a set of methods bridging the field of semantics and the study of the history of ideas. Such approaches include synonymity, subsumption and occurrence analysis allowing historians of ideas to trace and interpret concepts on a systematic basis within different contexts, authors, works and periods of time. Other semantic devices listed by the author can be used to define and detect ambiguity in communication between the author and the reader, formalise precision in interpretation or track agreement and disagreement in the process of communication and discussion ranging over centuries.
Along a historical timeline, spanning from the middle of the 19th century to 2009, Geeraerts [61] presents the major traditions in the linguistics field of lexical semantics, with a view on the theoretical and methodological relationships among five theoretical frameworks: historical-philological semantics, structuralist semantics, generativist semantics, neostructuralist semantics and cognitive semantics. While focusing on the description of these theoretical frameworks and their interconnections in terms of affinity, elaboration and mutual opposition, the book also provides an overview on the mechanisms of semantic change within these different areas of study. The main classifications of semantic change resulted from historical-philological semantics include on one hand, the semasiological mechanisms (
In cognitive linguistics and diachronic lexicology, Grondelaers et al. [66] also identify that semantic change could be approached by applying two different perspectives – onomasiological and semasiological. The onomasiological approach focuses on the referent and studies diachronically the representations of the referent, whereas the semasiological approach investigates the linguistic expression by researching diachronically the variation of the objects identified by the linguistic expressions under the investigation. There is a tendency to apply the semasiological approach in computational semantic change research because it relies on words or phrases extracted from the datasets; however, the extraction of concept representations from linguistic data poses certain challenges and requires either semi-automatically or automatically learning ontologies to trace concept drift or change as it was discussed above.
In other fields, such as terminology, semasiological and onomasiological approaches may encompass either a concept- or a term-oriented perspective [65,150]. Other standpoints, framed for instance in a sociocognitive context, attempt to take into account both the principles of stability, univocity of “one form for one meaning” and synchronic term-concept relationship from traditional terminology, and the need for understanding and interpreting the world and language in their dynamics as they change over time, and for applying more flexible tools when analysing semantic change in a specialised domain, such as prototype theory [176, pp. 126, 130)].
Diachronic change at the level of pragmatics requires a special endeavor as it is context specific. While analysing diachronic change of discourse markers, first it should be stressed that the notion of discourse marker was introduced by Schiffrin [160] and the author considered phrases such as ‘I think’ a discourse marker performing the function of discourse management deictically “either point[ing] backward in the text, forward, or in both directions”. Fraser [56] provided a taxonomy of pragmatic markers drawn from syntactic classes of conjunctions, adverbials and prepositional phrases followed by Aijmer [4] suggesting that ‘I think’ is a “modal particle”. Over the last few decades the research on discourse markers has developed into a considerable and independent field accepting the term of discourse markers [8,57,161]
Through the manual analysis of diachronic change of discourse markers, e.g., Waltereit and Detges [185] analysed the development of the Spanish discourse marker
In addition to linguistic approaches focusing on text linguistics and pragmatics, discourse analysis in a broad sense studies naturally occurring language referring to socio-related textual characteristics in humanities and social sciences. According to Foucault, one of the key theorists of the discourse analysis, the term “discourse” refers to institutionalized patterns and disciplinary structures concerned with the connection of knowledge and power [44]. Discourse analysis approaches language as a means of social interaction and is related to the social contexts embedding the discourse. Within this framework, the discourse-historical approach (DHA) is of particular interest, as part of the broader field of critical discourse analysis (CDA) that investigates “language use beyond the sentence level” and other “forms of meaning-making such as visuals and sounds” as elements in the “(re)production of society via semiosis” [192]. Thus, based on the principle of “triangulation”, DHA takes into account a variety of datasets, methods, theories and background information to analyse the historical dimension of discursive events and the ways in which specific discourse genres are subject to diachronic change. Recent studies on linguistic change using diachronic corpora and a combination of computational methods, such as word embeddings, and discourse-based approaches argue that a discourse-historical angle can provide a better understanding of the interrelations between language and social, cultural and historical factors, and their change over time [165,184].
LOD formalisms
Having given an overview of different theoretical perspectives on semantic change across numerous disciplines in (digital) humanities-related areas, we will look at how some of these perspectives can be modelled as linked data in this section. In particular, we survey possible modalities for formally representing the evolution of word meanings and their related concepts over time within a LL(O)D and Semantic Web framework (also in connection to block 2, Fig. 1). In Section 4.1, we will look at the OntoLex-Lemon model for representing lexicon-ontologies as linked data. This model is useful for representing the relationship between a lexicon and a set of concepts, something that is relevant for both knowledge-oriented and language-oriented approaches mentioned in Section 3. Next, in Section 4.2, we look at the representation of etymologies or word histories in linked data as these are particularly useful in language-oriented approaches to semantic change. Afterwards, in Section 4.4 we look at how to explicitly represent diachronic relations in RDF; this is useful for any situation in which we have to model dynamic information and is relevant to both of the general approaches in Section 3 and is not limited only to linked data. Finally, we look at resources for representing temporal information in linked data, in Section 4.4.
The OntoLex-Lemon model
OntoLex-Lemon [116] is the most widely used model for publishing lexicons as linked data. For what regards its modelling of the semantics of words, it represents the meaning of any given lexical entry “by pointing to the ontological concept that captures or represents its meaning”.8 Lexicon Model for Ontologies: Community Report, 10 May 2016 (w3.org) Here

OntoLex-Lemon core model.
OntoLex-Lemon also allows users the possibility of modelling
To summarise, OntoLex-Lemon offers users a model for representing the relationship between a lexical sense and an ontological entity in linked data. The relationship between lexical and conceptual aspects, or more broadly speaking, linguistic and conceptual aspects of meaning11 Note that ontologies are usually described as
Another OntoLex-Lemon class for modelling meaning is
Work on a Frequency, Attestation and Corpus Information module (FrAC) for OntoLex-Lemon is underway in the OntoLex W3C group [31]. This module, once finished, will enable the addition of corpus-related information to lexical senses, including information pertaining to word embeddings.
One important source of information on semantic shifts are etymologies. These are defined as word histories and include descriptions of both the linguistic drifts and cultural shifts described by Kutuzov et al. and other (language-related) approaches discussed in Section 3.2. They can be used in some of the knowledge-oriented approaches mentioned in Section 3.1 such as that of Richter. As well as being a
Current work in modelling etymology in LL(O)D was preceded and influenced by similar work in related standards such as the Text Encoding Initiative (TEI) and the Lexical Markup Framework (LMF). This includes notably Salmon-Alt’s LMF-based approach to representing etymologies in lexicons [157], as well as Bowers and Romary’s [26] work which builds on already existing TEI provisions for encoding etymologies in order to propose a
Work on the representation of etymologies in RDF includes de Melo’s [38] work on
Etymological datasets in LL(O)D include the Latin-based etymological lexicon published as part of the LiLa project and described in [114].
Representing diachronic relations
We have thus far looked at ways of representing information about lexicons and the concepts which they lexicalise in RDF and which are salient for both knowledge-oriented and language-oriented approaches. However, as argued by [92], to be able to represent changes in the meaning of concepts, as well as the concepts themselves within the framework of the OntoLex-Lemon model, it would be useful to be able to add temporal parameters to (at least) the properties They are able to show that their entailment for temporal RDF graphs does not lead to an asymptotic increase in complexity. A draft of the specification can be found at this link:
In terms of the second solution, there are numerous design patterns for adding temporal information to RDF and permitting temporal reasoning over RDF graphs without adding extra constructs to the language. We will look very briefly at a few of the most prominent of these. We refer the reader to [60] for a more detailed survey.
The first pattern we will look at is to reify the relation in question, that is turn it into an object, which was proposed by the W3C as a general strategy for representing relations with an arity greater than 2. According to this pattern, we can turn OntoLex-Lemon
Other prominent patterns take the
The most well known linked data resource for encoding temporal information is the OWL-Time ontology [78]. As of March 2020, it is a W3C Candidate Recommendation. OWL-Time allows for the encoding of temporal facts in RDF, both according to the Gregorian calendar as well as other temporal reference systems, including alternative historical and religious calendars. It includes classes representing time instants and time intervals as well as a provision for representing topological relationships among intervals and instants and in particular those included in the Allen temporal interval algebra [5]. This allows for reasoning to be carried out over temporal data that uses the Allen properties, in conjunction with an appropriate set of OWL axioms and SWRL rules, such as those described in [14].
Other useful resources that should be mentioned here are PeriodO,15
Given the possibilities described above for modelling semantic change via LL(O)D formalisms, we will address the question of automatically capturing such changes in word meaning (block 3, Fig. 1) by analysing diachronic corpora available in electronic format. This section provides an overview of existing methods and NLP tools for the exploration and detection of lexical semantic change in large sets of data, e.g. related to diachronic word embeddings, named entity recognition (NER) and topic modelling.
Overview
The past decade has seen a growing interest in computational methods for lexical semantic change detection. This has spanned across different communities, including NLP and computational linguistics, information retrieval, digital humanities and computational social sciences. A number of different approaches have been proposed, ranging from topic-based models [36,58,106], to graph-based models [127,173], and word embeddings [13,47,74,83,96,100,155,170]. [171,174], and [102] provide comprehensive surveys of this research until 2018. Since then, this field has advanced even further [46,101,136,164].
In spite of this rapid growth, it was only in 2020 that the first standard evaluation task and data were created. [162] present the results of the first SemEval shared task on
NLP challenges
Detecting lexical semantic change via NLP implies a series of challenges, related to the digitisation, preparation and processing of data, as discussed below.
Applying NLP tools, such as POS taggers, syntactic parsers, and named entity recognisers to historical texts is difficult, because most existing NLP tools are developed for modern languages [118,140] and historical language use often differs significantly from its modern counterpart. The two often have different linguistic aspects, such as lexicon, morphology, syntax, and semantics which make a naive use of these tools problematic [144,159]. One of the most prevalent differences is spelling variation. The detection of spelling variants is an essential preliminary step for identifying lexical semantic change. A frequently suggested solution for the spelling variation issue is normalisation. Normalisation is generally described as the mapping of historical variant spellings into a single, contemporary “normal form”.
Recently, Bollmann [21] systematically reviewed automatic historical text normalisation. Bollmann divided the research data into six conceptual or methodical approaches. In the first approach, each historical variant is checked in a compiled list that maps its expected normalisation. Although this method does not generalise patterns for variants not included in the list, it has proved highly successful as a component of several other normalisation systems [12,20]. The second approach is rule-based. The rule-based approach aims to encode regularities in the form of substitution rules in spelling variations, usually including context information to distinguish between different character uses. This approach has been adopted to various languages including German [23], Basque, Spanish [143], Slovene [50], and Polish [82]. The third approach is based on editing distance measures. Distance measures are used to compare historical variants to modern lexicon entries [20,87,139]. Normalisation systems often combine several of these three approaches [1,12,139,180]. The fourth approach is statistical. The statistical approach models normalisation as a probability optimisation task, maximising the probability that a certain modern word is the normalisation of a given historical word. The statistical approach has been applied as a noisy channel model [50,134], but more commonly as character-based statistical machine translation (CSMT) [43,138,158], where the historical word is “translated” as a sequence of characters. The fifth approach is based on neural network architectures, where the encoder–decoder model with recurrent layers is the most common [22,53,73,98,149]. The encoder–decoder model is the logical neural counterpart of the CSMT model. Other works modelled the normalisation task as a sequence labelling problem and applied long short-term memory networks (LSTM) [10,24]. Convolutional networks were also used for lemmatisation [88]. In the sixth approach Bollmann [21] included models that use context from the surrounding tokens to perform normalisation [110,126]. Bollmann [21] also compares and analyses the performance of three freely available tools that cover all types of proposed normalisation approaches on eight languages. The datasets and scripts are publicly available.
Other studies in detecting lexical semantic change pointed out different types of challenges. For instance, in their analysis of markers of semantic change and leadership in semantic innovation using diachronic word embeddings and two corpora containing scientific articles and legal opinions from the 20th and 18th century to present, [166] reported difficulties posed by names and abbreviations in identifying genuine candidates of semantic innovations. They applied a series of post-processing heuristics to alleviate these problems, by training a feed-forward neural network and using a pre-trained tagger to label names and proper nouns or to detect abbreviations under a certain frequency threshold, and discarding them from the list of candidates.
[128] addressed the scalability and interpretability issues observed in semantic change detection with clustering of all word’s contextual embeddings for large datasets, mainly related to high memory consumption and computation time. The authors used a pre-trained BERT model (see Section 5.5) to detect word usage change in a set of multilingual corpora (in German, English, Latin and Swedish) of COVID-19 news from January to April 2020. To improve scalability, they limited the number of contextual embeddings kept in memory for a given word and time slice by merging highly similar vectors. The most changing words were identified according to divergence and distance measures of usage computed between successive time slices. The most discriminating items from the clusters of usage corresponding to these words were then used by the researchers and domain experts in the interpretation of results.
Another type of challenge is that of assessing the impact of OCR (Optical Character Recognition) quality on downstream NLP tasks, including the combined effects of time, linguistic change and OCR quality when using tools trained on contemporary languages to analyse historical corpora. [182] performed a large-scale analysis of the impact of OCR errors on NLP applications, such as sentence segmentation, named-entity recognition (NER), dependency parsing and topic modelling. They used datasets drawn from historical newspapers collections and based their tests and evaluation on OCR’d and human-corrected versions of the same texts. Their results showed that the performance of the examined NLP tasks was affected to various degrees, with NER progressively degrading and topic modelling diverging from the “ground truth”, with the decrease of OCR quality. The study demonstrated that the effects of OCR errors on this type of applications are still not fully understood, and highlighted the importance of rigorous heuristics for measuring OCR quality, especially when heritage documents and a temporal dimension are involved.
Named-entity recognition and named-entity linking
Named-entity recognition (NER) and named-entity linking (NEL) which allow organisations to enrich their collections with semantic information have increasingly been embraced by the digital humanities (DH) community. For many NLP-based systems, identifying named-entity changes is crucial since failure to know various names referring to the same entity greatly affects their efficiency. Temporal NER has been mostly studied in the context of historical corpora. Various NER approaches have been applied to historical texts including early rule-based approaches [25,67,89] through unsupervised statistical approaches [172], conventional machine learning approaches [3,111,130] and to deep learning approaches [79,105,146,163,167]. Named-entity disambiguation (NED) was also investigated and Agarwal et al. [2] introduced the first time-aware method for NED of diachronic corpora.
Different eras, domains, and typologies have been investigated, so comparing different systems or algorithms is difficult. Thus, [48] recently introduced the first edition of HIPE (Identifying Historical People, Places and other Entities), a pioneering shared task dedicated to the evaluation of named entity processing on historical newspapers in French, German and English [49]. One of its subtasks is Named Entity Linking (NEL). This subtask includes the linkage of the named entity to a particular referent in the knowledge base (KB) (Wikidata) or a NEL node if the entity is not included in the base.
Traditionally, NEL has been addressed in two main approaches: text similarity-based and graph-based. Both of these approaches were adapted to historical domains mostly as ‘off-the-shelf’ NEL systems. While some of the previous works perform NEL using the KB unique ids [49,154], other works use LL(O)D formalisms [27,40,59,181]. One of the aims of the HIPE shared task was to encourage the application of neural-based approaches for NER which has not yet been applied to historical texts. This aim was achieved successfully. Teams have experimented with various entity embeddings, including classical type-level word embeddings and contextualised embeddings, such as BERT (see Section 5.5). The manual annotation guidelines of the HIPE corpus were derived from the Quaero annotation guide [153] and thus, the HIPE corpus mostly remains compatible with the NewsEye project’s NE Finnish, French, German, and Swedish datasets.17
Social media communication platforms such as Twitter, with their informal, colloquial and non-standard language, have led to major changes in the character of written languages. Therefore, in recent years, there has been research interest in NER for social media diachronic corpora. Rijhwani and Preoţiuc-Pietro [147] introduced a new dataset of 12,000 English tweets annotated with named entities. They examined and offered strategies for improving the utilisation of temporally-diverse training data, focused on NER. They empirically illustrated how temporal drift affects performance and how time information in documents can be leveraged to achieve better models.
A common approach for lexical semantic change detection is based on semantic vector spaces meaning representations. Each term is represented as two vectors representing its co-occurring statistics at various eras. The semantic change is usually calculated by distance metric (e.g. cosine), or by differences in contextual dispersion between the two vectors.
Previously, most of the methods for lexical semantic change detection built co-occurrence matrices [70,84,109]. While in some cases, high-dimensional sparse matrices were used, in other cases, the dimensions of the matrices were reduced mainly using singular value decomposition (SVD) [156]. Yet, in the last decade, with the development of neural networks, the word embedding approach commonly replaced the mathematical approaches for dimensional reduction.
Word embedding is a collective name for neural network-based approaches in which words are embedded into a low dimensional space. They are used as a lexical representation for textual data, where words with a similar meaning have similar representation [19,124,125,135]. Although these representations have been used successfully for many natural language pre-processing and understanding tasks, they cannot deal with the semantic drift that appears with the change of meaning over time if they are not specifically trained for this task.
In [64], a new unsupervised model for learning condition-specific embeddings is presented, which encapsulates the word’s meaning whilst taking into account temporal-spatial information. The model is evaluated using the degree of semantic change, the discovery of semantic change, and the semantic equivalence across conditions. The experimental results show that the model captures the language evolution across both time and location, thus making the embedding model sensitive to temporal-spatial information.
Another word embedding approach for tracing the dynamics of change of conceptual semantic relationships in a large diachronic scientific corpus is proposed in [16]. The authors focus on the increasing domain-specific terminology emerging from scientific fields. Thus, they propose to use hyperbolic embeddings [131] to map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. Using this approach, the authors built diachronic semantic hyperspaces for four scientific topics (i.e., chemistry, physiology, botany, and astronomy) over a large historical English corpus stretching for 200 years. The experiments show that the resulting spaces present the characters of a growing hierarchisation of concepts, both in terms of inner structure and in terms of light comparison with contemporary semantic resources, i.e., WordNet.
To deal with the evolution of word representations through time, the authors in [178] propose three LSTM-based sequence to sequence (Seq2Seq) models (i.e., a word representation autoencoder, a future word representation decoder, and a hybrid approach combining the autoencoder and decoder) that measure the level of semantic change of a word by tracking its evolution through time in a sequential manner. Words are represented using the word2vec skip-gram model [124]. The level of semantic change of a word is evaluated using the average cosine similarity between the actual and the predicted word representations through time. The experiments show that hybrid approach yields the most stable results. The paper concludes that the performance of the models increases alongside the duration of the time period studied.
Word embeddings are also used to capture synthetic distortions in textual corpora. In [188], the authors propose a new method to determine paradigmatic (i.e., a term can be replaced by a word) and syntagmatic association (i.e., the co-occurrence of terms) shifts. The study employs three real-world datasets, i.e., Reddit, Amazon, and Wikipedia, with texts collected between 1996–2018 for the experiments. The analysis concludes that local neighbourhood [75], which detects shifts via the
Transformer-based language models
The current state of the art in word representation for multiple well known NLP tasks is established by transformer-based pre-trained language models, such as BERT (Bidirectional Encoder Representations from Transformers) [42], ELMo [137] and XLNet [193]. Recently, transformers were also used in lexical semantic change tasks. In [62], the authors present one of the first unsupervised approaches to lexical-semantic change that utilise a transformer model. Their solution uses the BERT transformer model to obtain contextualised word representations, compute usage representations for each occurrence of these words, and measure their semantic shifts along time. For evaluation, the authors utilise a large diachronic English corpus that covers two centuries of language use. The authors provide an in-depth analysis of the proposed model, proving that it captures a range of synchronic, e.g., syntactic functions, literal and metaphorical usage, and diachronic linguistic aspects. In [86], different clustering methods are used on contextualised BERT word embeddings to quantify the level of semantic shift for target words in four languages, i.e., English, Latin, German, Swedish. The proposed solutions outperform the baselines based on normalised frequency difference or cosine distance methods.
Topic modelling
Topic modelling is another category of methods proposed for the study of semantic change. Topic modelling often refers to latent Dirichlet allocation (LDA) [18], a probabilistic technique for modelling a corpus by representing each document as a mixture of topics and each topic as a distribution over words. LDA is referred to either as an element of comparison or as a basis for further extensions that take into account the temporal dimension of word meaning evolution. Frermann and Lapata [58] draw ideas from such an extension, the dynamic topic modelling approach [17], to build a dynamic Bayesian model of Sense ChANge (SCAN) that defines word meaning as a set of senses tracked over a sequence of contiguous time intervals. In this model, senses are expressed as a probability distribution over words, and given a word, its senses are inferred for each time interval. According to [58], SCAN is able to capture the evolution of a word’s meaning over time and detect the emergence of new senses, sense prevalence variation or changes within individual senses such as meaning extension, shift, or modification. Frermann and Lapata validate their findings against WordNet and evaluate the performance of their system on the SemEval-2015 benchmark datasets released as part of the
Pölitz et al. [141] compare the standard LDA [18] with the continuous time topic model [187] (called “topics over time LDA” in the paper), for the task of word sense induction (WSI) intended to automatically find possible meanings of words in large textual datasets. The method uses lists of key words in context (KWIC) as documents, and is applied to two corpora: the dictionary of the German language (DWDS) core corpus of the 20th century and the newspaper corpus Die Zeit covering the issues of the German weekly newspaper from 1946 to 2009. The paper concludes that standard LDA can be used, to a certain degree, to identify novel meanings, while topics over time LDA can make clearer distinctions between senses but sometimes may result in too strict representations of the meaning evolution.
[36,106] apply the hierarchical Dirichlet process technique [175], a non-parametric variant of LDA, to detect word senses that are not attested in a reference corpus and to identify novel senses found in a corpus but not captured in a word sense inventory. The two studies include experiments with various datasets, such as selections from the BNC corpus (British English from the late 20th-century), ukWaC Web corpus (built from the .uk domain in 2007), SiBol/Port collection (texts from several British newspapers from 1993, 2005, and 2010) and domain-specific corpora such as sports and finance. Another example is [120] that applies topic modelling to the corpus of Hartlib Papers, a multilingual collection of correspondence and other papers of Samuel Hartlib (c.1600-1662) spanning the period from 1620 to 1662, to identify changes in the topics discussed in the letters. They then experimented with using topic modelling to detect semantic change, following the method developed in [77].
Based on these overviews and state of the art, we can say that automatic lexical semantic change detection is not yet a solved task in NLP, but a good amount of progress has been achieved and a great variety of systems have been developed and tested, paving the way for further research and improvements. An important aspect to stress is that this research has rarely reached outside the remit of NLP, and relatively few applications have involved humanities research (e.g., [121,165,184]). This is not particularly surprising, as it usually takes time for foundational research to find its way into application areas. However, as pointed out before (cf. [119]), given the high relevance of semantic change research for the analysis of concept evolution, this lack of disciplinary dialogue and exchange is a limiting factor and we hope that it will be addressed by future multidisciplinary research projects.
NLP for generating ontological structures
While automatic detection of lexical semantic change has shown advances in recent years despite a still insufficient interdisciplinary dialogue, the field of generating ontologies from diachronic corpora and representing them as linked data on the Web needs also further development of multidisciplinary approaches and exchanges, given the inherent complexity of the work involved. In this section, we discuss the main aspects pertaining to this type of task (block 4, Fig. 1), by taking account of previous research in areas such as ontology learning, construction of ontological diachronic structures from texts and automatic generation of linked data.
Ontology learning
Iyer et al. [81] survey the various approaches for (semi-)automatic ontology extraction and enrichment from unstructured text, including research papers from 1995 to 2018. They identify four broad categories of algorithms (similarity-based clustering, set-theoretic approach, Web corpus-based and deep learning) allowing for different types of ontology creation and updating, from clustering concepts in a hierarchy to learning and generating ontological representations for concepts, attributes and attribute restrictions. The authors perform an in-depth analysis of four “seminal algorithms” representative for each category (guided agglomerative clustering, C-PANKOW, formal concept analysis and word2vec) and compare them using ontology evaluation measures such as contextual relevance, precision and algorithmic efficiency. They also propose a deep learning method based on LSTMs, to tackle the problem of filtering out irrelevant data from corpora and improve relevance of retained concepts in a scalable manner.
Asim et al. [7] base their survey on the so-called “ontology learning layer cake” (introduced by Buitelaar et al. [28]), which illustrates the step-wise process of ontology acquisition starting with
He et al. [76] use the ontology learning layer cake framework and a diachronic corpus in Chinese (People’s Daily Corpus), spanning from 1947 to 1996, to construct a set of diachronic ontologies by year and period. Their ontology learning system deals only with the first four bottom layers of the ‘cake’ (see also [7] and [28] above), for term extraction, synonymy recognition, concept discovery and hierarchical concept clustering. The first layer is built by segmenting and part of speech (POS) tagging the raw text using a hierarchical hidden Markov model (HHMM) for Chinese lexical analysis [194] and retaining all the words, except for stopwords and low frequency items. For synonymy detection, He et al. apply a distributional semantic model taking into account both lexical and syntactic contexts to compute the similarity between two terms, a method already utilised in diachronic corpus analysis in [195]. Cosine similarity and Kleinberg’s “hubs and authorities” methodology [97] are used to group terms and synonyms into concepts and to select the top two terms with highest authority as semantic tags or labels for the concepts. An iterative K-means algorithm [112] is adopted to create a hierarchy of concepts with highly semantically associated clusters and sub-clusters. He et al. employ this four-step approach to build yearly/period diachronic XML ontologies for the considered corpus and evaluate concept discovery and clustering by comparing their results with a baseline computed via a Google word2vec implementation. The authors report that the proposed method outperformed the baseline in both concept discovery and hierarchical clustering, and that their diachronic ontologies were able to capture semantic changes of a term through comparison of its neighbouring terms or clusters at different points in time, and detect the apparition of new topics in a specific era. [76] also provides examples of diachronic analysis based on the ontologies derived from the studied corpus, such as shift in meaning from a domain to another, semantic change leading to polysemy or emergence of new similar terms as a result of real-world phenomena occurring in the period covered by the considered textual sources.
Other papers addressed the question of conceptualising semantic change using NLP techniques and diachronic corpora [16,69,152] implying various degrees of ontological formalisation.
Focusing on the way conceptual structures and the hierarchical relations among their components evolve over time, Bizzoni et al. [16] explore the direction of using hyperbolic embeddings for the construction of corpus-induced diachronic ontologies (see also Section 5.4). Using as a dataset the Royal Society Corpus, with a time span from 1665 to 1869, they show that such a method can detect symptoms of hierarchisation and specialisation in scientific language. Moreover, they argue that this type of technology may offer a (semi-)automatic alternative to the hand-crafted historical ontologies that require considerable amount of human expertise and skills to build hierarchies of concepts based on beliefs and knowledge of a different time.
In their analysis of changing relationships in temporal corpora, Rosin and Radinsky [152] propose several methods for constructing timelines that support the study of evolving languages. The authors introduce the task of timeline generation that implies two components, one for identifying “turning points”, i.e. points in time when the target word underwent significant semantic changes, the other for identifying associated descriptors, i.e. words and events, that explain these changes in relation with real-world triggers. Their methodology includes techniques such as “peak detection” in time series and “projected embeddings”, in order to define the timeline turning points and create a joint vector space for words and events, representing a specific time period. Different approaches are tested to compare vector representations of the same word or select the most relevant events causing semantic change over time, such as orthogonal Procrustes [74], similarity-based measures, and supervised machine learning (random forest, SVM and neural networks). After assessing these methods on datasets from Wikipedia, the New York Times archive and DBpedia, Rosin and Radinsky conclude that the best results are yielded by a supervised approach leveraging the projected embeddings, and the main factors affecting the quality of the created timelines are word ambiguity and the available amount of data and events related to the target word. Although [152] does not explicitly refer to ontology acquisition as a whole, automatic timeline generation provides insight into the modalities of detecting and conceptualising semantic change and word-event-time relationships that may serve with the task of corpus-based diachronic ontology generation.
Gulla et al. [69] use “concept signatures”, i.e. representations constructed automatically from textual descriptions of existing concepts, to capture semantic changes of concepts over time. A concept signature is represented as a vector of weights. Each element in the vector corresponds to a linguistic unit or term (e.g. noun or noun phrase) extracted from the textual description of the concept, with its weight calculated as a tf-idf (term frequency – inverted document frequency) score. The process of signature building includes POS tagging, stopword removal, lemmatisation, noun/phrase selection and tf-idf computing for the selected linguistic units. According to Gulla et al., this type of vector representation enables comparisons via standard information retrieval measures, such as cosine similarity and Euclidian distance, that can uncover semantic drift of concepts in the ontology, both with respect to real-world phenomena ( A company specialising in risk management and certification.
The transformation of the extracted information into formal descriptions that can be published as linked data on the Web is an important aspect of the process of ontology generation from textual sources. A number of tools have been devised to implement an integrated workflow for extracting concepts and relations, and converting the derived ontological structure into Semantic Web formalisations. While the first and second subsections above provided an overview of various approaches for corpus-based production of ontologies and ontological constructs including a temporal dimension, this subsection focuses on means for making the generated output available on the Web in a structured and re-usable format. Three categories of tools dedicated to such tasks are discussed, for extracting information and linking entities to available ontologies on the Web, learning ontologies and translating the resulting models into Semantic Web representations, and for performing shallow conversion to RDF.
An example from the first category is LODifier [9], which combines different NLP techniques for named entity recognition, word sense disambiguation and semantic analysis to extract entities and relations from text and produce RDF representations linked to the LOD cloud using DBpedia and WordNet 3.0 vocabularies. The tool was evaluated on an English benchmark dataset containing newspapers, radio and television news from 1998.
From the second category, OntoGain [45] is a platform for unsupervised ontology acquisition from unstructured text. The concept identification module is based on C/NC-value [55], a method that enables the extraction of multi-word and nested terms from text. For the detection of taxonomic and non-taxonomic relations, [45] applies techniques such as agglomerative hierarchical clustering and formal concept analysis in the first task, and association rules and conditional probabilities in the second. OntoGain allows for the transformation of the resulted ontology into standard OWL statements. The authors report assessment including experiments with corpora from the medical and computer science domain, and comparisons with hand-crafted ontologies and similar applications such as Text2Onto.
Concept-Relation-Concept Tuple-based Ontology Learning (CRCTOL) [85] is a system for automatically mining ontologies from domain-specific documents. CRCTOL adopts various NLP methods such as POS tagging, multi-word extraction and tf-idf-based relevance measures for concept learning, a variant of Lesk’s algorithm [108] for word sense disambiguation, and WordNet hierarchy processing and full text parsing for the construction of taxonomic and non-taxonomic relations. The derived ontology is then modelled as a graph, with the possibility of exporting the corresponding representation in RDFS and OWL format. [85] presents two case studies, for building a terrorism domain ontology and a sport event domain ontology, as well as results of quantitative and qualitative evaluation of the tool through various comparisons with other systems or assessment references such as Text-To-Onto/Text2Onto, WordNet, expert rating and human-edited benchmark ontologies.
One of the systems often cited as a reference in ontology learning from textual resources (see also above) is Text2Onto (the successor of TextToOnto) [35]. Based on the GATE framework [37], it combines linguistic pre-processing (e.g. tokenisation, sentence splitting, POS tagging, lemmatisation) with the use of a JAPE transducer and shallow parsing run on the pre-processed corpus to identify concepts, instances and different types of relations (subclass-of, part-of, instance-of, etc.) to be included in a Probabilistic Ontology Model (POM). The model, independent of any knowledge representation formalism, can be then translated into various ontology representation languages such as RDFS, OWL and F-Logic. The paper also describes a strategy for data-driven change discovery allowing for selective POM updating and traceability of the ontology evolution, consistent with the changes in the underlying corpus. Evaluation is reported with respect to certain tasks and a collection of tourism-related texts, the results being compared with a reference taxonomy for the domain.
Recent work accounts for more specialised tools, from the third category, such as converters, making, for instance, linked data in RDF format out of CSV files (CoW21
In this section (related to block 5, Fig. 1), we outline the existing resources on the Web including diachronic representation of data from the humanities, with a view towards the possibilities of integrating more resources of this kind into the LL(O)D cloud in the future.
The main nucleus for linguistic linked open data is the LL(O)D cloud [34],24
Not all diachronic datasets are registered through Linghub/LL(O)D Cloud. Within the CLARIAH project26
Also in the Netherlands, the Amsterdam Time Machine connects attestations of Amsterdam dialects and sociolects, cinema and theatre locations and tax information to base maps of Amsterdam at various points in time [132]. A combined resource like this allows scholars to investigate ‘higher’ and ‘lower’ sociolects in conjunction with ‘elite density’ in a neighbourhood (i.e. the proportion of wealthier people that lived in an area). Lexicologists at the Dutch Language Institute have been creating dictionaries of Dutch that cover the period from 500 to 1976 which are now being modelled through OntoLex-Lemon [41].
Searching for and modelling diachronic change requires rethinking some contemporary (Semantic) Web infrastructure. As [177] shows, standardised language tags cannot capture the differences between Old-, Middle- and Modern French resources.
Digital editions, often modelled in TEI [183], are a rich resource of diachronic language variation. Some corpora, such as the 15th-19th-century Spanish poetry corpus described in [11] contain additional annotations such as psychological and affective labels, but it seems the study was not focused particularly on how these aspects may have changed over time.
For humanities scholars such as historians, who deal with source materials dating back to for example the early modern period, language change is a given, but the knowledge they gain over time is not always formalised or published as linked data. For example, a project that analyses the representation of emotions plays from the 17th to the 19th century, a dataset and lexicon were developed, but these were not explicitly linked to the LL(O)D cloud [107,179]. In contrast to [11], here the labels are explicitly grounded in time. There is a task here for the Semantic Web community to make it easier to publish and maintain LL(O)D datasets for non-Semantic Web experts.
It should be also noted that while there do not currently exist guidelines for publishing lexicons and ontologies representing semantic change as LL(O)D data, there are moves towards producing such material within the
This paper presents a literature survey, bringing together various fields of research that may be of interest in the construction of a workflow for detecting and representing semantic change (Fig. 1). The state of the art described in the paper also represents the starting point in designing a methodology, based on this workflow, for the humanities use case UC4.2.1 as an application within the COST Action
At this stage, the reviewed literature and main surveyed approaches and tools (see Appendix) suggest that the theoretical frameworks (Section 3) and the NLP techniques for detecting lexical semantic change (Section 5) show good levels of development, although certain conceptual and technical difficulties are yet to overcome. The fields dealing with the generation of diachronic ontologies from unstructured text and their representation as LL(O)D formalisms on the Web (Section 4, 6, 7) would require further harmonisation with the previous points and research investment.
Despite recent advances in creating and publishing linguistic resources on the LL(O)D cloud, and the availability of potentially relevant resources, humanities researchers working on the detection and representation of semantic change as linked data on the Web are still confronted with a series of challenges. These include limitations in representing temporal and dynamic aspects given the work in progress status of some of the applicable Semantic Web technologies, absence of guidelines for producing diachronic ontologies, and lack of ways to ease publication and maintenance of data for non-Semantic Web experts. Another point requiring further attention is the need for building connections between the various areas of research involved in the type of task described in the paper. As we tried to illustrate through the structure of the generic workflow and the discussions within the related sections, the research agenda for attaining this goal should include interdisciplinary approaches and exchanges among the identified fields of study. The results of the survey seem to suggest that there are not yet enough interrelations and explicit connections between these fields, and the area under investigation would benefit from further developments in this direction.
We assume that, given the current progress in deep learning, digital humanities and the ongoing undertakings in LL(O)D, the detection and representation of semantic change as linked data combined with the analysis of large datasets from the humanities will acquire the level of attention and dialogue needed for the advancement in this area of study. Detecting and representing semantic change as LL(O)D is an important topic for the future development of Semantic Web technologies, since learning to deal with the knowledge of the past and its evolution over time also implies learning to deal with the knowledge of the future.
Footnotes
Acknowledgement
Author contributions
Appendix
Main NLP applications for generating (diachronic) ontological and linked data structures surveyed in Section 6
| Learning diachronic constructs | Ontologies [16,76] |
| Timelines [152] | |
| Concept signatures [69] | |
| Learning ontologies and producing linked data | OntoGain [45] |
| CRCTOL [85] | |
| TextToOnto [35] | |
| Extracting information and linking entities | |
| LODifier [9] | |
| Converting to linked data formats | CoW, cattle [123] |
| LLODifier [33] |
