Abstract
Glottocodes constitute the backbone identification system for the language, dialect and family inventory Glottolog (
Introduction
Glottocodes constitute the backbone identification system for the language, dialect and family inventory Glottolog (
In the current release, there are 25,900 glottocodes (8,533 language-level, 4,571 family-level and 12,796 dialect-level).
Motivation and history
Glottocodes were introduced in 2010 by Glottolog collaborator Sebastian Nordhoff, in response to the following requirements:
An ID specifically designed for machine readability, not confusable with an informal or human-directed identifier
An ID type oblivious to level of linguistic abstraction (idiolect, sociolect, dialect, language, subfamily, family, etc.)
An ID system for languages that improves on the ISO 639-3 language identifiers in terms of quality, transparency and anchoring
The 8-character long alphanumeric string was designed not to resemble an abbreviation or an easily remembered mnemonic. This was done specifically in order to counter any temptation to capitalize, modify, inflect or translate it, which users might if the ID-string had a more human-palatable appearance (such as a three-letter mnemonic, a standardized name or the like).
Glottolog had adopted a doculect-based approach1 The term ‘doculect’ emanates from [2] but the general approach has been used by many earlier authors, notably [10,17,22,26].
Finally, a decade or longer ago, the quality and transparency of the ISO 639-3 standard for languages was problematic and an alternative was clearly needed [11]. ISO 639-32 See
Glottolog was initiated by Harald Hammarström, Sebastian Nordhoff and Martin Haspelmath and is now run by a group of editors.3
As explained above, glottocodes are the identifiers for
In the following, we briefly describe the infrastructure and framework pioneered for the Glottolog data for data curation and publication ([5] and [6]).
All data is stored in UTF-8 encoded text files (with consistency ensured by the
Thus, collaboration and curation workflows can make use of
The master copy of the Glottolog
Released versions of the repository are published and archived with Zenodo (
This setup does not only provide a stable management system for all information shared between the Glottolog editors, but also a collaborative environment which allows involving the wider community. Somewhat similar to – but less formal than – ISO 639-3 change requests, Glottolog users can make use of GitHub issues to indicate errors or request inclusion of new languoids,4 E.g. E.g.
Glottolog aims to share this data in an open and FAIR ([27]) way. Stepping through the FAIR Guiding Principles for scientific data management and stewardship6
Glottolog is a well established language catalog as evidenced by more than 600 citations of editions of Glottolog such as “Glottolog 4.0” in the scholarly literature (according to Google Scholar). The Glottolog data repository lists 20 contributors in addition to the Glottolog editors (and not including users opening issues, see
Glottolog data is also well indexed in relevant catalogues: The first point of contact for many users is the Glottolog web application at
All languoids in Glottolog are unambiguously identified via glottocodes. These glottocodes are transparently associated with URLs in the glottolog.org domain, turning them into globally unique identifiers. Each release of Glottolog is identified by the DOI assigned by Zenodo.
CLDF – one of the dissemination formats of Glottolog – is designed to allow for explicit linking of metadata to identifiers. The underlying mechanism to do this is described in [25], and the sematics are provided through the CLDF Ontology (see
Glottolog data is accessible
Zenodo (and the metadata associated by Zenodo with the DOI assigned to data releases) guarantees that data is retrievable using the standard protocol associated with DOIs.
For each languoid, the CLDF/CSVW data associates an HTTP ([4]) URL, which is resolvable via the Glottolog web application.
Glottolog data is interoperable
Glottolog aims at integration with the Semantic Web at large and the Linguistic Linked Data initiative in particular.
At the most fundamental level this means resource URLs – aka URLs for Glottocodes. These resource URLs are not only usable as universally unique identifiers, but are also resolvable through the Glottolog web application. HTTP status codes ([4]) returned by the web application signal the status of Glottocodes as follows:
for active codes
or
for invalid codes
The web application also provides several serializations of RDF ([19]) representations of languoid data. These serializations can be retrieved using standard content negotiation mechanisms such as using
While these efforts provide convenient integration with the “living” Semantic Web, Glottolog also aims at interoperability for its archived, long-term available datasets. To this end, Glottolog data is serialized as a CLDF Structure Dataset ([12]). The CLDF standard ([8,9]) does not only provide interoperability with other CLDF datasets, but – due to being built on the W3C’s “CSV on the Web” recommendation ([25] and [21]) – also allows automatic conversion to RDF ([24]).
CLDF bundles data with structured, machine readable, semantic web-ready metadata. Since CLDF metadata is encoded in JSON-LD ([23]), the data can be marked up using standard ontologies such as Dublin Core, DCAT (
Thanks to improvements in the curation of ISO 639-3 language identifiers during the last decade, ISO 639-3 codes and language-level glottocodes are one-to-one interchangeable for the vast majority of cases, and the differences are few enough that a specific comment explaining the differences are given in each of the remaining cases on Glottolog. In fact, Glottolog aims at covering all valid ISO 639-3 codes to provide a full mapping, but typically there is a time lag of a couple of months between additions to ISO 639-3 and a Glottolog release addressing these changes. There remains a principled difference in anchoring in that the denotation of a glottocode in Glottolog is defined by the data and information in the references tied to it. The references are associated in Glottolog in such a way that the referenced data and information is enough to distinguish the languoid from all other languoids. Strictly speaking, the ISO 639-3 standard provide no definition or justification of the recorded entries. In Ethnologue [3] – the reference for most of the ISO 639-3 codes – each entry has metadata such as geographical information, name(s), speaker numbers and classification which presumably defines the language, but no actual or referenced data from the denoted language. Unfortunately, it is not so that metadata information is in all cases enough to identify its denotation. Language names are notoriously ambiguous and the case of language-shifting ethnic groups is particularly tricky, as most metadata (speaker numbers, geography, name) is not sufficient to disambiguate between the original and substituted language.
Glottolog, and in particular individual languoids, are also well-linked from Wikipedia. In particular, practically all language- and family- level languoids are referenced in Wikipedia. These Wikipedia links translate to Wikidata links (e.g.
Glottolog data is reusable
Glottolog data is release under a CreativeCommons CC-BY-4.0 license.
Like most large-scale databases, parts of Glottolog data are aggregated from various sources. Glottolog tries to be transparent about this, e.g. by
providing references for all classification proposals7 E.g.
providing references for all endangerment assessments8 See
describing the provenance of the bibliography9 See
We already described the relation between glottocodes and ISO 639-3 language codes. Arguably, the transparent mapping between the two, which Glottolog provides, is the most important contribution towards meeting domain-relevant standards.
But as explained above, Glottolog also
caters to the LLD community, by meeting Semantic Web standards, e.g. re-using ontologies like GOLD10
serves the OLAC community, by implementing the OAI-PMH data provider specification ([16]), thereby allowing harvesting through OLAC, helps researchers in descriptive and comparative inguistics to inform their analyses using Glottolog metadata, by making this data accessible as CLDF dataset, provides the NLP community with the means necessary to follow the “Bender Rule” [1] of always identifying the language(s) (or language varieties) involved in NLP research
Glottolog aims to be complete with respect to all assertable L1 languages12 See
Glottolog also classifies dialects insofar as it attaches them to exactly one language-level languoid. But the inventory of dialects (varieties of a language), non-L1 languages (artificial languages, speech registers, pidgins) and non-assertable languages (putative languages for which there is insufficient data to decide if they are different from all other languages) and putative families (hypotheses about family relationships that have appeared in the literature) is growing but still far from complete. The world may contain more of any or all of these entities without a necessary reflection in a glottocode. Genuine completeness with respect to these categories is deemed practically (if not theoretically) impossible.
For these reasons, Glottolog accounts for any changes to the language-level inventory between two releases, i.e. language-level glottocodes of the previous release will always be valid glottocodes in the next. So if something was deemed a real-world language, a user can follow any changes to that assertion. If a language-level languoid was completely erroneous, it is moved to the Bookkeeping category. If it is promoted/demoted to a family/dialect, it retains its glottocode but changes its level accordingly, but from that point on it ceases to be “protected” by its language-level status, so may be retired in the next release.13 This process works similar to the way deprecation (
Glottocodes are not recycled – for new entities, completely new glottocodes are assigned (retired codes are not re-used/re-purposed). Hence, all glottocodes that have ever appeared are either active or retired.
We already pointed out that Glottolog data is versioned and released periodically (aiming at a bi-annual release frequency). Each such Glottolog release is self-contained, i.e. does not reference any base data, but instead includes it. Thus, when linking other resources to Glottolog, one should always specify the particular target version.
Glottolog follows a semantic-versioning scheme14
Resources using Glottolog should always target the highest patch version of a particular minor version. This should not break any processing code, but may correct errata.
Upgrading resources to a new minor version may change data/links, but should not break processing code.
Upgrading to a new major version may break processing code, i.e. the data structure may change.
The Glottolog version history can be explored in two ways: The Glottolog web application resolves resource URLs of obsolete languoids as follows:
Where the HTML page at
Alternatively, since Glottolog data is curated as
We have described the practices and principles for glottocodes as the identificational system for the languages, dialects and families of the world including data curation, technical infrastructure and update/version-tracking systematics. The resulting data observes the crucial aspects of the FAIR (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship. As such the glottocode-system responds to an important challenge in the realm of Linguistic Linked Data with numerous NLP applications.
