Abstract
The ISO 25964-1 Thesauri and interoperability with other vocabularies—Part 1: Thesauri for information retrieval, creation, management, and use standard is undergoing a significant revision from the 2011 edition. The use cases for thesauri (or taxonomies) have expanded, considerably driving the revision. Thesauri are no longer solely used in learned publishing and libraries. This paper provides a brief history and discusses the new needs, the use cases, and then the actual changes to the standard. The data model is updated to include support for web navigation, chat boxes, online shopping, knowledge graphs/maps, artificial intelligence, among other use cases. Content is now digitized or born digital, and its distribution is virtual. Thesauri have always supported search and indexing/tagging of content. They are now the preferred alternative to traditional subject headings. This paper outlines the revision process and provides an overview of the specific revisions and additions. For example, the inclusion of Global Uniform Identifiers (GUIDs), list of connected standards, expanded examples in many non-Latin languages, guideline references for Diversity, Equity, and Inclusion (DEI), addition of concept and concept term, substantial updates to the Annexes, references, and much more. The revision was made available for comment and Vote July 30, 2024.
Keywords
Introduction
Thesaurus, taxonomy, ontology, controlled vocabulary, structured vocabulary, knowledge organization system (KOS), knowledge graph, etc. are all terms that are used rather loosely to cover the same topical areas. The ISO 25964-1 Thesauri and interoperability with other vocabularies—Part 1: Thesauri for information retrieval, creation, management, and use standard 1 and its parallel national standards such as the ANSI/NISO Z39.19, 2 have been around since the early 1990s, codifying the processes used by early abstracting and indexing services (secondary publishers) and then online databases. The ISO 25964—Part 1 was published in 2011 and is the current edition. So why the recent interest in updating such apparently static tools and their matching standards? Are they not too old fashioned to worry about?
A little background
Traditionally, thesauri have been used in scholarly publishing and libraries. They have been particularly important in providing subject access to complex esoteric areas such as biology, chemistry, engineering, medicine, physics, theology, and many more. Providing navigation through hierarchies, bringing together various words used to describe the same concept as synonyms, and cross-referencing related areas has been their strength. Printed indexes were an integral tool for researchers before information was made available over long distances via computers. In parallel, libraries used the subject heading approach, for example, the Library of Congress Subject Headings (LCSH), inverting the words to put the main words first and supplying see also (cross references) and see (synonym) references to their researchers. Both approaches are still widely used. The thesaurus approach is different in that it states the terms, words used, in natural language rather than inverting the terms. That is, the words are put in the order of the spoken word—natural. Using a thesaurus, the individual concepts are discrete—they stand alone and are combined at the time of search. This is known as post coordination. Using subject headings, the terms are put together in advance by the cataloguer. This concatenation is known as pre-coordination.
With the computerization of search, it is no longer necessary to pre-coordinate the terms and the researcher is free to use the terms that they know in the same way in which they that speak about their topic. Using Boolean commands to combine the topics (AND, OR, NOT) gives them a power that they did not have in the printed indexes or the library catalogues.
For the last forty-plus years library subject headings and thesaurus terms have formed separate camps for searching the literature. When the Dublin Core 3 initiative began in 1995 it was as a cross walk was built between those two communities to reduce from more than six hundred and 50 Library automation (MARC) 4 fields to the average of 15 fields needed in the computer database indexes. Such a wide difference in the number of items needed to execute a search was a widely debated topic. Dublin Core, as it became known, was approved as NISO Standard Z39.85 5 in 2001.
One of the reasons for the debate was the acceptance, or not, of the thesaurus terms (subject metadata) as a core tenet of the Dublin Core. Why? Because there was soon a movement to create a Digital Object Identifier (DOI) 6 and deposit “contributed metadata” with information object (article, book, etc.) into a system wherein people could find the item using that number. The DOI database and the Handler for that system became known as CrossRef. 7 The secondary publishers worried that they would lose their intellectual property, perhaps their whole reason for existing, to CrossRef if they supplied the thesaurus terms along with the other DOI data. The result was the Contributed Metadata, proposed by the National Federation of Abstracting and Indexing Services (NFAIS) 8 and accepted by the DOI Foundation.
Why the interest now?
The historical use cases are still very active and valid with the exponential growth in computerized information, the need to organize, manage, and retrieve this vast amount information had led to a new use case bursting upon the scene. The concept terms added to documents now support search, navigation and discovery on web sites, Content Management Systems (CMS), 9 and Digital Asset Management (DAM) 10 systems. They provide the building blocks for creating knowledge graphs and ontologies. Use in automated help responses and call centers has grown geometrically. Most recently their use within Generative Artificial Intelligence (GenAI) 11 systems to keep them from hallucinating and providing subject disambiguation has been noted as being crucial to the embedding process and accurate chat answers to users.
The explosion of use cases for the known thesaurus capabilities gave rise to increased interest in updating the ISO 25964 standard. Now thesauri and the taxonomic view, also known as the hierarchical view, are used for: • Search expansion, • Tagging/indexing for retrieval, • Alt term suggestions, • Support of clustering and other ontological options, • Shopping carts, • Web navigation, • Type ahead, • Recommendation engine matching, • Collection organization, • Chatbots, • Decision trees, • Profiles—members and authors, • Semantic fingerprints, • Knowledge graphs, • Human intelligence bases, • Artificial intelligence, • Generative AI, • Building Search query triples for SPARQL,
12
• Conference track sorting, • Peer Review keywords, • Interoperability—with any number of other systems, • Etc.
The process
All standards developed by the International Standards Organization (ISO), as well as those developed by related national standards organizations, are required to be reviewed every 5 years, with the review either being reaffirmed, revised or retired. When this standard came up for review it was clear from the comments that it was time to revise it.
The committee to do the work must be made up of representatives from throughout the ISO member nations. Each of the one hundred- and thirty-seven-member national standards bodies may name and approve through ISO representatives to participate in the work for revision of the standard. That said, it does take outreach to find individuals who would be willing to lend their expertise and spend the time that is required to work on the revision. They also must go through the approval process through their own national body and then ISO. It takes a frustrating amount of time to get this done.
There is a timeline to be met for the standard revision process. The first year is the creation of the revision. The second and third years provide time for all ISO members to vote, to comment, allow the committee to revise based on the feedback received, and then allow for a second round of voting prior to the publication process. For this revision of the ISO 25964 creation of the committee itself took the first 6 months before sufficient people were approved for the actual work to begin in earnest. Fortunately, the worldwide community for taxonomies, thesauri, and related standards is a group which was keenly interested and ready to work. Once the work begins ISO insists on a closed environment for the work. The drafts may not be shared outside of the committee as ISO’s intellectual property must be carefully preserved. The drafts may be shared only through the voting process to the member nations. So how do you get widespread input to the needs of users and potential users while honoring the ISO guidelines? How do we learn what needs to be changed and updated?
While the committee was being formed, we held an open call for comments and a workshop. In order to gather additional viewpoints and information we reached out to potential user communities including: • Taxonomy Division of the Special Library Association (SLA), • Taxonomy Book Camp—KM World,
13
• List Serves for taxonomy groups, • TaxoDiary Blog,
14
• Software developers, • Known authors in the field, • Previous members of the committee, • Members of NISO, Association for Information Science & Technology (ASIS&T), Special Library Association (SLA), Networked Knowledge Organization Systems/Services/Structures (NKOS), and International Society for Knowledge Organization (ISKO).
More than 150 individuals responded with interest and many provided helpful comments and insights. We held a brainstorming session at the Taxonomy Boot Camp/KM World meeting in November of 2023 with 52 people registered. This resulted in a white paper which was circulated to the interested people which resulted in the receipt of additional comments. This gathered information formed the basis for the committee’s work once the full roster was achieved. This mailing list is periodically contacted with updates on the progress of the standard to help manage expectations for the final version.
Creating the revision
With our list of priorities gathered from the user community in hand, the revision committee turned its attention to the existing standard. The document was redlined in rotation by committee members with extensive commentary for discussion. At first the committee met monthly, then biweekly, and then weekly to bring the revision draft forward on time to complete its work the day before the ISO deadline of June 30, 2024.
As the drafts moved forward, we created subcommittees to work on sections of the standard in order to move more quickly and to allow specific attention to some of the thornier sections.
It was clear that there was a need to expand beyond the current standard coverage if this was to be a truly international standard. Most of the examples in the current standard were in English, French, or German. We needed to expand to cover examples to character sets beyond the Latin alphabet. There has been considerable outreach to gather an international panel of experts in the field. The revision includes examples in Arabic, Cyrillic, Finish, Greek, Tamil, and other languages. A group pursued examples and varied ways of stating terms in many languages to contribute significantly expanded examples throughout the standard.
Some of the more esoteric aspects of thesaurus creation were rewritten for clarity. Others were rewritten and moved to Annexes to streamline the main standard.
New clauses
1. Concept identifiers, resource locators, and GUIDs (Globally Unique Identifiers) guidance has been added to the standard to support knowledge graphs, interoperability and SPARQL. This meant that the data models section also needed to be updated.
A significantly expanded list of connecting standards was developed and a crosswalk was built to them. There are more than 50 Terminology standards just within ISO itself and it is important to ensure that they do not contradict each other. Many of these different standards are not currently in alignment. If a vocabulary is to be controlled there should be one overarching guideline for them and the organization should adhere to its own work. A committee is now established within the ISO Technical Committee 46 (ISO TC 46)
15
to consider this initiative. Crosswalks to TC 37
16
are in the process of being established, for example: ISO 704—Terminology work—Principles and methods
17
; ISO 1087—Terminology work and terminology science—Vocabulary
18
; ISO 30042 becoming 24634—Management of terminology resources—Term Base eXchange—TBX
19
; Non-ISO groups who also have controlled vocabulary, terminology, ontology, taxonomy interests include organizations such as BSI, DIN, SKOS (W3C), OWL, and the SKOS - Thes Namespace built for OWL (now hosted by DCMI). We are learning of more daily!
A section on Diversity, Equity, and Inclusion (DEI) was added as many of the terms we need to have in a thesaurus are laden with unintended meanings outside of the specific group applying the terms. However, the older literature will likely include the terms which should be referenced since the literature once published does not change. Hence the prefered term is updated with synonoyms for previous usage. We cannot lose access to that information when a more current term is now in vogue. Without the presentation of the previous terms the information is not surface-able in search. Many organizations have already faced this quandary, and we provided links to their specific suggestions since treatment will vary by subject area.
The original work is divided into main chapters or clauses
Clauses 1–14 Definitions, Historical. These sections have been substantially updated. Since the standard was originally written, a number of new terms and applications have come into common use. For example, “Concept” replaces “Term” as the most common usage. A term represents a concept and can be stated in several ways. The idea of a Preferred term (PT)/Non-Preferred term (NPT) was hotly debated as it indicates a preference. Should we allow that? We considered renaming the Preferred/Non-Preferred labels as that may indicate a bias in the word selected to represent a concept. There was a long list of alternatives under consideration. With the review of the other ISO standards, especially the Technical Committee (TC) 37 standards, we realized we should leave the label since it is already in widespread use in other standards. In the end we decided that so many of the other ISO and national standards use that labeling we would leave it in the standard, but would also use concept term or concept label where appropriate.
Clause 15: Data Model needed to be updated to accommodate concept labels, GUID, PID, and other new fields added to the standard. There are also extensive models and ties to both the SKOS and UML formats.
Clause 16: Integrations and interoperability with other methodologies and software is mostly covered by Part 2 of this standard. So, this section was lightly updated, referring to the other work which will likely be revised in the next couple of years.
Clause 17: Exchange formats have increased in availability since 2011 and this section needed review and updating to ensure that the references were updated, and the formats covered were still true. Interestingly, the SKOS format itself is still valid although individual organizations have embroidered on their own implementations making it more of a guideline that a standard.
Clause 1: Protocols have also needed updating although not extensively.
The Annexes with sample thesauri were brought up-to-date with the current examples from each and, of course, with permissions to use them. Although many of the supporting examples are from organizations which are not part of the revision committee, they were very generous in providing updated examples of their prototypical work and permission to use the examples.
The section on software for thesaurus creation was moved to the third Annex in order to streamline the standard body. It is still being considered by the committee whether to include this Annex or not.
The ISO 25964 merges the mono and multilingual thesauri to a single work. The project concentrates on development and maintenance, but does not include on how to use it in the indexing process. There are other standards concerning how to index with subject metadata.
Status
This paper is meant only to provide the highlights of the current revision. The draft was completed June 29, 2024. It was released for initial comments and voting to the ISO member bodies on July 29, 2024. The voting concluded in October 2024, the committee will reconvene to resolve the comments made any additional items as well which need to be brought to conclusion.
Revision committee
As on all committees, the members of the working group varied in knowledge, skills, backgrounds, contributions, and attendance. The discussions at the regular zoom meetings were lively, informative, and productive. We all learned a lot. Several members quickly and willingly commented extensively, took sections or clauses and models, and rewrote/updated them for the group’s comments. Others researched/updated the bibliography and the references to other standards, and created new sections. Of special note are the contributions, alphabetically, of Kerry Blinco, Joseph Busch, Dave Clarke, Betsy Fanning, David Gillikin, Pat Harpring, Heather Hedden, Elisabeth Moscara, Doug Tudhope, and Marcia Zeng. They made the time fly, the work fun, and my job as convenor much lighter.
