Abstract
We present a novel, logic-based solution to the challenge of reconciling the meanings of taxonomic names across multiple biological taxonomies. The challenge arises due to limitations inherent in using type-anchored taxonomic names as identifiers of granular semantic similarities and differences being expressed in original and revised taxonomic classifications. We address this challenge through: (1) the use of taxonomic concept labels – thereby individuating name usages according to particular sources and allowing each taxonomy to be recognized separately; (2) sets of user-provided Region Connection Calculus articulations among concepts (RCC-5: congruence, proper inclusion, inverse proper inclusion, overlap, exclusion); and (3) the use of an Answer Set Programming-based reasoning toolkit that ingests these constraints to infer and visualize consistent multi-taxonomy alignments. The feasibility of this approach is demonstrated with a use case involving pairwise alignments of 11 non-congruent classifications of Eastern United States grass entities variously assigned to the
Keywords
Introduction
We present a novel, logic-based solution to the challenge of integrating the meanings of taxonomic names across multiple biological taxonomies. The challenge arises due to limitations inherent in using taxonomic names as identifiers of granular semantic similarities and differences being expressed in succeeding classifications. We address this challenge through the combined use of taxonomic concepts [5,35], Region Connection Calculus (RCC-5) articulations [34,71], and an Answer Set Programming-based reasoning toolkit that infers consistent multi-taxonomy alignments [16,61]. The feasibility of this approach is demonstrated with a use case involving 11 classifications of Eastern United States grass entities variously assigned to the
Names as identifiers of taxonomic meanings – challenges and solutions
Why are names not good enough? We adopt the view that taxonomic names and nomenclatural relationships are necessary but not sufficient for integrating biodiversity data for semantic information environments Web [5,35,58,73]. The reasons for this insufficiency are systemic and well known to taxonomy contributors and users [3,10,66,74]. Ultimately they are rooted in the way in which identity is established according to the rules of nomenclature that guide the application of names to perceived taxonomic groups [29,51,64,96].
Biological classifications strive to reflect natural, phylogenetic relationships. They are therefore subject to adjustments whenever new evidence regarding the identity of taxonomic entities or relationships among these is brought forth by the latest systematic research [37]. For many organismal groups in the tree of life, systematists are not close to completing this process of adjustment. For instance, in the past 20 years the number of validly recognized species of primates has increased from 233 to 488 [76]. While such necessary taxonomic changes accumulate over time, the
Typically both a type and a feature-based circumscription are provided when anchoring the meaning (referential extension) of a taxonomic name [29,34,37,96]. However, the former arbiter – i.e., the type identity – has special weight when dealing with alternative name:meaning (read: “name-to-meaning”) assignments that become necessary when taxonomies undergo revisions. Another relevant, Code-mandated naming rule is the Principle of Priority [67], which states that in case of (again, type-grounded) synonymy, the oldest available name remains the valid one. The vast majority of the 250+ year-old names of Linnaeus [78] are ‘eternally validated’ by this important Principle.
Application of the rules of nomenclature to changing classifications can create semantically complex networks of many-to-many relationships among valid and invalid names on one side, and associated circumscriptions on the other side [35,43,74]. Thus, in spite of the central role of Code-compliant names in interconnecting biodiversity data [69,70,78], these names have shortcomings as identifiers of granular differences between taxonomic perspectives that biodiversity data communities create and apply at any given time. Sound knowledge representation in the biodiversity data realm requires recognition of, and compensation for, these systemic insufficiencies [32,34,58].
Solutions to overcome taxonomic name:meaning dissociations may take two major pathways. One option is to assemble single, comprehensive taxonomies for particular groups, with periodically updated versions [10,66,79]. This approach offers an immediate and valuable service to users. However, in the longer term it often leads to multiple distinct perspectives being represented by earlier and later versions of the ‘same’ standard [5,37,90]. Thus in effect the unitary taxonomy turns into an open-ended temporal chain of partially incongruent taxonomies. Overlapping sets of names are reused from version to version, with varying circumscriptions and no explicit tracking of taxonomic alignment [18]. In the end, unitary systems are likely to promote the proliferation of ambiguous name:meaning relationships.
Truly alternative – though also complementary – options to unitary classifications are being developed under the term
The resolution gained by using such labels is critical. They permit the assembly of multiple alternative, internally coherent hierarchies where all concepts derived from one hierarchy can be connected via parent/child (
Here we integrate concept-level annotations of alternative taxonomic perspectives with two additional workflow components: (1) user provision of an initial set of Region Connection Calculus (RCC-5)
Here we apply the taxonomy alignment approach to the 11-classification
Reasoning about multi-taxonomy alignments with RCC-5 articulations
The Euler/X toolkit is a successor of the CleanTax software [84–86]. The CleanTax prototype was built on top of a traditional First-Order Logic (FOL) reasoner [63]. Euler/X advancements include interactive workflow support, inconsistency and ambiguity analysis functions [15,17,84], and the use of Answer Set Programming (ASP) reasoners, based on Stable Model Semantics [39,40,60].
Taxonomy alignment problems are modeled as sets of In the qualitative reasoning domain [54], the basic RCC-5 relationships are known as EQ (

Overview of input/output information for processing with the Euler/X taxonomy alignment toolkit, using the example of the Blomquist (1948)/Small (1933) alignment. (A) Input data format, showing the two input taxonomies and the set of six user-provided input articulations (Appendix A). (B) Input visualization, with legend (left) providing information on numbers of input concepts per taxonomy,
The set (C) of constraints applicable to taxonomy alignments are [87]: (1)
The toolkit functions with relevance to the Andro-UC are as follows (Fig. 1). (1) Visualization of each input taxonomy in the format of an

Tabular representation of the input alignment of taxonomic names and concepts used in the 11 succeeding classifications of the Andro-UC, as provided by Weakley [35,93,94]. Columns represent classifications whereas rows contain information on taxonomic name and concept identity (via taxonomic concept labels, see column headers). Cell shadings indicate congruent multi-concept lineages. Consecutive concept numbers (1–100) are reused in Fig. 3 for the purpose of comparison. See text for further details.
Additional toolkit functions include logic-based diagnosis and repair options in the case of inconsistent input (=constraint over-specification), and visualizations of multiple alignments as aggregate and cluster views in the case of ambiguous input (=constraint under-specification) [15–17,23,61]. The latter visualizations can inform interactive decision tree routines, where the user is repeatedly prompted to resolve ambiguous (i.e., disjunctive) articulations, thereby reducing the number of possible word alignments. Both sets of functions are intended to aid the user in achieving consistent, well-specified alignments [33]. However, neither set of functions is needed to properly align the Andro-UC input, which by virtue of the unambiguously specified user input displayed in Fig. 2 already satisfies the criteria of consistency and sufficiency. We refer readers to other contributions where these issues are discussed in more detail [15,17,33,36,53].
To our knowledge, the specific combination of generating reasoner-inferred alignments between multiple biological taxonomies with RCC-5 articulations (and ASP reasoners) has no immediate precedent in the broader semantics domain. The logic foundations for this particular approach were developed in [16,43,86]. The step of modeling an input taxonomy as an
Biodiversity scientists are often faced with use cases where sets of taxonomic occurrence records or entities can either be relevantly merged, or not, for information ingestion into subsequent analyses. This requirement, together with the notion that taxonomic boundaries are natural and empirically accessible [37], may motivate using RCC-5 over alternatives that express similarity ratios among individual concepts and concept hierarchies [92]. The latter are most appropriate for expressing “how semantically close?” two concepts are. However, for the biodiversity scientist this begs an additional question [34]: “are the differences significant, or negligible, for the purpose of merging data?” In this context, RCC-5 provides direct, actionable, set theory-based information for multi-taxonomy integration. The specific representation needs for biological taxonomies and derivations of FOL constraints are further discussed in [87].
Use of the RCC-5 articulations means that ambiguities due to incomplete knowledge in alignments are modeled through disjunctive articulations, which may be present in the input articulations, output MIR,3 A MIR is the unique node in a given R32 lattice that implies all other true articulations in the lattice.
Parallel efforts to derive taxonomic concept alignments ‘directly’ from textual descriptions through the application of Natural Language Processing methods and phenotype ontologies are introduced in [22]. Other taxonomically focused integration projects that do not utilize RCC-5 include [10,13,66,73,88]. The degree to which the RCC-5 alignment approach is relevant to other field that model semantic drift requires further exploration.
The Andro-UC has been selected to demonstrate the multi-taxonomy alignment approach for several reasons. First among these is the availability of preexisting concept circumscriptions and articulations through co-author Alan S. Weakley, an expert on the Flora (and floristic legacy) of the Southern and Mid-Atlantic States [93,94]. An earlier version of the use case was published in [35] and included eight classifications. Three recent classifications are herein added to the Andro-UC. The use case is furthermore suitable because it illustrates the considerable extent to which names and meanings may dissociate over time as Code-compliant names are applied to incongruent taxonomic classifications. The implications for integrating biodiversity data are thereby made clear. Moreover, with only 100 concepts, the Andro-UC is relatively small. Its outer taxonomic boundaries are well defined and stable throughout the 126-year time interval (1889–2015). These properties allow us to present the alignment visualizations within the confines of this contribution. Additional comments on the relevance of this use case and applicability of our approach to other alignment challenges are offered in the Discussion.
Taxonomic particulars
The history of the Andro-UC is reviewed in [35,93,94]. The 11 input classifications T1, …, T11 are each reproduced according to the source publications (Fig. 2). All input articulations were provided by the user in tabular format (Fig. 3), which readily facilitates translation into RCC-5 relations. Strictly speaking, the Andro-UC concerns the “

Hierarchical, multi-level representations of the 11 input classifications of the Andro-UC (see also Appendix A). Taxonomic name and concept identities (numbered from 1–100) as in Fig. 2.
The classifications of the Andro-UC include, in chronological sequence (Figs 2 and 3): Hackel (1889) [49], Small (1933) [81], Blomquist (1948) [7], Hitchcock & Chase (1950) [50], Radford et al. (1968) [71], abbreviated as “RAB (1968)”, Godfrey & Wooten (1979) [45], Campbell (1983) [11], Campbell (2003) [12], Weakley (2006) [93], Kartesz (2014) [56], referred to as “BONAP (2014)”, and Weakley (2015) [94].
The tabular representation of Fig. 2 encodes taxonomic congruence as a function of occupying the same row (width). For instance,
Another noteworthy aspect of the input representation are higher-ranked entities (compare Figs 2 and 3). These entities are not depicted in Fig. 2, because the table provided by Weakley emphasizes congruence among the narrowest concepts recognized in each classification. However, these higher-level entities are implied by conventions that guide the source taxonomies, and are usually made explicit therein. For instance, the acceptance of two variety-level concepts
Our representations fully account for the implied higher-level taxonomic concepts, yielding comprehensive alignments with up to four levels (Fig. 3). Where necessary, we have added nominal (type) taxonomic names and concepts to represent comparable ranked entities at all levels; e.g.,
The Euler/X toolkit is open source and available at [61]. The software can be cloned and then deployed on a desktop using the command-line interface. An overview of the toolkit’s reasoning and visualization options is available through the “help” command. Additional software dependencies include Python, the Answer Set Programming reasoners DLV [25] and Potassco (Gringo, claspD) [39], and GraphViz [38].
The input conventions for labeling concepts and representing parent/child ( Visualizations for alignments 1–5 of the Andro-UC, 1889–1979. Representation conventions and annotations as in Fig. 1C. (A) Small (1933)/Hackel 1889 alignment; (B) Blomquist (1948)/Small (1933) alignment; (C) Hitchcock & Chase (1950)/Blomquist (1948) alignment; (D) RAB (1968)/Hitchcock & Chase (1950) alignment; (E) Godfrey & Wooten (1979)/RAB (1968). Visualizations for alignments 6–9 of the Andro-UC, 1950–2006. Representation conventions and annotations as in Fig. 1C. (A) Godfrey & Wooten (1979)/Hitchcock & Chase (1950); (B) Campbell (1983)/Godfrey & Wooten (1979); (C) Campbell (2003)/Campbell (1983); (D) Weakley (2006)/Campbell (2003). Visualizations for alignments 10–12 of the Andro-UC, 1889–2015. Representation conventions and annotations as in Fig. 1C. (A) BONAP (2014)/Weakley (2006); (B) Weakley (2015)/BONAP (2014); (C) Weakley (2015)/Hackel (1889).


In configuring the pairwise alignments, we represent the later (younger) taxonomy as T2 and the earlier (older) taxonomy as T1 [33]. Accordingly, the visualizations (Figs 1, 4–6) show concepts unique to T2 as green rectangles, and concepts unique to T1 as yellow octagons. Aligned regions with multiple congruent concepts are shown as grey rectangles with rounded corners (Fig. 1C). We use the shorthand of [36] for taxonomic concept labels, where (e.g.)
All alignments were obtained using “polynomial encoding/possible world/reduced containment graph” commands, which show overlapping articulations among input concepts as blue dashed lines in the output visualizations [14,16,61]. The commands generate the set of output MIR (.csv format) and GraphViz-rendered alignment visualizations (.pdf format).
The sets of Maximally Informative Relations (MIR) for each of the 12 alignments are provided in Appendix B. To ensure complete reproducibility, we have also prepared the Andro-UC use case as an experiment at
Summary of taxonomic and nomenclatural identities of Euler regions across 12 alignment visualizations for the Andro-UC (see Figs 4–6). Columns show the number of aligned regions (excluding the congruent parent region), ratio of congruent (
) versus (
) unique regions, percentage of congruent regions (
), ratio of identical (=) versus different (≠) names occupying the congruent regions (
), ratio of unique (+) versus non-unique (−) names occupying unique regions (
), and ratio and percentage of reliable versus unreliable names (see text for explanation). Totals are percentages are provided for the cumulative values across all alignments
Summary of taxonomic and nomenclatural identities of Euler regions across 12 alignment visualizations for the Andro-UC (see Figs 4–6). Columns show the number of aligned regions (excluding the congruent parent region), ratio of congruent (
Number of aligned regions excludes the root/parent region (“
Summary of numbers of input concepts (T2/T1) and input articulations (A) for the 12 alignments of the Andro-UC, and of the Maximally Informative Relations (MIR), including totals and partitions according to each type of RCC-5 articulation. Legend: Rel.
Number in parentheses includes all MIR that articulate the root/parent region (“
Analysis of taxonomic name:meaning relationships in the 12 alignments of the Andro-UC, based on the 824 Maximally Informative Relations (MIR), and including assessments of reliable names [R] and unreliable names [UR]. Legend:
Third, we reinterpret the input displayed in Fig. 2 to evaluate the performance of names as concept identifiers over the entire 1889–2015 interval. We adopt Remsen’s [74] notion of
Analysis of name:meaning cardinality for the entire Andro-UC, based on 88 name usages of 36 taxonomic names corresponding to 46 unique (sets of) taxonomic meanings. Cell values indicate (1) that the name is used and (2) which of the 1–n meanings is symbolized by the name in the corresponding classification. Names are ordered according to their frequency of use in the 12 classifications. Non-congruent (sets of) meanings associated with each name are numbered in reverse chronological order, i.e., starting with the 2015 taxonomy. See also Fig. 1
Analysis of taxonomic name:meaning cardinality for the entire Andro-UC, based on 85 occurrences of concepts (“members”) that participate in 21 congruent concept chains, where individual chains are labeled with 1–4 taxonomic names. Cell values indicate (1) that the concept is an element of the chain and (2) which of the 1–n names is used to symbolize the member in the corresponding classification. Each of the 21 chains is labeled by its most recent member, and concept lineages are ordered accordingly. Non-identical (sets of) names associated with each chain are numbered in reverse chronological order, i.e., starting with their name in the 2015 taxonomy. See also Fig. 1 and Table 4
Extent and origins of taxonomic incongruence
Each of the 12 input configurations yields a single, consistent, and unambiguously resolved alignment (Figs 4–6). The 12 visualizations clearly illustrate that none of the paired input taxonomies are entirely congruent, instead showing 2–12 unique regions (compare Figs 5B and 5C), and an overall ratio of 56 congruent to 71 non-congruent regions (Table 1). While we cannot examine each alignment in fine detail, we highlight select phenomena that capture the extent and causes of taxonomic incongruence in the Andro-UC. One cause for incongruence is unequal granularity across classifications. For instance, at the lowest taxonomic level, classifications authored from 1933 to 1979 recognize 1–5 concepts, whereas taxonomies published outside of this interval accept 7–9 concepts (Figs 2 and 3). Such differences cause the more finely resolving taxonomy to have one or more non-congruent (properly included) low-level concepts in comparison to its counterpart (e.g., Figs 4A and 5B). For instance, alignments of any taxonomy to that of the most coarse-grained RAB (1968) classification are only congruent with regards to the root-level concepts (Figs 4D and 4E), given that RAB (1968) recognize no additional taxonomic subdivisions within the complex. In the context of its immediate predecessor and successor (Figs 2, 4D, 4E, and 5A), the 1968 classification appears disruptive because the chain of taxonomic resolution between Hitchcock & Chase (1950) and Godfrey & Wooten (1979) is not propagated in RAB (1968).
Taxonomies produced in 1983 or later show higher levels of congruence between their finest-degree entities (Figs 5C, 5D, and 6). By and large, taxonomists publishing in the past 30 years have adopted Campbell’s (1983) perspective on how finely one should differentiate units within the complex. Incongruences among these recent perspectives are rooted mainly in disagreements on how to name and integrate low-level entities into parent concepts. Interestingly, Hackel (1889) already recognized seven low-level entities, and in that sense his classification is more congruent with contemporary perspectives (Fig. 6C) than with those published in 1933–1979.
In addition to unequal granularity, five alignments show overlapping (
The 1950/1948 alignment represents an interesting case of overlap (Fig. 4C). Both Hitchcock & Chase (1950) and Blomquist (1948) recognize three identically named species-level concepts within in the complex, one of which is also taxonomically congruent (1950.A_capillipes
Of particular note is the articulation 1950.A_glomeratus
Generalizing the phenomenon exemplified in the 1950/1948 alignment, we observe that overlap of two (or more) concepts creates
Overall, occurrences of differential resolution and overlapping concepts in the Andro-UC result in pairwise alignments with 5–15 regions (Table 1). Taking the 12 alignments in conjunction, 44.1% of the 127 inferred alignment regions are taxonomically congruent (range: 0.0–85.7%), leaving the remaining 55.9% incongruent. This ratio of in-/congruence between paired taxonomies is the semantic basis of the dis-/agreements that taxonomic names are suited to identify and track, though only up to a point, as we analyze in the next section.
Quantification of name:meaning dissociation
Taxonomic names are reliable identifiers of taxonomic in-/congruence for 77/127 (60.6%) of the regions present in the 12 pairwise alignments of the Andro-UC (range: 38.5–83.3%) (Table 1). The highest ratios are obtained for the 1968/1950 and 1979/1968 alignments. The latter include no congruent regions, since every unique name also symbolizes a unique alignment region (Figs 4D and 4E). The 5:13 ratio (38.5%) for the 2015/1889 alignment (Fig. 6C) is low as expected. In particular, 0/7 congruent concept regions in this 126 year-spanning alignment have reliable names; i.e., each of these regions is labeled by two non-identical names. However, taxonomic names in the Andro-UC do not necessarily perform better over short time intervals, or in alignments whose input taxonomies are closer to the present (2015). One example is the 2006/2003 alignment (Fig. 5D), which has an undesirable 6:9 ratio (40.0%) of reliable:unreliable names.
The 824 output MIR permit finer assessments of name:meaning dissociation (Table 3). Accordingly, among all 60 instances of pairwise taxonomic concept congruence (
Among the remaining 161 non-congruent articulations (>, <,
Quantification of name:meaning cardinality over the 126-year period of the Andro-UC reveals that 18/36 taxonomic names (50.0%) have been used in multiple treatments, whereas the other 18 names are particular to single treatments (Table 4). Cumulatively, the use case entails 88 taxonomic name usages and 46 unique name:meaning combinations (ratio: 1.91:1). Only one name –
The most frequently used name –
The 88 name usages in the Andro-UC correspond to 21 chains of taxonomically congruent concepts (Table 5). Of these, the chain symbolized by 2015.A_glaucopsis (most recent member) is the longest, with elements appearing in 9/11 classifications and under four non-identical names. Other long chains include 2015.A_virginicus (8 usages/4 non-identical names), 2014.A_capillipes (8/2), 2015.A_hirsutior (7/4), and 2015.A_tenuispatheus (7/4). At the other end of the spectrum, five concepts display globally unique meanings that whose meanings are unique to one classification (two authored in 1979; and one in 1968, 1950, and 1948, respectively).
At the other end of the length spectrum, there are five concepts whose meanings are unique to one classification (two authored in 1979; and one in 1968, 1950, and 1948, respectively).
The least favorable name:meaning cardinality among the 21 chains 4:1; meaning that four non-identical names are used to identify sets of taxonomically congruent concepts. This ratio applies to six concept chains: 2015.A_capillipes, 2015.A_dealbatus, 2015.A_virginicus, 2015.A_glaucopsis, 2015.A_hirsutior, and 2015.A_tenuispatheus. Conversely, a cardinality of 1:1 is obtained in 9/21 chains, of which only four have more than one usage (Table 5).
The information shown in Tables 4 and 5 provides an intuitive sense of how taxonomic names fare in the longer term as identifiers of taxonomy meanings in the Andro-UC. The performance of names should be evaluated in the context of taxonomic stability. High taxonomic stability would be reflected in an abundance of occupied cells in Table 5, because early-authored concepts would have congruent successors – with either identical or non-identical names – in the 1889–2015 time interval. This is not the case: only 85/231 cells (36.8%) have values, and 14/16 chains (87.5%) with multiple elements are ‘interrupted’.
In spite of persistent taxonomic meaning evolution, identifiers could nevertheless (in principle) be designed to achieve a name:meaning cardinality of 1:1. In that case taxonomic names would simultaneously show a score of 1 in the “Meanings” column of Table 4 and a score of 1 in the “Names” column of Table 5. Thirty-two names meet the former condition, and nine names meet the latter condition. However, the intersection of these two sets of includes only one name –
In summary, even though names used in the Andro-UC act as identifiers of meanings with reliability ratios of 56.6% or higher in the local, pairwise alignments (Tables 1–3), their global reliability is such that >97.2% diverge from an ideal name:meaning cardinality of 1:1. This assessment remains adequate even if taxonomic change is taken into account.
Discussion
We focus our discussion on the performance of names as identifiers of taxonomic concepts, emphasizing on new insights gained from our representation and reasoning approach. We also assess the relevance of the RCC-5 multi-taxonomy alignment approach for wider application (with scalability implications) in the biodiversity data realm, and potential applications to other semantic integration tasks.
New knowledge products
What aspects of our approach are new and valuable? The Andro-UC illustrates the unique ability of RCC-5 multi-taxonomic alignments to resolve taxonomic meaning evolution at more granular levels than is possible using taxonomic names and nomenclatural relationships [70]. This follows directly from the input information – we can only represent and align the entities shown in Fig. 1 if taxonomic concept labels and RCC-5 articulations are used. Critically, the approach requires an initial set of articulations provided by human (expert) users, and grounded in their assessments of pertinent taxonomic evidence, that satisfy criteria of consistency and (lack of) ambiguity to yield well-specified alignments. Compliance with these criteria is achieved by the interactive toolkit workflow [33].
New knowledge products for the Andro-UC include the output MIR, alignment visualizations, and name:meaning cardinality analyses. Through the reasoning process, the set of 88 user-provided input articulations is logically tested and augmented to yield 824 Maximally Informative Relations (1025 MIR if the root concepts are included). The MIR derived for each alignment can be queried to determine whether any concept pairs (and ancillary biological data) are suitable for integration, or not [36,43,53,85,86]. In particular, articulations of congruence (“yes, integrate”) and exclusion (“no”) between two concepts are reciprocally actionable in this context. Proper inclusion and inverse proper inclusion are least unilaterally actionable without ambiguity (“add data assigned to the less inclusive concept to those of the more inclusive one”). Overlap is the most challenging articulation for the purpose of merging ancillary information. However, in some instances overlap at higher levels in an alignment can be resolved into proper inclusion at lower levels. For instance, the 2014/2006 alignment (Fig. 6A) shows the articulation 2014.A_glomeratus
The alignment visualizations are logically congruent with the output MIR [14,16,23,87]. Their unique value lies in aiding human users to understand multi-concept relationships through tree-like representations. Visualization tools for multi-taxonomy relationships have advanced significantly over the past 20 years [3,5,16,46,47,98]. Nevertheless, the Euler/X toolkit is the first platform to leverage RCC-5 relationships and logic reasoning to yield comprehensive, tree-like alignment visualizations.
The visualizations communicate uniquely valuable information. For instance, Figs 1–3 all show information related to the 1948/1933 alignment. Figure 2 effectively visualizes the lowest-level entities and articulations of the entire Andro-UC, but is not well suited for input taxonomies nested into three or more ranks (or phylogenetic levels). Such tables are ‘flattened’ into two dimensions. Figure 3, in turn, can shows all nested entities for the individual 1948/1933 taxonomies, but does not provide accurate multi-concept alignment information. Using names to navigate across these trees may lead to erroneous conclusions such as 1948.A_virginicus | 1933.A_glomeratus, when the proper articulation is
In contrast, the alignment visualizations (Fig. 1) simultaneously communicate information about nomenclatural identity, multi-level tree hierarchy, and multi-tree in-/congruence. Their interpretation is intuitive; for instance, the proportion and position of grey squares versus green rectangles or yellow octagons communicate the extent and localization of taxonomic in-/congruence in an alignment (compare, e.g., Figs 5B and 5C). The relative occurrences of (=, ≠, +, −) annotations show the degree to which taxonomic names can reliably integrate taxonomic meanings.
Building better identifiers for biodiversity data
How relevant is our representation approach to the broader, semantics-facilitated biodiversity data realm? Multiple reviewers raised this important question. We think that it is too early to attempt a comprehensive answer. Technical, scientific, cognitive-evolutionary, and socio-cultural constraints affect how identifier granularity is managed in the biological domain. Predicting how the RCC-5 alignment approach will fare in light of these trade-offs is beyond the scope of this analysis. We can, however, assess the particulars and generalities of the Andro-UC, and what it teaches us about identifying and linking taxonomically identified information in open-ended biodiversity data environments.
The scale of the Andro-UC is small. Weakley’s (2015) classification recognizes seven species-level concepts – more than any other taxonomy. The taxonomic history is evidently complex, but no more complicated than that of many other continuously revised groups [1,20,32,35,43,68,77,90]. The poor performance of names as identifiers of divergent taxonomic meanings is not exceptional for the field (herein broadly defined to include phylogenetics).
The problem of name:meaning dissociation in biological taxonomy is systemic. It is rooted in Code-mandated principles that promote stability and change in naming (largely) as a function of nomenclatural type identity and priority. To some degree the inadequacies are manageable through social processes, including conservative re-/naming practices or ‘standardized’ taxonomies [6,44,55,79,91]. In practice, the long-term drawbacks of using taxonomic names as concept identifiers are frequently mitigated by the ability of well-trained human scientists to contextualize name usages and thereby infer the intended meanings [30,32,69]. However, no counteracting human practice can alter the insight that taxonomic names and nomenclatural relationships are fundamentally not designed to track granular similarities and differences in taxonomic meaning of the sort exemplified in the Andro-UC. Computer algorithms in particular struggle with inferring what “
Specifying the referential extension of taxonomic names for reliable reuse requires more than ostension (the act of pointing) to exemplars (types). Ostensive definitions of taxonomic meanings are bound to under-specify the intended meanings in many applied contexts, such as those of the Andro-UC. Instead it is more appropriate to model the name-to-(currently-perceived-)taxon linkage as a matter of theory construction [31,75]. The challenge of integrating biodiversity data then becomes one of aligning multiple taxonomic theories, which can be modeled with the RCC-5 approach.
The aforementioned insufficiencies are most apparent in cases of multi-concept overlap. Such cases are frequent in taxonomy, and they cannot be reduced to the differences in degree of resolution [26,65,68,77]. As an example, the 1950/1948 classifications of the Andro-UC concur that there are three identically named species-level concepts entailed in the complex (Fig. 4C). They also concur that 1950/1948.A_virginicus has three variety-level child concepts. However, they disagree on the extent to which the available, type-anchored names reach out to perceived, and necessarily more inclusive, taxa presumed (more precisely: theorized) to constitute natural, evolutionary entities [9]. As a consequence of this differential inference of ‘extra-typical’ taxonomic boundaries, the four 1950/1948 species-level concepts overlap in complex ways (Fig. 4C). Such multi-theory overlap is more frequent at higher taxonomic levels, where the performance of names as identifiers of taxonomic meanings becomes increasingly poor [33,35,36].
The herein demonstrated alignment approach paves the way for building better taxonomic concept identifiers and multi-taxonomy resolution services.
Scalability of the RCC-5 alignment approach
How widely applicable (or scalable) is the RCC-5 multi-taxonomy alignment approach within the field of biological taxonomy? Generally speaking, reasoning about taxonomies with RCC-5 remains in its infancy [16,33,36,43,87]. At present, the Euler/X toolkit can effectively process consistent, well-specified, pairwise input taxonomies with up to 200–400 concepts each [14,16,61]. While this scale is sufficient for small- to medium-sized alignment use cases, future toolkit development should concentrate on modularizing the reasoning process, specifically by using a divide-and-conquer approach that better leverages the hierarchical structure of the input constraints and dynamic user/reasoner interactions. Demonstrating the practical value of the approach requires making the toolkit accessible to larger use cases and biodiversity data environment where taxonomy evolution is an important variable to identify and control.
The analysis of the Andro-UC demonstrates the potential of reasoning about taxonomies and at the same time leaves much room for further work. In particular, the 11 input classifications allow for 55 pairwise comparisons, of which only 12 are presented here. This omission is deliberate. New toolkit releases will have the ability to align more than two input taxonomies simultaneously (but remain in development). Such an approach entails new reasoning challenges and products. For instance, we could ask to what extent 12 alignments produced in the current study are sufficient for recovering the full set of 55 pairwise alignments, based on transitive reasoning. Solutions to such challenges are relevant to the issue of scalability, and can inform the users’ practice of engaging with the toolkit.
Pathways to broader implementation should focus on directly integrating the use of taxonomic concept labels, parent/child relationships, RCC-5 articulations, and reasoning and visualization services into prominent biodiversity data platforms [2,34,35,48,57,59]. We envision information environments where identifications of organismal occurrence records are augmented to the level of carrying taxonomic concept labels [53]. The circumscriptions of the respective concepts are also managed in the platform, and consistent, well-specified RCC-5 alignments are provided. Building such an infrastructure would permit biologically significant queries of the following types. (1) Return all records identified to the name
The above queries (2)–(6) are biologically significant and depend on utilizing the RCC-5 alignment approach to achieve the desired degree of resolution. Such logic-enabled integration services are urgently needed in our assessment to build open-ended biodiversity data environments that can manage the complexities of evolving taxonomic knowledge. Strengths of the RCC-5 approach in this context include explicitness, consistency, machine-interpretability, and flexibility in processing diverse forms of taxonomic concept input ranging from minimally structured lists of taxonomic concept labels to phylogenies and monographic revisions [8,33,35,36,53].
Non-taxonomic alignment challenges
The RCC-5 alignment approach has so far been limited to use cases in biological taxonomy. Explorations of the toolkit’s performance in relation to other integration challenges is generally recommendable if the new focal domain shares several of the toolkit’s critical (taxonomic) input/output constraints [87]. This means that other semantic integration challenges that need to consistently align and visualize multiple, hierarchically structured sets of concepts with coverage and/or disjoint siblings constraints may benefit from exploring the RCC-5 alignment approach.
Our approach can be complemented by Semantic Web methods that reason over concept similarity and drift by leveraging Natural Language Processing techniques and relationships defined in OWL-DL ontologies [13,22,27,37,73,80,92]. Such complementary analyses of concept identity and semantic evolution are now possible.
