Abstract
One of the key value propositions for knowledge graphs and semantic web technologies is fostering semantic interoperability, i.e., integrating data across different themes and domains. But why do we aim at interoperability in the first place? A common answer to this question is that each individual data source only contains partial information about some phenomenon of interest. Consequently, combining multiple diverse datasets provides a more holistic perspective and enables us to answer more complex questions, e.g., those that span between the physical sciences and the social sciences. Interestingly, while these arguments are well established and go by different names, e.g., variety in the realm of big data, we seem less clear about whether the same arguments apply on the level of schemata. Put differently, we want diverse data, but do we also want diverse schemata or a single one to rule them all?
Keywords
Diverse data
Let us first answer the question of what data diversity is or could be. Several different perspectives come to mind.
Diverse schemata?
Interestingly, while the first two types of data diversity are unquestioned and have been part of data science theory and practice for years, this last type of diversity is often misunderstood and invites controversies. Intuitively, as scientists, we are inclined to believe that one perspective is more accurate, less biased, simpler, leads to better predictions, and so forth. To some degree, the idea of several equally-valid perspectives seems alien to us, maybe because of seemingly colliding with the law of excluded middles by which two statements that disagree cannot both be true.
Similarly, we tend to believe that data are raw [15], i.e., that they are independent of who observes. However, this is not always the case and particularly not for categorical data. For example, there is no ‘true’ definition of poverty, gender, forest, or planet, yet these terms play a prominent role in science and society. While it seems easy to claim that only the first two are partially defined by culture and society, the same is true for the latter examples as it is evident from the more than 600 commonly-used (and contradictory) definitions of forest [1, 9] and the changes [14] to the category of planets over time.
Put differently, many concepts are cognitive artifacts, and there are many ways to construct them.1
This should not be confused with questioning scientific methods or the need for well-established definitions of physical quantities, and so forth.
It is interesting to examine how disciplines that cannot afford the crisp nature of top-down axiomatic knowledge representation address this discussion. For instance, in machine learning and representation learning, data diversity is desirable during training to ensure that the resulting model captures the entire range of cases that it will encounter in the wild. Similarly, recent work on linguistic embeddings can distinguish between different meanings that terms take depending on the context [12]. Another example is Cognitive Science, in which many different theories of categorization are studied, including concepts with multiple prototypes [10].
From an even more abstract stance, the ongoing cultural goal of increasing workforce diversity is rooted in the assumption that diversity improves representation. Put differently, who we are, and where we come from (culturally and geographically), influence how we experience, i.e., categorize, the world around us.
This has important consequences for both the schemata we design and the representations we learn. Intuitively, increasing the number of classes to be distinguished reduces the accuracy of a model (while keeping other parameters such as training size invariant). Similarly, there are TBox axioms that are easy to learn by rule mining from existing knowledge graphs, but would fail to capture the context of data when schemata do not match how data exists in reality for lack of diversity in their construction [8]. Finally, for some concepts, we may even end up in situations where the features that can be extracted from a given source, e.g., a facial image, can no longer be used effectively for a task at hand, e.g., classification.
The resulting dilemma can be summarized nicely by the following observation: Diverse data and application needs require diverse schemata, while interoperability and integration benefit from common schemata. So, how do we support diverse (even contradictory) schema knowledge while avoiding another Tower of Babel? Many potential solutions were discussed in classical AI literature some decades ago, such as the notion of contexts and microtheories [11]. However, they do not answer how much diversity across schemata we deem to be beneficial, nor how to strike the right balance between the increasing complexity of individualized schemata and the need for efficient retrieval and integration. In fact, efforts such as Schema.org, seem to favor single, shallow vocabularies to fulfill application needs. Modular ontology design supported by structural patterns and expressive alignments between ontologies is another path forward [13]. Despite success stories, a large-scale, industry-strength application of these ideas is still missing.2
In fact, the authors are working on such an application as part of their KnowWhereGraph project, see
However, this is not a technical paper but one to start an important discussion: How diverse do we want our schemata to be and which price are we ready to pay in terms of prediction accuracy and reduced interoperability?
Footnotes
Acknowledgements
The authors acknowledge support from NSF award 2033521.
