Abstract
For a domain with a strong focus on unambiguous identifiers and meaning, the Semantic Web research field itself has a surprisingly ill-defined sense of identity. Started at the end of the 1990s at the intersection of databases, logic, and Web, and influenced along the way by all major tech hypes such as Big Data and machine learning, our research community needs to look in the mirror to understand who we really are. The key question amid all possible directions is pinpointing the important challenges we are uniquely positioned to tackle. In this article, we highlight the community’s unconscious bias toward addressing the Paretonian 80% of problems through research – handwavingly assuming that trivial engineering can solve the remaining 20%. In reality, that overlooked 20% could actually require 80% of the total effort and involve significantly more research than we are inclined to think, because our theoretical experimentation environments are vastly different from the open Web. As it turns out, these formerly neglected “trivialities” might very well harbor those research opportunities that only our community can seize, thereby giving us a clear hint of how we can orient ourselves to maximize our impact on the future. If we are hesitant to step up, more pragmatic minds will gladly reinvent technology for the real world, only covering a fraction of the opportunities we dream of.
Back to the future
Re-reading the original Semantic Web vision [7] from 2001, we immediately notice where the predictions went wrong. Far less obvious are those that came true; they have become givens in today’s world, part of the new normal that now forms our everyday reality. We have forgotten the era ruled by the indestructible Nokia 3310, whose monochrome screen barely counted more pixels than a modern-day app icon, years before most people had Internet access at home – let alone on their phone. The crazy thing was imagining that we would be instructing our mobile devices to perform actions for us; the planning and realization of those actions were plausibly explained in the rest of the article. With the unimaginable eventually being solved after a decade of research, the imaginable may have turned out to be the toughest nut to crack.
The Semantic Web’s roots can be traced further back to the initial Web proposal [2], whose opening diagram presents what we now refer to as a knowledge graph, an early glimpse into subject–predicate–object triples rather than the
Nonetheless, semantic technologies are regularly coined as a means of tackling some of the Web’s most pressing challenges, such as combatting disinformation or fueling its re-decentralization movement [23].
Meanwhile, the Semantic Web research community is facing its own battles with some of the latest technological hypes, doubting between defending its own relevancy next to Big Data, machine learning, and blockchain, or surfing atop the waves created by those.
If you can’t beat them, join them; if you can’t join them, repackage. The days when the keyword “semantics” led to guaranteed project funding have faded faster from our collective memory than the Nokia 3310 ever will.
Granted, cracks have started creeping into these other technologies, too. Maybe Big Data is not limitless in practice if technical capabilities scale faster than the human and legal processes for ethical data management, and we do need to link data across distributed sources instead of unconditionally aggregating them. Perhaps there are problems that machine learning can never solve reliably, and the safety provided by first-order logic proofs is irreplaceable for crucial decisions. And possibly it will turn out that decentralized consensus only touches a small part of all use cases, that disagreement under the “anyone can say anything about anything” flag provides a more workable model of the virtual world.
So when we are not riding others’ waves, what is it that unites the Semantic Web research community? What makes us truly “us”, what are the semantics we can attach to our own identity? Having emerged at the intersection of the Web, databases, and logic, we have since become disconnected from these domains, our awareness of which sometimes appears to be frozen in time. We tend to disregard that the Web from which we spun off is no longer the same as it was, and that different approaches are required today. We have held on to
The main danger within an existential crisis is the risk of losing our connection to the reality from which we originate. The philosophy of our community seems to align with Alan Kay’s quote that “The best way to predict the future is to invent it.” We build and we investigate, expecting the future to wrap its arms around the creations we are spawning. In this vision article, we rather embrace John Perry Barlow’s inversion of the quote, in which “The best way to invent the future is to predict it.” Looking back at the dreams from the past and recombining those with the aspirations of the present, what are the essential missing pieces that require our unique dedication as Semantic Web scholars? As in the original Semantic Web article, those topics that have long been considered trivial might very well be the hardest ones in practice [21].
In this article, we make the case for a return to our roots of “Web” and “semantics”, from which we as a Semantic Web community – what’s in a name – seem to have drifted in search for other pursuits that, however interesting, perhaps needlessly distract us from the quest we had tasked ourselves with. In covering this journey, we have no choice but to trace those meandering footsteps along the many detours of our community – yet this time around with a promise to come back home in the end.
A little semantics
The term “Semantic Web” evidently coincides with adding semantics to Web content in order to improve interpretation by machines. However, after two decades of debate, we still seem uncertain about exactly how much semantics are in fact useful. The gap between data that are published and applications that should consume them continues to grow. While the call for Linked Data has brought us the eggs, the chickens that were supposed to be hatching them are still missing, partly because making sense of others’ data remains hard.
To intertwine data with meaning, we rely on
Early efforts were devoted to the development of ontology engineering, and understandably so. Having generic software to automatically act on a variety of independent data sets was what made the Semantic Web vision so appealing. Once domain knowledge had been formalized, it could be applied to represent facts, from which reasoners could automatically derive new facts. Yet once we took those endeavors to the Web, it became apparent we had missed the general practical implications. As semantics are always consensus-based, domain models are only as valuable as the scope of the underlying consensus. Hence, their usage cannot be guaranteed by parties that were not involved or disagree with the consensus. Often, people resort to mitigation strategies that disregard the semantics enshrined in description logic, by selectively reusing properties and classes upon publication, or freely reinterpreting semantics upon consumption.
Core frameworks such as
The disconnect between the need of semantics and the effort to provide it, has cultivated a heterogeneous and underspecified Web of Data [20]. We cannot afford any longer to handwavingly address practical implementation and usability with deep theories. As depicted in Fig. 1, a strong implicit assumption underlies a lot of our work: that solving the core 80% of a problem is where research is needed, and that the remaining 20% consist of simple engineering to take that research from theory to practice. However, is what we often dismiss as “engineering” really just a matter of writing more code? As scientists, we might want to validate that hypothesis, given the considerable problems that arise when we try to deploy semantics at Web-scale.

After having solved the core 80% of a research problem, we often assume that the remaining 20% are practicalities that can be addressed through trivial engineering. In reality, lifting research from controlled experimental environments to the open web likely leads to other research problems. In addition to bringing problems from theory to practice, we can let practical problems inspire theory.
We need to consider the Web we have, before we can have the Web we want. After all, what good is high-performance inferencing if ontologies cannot be found or are outdated? What good are unique identifiers for concepts when stating equality with
What arguably sets us apart besides semantics is, well, the Web. In contrast to relational or other databases, our domain of discourse is infinite and unpredictable on multiple levels. Because of the open-world assumption, no single
The Web is what we deliver as an answer to any Linked Data skeptic, as an irrefutable argument that all of our perceived or actual complexity is justified, because we are dealing with problems that span the entire virtual address space of the globe and in fact the universe. The Web is the reason why our ontologies are spread all over the place, why the prefix expansion for the
We are not even talking here about taking our scholarly communication to the Web; let that be the crusade of the dogfooders [9], to whom we dedicate Section 7. We mean to say that “it works in our university basement” has become an acceptable and applauded narrative – and to be fair to both the innocent and the guilty, impressive efforts undertaken in such basements have rightly been awarded scientific stamps of excellence through rigorous non-Web peer review processes. However, we cannot claim the Web as the sole source of our intricacies, while simultaneously ignoring all of the Web’s difficulties by conducting all of our experiments in hermetically controlled environments. By doing so, we pretend that the comfortable 80% cannot significantly be affected by the unpredictable impurities of the 20%, that an n-fold performance gain in our basements can directly be extrapolated to the same gain for Linked Data in general. As Goodhart’s law states: “When a measure becomes a target, it ceases to be a good measure”, except that we can strongly question whether non-Web environments, pure and controlled as they are, have ever fulfilled the role of good measure providers in the first place.
No, we cannot safely assume that the
Upon closer reflection, our fears about testing on the Web are probably justified; our scientific conclusions and their presumed external validity perhaps a little less.
In all honesty, the academic community did take its publish or perish adage to heart, and is co-responsible for the billions of
We should, however, not become too puristic in our judgment; an important aspect of scientific studies is their ability to zoom in on the isolated contribution of specific factors. Many valid use cases for non-Web
“Linked” as bigger than “Big”
When Big Data became mainstream around 2010, the Semantic Web community was listening with great attention. After all, we had already been working with staggering numbers of facts, hundreds of millions of triples not being an exception. Furthermore, when considering all data on the Web as a whole, we would surely reach the threshold at which Linked Data should be considered Big Data in its own right.
However, Big Data and Linked Data are not necessarily structurally compatible. A main advantage of the
A conceptual issue with the Big Data vision, at least for our purposes, is that it takes the path of the lowest common denominator, as a natural result of an aggregation process. While aggregation definitely has its merits for discovery and analysis, it also flattens unique characteristics and attributes of individual data sets, dissolving them into a much larger and more homogeneous space. An example of how this unintentionally can become troublesome is found within the Europeana initiative [17], which serves the noble cause of aggregating highly diverse metadata from cultural institutions all across Europe. However, several individual institutions felt wronged when they had to upload their data set – which they knew so well and had taken care of for so many years – only for it to be mingled with those of others who surely would have different accents and inferior quality thresholds [25]. What gives Big Data its attractiveness and efficiency might thus take away what differentiates us. Time will tell if similar arguments can be made about the Wikidata project [28], which aims to be a global knowledge base.
For some time, we have been mildly apologetic about not doing Big Data, at one point hastily rebranding ourselves as “Semantics and Big Data” [10] before realizing that, indeed, there is another research community out there that is better positioned to tackle those challenges. Considering the 2001 article [7] as the official birth date of the Semantic Web, let us conveniently ignore those teenage years during which we should be forgiven for rapidly cycling through different phases as we were in fact just constructing our own identity. We should not aspire to be that popular kid from high school, who, as it turned out later, had merely peaked early in life. Nearing our twenties now, let us stop apologizing already for just being ourselves.
If we conceptually think about Big Data versus what we are aiming to achieve with Linked Data, our challenges might very well be the bigger ones. Notwithstanding impressive research and engineering efforts to scale up Big Data solutions the way they do, harvesting an enormous amount of homogeneous data in a single place creates ideal conditions for processing and analysis. A small number of very large data sets is easier to manage than a very large number of small data sets. Size does matter, just not always in the way others think: the heterogeneity and distribution of Linked Data is currently at a level that cannot be adequately tackled with Big Data techniques. Instead of being ashamed about practicing Small Data, we should proudly flaunt its multitude and diversity. In times of increasing calls for inclusion, let this be a good thing.
Because even if we technically would be able to centralize everything in one place, we could only serve the relatively small space of public data, not all of the private data that is the focus point of Big Data applications. After all, there are very good reasons for data to live in different places, not in the least legal or privacy concerns. Those needs are only becoming more pressing, given important drivers such as the
In a distributed future, there will not be less data, but more; if it cannot reside in one place for whatever reason, it will have to be linked. This is yet another reason why we need to be prepared for Web-scale discovery and querying over federations that are magnitudes more challenging than our current experimental environments.
AI beyond ML
There is no question the age of deep learning is very much upon us. As the latest one to mature, deep learning has spawned numerous research efforts, techniques, and even production-ready applications with machine learning, elevating the state of
Semantic technologies were originally considered part of the
As both angles have their merits, the future is very likely hybrid, and we need to further explore complimentary roles. For instance, semantics and inference can pre-label data to improve the accuracy of models. Post-execution explainability could be achieved by reasoning over semantic descriptions of nodes. In the area of personal digital assistants, declarative
Semantic inference and first-order logic might lead to less spectacular conclusions, but they will nonetheless be crucial to advanced machine learning systems. Also here, it is important to solve the engineering side of things. Several machine learning tools are readily available for developers, who, through testing, discover further challenges. When machine learning solutions “just work”, developers do not need to know what is inside; importantly, such simplicity is the result of research, not just engineering. Getting rid of the “trivial” problems with semantic inference hopefully means providing these more spectacular results, on the Web. Maybe this is the better way to position ourselves in the next waves to come, such as reinforcement learning.
Challenging until proven trivial
Ultimately, all of above indicates a need to guard ourselves from conducting research in a vacuum. Not all science requires practical purposes, but many of the research problems we study will never actually occur if the Semantic Web does not take off any further, so we should at least consider – for our own sake – prioritizing those urgent problems that are blockers to its adoption. Part of our hesitance might be that, having fought hard for recognition as a scientific domain, we are afraid to be pushed back into the corner of engineering. We usually zoom in on very focused, often incremental research problems, which tend to bring us progress. Our conferences and journals strive to find a high threshold for what qualifies as research, with a strong focus on qualitative experimentation. Thereby, we risk optimizing for familiarity and purity rather than for originality and impact, because the scientific merits of novel directions are inherently much harder to assess. While high thresholds in general are commendable, they also result in a higher percentage of false negatives, both in submitted works that never get accepted, and in stellar research ideas that never materialize because fear of such rejections encourages safer bets.
As much time as we spend justifying ourselves toward other communities, those efforts sometimes pale to how our reviewers expect authors to justify their choice to address pragmatic concerns that, all things considered, should be no less of a scientific contribution. Pareto’s law from Fig. 1 lures around the corner: we consider the core 80% of a hard problem and assume that the remaining 20% is a non-issue. Converting technological research into digestible chunks for developers is considered trivial and outside of our scientific duty, despite the considerable scientific challenges of creating simple abstractions to complex technology, as the machine learning community shows time and time again.
Yet everything that reeks of engineering is shunned. However, most researchers in our community have not built a single Semantic Web app, so we cannot pretend to understand the insides of the 20%. As such, it is impossible to tell whether that remainder is trivial or not. We do not get in touch with some of the most pressing issues, because we already ruled them out as trivial, and then wonder about the reasons for the low adoption of the otherwise excellent 80% research.
Since the Semantic Web started, Web development has massively changed. Many apps are now built by front-end developers, for whom Semantic Web technologies are inaccessible – explaining the success of substantially less powerful but far more developer-friendly technologies such as GraphQL. The GraphQL community, who pride themselves on simplicity compared to the Semantic Web technology stack, are slowly discovering that they were merely solving simpler problems. Queries with local semantics indeed become problematic if data originates from multiple sources. Instead of applying the lessons from years of
Designing an appropriate Linked Data developer experience [24] is so challenging because, while regular apps are hard-coded against one specific well-known back-end, Linked Data apps need to expect the unexpected as they interface with heterogeneous data from all over the Web. Building such complex behavior involves a sophisticated integration of many branches of our research, which requires designing and implementing complex program code. Exposing such complex behavior into simple primitives, as is needed for front-end developers, requires automating the generation of that complex code, likely at runtime. Such endeavours have not been attempted at the research level, let alone would they be ready for implementation by skilled engineers.
This research gap between current research solutions and practice means that much of our work cannot be applied. Some find it acceptable that nothing works in practice yet. Unfortunately, such a lax attitude leaves us with an all too comfortable hiding spot: why would my research have to work in the real world if others’ does not? As a direct consequence of this line of thought, we cannot meaningfully distinguish research that could eventually work from research that never will.
Until we have examined whether or not something is trivial, we should not make any implicit assumptions. Perhaps we should consider scoring manuscripts on the 80/20 Pareto scale, and ensure that we have enough of both sides at our conferences and in our journals. By also judging applicability, we abandon our filter bubbles and extend our action radius to urgent problems in the way of adoption – which will only grow our research community.
Practice what we preach
Not only do many of us lack Semantic Web experience as application developers, our even bigger gap is experience as users. Although a significant amount of our communication (not in the least toward funding bodies) consists of technological evangelism, we rarely succeed in leveraging our own technologies. If we keep on finding excuses for not using our own research outcomes, how can we convince others? The logicians among us will undoubtedly recognize the previous statement as a tu quoque fallacy: our reluctance to dogfood is factually independent of our technology’s claim to fame. Yet if all adoption were solely based on sound reasoning, our planet would look very different today. Credibility and fairness aside, we are not in the luxury position to tell others to “do as I say, not as I do.” The burden of proof is entirely upon ourselves, and the required evidence extends beyond the scientific.
In addition to being an instrument of persuasion, dogfooding addresses a more fundamental question: which parts of our technology are ready for prime time, and which parts are not? By becoming users of our own technologies, we will gain a better understanding of the elusive 20% that clearly, had it actually been so trivial, would already have been there. Never underestimate the power of frustration: feeling frustrated about unlocked potential is what prompted Tim Berners-Lee to invent the Web [3]. Only by managing almost his entire life with Linked Data, he is able to keep a finger on the Semantic Web’s pulse, and his eyes on its Achilles’ heel.
If we similarly had a deeper understanding of real-world Linked Data flows and obstacles, would we not be in a better position to make a difference? We might want to address concrete problems happening today, in addition to targeting those that will hopefully arise – conditional on today’s problems ending up solved – after several more years.
In conclusion
After almost two decades, the Semantic Web should step out of its identity crisis into adolescence. In search of a target market for adoption, research in semantic technologies has ridden others’ waves perhaps a little too often. While those bring in useful lessons to be learned, we should not forget to learn our own on the place where we can make a major difference: the Web. There, new technologies still emerge every day – just not ours. Investing in theoretically interesting problems without also delivering the necessary research to achieve practical implementations seems to have singled us out.
A Semantic Web has data and semantics intertwined, yet distributing those semantics has been proven hard. Can we focus on the practice and implications of sharing and preserving semantics? If not, we might leave the original vision to die in the hands of a more short-term and pragmatic agenda. No doubt, the need for full-scale data integration will eventually reappear, possibly reinventing the solutions and methods we are working on today. But that realization might take another decade.
The Web might not be our only target market, but it is the one that sets us apart. Yet it does not pop up in the average “threats to validity” section of articles – if there even is one. The rules are set in a unique way, which requires overcoming specific hurdles to make things work. To really test the external validity of our work, we should submerge in the practical side of things and thus make the Web a better suited place for data consumption. Our experimental environment should not the same as that of Big Data; we should thrive with a lot of small data sets instead of a few large ones, and in heterogeneity instead of homogeneity. We could differentiate ourselves as the main driver for the much needed re-decentralization of the Web, where, backed by privacy and data legislation, Web-scale federation is the next big thing. To this end, positioning semantic technologies as a complement to machine learning is a necessity. The future of
In order to succeed, we will need to hold ourselves to a new, significantly higher standard. For too many years, we have expected engineers and software developers to take up the remaining 20%, as if they were the ones needing to catch up with us. Our fallacy has been our insistence that the remaining part of the road solely consisted of code to be written. We have been blind to the substantial research challenges we would surely face if we would only take our experiments out of our safe environments into the open Web. Turns out that the engineers and developers have moved on and are creating their own solutions, bypassing many of the lessons we already learned, because we stubbornly refused to acknowledge the amount of research needed to turn our theories into practice. As we were not ready for the Web, more pragmatic people started taking over.
And if we are honest, can we blame them? Clearly, the world will not wait for us. Let us not wait for the world.
