Abstract
This article gives an overview of recent efforts focusing on integrating heterogeneous data using Knowledge Graphs. I introduce a pipeline consisting of five steps to integrate semi-structured or unstructured content. I discuss some of the key applications of this pipeline through three use-cases, and present the lessons learnt along the way while designing and building data integration systems.
Introduction
Data abounds in large enterprises. Beyond structured data, which garnered a lot of attention from data specialists in the past, the last few decades saw the meteoric rise of semi-structured and unstructured data including JSON documents, email or social network messages, and media content. Most companies are struggling to create a coherent and integrated view over all those types of data.
Knowledge Graphs have become one of the key modalities to integrate disparate data in that context. They provide declarative and extensible mechanisms to relate arbitrary concepts through flexible graphs that can be leveraged by downstream processes such as entity search [10] or ontology-based access to distributed information [2].

The XI Pipeline goes through a series of five steps to integrate semi-structured or unstructured content leveraging a Knowledge Graph.
Yet, integrating enterprise data to a given Knowledge Graph is a highly complex and time-consuming task. In this article, I briefly summarize the recent research efforts from my group in that regard. I introduce the
An overview of the pipeline we devised to integrate heterogeneous contents leveraging a Knowledge Graph is given in Fig. 1. This pipeline focuses on semi-automatically integrating unstructured or semi-structured documents, as they are from our perspective the most challenging types of data to integrate, and as end-to-end techniques to integrate strictly structured data abound [9,13]. The Knowledge Graph underpinning the integration process should be given a priori, and can be built by crowdsourcing (see Section 3.2), by sampling from existing graphs (Section 3.1) or through a manual process (Section 3.3). The integration process starts with semi-structured or unstructured data given as input (left-hand side of Fig. 1) and goes through a series of steps, described below, to integrate the content by creating a set of new nodes and edges in the Knowledge Graph as output (right-hand side of Fig. 1).
Name-Entity Recognition (NER)
The first step is to go through all labels / textual contents in the input data and identify all entity mentions (e.g., locations, objects, persons or concepts) appearing in the text. Two main strategies can be applied here:
when the Knowledge Graph is complete and contains all entities of interest along with their labels, we proceed with Information Retrieval techniques to build inverted indices over the Knowledge Graph and identify all potential entities from the text by leveraging ad-hoc object retrieval techniques [21]; when the Knowledge Graph is incomplete and is missing a number of entities and labels of interest, things get more complex. The main problem we face in that case is to identify entities from text while not knowing anything about them, which is intrinsically a very challenging problem. To solve this issue, we leverage NLP techniques (part-of-speech tags), third-party information such as large collections of N-grams and Machine Learning to identify new entities and add them to the Knowledge Graph dynamically [11].
Entity linking
The first step typically returns a set of textual mentions (
Our solution to that problem departs from the state of the art in two important ways [3]: we use
Type ranking
The next step we perform is pretty unique. We assume that each entity in the Knowledge Graph is associated with a series of
Co-Reference Resolution
Up to this point, we have created a series of high-quality links, along with relevant type information, to integrate mentions from the input data to entities in the Knowledge Graph. However, a number of further mentions available in the input data, such as noun phrases (e.g., “the Swiss champion” or “the former president”), cannot be resolved by our method. To tackle this issue, we introduce a
Relation extraction
The final step is to extract
Use-cases
The outcome of the process described above is a set of nodes and links connecting mentions from the input data to entities and relations in the Knowledge Graph. As a result, the Knowledge Graph can then be used as a central gateway (i.e., as a
We extended this generic approach to integrate various types of Big Data. We briefly present below three such deployments focusing on integrating different input data: (1) research articles, (2) social media content, and (3) cloud infrastructure data.
ScienceWise: Integrating research articles
As the production of research artifacts is booming, it is getting more and more difficult to track down all the papers related to a given scientific topic. The ScienceWise [1] platform (co-created with EPFL and Leiden University) was conceived in that context, in order to help physicists track down articles of interests from arXiv. The platform allows physicists to register their interest from a Knowledge Graph where most entities relating to physics have been defined through crowdsourcing. As new articles are uploaded on arXiv, they are automatically integrated to the Knowledge Graph using a pipeline similar (although simpler) to the one described above in Section 2. As a result, the physicists are automatically notified whenever a new paper relating to one of their interests gets uploaded.
ArmaTweet: Integrating social media contents
The second system we built tackles social media content. Specifically, we looked into how Knowledge Graphs can help integrate series of tweets (i.e., microposts) that are difficult to handle otherwise given their short and noisy nature. The resulting system, ArmaTweet [20] (a collaboration between ArmaSuisse, the University of Oxford and my group) takes as input a stream of tweets, extracts structured representations from the tweets using a pipeline similar to the one presented above, and integrates them to a Knowledge Graph built by borrowing content from both DBpedia and WordNet. ArmaTweets allows to pose complex queries (such as “find all politicians dying in Switzerland” or “find all militia terror acts”) against a set of tweets, which could not be handled otherwise using classical Information Retrieval or Knowledge Reasoning methods.
Guider: Integrating cloud infrastructure data
Another integration project we worked on (together with Microsoft CISL) is Guider [7]: a system to automatically integrate cloud infrastructure data to a Knowledge Graph. The input data in this case is a very large set of
Conclusions & lessons learnt
Drawing from our own experience, Knowledge Graphs proved to be powerful and flexible abstractions to integrate heterogeneous pieces of content. Yet, the integration process required to correctly map the input data onto a Knowledge Graph is taxing, as automated techniques cannot fully grasp the semantics of arbitrary input data (yet). While working on the various efforts described above, we learnt a few lessons that we hope will be valuable for future research.
First, human attention (in the form of crowdsourcing or manual inspection of the input and/or output data) is still key to provide high-quality results. While automated techniques have improved, they are still far from providing ideal results. Along similar lines, one cannot expect perfect results from human experts either, given the inevitable subjectivity or ambiguity of some of the tasks in a large-scale integration project.
Second, entity types represent very useful constructs in integration efforts. We are not talking about coarse-grained types (e.g,
Third, the quality of the integration process is always constraint by the quality of the Knowledge Graph used as a mediation layer. Large Knowledge Graphs typically are full of errors and inconsistencies [17], which have to be fixed prior to the integration process in order to maximize the quality of the results. Missing data in the Knowledge Graph is yet another issue, which jeopardizes the entire integration process as working with incomplete data is inherently very challenging.
Finally, designing a generic platform capable of integrating different data for different applications proved to be impractical. Even if, as described above, many ideas and processes can be recycled from one project to the next, real data is always intricate and specific, making it essential to specialize the approach for the use-case at hand. Providing a library of composable software artifacts, each responsible for a certain integration subprocess and each focusing on a certain data modality, might be an interesting avenue for future work in that context.
