Abstract
Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, one of those approaches being distant supervision. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. In this paper we propose the use of distant supervision for relation extraction from the Web. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co-reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%.
Keywords
Introduction
In the past years, several cross-domain knowledge bases such as Freebase [7], DBpedia and Wikidata [37] have been constructured by Web companies and research communities for purposes such as search and question answering. Even the largest knowledge bases are far from complete, since new knowledge is emerging rapidly. Most of the missing knowledge is available on Web pages in the form of free text. To access that knowledge, information extraction (IE) and information integration methods are necessary. In this paper, we focus on the task of relation extraction (RE), that is to extract individual mentions of relations from text, and also present how those individual mentions can be integrated and redundancy of information across Web documents can be exploited to extract facts for knowledge base population. One important aspect to every relation extraction approach is how to annotate training and test data for learning classifiers. In the past, four groups of approaches have been proposed (see also Section 2).
Supervised approaches use manually labelled training and test data. Those approaches are often specific for, or biased towards a certain domain or type of text. This is because IE approaches tend to have a higher performance if training and test data is restricted to the same narrow domain. In addition, developing supervised approaches for different domains requires even more manual effort.
Unsupervised approaches do not need any annotated data for training and instead extract words between entity mentions, then cluster similar word sequences and generalise them to relations. Although unsupervised approaches can process very large amounts of data, the resulting relations are hard to map to ontologies. In addition, it has been documented that these approaches often produce uninformative as well as incoherent extractions [13].
Semi-supervised methods only require a small number of seed instances. The hand-crafted seeds are used to extract patterns from a large corpus, which are then used to extract more instances and those again to extract new patterns in an iterative way. The selection of initial seeds is very challenging – if they do not accurately reflect the knowledge contained in the corpus, the quality of extractions might be low. In addition, since many iterations are needed, these methods are prone to semantic drift, i.e. an unwanted shift of meaning. This means these methods require a certain amount of human effort – to create seeds initially and also to help keep systems “on track” to prevent them from semantic drift.
A fourth group of approaches are distant supervision or self-supervised learning approaches [30]. The idea is to exploit large knowledge bases (such as Freebase [7]) to automatically label entities in text and use the annotated text to extract features and train a classifier. Unlike supervised systems, these approaches do not require manual effort to label data and can be applied to large corpora. Since they extract relations which are defined by vocabularies, these approaches are less likely to produce uninformative or incoherent relations.
Although promising, distant supervision approaches have several limitations with respect to Web IE that require further research. This work improves on existing distant supervision approaches by addressing four challenges, illustrated with the following example: “Let It Be is the twelfth and final album by The Beatles which contains their hit single ‘Let it Be’. They broke up in 1974.”
The contributions of this paper to research on distant supervision for Web information extraction are: (1) recognising named entities across domains on heterogeneous Web pages by using Web-based heuristics; (2) reporting results for extracting relations across sentence boundaries by relaxing the distant supervision assumption and using heuristic co-reference resolution methods; (3) proposing statistical measures for increasing the precision of distantly supervised systems by filtering ambiguous training data, (4) documenting an entity-centric approach for Web relation extraction using distant supervision; and (5) evaluating distant supervision as a knowledge base population approach and evaluating the impact of our different methods on information integration.
Related work
There are have been several different approaches for IE from text for populating knowledge bases which try to minimise manual effort in the recent past.
Semi-supervised bootstrapping approaches such as KnowItAll [12], NELL [9], PROSPERA [23] and BOA [17] start with a set of seed natural language patterns, then employ an iterative approach to both extract information for those patterns and learn new patterns. For KnowItAll, NELL and PROSPERA, the patterns and underlying schema are created manually, whereas they are created automatically for BOA by using knowledge contained in DBpedia.
Ontology-based question answering systems often use patterns learned by semi-supervised information extraction approaches as part of their approach. Unger et al. [35], for instance, use patterns produced by BOA.
Open information extraction (Open IE) approaches such as TextRunner [43], Kylin [39], StatSnowball [44], Reverb [13], WOE [40], OLLIE [20] and ClausIE [11] are unsupervised approaches, which discover relation-independent extraction patterns from text. Although they can process very large amounts of data, the resulting relations are hard to map to desired ontologies or user needs, and can often produce uninformative or incoherent extractions, as mentioned in Section 1.
Bootstrapping and Open IE approaches differ from our approach in the respect that they learn extraction rules or patterns, not weights for features for a machine learning model. The difference between them is that statistical approaches take more different factors into account to make ‘soft’ judgements, whereas rule- and pattern-based approaches merge observed contexts to patterns, then only keep the most prominent patterns and make hard judgments based on those. Because information is lost in the pattern merging and selection process, statistical methods are generally more robust to unseen information, i.e. if the training and test data are drawn from different domains, or if unseen words or sentence constructions occur. We opt for a statistical approach, since we aim at extracting information from heterogenous Web pages.
Automatic ontology learning and population approaches such as FRED [25,26] and LODifier [5] extract an ontology schema from text, map it to existing schemas and extract information for that schema. Unlike bootstrapping approaches, they do not employ an iterative approach. However, they rely on several existing natural language processing tools trained on newswire and are thus not robust enough for Web IE.
Finally, distantly supervised or self-supervised approaches aim at exploiting background knowledge for RE, most of them for extracting relations from Wikipedia. Mintz et al. [22] aim at extracting relations between entities in Wikipedia for the most frequent relations in Freebase. They report precision of about 0.68 for their highest ranked 10% of results depending what features they used. In contrast to our approach, Mintz et al. do not experiment with changing the distant supervision assumption or removing ambiguous training data, they also do not use fine-grained relations and their approach is not class-based. Nguyen et al. [24]’s approach is very similar to that of Mintz et al., except that they use a different knowledge base, YAGO [32]. They use a Wikipedia-based named entity recogniser and classifier (NERC), which, like the Stanford NERC classifies entities into persons, relations and organisations. They report a precision of 0.914 for their whole test set, however, those results might be skewed by the fact that YAGO is a knowledge base derived from Wikipedia. In addition to Wikipedia, distant supervision has also been used to extract relations from newswire [27,28], to extract relations for the biomedical domain [10,29] and the architecture domain [36]. Bunescu and Mooney [8] document a minimal supervision approach for extracting relations from Web pages, but only apply it to the two relations company-bought-company and person-bornIn-place. Distant supervision has also been used as a pre-processing step for learning patterns for bootstrapping and Open IE approaches, e.g. Kylin, WOE and BOA annotate text with DBpedia relations to learn patterns.
A few strategies for seed selection for distant supervision have already been investigated: at-least-one models [18,21,27,33,42], hierarchical topic models [1,31], pattern correlations [34], and an information retrieval approach [41]. At-least-one models are based on the idea that “if two entities participate in a relation, at least one sentence that mentions these two entities might express that relation” [27]. While positive results have been reported for those models, Riedel et al. [27] argue that it is challenging to train those models because they are quite complex. Hierarchical topic models [1,31] assume that the context of a relation is either specific for the pair of entities, the relation, or neither. Min et al. [21] further propose a 4-layer hierarchical model to only learn from positive examples to address the problem of incomplete negative training data. Pattern correlations [34] are also based on the idea of examining the context of pairs of entities, but instead of using a topic model as a pre-processing step for learning extraction patterns, they first learn patterns and then use a probabilistic graphical model to group extraction patterns. Xu et al. [41] propose a two-step model based on the idea of pseudo-relevance feedback which first ranks extractions, then only uses the highest ranked ones to re-train their model.
Our research is based on a different assumption: instead of trying to address the problem of noisy training data by using more complicated multi-stage machine learning models, we want to examine how background data can be even further exploited by testing if simple statistical methods based on data already present in the knowledge base can help to filter unreliable training data. Preliminary results for this have already been reported in Augenstein et al. [3,4]. The benefit of this approach compared with other approaches is that it does not result in an increase of run-time during testing and is thus more suited towards Web-scale extraction than approaches which aim at resolving ambiguity during both training and testing. To the best of our knowledge, our approach is the first distant supervision approach to address the issue of adapting distant supervision to relation extraction from heterogeneous Web pages and to address the issue of data sparsity by relaxing the distant supervision assumption.
Distantly supervised relation extraction
Distantly supervised relation extraction is defined as automatically labelling a corpus with properties, P and resources, R, where resources stand for entities from a knowledge base,
In general relations are of the form
In the remainder of this paper, several adjustments to this approach are presented, method names are indicated in bold font.
Seed selection
Before using the automatically labelled corpus to train a classifier, we detect and discard examples containing highly ambiguous lexicalisations. We measure the degree to which a lexicalisation
Ambiguity within an entity
Our first approach is to discard lexicalisations of objects if they are ambiguous for the subject entity, i.e. if a subject is related to two different objects which have the same lexicalisation, and express two different relations. To illustrate this, let us consider the problem outlined in the introduction again: Let It Be can be both an album and a track of the subject entity The Beatles, therefore we would like to discard Let It Be as a seed for the class Musical Artist.
Ambiguity across classes
In addition to being ambiguous for a subject of a specific class, lexicalisations of objects can be ambiguous across classes. Our assumption is that the more senses an object lexicalisation has, the more likely it is that object occurrence is confused with an object lexicalisation of a different property of any class. An example for this are common names of book authors or common genres as in the sentence “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac.
We view the number of senses of each lexicalisation of an object per relation as a frequency distribution. We then compute min, max, median (
Relaxed setting
In addition to increasing the precision of distantly supervised systems by filtering seed data, we also experiment with increasing recall by changing the method for creating test data. Instead of testing, for every sentence, if the sentence contains a lexicalisation of the subject and one additional entity, we relax the former restriction. We make the assumption that the subject of the sentence is mostly consistent within one paragraph as the use of paragraphs usually implies a unit of meaning, i.e. that sentences in one paragraph often have the same subject. In practice this means that we first train classifiers using the original assumption and then, for testing, instead of only extracting information from sentences which contain a lexicalisation of the subject, we also extract information from sentences which are in the same paragraph as a sentence which contains a lexicalisation of the subject.
Our new relaxed distant supervision assumption is then:
This means, however, that we have to resolve the subject in a different way, e.g. by performing co-reference resolution and searching for a pronoun which is coreferent with the subject mention in a different sentence. We test four different methods for our relaxed setting, one of which does not attempt to resolve the subject of sentences, one based on an existing co-reference resolution tool, and two based on gazetteers of Web co-occurrence counts for number and gender of noun phrases.
The first step in co-reference resolution is usually to group all mentions in a text, i.e. all noun phrases and pronouns by gender and number. If two mentions disagree in number or gender, they cannot be coreferent. As an example, should we find “The Beatles” and “he” in a sentence, then “The Beatles” and “he” could not be coreferent, because “The Beatles” is a plural neutral noun phrase, whereas “he” is a singular male pronoun. Since we do not have any a-priori information on what number and gender the subject entity is, we instead make those judgments based on the number and gender of the class of the subject, e.g. The Beatles is a Musical Artist, which can be a band (plural) or a female singer or a male singer. Bergsma and Dekang have collected such a resource automatically, which also includes statistics to assess how likely it is for a noun phrase to be a certain number or gender. In particular, they collected co-occurrence counts of different noun phrases with male, female, neutral and plural pronouns using Web search. Our heuristic co-reference approach consists of three steps. First, we collect noun phrases which express general concepts related to the subject entity, which we refer to as synonym gazetteer. We start with the lexicalisation of the class of the entity (e.g. “Book”), then retrieve synonyms, hypernyms and hyponyms using Wikipedia redirection pages and WordNet [14]. Second, we determine the gender of each class by looking up co-occurrence counts for each general concept in the noun phrase, gender and number gazetteer. We aggregate the co-reference counts for each class and gender or number (i.e. male, female, neutral, plural). If the aggregated count for each number or gender is at least 10% of the total count for all genders and numbers, we consider that gender or number to agree with the class. For each class, we then create a pronoun gazetteer containing all male, female, neutral or plural personal pronouns including possessives, e.g. for “Book”, that gazetteer would contain “it, its, itself”. Lastly, we use those gazetteers to resolve co-reference. For every sentence in a paragraph that contains at least one sentence with the subject entity, if any of the following sentences contain a pronoun or noun phrase that is part of the synonym or pronoun gazetteer for that class and it appears in the sentence before the object lexicalisation, we consider that noun phrase or pronoun coreferent with the subject. The reason to only consider noun phrases or pronouns to be coreferent with the subject entity if they appear after the object entity is to improve precision, since anaphora (expressions referring back to the subject) are far more common than cataphora (expressions referring to the subject appearing later in the sentence).
We test two different methods.
Information integration
After features are extracted, a classifier is trained and used to predict relation mentions, those predictions can be used for the purpose of knowledge base population by aggregating relation mentions to relations. Since the same relations might be found in different documents, but some contexts might be inconclusive or ambiguous, it is useful to integrate information taken from multiple predictions to increase the chances of predicting the correct relation. We test several different methods to achieve this.
There is only one instance of three mutual labels for our evaluation set, namely River:origin, River:countries and River:contained by.
Freebase classes and properties used
Corpus
To create a corpus for Web relation extraction using background knowledge from Linked Data, seven Freebase classes and their five to seven most prominent properties are selected, as shown in Table 1. The selected classes are subclasses of either “Person” (Musical Artist, Politician), “Location” (River), “Organisation” (Business (Operation)), Education(al Institution)) or “Mixed” (Film, Book). To avoid noisy training data, we only use entities which have values for all of those properties and retrieve them using the Freebase API. This resulted in 1800 to 2200 entities per class. For each entity, at most 10 Web pages were retrieved via the Google Search API using the search pattern “‘subject_entity” class_name relation_name’, e.g. “‘The Beatles” Musical Artist Origin’. By adding the class name, we expect the retrieved Web pages to be more relevant to our extraction task. Although subject entities can have multiple lexicalisations, Freebase distinguishes between the most prominent lexicalisation (the entity name) and other lexicalisations (entity aliases). We use the entity name for all of the search patterns. In total, the corpus consists of around one million pages drawn from 76,000 different websites. An overview of the distribution of websites per class is given in Table 2.
Distribution of websites per class in the Web corpus sorted by frequency
Distribution of websites per class in the Web corpus sorted by frequency
Text content is extracted from HTML pages using the Jsoup API,2
Some of the relations we want to extract values for cannot be categorised according to the 7 classes detected by the Stanford NERC and are therefore not recognised. An example for this is MusicalArtist:album, MusicalArtist:track or MusicalArtist:genre. Therefore, as well as recognising named entities with Stanford NERC as relation candidates, we also implement our own NER, which only recognises entity boundaries, but does not classify them.
To detect entity boundaries, we recognise sequences of nouns and sequences of capitalised words and apply both greedy and non-greedy matching. The reason to do greedy as well as non-greedy matching is because the lexicalisation of an object does not always span a whole noun phrase, e.g. while ‘science fiction’ is a lexicalisation of an object of Book:genre, ‘science fiction book’ is not. However, for MusicalArtist:genre, ‘pop music’ would be a valid lexicalisation of an object. For greedy matching, we consider whole noun phrases and sequences of capitalised words. For non-greedy matching, we consider all subsequences starting with the first word of the those phrases as well as single tokens, i.e. for ‘science fiction book’, we would consider ‘science fiction book’, ‘science fiction’, ‘science’, ‘fiction’ and ‘book’ as candidates. We also recognise short sequences of words in quotes. This is because lexicalisation of objects of MusicalArtist:track and MusicalArtist:album often appear in quotes, but are not necessarily noun phrases.
Annotating sentences
The next step is to identify which sentences express relations. We only use sentences from Web pages which were retrieved using a query which contains the subject of the relation. To annotate sentences, we retrieve all lexicalisations
Seed selection
After training data is retrieved by automatically annotating sentences, we select seeds from it, or rather discard some of the training data, according to the different methods outlined in Section 3.1. Our
Features
Given a relation candidate as described in Section 4.3, our system then extracts the following lexical features and named entity features, some of them also used by Mintz et al. [22]. Features marked with (*) are only used in the normal setting, but not for the
The object occurrence
The bag of words of the occurrence
The number of words of the occurrence
The named entity class of the occurrence assigned by the 7-class Stanford NERC
A flag indicating if the object or the subject entity came first in the sentence (*)
The sequence of POS tags of the words between the subject and the occurrence (*)
The bag of words between the subject and the occurrence (*)
The pattern of words between the subject entity and the occurrence (all words except for nouns, verbs, adjectives and adverbs are replaced with their POS tag, nouns are replaced with their named entity class if a named entity class is available) (*)
Any nouns, verbs, adjectives, adverbs or named entities in a 3-word window to the left of the occurrence
Any nouns, verbs, adjectives, adverbs or named entities in a 3-word window to the right of the occurrence
In comparison with Mintz et al. [22] we use richer feature set, specifically more bag of words features, patterns, a numerical feature and a different, more fine-grained named entity classifier.
We experiment both with predicting properties for relations, as in Mintz et al. [22], and with predicting properties for relation mentions. Predicting relations means that feature vectors are aggregated for relation tuples, i.e. for tuples with the same subject and object, for training a classifier. In contrast, predicting relation mentions means that feature vectors are not aggregated for relation tuples. While predicting relations is sufficient if the goal is only to retrieve a list of values for a certain property, and not to annotate text with relations, combining feature vectors for distant supervision approaches can introduce additional noise for ambiguous subject and object occurrences.
Models
Our models differ with respect to how sentences are annotated for training, how positive training data is selected, how negative training data is selected, which features are used, how sentences are selected for testing and how information is integrated.
Predicting relations
In order to be able to compare our results, we choose the same classifier as in Mintz et al. [22], a multi-class logistic regression classifier. We train one classifier per class and model. The models are used to classify each relation mention candidate into one of the relations of the class or NONE (no relation). Relation mention predictions are then aggregated to predict relations using the different information integration methods described in Section 3.3.
Evaluation
The goal of our evaluation is to measure how the different distant supervision models described in Section 4.7 perform for the task of knowledge base population, i.e. to measure how accurate the information extraction methods are at replicating the test part of the knowledge base. To this end we carry out a hold-out evaluation, for which 50% of the knowledge base is used for training and 50% for testing. We annotate the whole corpus with relations already present in Freebase, as described in Section 4 and use 50% of it for training and 50% for testing.
The following metrics are computed: precision, recall and a top line. Precision is defined as the number of correctly labelled relations divided by the number of correctly labelled plus the number of incorrectly labelled relations. Recall is defined as the number of correctly labelled relations divided by the number of all relation tuples in the knowledge base. The number of all relation tuples includes all different lexicalisations of objects contained in the knowledge base. To achieve a perfect recall of 1, all relation tuples in the knowledge base have to be identified as relation candidates in the corpus first. However, not all relation tuples also have a textual representation in the corpus. To provide insight into how many of them do, we compute a top line for recall. The top line would usually be computed by dividing the number of all relation tuples appearing in the corpus by the number of relation tuples in the knowledge base, as e.g. in [16]. The top line we provide is only an estimate, since the corpus is too big to examine each sentence manually. We instead compute the top line by dividing the number of relation tuples identified using the most inclusive relation candidate identification strategy, those used by the
Results for different seed selection models detailed in Section 4.7 averaged over all properties of each class are listed in Table 3. Model settings are incremental, i.e. the row
Seed selection results: micro average of precision (P) and recall (R) over all relations, using the Multilab+Limit75 integration strategy and different seed selection models. The top line for recall is 0.0917
Seed selection results: micro average of precision (P) and recall (R) over all relations, using the
Information integration results: micro average of precision (P) and recall (R) over all relations, using the
Co-reference resolution results: micro average of precision (P) and recall (R) over all relations, using the
Best overall results: micro average of precision (P), recall (R) and top line recall (top line) over all relations. The best normal method is the
From our evaluation results we can observe that there is a significant difference in terms of performance between the different model groups.
The
The
For our different seed selection methods,
Although strategically selecting seeds improves precision, the different
Our different models based on the relaxed setting show a surprisingly high precision. They outperform all models in terms of recall, and even increase precision for most classes. The classes they do not increase precision for are “Educational Institution” and “Film”, both of which already have a high precision for the normal setting. The
In general, the availability of test data poses a challenge, which is reflected by the top line. The top line is quite low, depending on the class between 0.035 and 0.23. Using a search based method to retrieve Web pages for training and testing is quite widely used, e.g. [36] also use it for gathering a corpus for distant supervision. To increase the top line, one strategy could be to just retrieve more pages per query, as Vlachos do. Another option would be to use more sophisticated method for building search queries, as for instance researched by West et al. [38]. As for different relations and classes, we can observe that there is a sizable difference in precision for them. Overall, we achieve the lowest precision for Musical Artist and the highest for Educational Institution.
When examining the training set we further observe that there seems to be a strong correlation between the number of training instances and the precision for that property. This is also an explanation as to why removing possibly ambiguous training instances only improves precision up to a certain point: the classifier is better at dealing with noisy training data than too little training data.
We also analyse the test data to try to identify patterns of errors. The two biggest groups of errors are entity boundary recognition and subject identification errors. An example for the first group is the following sentence: “<s>The Hunt for Red October</s> remains a masterpiece of military <o>fiction</o>.”
Although “fiction” would be correct result in general, the correct property value for this specific sentence would be “military fiction”. Our NER suggests both as possible candidates (since we employ both greedy and non-greedy matching), but the classifier should only classify the complete noun phrase as a value of Book:genre. There are several reasons for this: “military fiction” is more specific than “fiction”, and since Freebase often contains the general category (“fiction”) in addition to more fine-grained categories, we have more property values for abstract categories to use as seeds for training than for more specific categories. Second, our Web corpus also contains more mentions for broader categories than for more specific ones. Third, when annotating training data, we do not restrict positive candidates to whole noun phrases, as explained in Section 4.2. As a result, if none of the lexicalisations of the entity match the whole noun phrase, but there is a lexicalisation which matches part of the phrase, we use that for training and the classifier learns wrong entity boundaries. The second big group of errors is that occurrences are classified for the correct relation, but the wrong subject.
“<s>Anna Karenina</s> is also mentioned in <o>R. L. Stine</o>’s Goosebumps series Don’t Go To Sleep.”
In that example, “R. L. Stine” is predicted to be a property value for Book:author for the entity “Anna Karenina”. This happens because, at the moment, we do not take into consideration that two entities can be in more than one relation. Therefore, the classifier learns wrong, positive weights for certain contexts.
Discussion and future work
In this paper, we have documented and evaluated a distantly supervised class-based approach for relation extraction from the Web which strategically selects seeds for training, extracts relation mentions across sentence boundaries, and integrates relation mentions to predict relations for knowledge base population. Previous distantly supervised approaches have been tailored towards extraction from narrow domains, such as news and Wikipedia, and are therefore not fit for Web relation extraction: they fail to identify named entities correctly, they suffer from data sparsity, and they either do not try to resolve noise caused by ambiguity or do so at a significant increase of runtime. They further assume that every sentence may contain any entity in the knowledge base, which is very costly.
Our research has made a first step towards achieving those goals. We experiment with a simple NER, which we use in addition to a NERC trained for the news domain and find that it can especially improve on the number of extractions for non-standard named entity classes such as MusicalArtist:track and MusicalArtist:album. At the moment, our NER only recognises, but does not classify NEs. In future work, we aim to research distantly supervised named entity classification methods to assist relation extraction.
To overcome data sparsity and increase the number of extractions, we extract relation mentions across sentence boundaries and integrate them to predict relations. We find that extracting relation mentions across sentence boundaries not only increases recall by up to 25% depending on the model, but also increases precision by 8% on average. Moreover, we find that a gazetteer-based method for co-reference resolution achieves the same performance on our Web corpus as the Stanford CoreNLP co-reference resolution system. To populate knowledge bases, we test different information integration strategies, which differ in performance by 5%. We further show that simple, statistical methods to select seeds for training can help to improve performance of distantly supervised Web relation extractors, increasing precision by 3% on average. The performance of those methods is dependent on the type of relation it is applied to and on how many seeds there are available for training. Removing too many seeds tends to hurt performance rather than improve it.
One potential downside of using distant supervision for knowledge base population is that it either requires a very large corpus, such as the Web, or a big knowledge base for training. As such, distant supervision itself is an unsupervised domain-independent approach, but might not necessarily be useful for scenarios for which only a small corpus of documents or only a very small number of relation tuples is available in the knowledge base. For our experiments, we use a relatively large part of the knowledge base for training, i.e. 1000 seed entities for training per class, and 10 Web documents per entity and relation. In other experimental setups for distant supervision, only 30 seed entities, but 300 Web documents per entity and relation are used [36]. It is not just the quantity of documents retrieved that matters, but also the relevance to the information extraction task. Information retrieval for Web relation extraction, i.e. how to formulate queries to retrieve relevant documents for the relation extraction task is something that has already been researched, but not been exploited for distant supervision yet [38]. In future work, we plan to research how to jointly extract relations from text, lists and tables on Web pages in order to reduce the impact of data sparsity and increase precision for relation mention extraction. A detailed description of future work goals is also documented in Augenstein [2].
Footnotes
Acknowledgements
We thank the EKAW and SWJ reviewers for their valuable feedback. This research was partly supported by the EPSRC funded project LODIE: Linked Open Data for Information Extraction, EP/J019488/1.
