Abstract
The ontology underlying the Wikidata knowledge graph (KG) has not been formalized. Instead, its semantics emerges bottom-up from the use of its classes and properties. Flexible guidelines and rules have been defined by the Wikidata project for the use of its ontology, however, it is still often difficult to reuse the ontology’s constructs. Based on the assumption that identifying ontology design patterns from a knowledge graph contributes to making its (possibly) implicit ontology emerge, in this paper we present a method for extracting what we term empirical ontology design patterns (EODPs) from a knowledge graph. This method takes as input a knowledge graph and extracts EODPs as sets of axioms/constraints involving the classes instantiated in the KG. These EODPs include data about the probability of such axioms/constraints happening. We apply our method on two domain-specific portions of Wikidata, addressing the music and art, architecture, and archaeology domains, and we compare the empirical ontology design patterns we extract with the current support present in Wikidata. We show how these patterns can provide guidance for the use of the Wikidata ontology and its potential improvement, and can give insight into the content of (domain-specific portions of) the Wikidata knowledge graph.
Introduction
Ontologies are often developed in a top-down manner, for example, by starting from specific requirements coming from domain experts or through deriving concepts from application needs. Then, as a second step, those formally defined ontologies are populated by knowledge graphs (KGs). However, this is not always the case.
As an exemplary use case, Wikidata1
Due to this bottom-up and flexible definition, and to its constant evolution, it can sometimes become challenging for an (external) user to understand the knowledge graph, and to effectively reuse both the ontology and its data [5,16]. The analysis of the content of a KG may have as goals (i) to understand the structure of the KG and its main contents, (ii) to determine whether the KG answers possible questions from the user, and (iii) to find the subset of triples that are pertinent to the user’s use case [13]. In this context, it would be useful to exploit the KG’s data to derive axioms or constraints as building blocks of a background ontology. Such an inferred ontology would represent an access point to the content of the knowledge graph, possibly supporting its reuse. Wikidata does provide some flexible guidelines around use (see Section 4); however, there still remains room to provide additional, more detailed, guidance on how to use Wikidata’s ontology by exploiting its actual usage.
In [8], we introduced a method for extracting what we name empirical ontology design patterns2
What we term here empirical pattern corresponds to what we termed emerging pattern in [8].
This paper extends [8], by presenting the development of two additional steps of the method, that transform these frequent triplets into:
probabilistic ontology design patterns, using RDF-star, a syntax that extends RDF in order to make statements about statements, as sets of axioms that are given a probability value based on their actual usage in the KG;
probabilistic ShEx shapes3
ShEx is one of the schema languages used to formalise shapes.
We also provide a clear definition of what we mean by probabilistic patterns/shapes, and explain which interpretation of probability we adopt (i.e. frequentist probability).
Moreover, while in [8] our experiments were only performed on a subset of Wikidata focused on the music domain, in this paper, besides re-running our method on the Wikidata music sub-KG, we also perform the same experiments on the art, architecture, and archaeology (AAA) domain. Besides providing additional, useful results, and comparing the results obtained from two different domains, this new contribution shows that our method is domain independent.
Finally, another novel contribution of this work is the comparison of the empirical patterns extracted from 3 different versions of Wikidata, with an interval of 15 months in between.
The remainder of the paper is organized as follows. In Section 2 we discuss related work that is relevant to the generation of shapes or data-driven patterns. Section 3 describes our method. Section 4 presents existing ways, constructs, and projects that support the reuse of the Wikidata ontology and KG, while Section 5 shows the results of our experiments on two Wikidata subsets in the music and AAA domains. In Section 6, we compare our results and current Wikidata functionality. Finally, in Section 7 we briefly discuss the potential in applying such method to KGs other than Wikidata. Section 8 concludes the paper and presents future work.
There exists many methods that generate constraints for concepts, in the form of shapes or patterns, mostly to provide support for validating knowledge graphs, but possibly useful also for its exploration.
Some of these approaches (like Astrea) are only based on ontologies. Astrea4
However, the majority of the methods build shapes, as collections of validation rules, from knowledge graphs. Shape Designer [4] is a graphical tool for automatically constructing valid SHACL or ShEx constraints that are satisfied from an RDF KG. The cardinality of the triple constraints (that is, exactly one, optional, at least one, any number) is derived from the data based on predefined rules. However, the tool cannot process large KGs, like Wikidata, without specifying a limit to the number of SPARQL query results. The experiments presented in [17] demonstrate that currently available methods cannot handle the scale of large knowledge graphs such as Wikidata: they crash even with KGs with a few millions triples.5
The subKG of Wikidata on the music domain, one of the two subgraphs to which we apply our method, contains more than 5 millions triples.
Some methods are based on knowledge graph profiling techniques. These techniques generate concise and meaningful summaries from an RDF knowledge graph, which are then used as an input for constructing shapes. In [15] the authors present an approach that relies on machine learning techniques for automatically generating RDF shapes. Profiled RDF data are used as features, and the method uses the Loupe tool6
As shown by [17], all existing approaches that automatically generate shapes include in such shapes a high number of constraints such that it is non-trivial for a human user to assess their correctness and validity. Moreover, in most cases, no constraint is produced for non-literal objects, i.e. most constraints do not indicate that the objects of a property should be of a specific type.
In [3,22], a method for extracting what is termed Statistical Knowledge Patterns (SKPs) from a knowledge graph is presented. This method is based on statistical measures similar to the ones usually used for generating data-driven shapes, which rely on the frequency in the data. An SKP is expressed in OWL and is constructed around one of the main (i.e. with more instances) classes from an ontology: it reflects the usage of that class by enriching the class’ properties and axioms from the ontology with properties and axioms that can be inferred thanks to statistical measures computed on the data. The most frequent (based on a threshold) properties in the data are selected, and the appropriate range axioms are introduced, unless they have been already explicitly asserted in the ontology. A catalogue containing 34 SKPs extracted from a version of DBpedia can be found online.7
Our method extracts the empirical patterns from a knowledge graph. At a fundamental level, the axioms and constraints extracted by our method are dependent on the KG being processed and its state at a given time. We observe the regularity of patterns and use that to assign a probability to each of the constructs we extract.
We adopt the frequentist interpretation of probability.8
We now will go into details about our method for extracting empirical patterns from a knowledge graph. An overview of the method can be seen in Fig. 1. The method is also formally described in Algorithm 1.

Method for extracting empirical patterns from a KG.

Pseudocode of the algorithm for extracting empirical patterns from a KG
Select relevant classes from the domain subgraph The first step of the method takes as input the domain subgraph and returns the number of instances for each instantiated class of the graph. For each class, it also returns a percentage of coverage, which represents its frequentist probability, intended as a simple ratio between the number of instances of a class and the total number of distinct instances included in the graph. Subsequently, all classes that have a number of instances that fall below a given threshold are filtered out. This threshold needs to be given as input. The empirical patterns will be generated around the classes that have been selected based on this threshold, i.e. as a final output there will be a pattern corresponding to each filtered class.
This threshold is computed based on the absolute distance between the number of instances of a given class and the number of instances of the most populated class in the whole knowledge graph, i.e. the maximum value in the distribution. The distance is normalised, by dividing the result by the maximum value, such that this threshold falls within the
Extract a subgraph for each selected class (lines 6–8 in Algorithm 1) Once the list of classes has been obtained, a subgraph for each class is constructed. This subgraph includes, from the domain subgraph, only the triples that have an instance of the given class as subject. For example, the subgraph corresponding to the class album will include all the triples where an instance of album is in the subject position.
Most frequent properties for each class (lines 9–11 in Algorithm 1) At this point, iterating over all subgraphs, we count the number of distinct instances that have at least one triple involving each property in the subgraph, i.e. we compute their occurrences. The frequentist probability, expressed as a percentage, represents the ratio between the number of instances that have a given property, and the total number of instances of the specific class. Then, for each subgraph, we include in the final empirical pattern only the properties that are above another threshold
Most frequent ranges for each frequent property (lines 12–14 in Algorithm 1) For each subgraph, we build the domain-property-range triplets, where domain is the type of the subject and range corresponds to either the type of the object (when the object is a resource) or the data type (e.g.
Empirical patterns: From triplets to probabilistic ODPs (lines 15–16 in Algorithm 1) After these steps, we have obtained the patterns in the form of sets of domain-property-range triplets associated with their occurrences and probability values. We transform each domain-property-range triplet into an OWL existential axiom, that will be part of an OWL ontology design pattern. Each axiom is annotated with its probability with respect to the specific pattern.
This is the reason why we chose, in the implementation of our method, to formalise the patterns using RDF-star,10
Empirical patterns: From triplets to shapes Additionally, each set of triplets is transformed into a shape, where each constraint is annotated with its probability value through comments. In implementing the method, we rely on the Shape Expressions (ShEx13

Example of ShEx shape.
Before describing the results of using our method for Wikidata, we first provide motivation for its usage based on the resources that have been created and adopted by Wikidata for recommending how to use its underlying ontology.
Property constraints Property constraints,14

Example of the limits of Wikidata property constraints.

Examples of type of Wikidata property related to chess and sport.
For instance, as in Fig. 2, a triple with an instance of recurrent event edition as subject, part of the series as predicate, and an instance of collection of articles as object would comply with the property constraints of the property, even if a more appropriate range in this case would be the class recurring event (
Properties for this type The property properties for this type (
Type of Wikidata property The class Type of Wikidata property (
Wikidata schemas The Schemas Wikidata project15
See Shape E10.
Properties list in a WikiProject In the context of domain-specific projects (e.g. music, astronomy, books, geology), the members of the community that are expert in that domain may define a list of properties that are recommended to be used for describing relevant entities of that domain. Each property, listed in a table, is usually associated with the data type of its range (that is, item, string, etc.), and a description of the usage of that property. This description, in some cases, also includes in plain text possible types for the range. For example, in the context of the WikiProject Books,20
In this section, we discuss the results of applying our method to the Wikidata subgraphs on music and the ‘art, architecture and archaeology’ domains.
Input
In order to deal with the size of Wikidata, we used the Knowledge Graph Toolkit (KGTK)21
In order to extract domain-specific patterns, and to handle a sub-graph of Wikidata with a more manageable size, we focus on specific domains represented in the Wikidata knowledge graph. While we work on the “music” and “art, architecture, and archaeology (AAA)” domains, the method can be applied to any domain. The extraction of instances related to both domains is based on a list of WordNet and BabelNet synsets identified as belonging to the respective domains, according to BabelDomains [20]. Then, the two Wikidata subgraphs are extracted by selecting each triple where the Wikidata domain-specific instance is in the subject position. The method we present in this paper does not focus on the extraction of domain-specific knowledge graphs from Wikidata (or any KG): by relying on BabelDomains we were able to easily obtain the two Wikidata subgraphs. However, a user that wants to use our method for extracting empirical patterns from a KG, could either run the method on the whole KG, or give as input a subgraph obtained with a method of their choice.
As explained in Section 3, each main step of our method takes as input a threshold, for filtering the list of classes (that is, the set of empirical patterns that will be generated), the list of properties to be included in each pattern, and the number of ranges that will define the final number of triplets/probabilistic axioms. These thresholds are to be defined by the user, based on her requirements. In this work, we do not present a method for finding the best thresholds to be chosen. However, for the purpose of presenting our results and comparing them with the current support in Wikidata, we have chosen reasonable defaults. For both the music and AAA patterns, the threshold
Both code and results are available on GitHub:
The time required for producing all results with our method, with the chosen thresholds, is: about 8 hours for downloading the Wikidata dump; about 5 hours for processing such dump with KGTK and producing the necessary files for the extraction of the EODPs; about 1 hour for the actual extraction of the EODPs. The experiments have been executed on a RTX3090 with 128GB of RAM and Intel Core i9.
The Wikidata subgraph on the music domain contains 5,083,818 triples, and 226,989 distinct instances.
Most populated classes: Music patterns Having
Most populated classes in the Wikidata music subKG
Most populated classes in the Wikidata music subKG
In Table 1, the 7 classes around which we build music patterns are presented, with the respective number of instances and of triples that have an instance of the class as subject. The most relevant entities in this Wikidata music KG include both agents (such as human and musical group) and objects (namely, single, album, musical work, extended play, record label). Notice that single (
Recommended properties for each pattern The average number of the most frequent properties selected for each pattern based on the
The number of selected properties is not directly proportional to the number of triples in the subgraph: for instance, for musical groups we recommend more properties that are frequently used (selected out of a total of 891 properties) than albums (369 properties in total). The most common properties across all patterns (without considering ID properties) are:
Statistics of selected properties and triplets for each pattern

The album pattern.
Recommended ranges for each property Table 2 lists the number of triplets
Example: The album pattern In Fig. 4, we graphically represent the pattern for albums. Each domain-property-range triplet is accompanied by the number of instances in the Wikidata music subgraph that comply with that triplet (light blue rectangles). Based on the threshold we used, for most properties we recommend only one range. However, the performer property, when used with albums, can have either a human or a musical group as range within the pattern, and the 3 recommended ranges for the property language of work or name have a subclass-of relation. It is interesting to notice that 4 recommended properties link to other empirical patterns as recommended ranges (record label, human, musical group).
@prefix rdf: <
@prefix wd: <
@prefix wdt: <
@prefix wikibase: <
@prefix dcterms: <
@prefix prov: <
@prefix os: <
@prefix oplax: <
@prefix weps: <

Snapshot of the album empirical ODP, expressed using rdf-star and owl-star.

Classes and properties used for annotating the probabilistic pattern.
The axioms in Listing 2 correspond to the subset of the ShEx shape we generate for this pattern, that can be found in Listing 3. The frequentist probabilities are reported via comments. Additionally, the shape also includes the constraint about the type (

Snapshot of the album empirical shape, expressed as a shape using ShEx.
The Wikidata subgraph on the art, architecture, and archaeology (AAA) domain includes 493,999 triples, and 26,380 distinct instances.
Most populated classes: AAA patterns The threshold
Most populated classes in the Wikidata art, architecture, and archaeology subKG
Most populated classes in the Wikidata art, architecture, and archaeology subKG
Table 3 lists the 11 classes around which we build our patterns, along with their number of instances and the number of triples with an instance of the class as subject. The most relevant entities in the Wikidata AAA domain include both agents (human, museum intended as an institution) and objects (house, skyscraper, painting, etc.). As for hierarchical relations (
Recommended properties for each pattern The threshold
Statistics of selected properties and triplets for each pattern

The museum pattern.
Recommended ranges for each property Table 4 reports the number of triplets
Example: The museum pattern In Fig. 6 we provide a graphical representation of the pattern for museums. 11 out of 15 properties are datatype, and, as expected, have only one recommended range. Out of the remaining 4 object properties, 1 has one range, i.e. heritage designation for the homonymous property heritage designation, while the other 3 properties, which are all place-related, have multiple ranges: (i) country and sovereign state for the property country, the latter being a subclass of the former, with a very close percentage of coverage; (ii) city (8%) and its subclass big city (9%) for the property location; and (iii) again, city (15%) and its subclass big city (21%) for the property location, in addition to U.S. state (15%). As you can notice, between these 3 properties related to places, country is way more frequent, indeed almost all (99%) museums have this property.

Snapshot of the museum empirical ODP, expressed using rdf-star and owl-star.

Snapshot of the museum empirical ODP, expressed as a shape using ShEx.
Listing 4 includes a snapshot of the OWL probabilistic pattern for museums. Below the triples that annotate the pattern and its related sources, we include a subset of the probabilistic axioms that have been automatically generated.
In this section, we discuss the results we obtain from both the music and AAA Wikidata knowledge graphs, and we perform an evaluation of the resulting empirical patterns through a comparison between them with the current support to reuse provided by Wikidata.
Percentages of coverage of the patterns properties in the KG
Percentages of coverage of the patterns properties in the KG
Patterns coverage In order to observe how the extracted patterns are populated in the Wikidata subKG on music, we report in Table 525
Columns indicate the number/fraction of properties considered. The actual number of properties corresponding to the fraction is reported in square brackets. The number of instances covering the whole pattern is in round brackets. Example instances populating the whole patterns can be found here:
Comparison with property constraints If we take into account the most common properties across all patterns, that is genre (7/7 patterns) and record label (6/7), we can notice that the domains and ranges we suggest are all included in the subject type and value-type constraints of the two properties. In some cases, the Wikidata constraints list as ranges classes that are more general in the hierarchy with respect to the classes we suggest: for instance, they include work in place of the more specific musical work. However, as explained in Section 4, the correct pairs of domain and range cannot be specified inside Wikidata, thus our results integrate these constraints by suggesting that e.g. music genre is more appropriate as range of the property genre with record label as domain, than e.g. criticism – which is included in the value-type constraint of genre, and which never occurs in the data. Moreover, the subject type and value-type constraints are not available for all properties; for instance, follows (used in 4 out of 7 patterns) has no such constraints.

Example of comparison between our pattern and the Wikidata support for describing musical groups.
Comparison with properties for this type Taking into account our selected music patterns, we compared the properties we include and those included as value of the property properties for this type (
Comparison with type of Wikidata property
Comparison with properties listed in the WikiProject Music The WikiProject Music27
Y stands for yes, N stands for no.
Comparison between properties recommended by WikiProject Music and properties included in our pattern for record labels
Comparison between properties recommended by WikiProject Music and properties included in our patterns for releases
Now, let us compare the properties recommended for instances of subclasses of Release by WPMusic and our relevant patterns album (A), single (S) and extended play (P) (Table 7). 6 properties recommended by WPMusic are included in all our 3 patterns, 3 properties are included in some of our patterns, while 9 properties are not included. However, for instance, the property composer is used only 6, 186 and 1602 times for instances of extended play, album and single, respectively; instrument is never used for any of these entities (instead, it is frequently used, and is included, in the human pattern); working title is used only twice for albums, while the property title (
Comparison with music-related shapes We manually identified from the complete list of Wikidata entity schemas30
Patterns coverage Table 8 reports the percentage of the total instances that covers increasing subsets of the recommended properties for each class. 3 out of 11 AAA patterns have 100% coverage considering only the most frequent property: castle, English country house, and art museum. However, while in the case of the castle and English country house patterns, this percentage of coverage with the most frequent property is associated with a high (wrt the other patterns) percentage of coverage also considering all properties of the pattern – 2.78% and 9.89%, respectively, being English country house the most populated pattern out of all 11 patterns –, art museum has only one instance that complies with the whole pattern (0.24%). Other patterns widely populated (considering all properties), wrt the total number of instances, are house (4.88%) and single-family detached home (5.88%). All patterns but painting have at least one representative entity for the whole set of properties: the pattern painting, which has a very high coverage (higher than other patterns) considering the first sets of properties (e.g. 97.34% with the first three properties), has a 0.70 percentage of coverage with 20/23 selected properties, while has no instances covering all the most frequent 21, 22, and 23 (i.e. all) properties. Other patterns with a lower percentage of coverage considering the whole set of properties are: human (8 individuals complying with the whole pattern out of 7,622), museum (1/636), and hotel (1/427).
Percentages of coverage of the AAA patterns properties in the KG
Percentages of coverage of the AAA patterns properties in the KG
Comparison with property constraints The most common properties across all patterns are
Comparison with properties for this type 8 out of the 11 classes we build our patterns around have values for the property properties for this type (
Comparison with type of Wikidata property Wikidata property to identify artworks links only to IDs related to specific museums, while the only IDs we select for paintings, based on their usage, are quite general (see Fig. 8). Wikidata property related to architecture, apart from IDs, recommends the properties architect32
And the more specific landscape architect.

Example of comparison between our patterns and the Wikidata support for describing artworks and architecture.
Comparison with properties listed in the WikiProjects It does not exist a WikiProject addressing the Art, architecture, and archaeology (AAA) domain. The WikiProject Archaeology33
Comparison between properties recommended by WikiProject Museums and properties included in our pattern for museums
The WikiProject Museums36
Comparison with AAA-related shapes We could identify 10 shapes related to the architecture, archaeology and art domain from the list of Wikidata entity schemas:37
The shape for museums38
The shape for paintings39
The only class common to both music and AAA domains is
Let us now take a look at the ranges recommended for some common properties. For the property occupation, in both patterns, the classes occupation and profession are recommended as ranges; however, the more specific range artistic profession is recommended for AAA humans (45.76%), while the domain-specific range musical profession is recommended for music humans (83.78%). The property educated at recommends as ranges, along with other general classes, conservatory in the music human pattern, and art school in the AAA human pattern. It is clear from these examples that, even in the case of general properties common to patterns from different domains, it is possible to have domain-specific recommendations of ranges.
Evolution of the music and AAA patterns across two versions of Wikidata
In this section, we report our analysis of the empirical ODPs extracted from both the music and the art, architecture, and archaeology sub-KGs, from three different versions of Wikidata: the release on which the experiments presented and discussed in the previous sections have been run (downloaded on 04-04-2022, april 2022 version), a release dated 6 months later (downloaded on 10-10-2022, october 2022 version), and a release date 9 months later than the october 2022 version (downloaded on 10-07-2023, july 2023 version). The classes that have been selected for building the patterns, using the same thresholds, are the same across the three releases, and the number of instances for each class is almost equal, or the difference is negligible.
Music Out of the 7 music patterns extracted from all Wikidata versions, 2 include the same exact list of properties in the three versions, with the same – or slightly different – order of frequency. The remaining 5 patterns show small changes:
the human EODP includes 2 additional identifier properties in the october 2022 version: National Library of Israel J9U ID (with an increase of 14.26% than the april 2022 version) and Grove Music Online ID (+15.47%), to which are added other 2 properties in the july 2023 version, that is award received (+1.14%) and Film.ru person ID (which is indeed a quite recent property that replaces the older Film.ru actor ID40
See
the musical group pattern also includes 2 more IDs in the october 2022 version: Musik-Sammler.de artist ID (+6.58%) and Bibliothèque nationale de France ID (+0.18%), which are present also in the july 2023 version.
album recommends 2 properties (not identifiers) in addition to those from the april 2022 version: place of publication (+10.37%) and title (+1.2%), which are included in the july 2023 version too.
the musical work/composition EODP is the only one that looses properties in the newer versions: one property is not included anymore in the october 2022 version, i.e. followed by (−0.31%); however, this is the first property in the list of those excluded based on the threshold; and other two properties are not included in the july 2023 version, i.e. follows (−0.32%) and producer (−0.2%).
the record label EODP includes 2 additional properties in the july 2023 version: name (+2.91%) and Commons Category (+3.92%); moreover, the newer pattern makes it evident, thanks to the change of label, that one property is being gradually superseded (WorldCat Identities ID (superseded), for now −0.2%).
Across the april 2022 and the october 2022 versions, the properties that show a wider increase in their usage, due to the edits along the 6 months, are National Library of Israel J9U ID (human pattern), Grove Music Online ID (human pattern), and place of publication (album). The july version shows that the new property Film.ru person ID has been widely used from october 2022 to july 2023 (human pattern).
AAA 4 patterns, out of the 11 patterns extracted on the art, architecture, and archaeology Wikidata subgraph, include the same set of properties across the two versions. As for the remaining 7 patterns:
the human EODP includes 2 additional ID properties in the october 2022 version: described by source (+5.08%) and National Library of Israel J9U ID (+12.42%), to which are added other 2 properties in the july 2023 version, that is field of work (+3.2%) and AKL Online artist ID (1.84%).
castle has 4 additional properties in the october 2022 release, that is: TripAdvisor ID (+16.36%), described by source (+15.67%), official website (+3.11%), and Historic England research records ID (this property was never used for castles in the april 2022 version); additionally, the july 2023 version includes also Canmore ID (+0.88%) and GB1900 ID (+12.31%).
hotel recommends 4 other properties in the october 2022 version: hotel rating (+4.73%), Skyscanner hotel ID (+4.96%), Agoda hotel numeric ID (+3.78%), and Booking.com numeric ID (+4.72%); instead, no addition could be found in the july 2023 release.
the art museum EODP has one additional property in the october 2022 release, that is National Library of Israel J9U ID (+15.23%), and other 2 properties in the july 2023 release: heritage designation (+0.83%) and Sotheby’s Museum Network ID (+0.35%).
the museum pattern includes two more properties in the july 2023 release, that is: postal code (+1.49%) and OpenStreetMap node ID (16.98%, never used in the two previous releases).
the English country house includes also the TripAdvisor ID property in the latest version (+5.88%).
the painting EODP recommends also the depicts Iconclass notation property in the latest version (+7.95%).
We notice that, in the human EODP, from both the AAA domain and the music domain, one of the properties that most increased in their usage is National Library of Israel J9U ID. In the AAA domain, the same property is more used also with art museums. Other properties that have undergone a relevant increase are TripAdvisor ID and described by source (castle).
In the previous sections, we presented how empirical patterns can be extracted from the Wikidata knowledge graph, which lacks an explicit and formal ontology, but provides some constraints and guidelines for its usage. Here, we briefly discuss how such patterns could benefit knowledge graphs other than Wikidata.
In the case of a knowledge graph that is completely devoid of an ontology, with no rules or constraints on the use of properties and classes, our EODPs may play a fundamental role in guiding a user when trying to navigate/reuse the knowledge graph, and may be a useful starting point to the development of a formalised ontology from the knowledge graph.
Our method extracts patterns from the data level without taking into account an ontology, however, it can be useful even when a strongly formalised ontology does exist. Indeed, it may suggest possibly missing axioms in the input ontology, or wrong/irrelevant axioms that do not reflect the actual usage in the data, giving insights on how the data (the empirical ODPs) match the actual intent of the ontology designers.
Let us take the ontology network and knowledge graph ArCo [6], which we contributed to, as an example. The ArCo ontology network has been developed following the pattern-based eXtreme Design methodology [7]. One of the ODPs modelled within the ontology is built around the class
@prefix a-cd: <

Comparison between ArCo’s top-down ODP for acquisitions and the EODP extracted from ArCo’s KG.
First of all, the EODP provides a useful prioritization of the properties needed to describe an acquisition, based on the specific knowledge graph taken into account. For instance, the link to an
@prefix cis: <
@prefix ti: <
@prefix core: <
@prefix dc: <
In this paper, we presented a method for extracting empirical patterns from a knowledge graph, without relying on an ontology. The method takes as input a KG and builds empirical ontology design patterns (EODPs) around the classes instantiated in the KG, by formalising axioms and constraints that involve those classes. Such EODPs incorporate information about the probability of each axiom or constraint to happen, based on their occurrences in the KG. Data about the probability is used as a filter to define how likely a class or axiom should be in order to be considered relevant. When run on a knowledge graph, the extracted EODPs allow a user to observe the main (wrt their frequency) EODPs in a (domain-specific) knowledge graph, and how the core classes of the EODPs are described in the actual usage through relations, and to which entities they are linked.
Experiments on two subgraphs extracted from Wikidata, on the music and the art, architecture and archaeology domains, demonstrated how these patterns can support the reuse of the Wikidata ontology, and how they can be an access point to understand the content of a knowledge graph, and possibly compare different KGs.
As future work, we would like to investigate how the empirical ODPs we can extract from a KG can be mapped to state-of-the-art ODPs, like those published on the ODP Portal.47
Footnotes
Acknowledgements
This work has been enabled by the H2020 Project Polifonia: a digital harmoniser for musical heritage knowledge funded by the European Commission Grant number 101004746.
