Abstract
Semantic annotation of tabular data is the process of matching table elements with knowledge graphs. As a result, the table contents could be interpreted or inferred using knowledge graph concepts, enabling them to be useful in downstream applications such as data analytics and management. Nevertheless, semantic annotation tasks are challenging due to insufficient tabular data descriptions, heterogeneous schema, and vocabulary issues. This paper presents an automatic semantic annotation system for tabular data, called MTab4D, to generate annotations with DBpedia in three annotation tasks: 1) matching table cells to entities, 2) matching columns to entity types, and 3) matching pairs of columns to properties. In particular, we propose an annotation pipeline that combines multiple matching signals from different table elements to address schema heterogeneity, data ambiguity, and noisiness. Additionally, this paper provides insightful analysis and extra resources on benchmarking semantic annotation with knowledge graphs. Experimental results on the original and adapted datasets of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) show that our system achieves an impressive performance for the three annotation tasks. MTab4D’s repository is publicly available at
Introduction
Many tabular data resources have been made available on the Web and data portals, thanks to the Open Data initiative in recent years. The resources contain valuable information that helps establish transparency, improve human life quality, and inspire business opportunities. Although tabular data offers enormous potential, it is difficult to be used in applications due to insufficient descriptions, heterogeneous schema, and vocabulary issues.
One possible solution for the usability problems is to generate semantic annotation of tables, particularly matching table elements with knowledge graphs (KGs) such as DBpedia. As a result, the meaning of tabular data could be interpreted or inferred by knowledge graph concepts; therefore, it is easy to be used in other downstream applications such as data analytics and management.
This paper presents MTab4D, a semantic annotation system for tabular data, designed to address the three annotations tasks of the Semantic Web challenge on tabular data annotation with knowledge graphs (SemTab 2019).1 SemTab 2019:

Tabular data annotations with DBpedia. dbr: is an entity prefix, and dbo: is a type prefix.
Figure 2 depicts an example of semantic annotations for tabular data. dbr:Arthur_Drews is an entity annotation for cell “A. Drews”. The type annotations for column “col1” are dbo:PopulatedPlace and dbo:Place. The property of dbo:deathYear is the annotation for the relation between column “col0” and column “col2”.

An example of semantic annotation of tabular data.
This paper proposes an annotation pipeline that combines multiple matching signals from table elements to address schema heterogeneity and ambiguity. Our system is inspired by the graphical probability model-based approach [11] and the signal (or confidence) propagation as in the T2K system [22]; however, our system improves the annotation tasks’ performance with the following contributions:
We also provide the contributions to the tabular data annotation community as follows.
API documents:
MTab4D Graphical Interface: MTab4D Repository:
MTab4D is an extended version of our work MTab [13,15] (the best performance for the three matching tasks: the 1st rank in all (four) rounds and all (three) tasks). This system advances the previous study in the three directions:
Refactoring Implementation: We refactor the codes over most of the components of MTab [13,15] to optimize MTab4D efficiency. Moreover, MTab4D could auto predict table header, subject column, and matching targets (Section 3.2), and provide annotations. MTab4D is publicly available at 4 under an open-source license.
Reproducibility: MTab’s entity search modules are built by aggregating many online entity search services from DBpedia, Wikipedia, and Wikidata [13,15]. As a result, it is hard to reproduce the entity search results since the search index changes over time. We build new search modules based on the DBpedia October 2016 dump to enable reproducibility. Moreover, we also provide the adapted SemTab 2019 data with the DBpedia October 2016 version [14]. These resources enable a consistent environment setup for a fair comparison between annotation systems in future studies.
Public services: We also focus on building public services so that we refactor the implementation, optimize system efficiency, and support multilingual tables, and could be able to process various table formats such as Excel, CSV, TSV, or markdown tables. We also provide a graphical interface that enables the user to do table annotation by pasting the table of contents from table files or websites. Wang et al. [26] state that only our system could generate the annotations, while other annotation systems require high time complexity.
The rest of this paper is organized as follows. In Section 2, we define the annotation tasks and describe MTab4D assumptions. Then, we present the overall framework and the details of each framework’s module in Section 3. Section 4 reports experimental settings, results, and error analysis. We describe MTab4D public APIs and graphical interfaces in Section 5. Section 6 discusses the related work on semantic annotation of table data and summarizes the participant approaches. Finally, we summarize the paper and discuss future directions and the lessons learned in Section 7.
In this section, we provide formal definitions for the three annotation tasks in Section 2.1. The assumptions on MTab4D are described in Section 2.2.
Problem definitions
Knowledge graph
The DBpedia knowledge graph
We denote the set of entities as
Entity types are related by an rdfs:subClassOf relation. When
Entity
A triple comprises an entity
Tabular data
Let
Matching targets
Let
Semantic annotation tasks
Given the DBpedia knowledge graph Cell-Entity matching ( Column-Type matching ( Column Pair Relation-Property matching (
Assumptions
We build MTab4D system based on the following assumptions:
MTab4D is built based on a closed-world assumption.
MTab4D annotates tabular data based on the knowledge graph information. Therefore, we assume that the knowledge graph (DBpedia) is complete and correct. When table elements are not available in the knowledge graph, the system mistakenly returns the most relevant results (incorrect answers).
The tabular input data is a horizontal relational table type.
A horizontal relational table contains semantic knowledge graph triples as (subject, predicate, object). The table also has a subject column containing entity names, and the relation between the subject column and other table columns represents the predicate relation between the entities (subject) and attribute values (object). All the cell values of the same column have the same datatype, and the entities related to cell values of the same column are of the same type.
In this section, we describe the MTab4D framework in Section 3.1. The details of each step are described from Section 3.2 to Section 3.7.

MTab4D framework for tabular data annotations.
We design MTab4D as the seven-step pipeline as shown in Fig. 3.
Step 1 pre-processes the input table consisting of cell value normalization, cell and column datatype prediction, header prediction, subject column prediction, and matching targets prediction. Step 2 is to generate candidate entities.
Then, Step 3 and Step 4 generate candidate types and properties using the row-based aggregation from Step 2, respectively. Step 5 disambiguates candidate entities with confidence aggregation from Step 2, Step 3, and Step 4.
Step 6 and Step 7 are to disambiguate candidate types and properties with results from Step 5, respectively.
The following are detailed explanations of each step of the framework.
Step 1: Pre-processing
We perform the five following processes: cell normalization, datatype prediction, header detection, and subject column prediction, matching targets prediction.
Cell normalization
We remove HTML tags and non-cell-values such as -, NaN, none, null, blank, unknown, ?, #. Additionally, we use the
Data type prediction
The system predicts a table cell’s datatype into either non-cell (empty cell), literal, or named-entity (NE). We use the pre-trained SpaCy models [8] (OntoNotes 5 dataset) to identify named and numeric entities. A cell has a named-entity type when the SpaCy model recognizes an entity-name tag such as PERSON (human names), NORP (nationalities), FAC (building), ORG (companies), GPE (countries, cities), LOC (locations), PRODUCT (objects, vehicles), EVENT (wars, sports events), WORK_OF_ART (books, songs), LAW (law documents), LANGUAGE (named language). A cell has a literal type when the recognized SpaCy tag is a numeric tag such as DATE (date), TIME (time), PERCENT (percentage), MONEY (amount of money), QUANTITY (measurements), ORDINAL (ordinal), and CARDINAL (other numerical values). If no tag is assigned, we associate the cell type with named-entity because the SpaCy model could miss the named-entity types.
Next, the system predicts a table column’s datatype into either a non-match column (empty column)
Header prediction
Let Table headers could be located in some of the first rows of a table. If the list of datatypes of the header candidate row differs from most datatypes of the remaining rows, the candidate is the table header. For example, the list of datatypes of header candidate (row) is [named-entity, named-entity, named-entity], while the list of the majority datatype of remaining rows is [named-entity, literal, literal]. We also found that the length of header text is empirically shorter or longer than the remaining data rows. If the length of values of the header candidate row is less than the 0.05 quantile or larger than the 0.95 quantiles of the length of the value of remaining rows, the candidates are the table header.
Subject column prediction
Let A column is a subject column when its datatype is a named-entity type. The average cell value length is from 3.5 to 200. We also add a restriction that only considers non-header cells since the length of table headers could differ from the remaining cells. The subject column is determined based on the uniqueness score as an increased score for columns with many unique values and reduces the score for columns with many missing values. The subject column is the highest unique score column. If we have many columns with the same score, the left-most column is chosen.
Matching targets prediction
MTab4D uses the following heuristics to generate matching targets for the three annotations tasks when the input does not have matching targets.
CEA task: matching targets are the table cells whose datatypes are strings. CTA task: matching targets are columns so that the column datatypes are strings. CPA task: matching targets are the relation between the core attribute and the remaining table columns.
Step 2: Candidate entity generation
To generate candidate entities for the CEA matching targets, we perform entity searching on the MTab4D search engine.5 MTab4D Entity search:
We build the three entity search modules (i.e., keyword search, fuzzy search, and aggregation search) to address table cell values’ ambiguity and noisiness.
Let
For the three search modules, the default limit of the ranking list is set as 100 in all our experiments for efficiency reasons. The detail of search modules is described as follows.
We build the keyword search to address table cells ambiguity and entity name variant. We use the Elasticsearch engine6
Another challenge of entity search is that table cells might be noisy, contain many spelling errors, and are expressed as abbreviations. We introduce the fuzzy search module using edit distance (Damerau–Levenshtein) and entity popularities. The ranking score of fuzzy search is calculated as follows.
Since the edit distance calculation is expensive, we perform candidate filtering and hashing to reduce the number of operations on pairwise edit distance calculation. We remove candidate entities with their length larger or smaller
Aggregation search
This search module is designed to aggregate keyword search and fuzzy search results with a weighted fusion as the following equation.
Step 3: Candidate type generation
This step is to generate candidate types for the named-entity columns. The overall confidence scores of candidate types are described in Section 3.4.5. The details of entity search signals, named-entity recognition signals, table header signals, numerical column signals are described in Section 3.4.1, Section 3.4.2, Section 3.4.3, and Section 3.4.4.
Entity search signals
Let
Named-entity recognition signals
We denote the potential function of candidate types derived from name-entity recognition signals as
Finally, we normalize the confidence scores of candidate types to the range
Mappings between named entities (using SpaCy toolkit) and DBpedia entity types
Mappings between named entities (using SpaCy toolkit) and DBpedia entity types
Let
We also normalize candidate type confidence scores to
Numerical column signals
Let
We first use EmbNum+ [16] for column We used the 200 most frequently numerical attributes (numerical properties, e.g., height, weight) of DBpedia as the database.

Property lookup with EmbNum+.
Next, we use the candidate properties to infer the classes (types) for the subject column. The inferred classes are the candidate properties’ domain classes (dbo:domain). For example, in Fig. 4, the candidate properties of the two numerical columns are “dbo:oclc” and “dbo:finalPublicationYear”. The inferred candidate types of the subject column given the two numerical columns are “dbo:WrittenWork” and “dbo:PeriodcalLiterature” (the domain types of the candidate properties).
Let
The confidence scores of candidate types
To avoid adding too much noise to the final aggregation, we omit the types that have confidence scores less than The parameter
This step is to generate candidate properties
Subject column – named-entity column
Let
Cell
We assume that there is a relation between candidate entities of
Figure 5 illustrates candidate property generation between the subject and entity columns. The list of

Illustration of candidate property generation between the subject column and an entity column.
Let
We perform value-based matching to calculate the similarities between entity attribute values of the Textual values: We use the normalized Damerau–Levenshtein distance as the similarity between Numerical values: we adapt the relative change as the numerical similarity as the following equation.
We aggregate the confidence scores of all rows of the two columns based on properties and then normalize these scores to

Illustration of candidate property generation between the subject column and a literal column.
Figure 6 illustrates candidate property generation between the subject and literal columns. The list of
This step is to re-calculate the candidate entities based on the prior signals from the previous steps. Given a table cell
The candidate entity confidence scores of the table cell of
Step 6, 7: Type and property matching
We aggregate the highest probabilities of candidate entities in Step 5 for each cell
Evaluation
This section first reports the detail about benchmark datasets in Section 4.1, evaluation metrics in Section 4.3, and experimental setting in Section 4.4. The overall results are reported in Section 4.5.
Datasets
We use the two datasets as the original SemTab 2019 (four rounds) and the adapted SemTab 2019 with the 2016 October of DBpedia [14]. Table 2 reports the number of tables in each dataset; target cells (in CEA), columns (in CTA), and column pairs (in CPA) of the original and adapted versions of the SemTab 2019 dataset.
Statistics of matching targets on the original and adapted SemTab 2019 datasets
Statistics of matching targets on the original and adapted SemTab 2019 datasets
The SemTab 2019 challenge has four rounds; each round came with a different set of tables and matching targets for each annotation task [7]. In detail, Round 1 data is a subset of the T2Dv2 dataset, a standard dataset in tabular data annotation. Round 2 is the biggest and most complex since it combines Wikipedia tables and automatically generated tables from DBpedia. Round 3 and 4 datasets also are automatically generated from DBpedia, but the easily matched cells are removed in Round 4. To generate the tabular data, firstly, a list of classes and properties are gathered, then for each class, the generator selects groups of properties and uses them to create “realistic” tables using SPARQL queries. Finally, the “realistic” tables are added noise into the surface textual of table cells or removed “easy” matches cells.
Adapted SemTab 2019 dataset
This section presents an adapted SemTab 2019 with the 2016 October of DBpedia for reproducibility [14]. It is challenging to compare annotation systems while not using the same experimental setting (DBpedia version). Knowledge Graphs change over time so that the schema or instances from a DBpedia version have many differences from another version. As a result, an annotation system could yield a different performance when benchmarked on different DBpedia versions.
To enable reproducibility, we perform adaptations on the original version of the SemTab 2019 dataset to October 2019 of DBpedia as in Section 4.1.3. Section 4.1.4 reports the open resources from the adapted dataset.
Ground truth
We process the original SemTab 2019 dataset as follows:
We make matching targets, and ground truth answers consistent by removing the matching targets that are not available in the ground truth and the original matching targets. We remove invalid entities, types, and properties that are not available in the October 2016 version of DBpedia. We add missing redirects, equivalent entities, types, and properties. We remove prefixes to avoid redirect issues. For example, the expected prefix of the entity “Lake Alan Henry” is “http://dbpedia.org/
Public resources
We also prepare open resources to be reproduced for future studies.
Schema: We prepare the CSV files as DBpedia class hierarchy, properties, and equivalents. Data: We also published an entity JSON list dump (all information about entities) of the October 2016 version of DBpedia. The information of each entity could be accessed quickly using our opened API
4
without processing entire all DBpedia entities. Entity Search: We provide a public API of entity search based on entity label and aliases (multilingual) of DBpedia 2016 October. The search results will be expected to be the same using our API, while using other online entity searches (e.g., DBpedia entity search or Wikipedia search) could yield different answers. Other resources: We also provide other public APIs
4
of tabular data annotations, numerical attribute labeling, annotation evaluation for the original and adapted SemTab 2019 datasets [14].
Analysis of the original SemTab 2019 dataset
Index Inconsistencies (IIndex) describes the number of invalid table cell indexes of CEA targets. Encoding Inconsistencies (IEncoding) describes the number of encoding errors of DBpedia URIs. Many samples are inconsistent with URI encoded and decoded representation. For example, an entity URI of dbr:Angélica_Rivera could be encoded as “dbr:Ang%C3%A9lica_Rivera” and decoded as “dbr:Angélica_Rivera”. The ground truth of CEA contains a mixture between encoded URI and decoded URI. The encoding URI (percent-encoding) is not encouraged.9 DBpedia URI encoding:
The Round 2 dataset contains many inconsistencies, including the four types of inconsistencies (7.8% inconsistencies) because this dataset combines a subset of Wikipedia tables (with a different KG target: the October 2015 version of DBpedia) and automatically generated tables (from the October 2016 version of DBpedia). Round 1 dataset is the second place of inconsistencies (7%) since this is the subset of the T2D dataset (with a KG target as the 2014 version of DBpedia). Round 3 dataset has 3% inconsistencies, and Round 4 dataset is the cleanest in four rounds (2% inconsistencies).
Analysis of the CEA task of SemTab 2019 with the DBpedia October 2016
Round 2 dataset has 2% index errors where the target matching is not available in the input table. Round 1 has no errors, while Round 2, 3, 4 have 5%, 6%, and 1% IHierarchy error rates.
Analysis on CTA task of SemTab 2019 with the October 2016 version of DBpedia
Table 5 depicts statistics on missing equivalent properties (IIEquivalent) in the October 2016 version of DBpedia. Round 1 has 4% of ground truth missing equivalent properties. Round 2, 3, 4 have approximately 28% missing equivalent properties.
Analysis on CPA task of SemTab 2019 with the October 2016 version of DBpedia
There are four different metrics used to evaluate tabular data annotation:
F1-score is a harmonic mean of precision and recall. It is used as the primary score to measure the performance of entity annotations (
Regarding the type annotation
Let the list of target columns be
Experimental settings
MTab4D is built based on the October 2016 version of DBpedia with three versions (a, b, f) depending on the use of the entity search module. MTab4Db is the system that uses the keyword search, MTab4Df is used the fuzzy search, and MTab4Da is used the aggregation search.
We compare MTab4D with other systems, using the results reported in SemTab 2019 dataset (original version). Unlike our participated system MTab, MTab4D focuses on reproducibility, where we use the entity search built from dump data of DBpedia.
We also conduct experiments MTab4D on the adapted version of SemTab 2019, where we remove the inconsistencies between ground truth and the October 2016 version of DBpedia.
Experimental results
In this section, we first report the results of MTab4D with other SemTab 2019 participants on the original data version in Section 4.5.1. The full results are reported on the challenge websites.10 Results:
Table 6 depicts the CEA results in the F1 score and Precision of MTab4D compared to the other systems on the original version of SemTab 2019. Because of the high data inconsistencies in Rounds 1 and 2, MTab4D could not provide comparable results with the original system MTab. However, MTab4Db, with a keyword search, got slightly higher performance than the MTab system in Round 3 and Round 4. In Round 1 and Round 2, MTab4Df using fuzzy search achieves higher performance than MTab4D using keyword search and aggregation search. It could be explained that the datasets have a higher noisy level of table cells, such as incorrect encoding parsing, entity labels variance, or abbreviation. In rounds 3 and 4, MTab4Db using keyword search achieves higher performance since the table cells are more likely similar to entity labels of DBpedia.
CEA results in F1 score and Precision for the four rounds of the original version of SemTab 2019
Results are taken from SemTab 2019 [9].
CEA results in F1 score and Precision for the four rounds of the original version of SemTab 2019
Results are taken from SemTab 2019 [9].
Table 7 depicts the CTA results in AH score and AP score of MTab4D compared to the other systems on the original version of SemTab 2019. Because of the CTA inconsistencies of the ground truth, MTab4D results are not comparable with our system MTab in SemTab 2019. MTab4Df achieves slightly higher performance than MTab4D using other entity search modules.
CTA results in F1 and Precision for Round 1 and AH score, and AP score of the original version of SemTab 2019
Results are taken from SemTab 2019 [9].
Table 8 depicts the CPA results in F1 score and precision of MTab4D compared to the other systems on the original version of SemTab 2019. MTab4D results are slightly lower than our system MTab in SemTab 2019 because of lacking equivalent properties of the CPA ground truth. The results of using different search modules are similar; as a result, we conclude that there is no effect of using different entity search modules in MTab4D.
CPA results in F1 score and Precision for the four rounds of the original version of SemTab 2019
Results are taken from SemTab 2019 [9].
This section compares the MTab4D performance on the original and adapted versions of the SemTab 2019 datasets.
Table 9 depicts the MTab4D results in the F1 score of the CEA task on the original and adapted version of the SemTab 2019 dataset. MTab4D, built on the October 2016 version of DBpedia, consistently achieves better performance on the adapted version dataset than the original one. Round 1 and 2 results in the adapted version have more improvement than Round 3, 4 because Round 1, 2 have more inconsistencies in the original dataset.
CEA results in F1 score of MTab4D for the original and adapted version of SemTab 2019 dataset
CEA results in F1 score of MTab4D for the original and adapted version of SemTab 2019 dataset
Table 10 depicts the MTab4D results in the AH score of the CTA task on the original and adapted version of the SemTab 2019 dataset. The performance of MTab consistently improves on the adapted version.
CTA results in AH score of MTab4D for the original and adapted version of SemTab 2019 dataset
Table 11 depicts the MTab4D results in the F1 score of the CPA task on the original and adapted version of the SemTab 2019 dataset. The performance of MTab significantly improves on the adapted version, adding the equivalent properties into the ground truth data. Due to the incompleteness of DBpedia, there are many indirect equivalent properties in DBpedia. For example, dbo:deathCause and dbo:causeOfDeath have the same equivalent property of wikidata:P509 (cause of death). The problem of knowledge graph completion is not the main focus of this work, but we can expect the improvement of property annotations when the completeness of DBpedia is improved.
CPA results in F1 score of MTab4D for the original and adapted version of SemTab 2019 dataset
This section analyzes the error cases of entity CEA, CTA, and CPA tasks of MTab4D. Specifically, we perform the following analysis objectives:
For each question, we analyze MTab4D results on the adapted datasets (Section 4.5.2) using the MTab4Db (MTab4D with the keyword search in Section 3.3.1) since this setting achieved the best performance in this dataset. Error details on each table and errors of other MTab4D versions (i.e., MTab4Df, and MTab4Da) are available on the MTab4D repository.11 MTab4D error log files:
Statistics of entity annotation errors on the adapted SemTab 2019 dataset depicts in Table 12. The error rate of entity annotation is from 0.75% to 14.32%. MTab4D results have higher error rates in noisy data as Round 1 (14.32%), and Round 2 (8.17%), and lower error rates in the synthesis data Round 3 (0.75%), and Round 4 (1.27%).
Most entity annotation errors (from 89.21% to 95.88%) do not have correct answers in entity search modules in Step 2. Due to the high ambiguity of table cells, there are 4.12%–10.79% of other CEA errors even if there is a correct answer in the step of candidate entity generation.
Because MTab4D entity search modules return the top 100 relevant entities as a default setting (efficiency reasons), a correct answer might be ranked lower than the top 100. To understand MTab4D performances in a larger search limit, we re-run the experiments with the search limit setting as 1,000. Table 13 depicts MTab4D performances with the search limit as 1,000 and as 100 on the adapted SemTab 2019 dataset. Although there is an increasing search limit in MTab4D, the differences in final results are not significant. Building a better search engine for tabular data is challenging for future studies because table data contains ambiguous text, abbreviations, misspellings.
Statistics of entity annotation errors on the adapted SemTab 2019 dataset. We show the number of CEA matching targets as #Targets , the number of errors of MTab4Db as #Errors , the percentage of CEA errors when there is no correct entity available in the candidate entity list of the search modules (Section 3.3) as
Search
, and
Others
(other error cases)
Statistics of entity annotation errors on the adapted SemTab 2019 dataset. We show the number of CEA matching targets as
Statistics of type annotation errors on the adapted SemTab 2019 dataset are depicts in Table 14. In this analysis, a type annotation is an error when there is no overlapping between the annotated and ground truth types (the concatenation of perfect and OK types). The error rate of type annotation is from 1.57% to 16.75%. MTab4D results have higher error rates in noisy data as Round 1 (9.17%), and Round 2 (16.75%), and lower error rates in the synthesis data Round 3 (1.75%), and Round 4 (1.57%).
There is also a large portion of the CEA annotation errors (80%-99.35%) for type errors. Since the MTab4D type annotation module aggregates confidence signals from CEA annotation results, the errors in CEA tasks also affect the performance of CTA tasks.
Statistics of type annotation errors on the adapted SemTab 2019 dataset. We show the number of CTA matching targets as #Targets , the number of errors of MTab4Db as #Errors. CEA errors depicts the probability of having CEA errors in the table that have CTA errors
Statistics of type annotation errors on the adapted SemTab 2019 dataset. We show the number of CTA matching targets as
Statistics of property annotation errors on the adapted SemTab 2019 dataset are depicts in Table 15. The error rate of type annotation is from 2.66% to 2.97%. We also have the same observation as the CTA tasks; the CPA results are strongly affected by the CEA task performance as a large portion of the CEA annotation errors (80%-99.35%) for property annotation errors.
Statistics of property annotation errors on the adapted SemTab 2019 datasets. We show the number of CPA matching targets as #Targets , the number of errors of MTab4Db as #Errors. CEA errors depicts the probability of having CEA errors in the table that have CPA errors
Statistics of property annotation errors on the adapted SemTab 2019 datasets. We show the number of CPA matching targets as
Statistics of MTab4D errors on different table sizes are reported in Table 16. Overall, MTab4D performance increase with the increase of table size. Regarding the CEA task, MTa4D provides many errors (31.78% of the annotations are incorrect) in the small tables (less than ten cells), while the system performs very well in large tables. In the CTA task, MTab4D results also have many errors in small table sizes (30.66% of the annotations are errors); however, the system works well for large tables. There is no target matching for tables with the number of cells less than ten and larger than 10,000 cells for the CPA task. MTab4D provides the best results for tables with cells from 100 to 1,000 with only 5.72% incorrect answers.
Statistics of MTab4D errors on different table sizes
Statistics of MTab4D errors on different table sizes
This section describes our implementations as described in Section 3: MTab4D APIs and MTab4D graphical interface.
MTab4D APIs
We provide the five following APIs.
Entity Search: This API is used to search relevant entities from the October 2016 version of DBpedia. There are three search modules (Section 3.3): keyword search, fuzzy search, and aggregation search. Entity Information Retrieval: This API is used to retrieve entity information from the October 2016 version of DBpedia. The responded object includes DBpedia title, mapping to Wikidata, Wikipedia, label, aliases, types, entity popularity (PageRank score), entity triples, and literal triples. Table Annotation: This API sends a table to the API and gets the annotation results, including structural and semantic annotations. The user could provide the annotation targets for CEA, CTA, or CPA tasks as the input, or MTab4D also could automatically predict the targets based on cell and column datatypes as in Section 3.2.5. Numerical Labeling: The user could do numerical labeling from numerical columns and get a ranking list of relevant properties as EmbNum+ [16]. SemTab 2019 Evaluation: The user could submit the annotation results of CEA, CTA, and CPA tasks to calculate the evaluation metrics from the original and adapted datasets of SemTab 2019.
MTab4D graphical interface
We provide two interfaces entity search and table annotation.
Entity search interface
The user can enter a query in the entity search interface then search with the three MTab4D entity search modules (i.e., keyword search, fuzzy search, and aggregation search). Figure 7 illustrates an example of the fuzzy search with the keyword of “Senaticweb”. It takes only 0.06 seconds to get the relevant “Semantic Web” entity.

MTab4D entity search interface.
The user can copy and paste table content expressed in any language from tabular data files (Excel, CSV, TSV) or tables on the Web in the table annotation interface. Then, the user could tap the “Annotate” button to get the annotation results.

MTab4D table annotation interface.
Figure 8 illustrates an example of table annotation on the “v15_1” table in Round 4 of SemTab 2019. MTab4D takes 0.78 seconds to annotate the input table, as shown in the left figure. The figure on the right shows the annotation results. The table header is in the first row, and the core attribute is in the first column. Entity annotations are in red and located below the table cell value. The type annotation is in green and located in the “Type” column. Finally, the relations between the core attribute and other columns are in blue and located in the property column.
In this section, we review the other systems participating in SemTab 2019. Also, we discuss related works on the tabular data annotation task.
SemTab 2019 systems
Comparison of candidate entity generation methods of SemTab 2019 participants
URI heuristic: checking whether there is an available entity that has a similar label with table cell using ASK SPARQL.
Comparison of candidate entity generation methods of SemTab 2019 participants
URI heuristic: checking whether there is an available entity that has a similar label with table cell using ASK SPARQL.
This section describes the six other frequent participants for all rounds of SemTab 2019 challenges.
The participants generate candidate entities by looking up table cell values or search values in the local index with Elastic Search in DBpedia, Wikidata. Table 17 reports the lookup services used in the participant systems because of lacking specification of information retrieval techniques, hyper-parameters, database index sources. Then, the candidate types and candidate properties are estimated using the candidate entities. Finally, the systems perform entity disambiguation to return the CEA results. The CTA and CPA annotations are generated with the CEA annotations using majority voting.
CSV2KG (IDLAB) first searches on DBpedia lookup and DBpedia Spotlight to generate candidate entities [25]. The candidate types and property annotations are estimated using majority voting approaches based on candidate entities. Then, the entity annotations are estimated using the information of candidate properties. Finally, type annotations are estimated using entity annotations.
Tabular ISI approach first generates candidate entities with Wikidata API and Elastic Search on entity labels of Wikidata, DBpedia. Second, the authors use the heuristics TF-IDF approach and machine learning (neural network ranking) model to select the best candidate for the entity annotation task [24]. The type annotations are estimated with the results from entity annotations with hierarchy searching on common classes. The property annotations are estimated by finding the relation between candidate entities of the primary and secondary columns or values matching the primary and secondary columns’ values.
Mantis Table performs column analysis, including predicting name entity columns, literal columns, and subject column, then mapping between columns into concepts in DBpedia [3]. The relationships between the subject column and other columns are estimated based on predicate context and predicate frequency of column value and candidate predicates. Finally, entity linking is performed using the results from previous steps for cell value disambiguation. The property annotations are estimated by getting the maximum frequency of candidate properties in the entity linking phase. The authors calculate the hierarchical path score of entity types from entity annotations to estimate type annotations. Then type annotations are estimated on the maximum of the path score.
DAGOBAH performs entity linking with a lookup on Wikidata, DBpedia, and voting mechanisms [1]. The authors used Wikidata entity embedding to estimate the entity candidate types, assuming that the same column’s entities should be closed in the embedding spaces as they share semantic meanings.
LOD4ALL uses a combination of direct search (SPARQL ASK on dbr:“query”), keyword search (Abbreviation of Human name), and Elastic Search to find candidate entities [12]. The candidate entities will be used to estimate type annotations with majority voting. Then, the system determines the entity annotations with the type constraints. Finally, the property annotations are estimated by a column-based majority voting with entity annotations of each table row.
ADOG focuses on entity annotation with Elastic Search on an integrated ontology (DBpedia sub-graph) using a NoSQL database named ArangoDB [18]. The system estimates entity annotation using Levenshtein distance, and the results of type and property annotations are estimated from entity annotations.
In summary, some participants adopt the online lookup services of DBpedia, Wikidata, Wikipedia. As a result, it is hard to reproduce the experimental result while the source index changes over time. Some participants built offline lookup services that lacked specification on the information retrieval techniques, hyper-parameter settings, or index sources. It is also hard to reproduce their results. To address the problem, we built a database index from dump data of the October 2016 version of DBpedia. We released many public APIs about entity search modules that enable reproducibility for future studies.
Moreover, tabular data contains many numerical attributes that help us use semantic labeling results for numerical attributes. In MTab4D, we aggregate signal from the results of semantic labeling for numerical attributes (columns) using EmbNum+ [16] (deep metric for distribution similarity calculation). Additionally, we also use novel signals from the relations of column pairs to enhance overall matching performance.
The tabular data annotation tasks could be categorized as structure or semantic annotation.
The structural annotation contains table type prediction [17], datatype prediction, table header annotation, subject column prediction, and holistic matching across tables [10]. In SemTab 2019, most tables are represented as a horizontal relational type; headers are located at the first row of tables, and the subject column is in the first table column.
There are many previous studies on table semantic annotation, including schema-level matching, e.g., tables to classes [22], columns to properties [2,16,19,22] or classes [27], and data-level matching, e.g., rows [5,22] or cells to entities [11,27]. SemTab 2019 also has schema annotation as the CTA task, data annotation as the CEA task, and a novel CPA task as column relation annotation.
DBpedia version
Due to the different environment settings, such as the DBpedia version, it is hard to compare the annotations directly. Table 18 reports the DBpedia versions used in table annotation tasks. Quercini et al. [20] used a snippet of DBpedia in 2013. T2K [22] conducts experiments on the T2D dataset built on the 2014 version of DBpedia. Efthymiou et al. [5] introduce Wikipedia tables and an adapted version of Limaye gold standard [11] built on the October 2015 version of DBpedia. The recent work (MantisTable [4]) builds the annotation system based on the DBpedia 2017 version. In this work, we follow the SemTab 2019 to build the system based on the DBpedia October 2016 version.
Studies used DBpedia as the target knowledge graph
Studies used DBpedia as the target knowledge graph
This paper presents MTab4D, a table annotation system that combines multiple matching signals from different table elements to address schema heterogeneity, data ambiguity, and noisiness. This paper also provides insightful analysis and extra resources on benchmarking semantic annotation with knowledge graphs. Additionally, we also introduce MTab4D APIs and graphical interfaces for reproducibility. Experimental results on the original and adapted datasets of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019) show that our system achieves an impressive performance for the three matching tasks.
Future work
MTab4D could be improved in many dimensions, such as effectiveness, efficiency, and generality. Regarding efficiency, MTab4D could be modified in a parallel processing fashion since the lookup steps and the probability estimations in Step 2, 3, and 4 are independent. Regarding effectiveness, MTab4D performance could be improved by relaxing our assumptions:
The closed-world assumption (Assumption 1) might not hold in practice. Improving the completeness and correctness of knowledge graphs might improve MTab4D performance. MTab4D assumes the input table as a horizontal relational type as in Assumption 2. To make MTab4D work for other table types, e.g., vertical relational, we need to perform further preprocessing steps to identify table types and transform table data to horizontal relational. Many tables could have a shared schema, e.g., tables on the Web could be divided into many web pages; therefore, we can expect an improving matching performance by stitching tables on the same web page (or domain) [10,21].
Lessons learned
This section discusses the lessons learned from SemTab 2019 challenge.
Footnotes
Acknowledgements
The research was partially supported by the Cross-ministerial Strategic Innovation Promotion Program (SIP) Second Phase, “Big-data and AI-enabled Cyberspace Technologies” by the New Energy and Industrial Technology Development Organization (NEDO).
We would like to thank the SemTab 2019 challenge organizers for organizing the successful challenge. We also thank IBM Research and SIRIUS for their sponsorship of the challenge.
