Sage Journals: Discover world-class research

Abstract

Machine learning (ML) methods have demonstrated strong predictive capabilities when trained on large datasets. However, in domains where data is scarce or sensitive, ML models often exhibit sub-optimal performance. Our hypothesis is that semantically enriching the available training dataset can enhance the predictive power of ML models, particularly in data-scarce scenarios. To investigate this hypothesis, we propose novel neuro-symbolic approaches that augment tabular data with knowledge graph (KG) information, providing additional context and structure to improve model performance. Concretely, we introduce and examine several integration techniques of KG information through embeddings and explore how different KG embedding algorithms affect model performance, with a specific focus on accuracy and F2 scores. Our evaluation involves four distinct ML algorithms and four KG embedding techniques. We apply our approach to binary classification tasks on tabular data, including heart disease and chronic kidney disease. Our experimental results show improvements in performance particularly when tabular data is augmented with distance features computed in the embedding space. Notably, we achieve gains in F2 scores, such as an increase in XGBoost performance from 75.19% to 90.85% for heart disease prediction. These findings demonstrate the potential of KG-based augmentation to enhance ML performance.

Keywords

neuro-symbolic AI knowledge graph embeddings machine learning data augmentation

1. Introduction

Machine learning (ML) has revolutionised various domains by providing powerful tools for pattern recognition, predictive analytics and data-driven decision-making. Techniques such as deep learning have achieved remarkable success in fields ranging from computer vision (Deng et al., 2009; Wortsman et al., 2022) to natural language processing (Kenton & Toutanova, 2019; Ramesh et al., 2022). These advancements have been largely driven by the availability of large datasets and the computational power to process them.

However, ML methods often face significant challenges related to data quality and availability. Data sparsity, imbalance and sensitivity can severely hinder the performance of ML models (Poulinakis et al., 2023). In the medical domain, one important task is predicting patient outcomes, for instance, determining the presence or absence of a disease based on clinical observations. This task often suffers from an insufficient amount of labelled data due to privacy concerns (Jarrett et al., 2019). Although advances have been made, models trained solely on tabular data fail to fully capture the domain’s complexity and semantics, limiting their ability to generalise effectively (Ruiz et al., 2024).

To overcome these limitations, neuro-symbolic (NeSy) artificial intelligence (AI) has emerged as a promising approach to integrate domain knowledge into ML models. NeSy AI combines the strengths of symbolic AI – known for logical reasoning and explainability – with sub-symbolic methods such as deep learning (Garcez & Lamb, 2023; Hitzler et al., 2022; Sarker et al., 2021). In particular, structured semantic knowledge such as knowledge graphs (KGs) has emerged as a key element in bridging the gap, providing a structured way to represent relationships between entities and capture domain-specific semantics (Bhatt et al., 2020; Gaur et al., 2018; Herron et al., 2023; Yin et al., 2019). KGs have been widely used in tasks such as KG completion (Lin et al., 2015) and link prediction (Wang et al., 2021). However, their potential to enhance ML predictions on tabular data by incorporating semantic knowledge through embeddings remains underexplored.

We propose integrating KGs into ML pipelines to enhance tabular data with structured, domain-specific information. Drawing upon techniques from the Semantic Web community, our approach begins by utilising ontologies to formalise domain semantics. We then construct KGs based on these ontologies, enriching the datasets with structured knowledge specific to the medical domain. Subsequently, we employ KG embeddings to transform the KGs into numerical vector representations suitable for ML algorithms. By embedding relationships and domain knowledge from KGs into these vectors, our methodology enhances the ML pipeline by augmenting the datasets with semantic knowledge, aiming to improve predictive performance – especially in data-scarce domains. This study specifically explores binary classification tasks in both medical predictions (heart disease and chronic kidney disease) where domain-specific structure is crucial for robust prediction. Our research is guided by the following research questions:

–
RQ1: How can KGs be optimally infused into an ML pipeline to enhance performance in terms of accuracy and F2 score?
–
RQ2: How does the choice of knowledge graph embedding algorithms affect the performance of machine learning models when used to augment tabular data?
–
RQ3: How do different ML algorithms perform when KG-based information is integrated into the input data?

To address these research questions, we took an exploratory approach, systematically investigating each aspect step by step. For RQ1, we derived five sub-hypotheses to examine how KGs can be optimally integrated into ML pipelines to enhance performance metrics such as accuracy and F2 score (with the reason for selecting these metrics explained in Section 6.2). We tested these hypotheses using eight different approaches, each incorporating KGs and embeddings in various ways. For RQ2 and RQ3, we empirically evaluated the impact of different KG embedding algorithms and ML models across two medical domains – heart disease and chronic kidney disease prediction.

Building on our previous work (Llugiqi et al., 2024), we extend and formalise our methodology for integrating KG embeddings into ML pipelines. We employ two additional embedding techniques alongside those used previously to transform the KGs into numerical vector representations suitable for ML algorithms. We developed and tested different approaches based on five sub-hypotheses derived from our first research question, providing a comprehensive evaluation of their impact on model performance in heart and kidney disease prediction. Our study demonstrates the effectiveness of incorporating ontological knowledge into the ML training process, highlighting the potential for improved predictive performance in data-scarce domains and its applicability across various fields where ontologies can be developed or expanded.

The remainder of this article is organised as follows: In Section 2, we define the key concepts that we use in our work. This is followed by an overview of related work in Section 3. In Section 4, we present an overview of our proposed approach, with more detailed explanations about our approaches provided in Section 5. Our experimental analysis is discussed in Section 6, where we outline the goals and setup of our experiments. In Section 7, we present and analyse the outcomes of our experiments. Finally, we summarise our findings and outline directions for future work in Section 8.
2. Problem Description and Background Information

In this section, we outline the problem we aim to address, followed by introducing the key concepts that we use throughout the article, beginning with ontologies, KGs and knowledge graph embeddings.

2.1. Problem Description

In this study, we address the challenge of predicting heart disease and chronic kidney disease using patient medical records in tabular data format. Each dataset can be represented as a table $T$ $\in$ $R^{n \times m}$ , where $n$ is the number of patient instances and $m$ is the number of features or attributes. These features capture patient demographics, clinical measurements and diagnostic information essential for disease prediction. For heart disease prediction, we consider features such as age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar and resting electrocardiogram results. The kidney disease dataset similarly includes essential attributes, including age, blood pressure, specific gravity, albumin, blood glucose, blood urea, serum creatinine, haemoglobin and red and white blood cell counts.

We focus on binary classification to predict the presence or absence of these diseases. Formally, given the dataset $T$ , the goal is to learn a function $f : R^{m} \to {0, 1}$ that maps a patient’s feature vector to a binary outcome indicating disease presence (1) or absence (0). As illustrated in Figure 1, the tabular data $T$ serves as input to machine learning models, which then output predictions regarding disease presence.

Figure 1.

Baseline for ML prediction on tabular data.

For example, given a patient’s data (e.g., age 62, female, asymptomatic, resting blood pressure 140, cholesterol 268, no fasting blood sugar, max heart rate 160, downsloping slope and thalassemia), our model aims to determine the likelihood of heart disease. Similarly, a record for kidney disease might involve attributes such as age 68, blood pressure 70, specific gravity 1.01 and blood urea 54. The objective is to accurately predict disease presence.

Due to the sensitive nature of medical data, datasets in this domain are often limited or partially incomplete, impacting model performance. This scarcity of data, combined with varying data quality, presents a challenge to achieving optimal prediction accuracy, necessitating robust preprocessing and, potentially, data augmentation strategies to improve model generalisability and reliability.

2.2. Background Information

Given the problem definition described in the previous subsection, our approach aims to augment these datasets by integrating semantic information to enhance predictive capabilities. To achieve this, we leverage ontologies to capture the domain knowledge, and then we use KGs to enrich the datasets with ontologies. We then need KG embeddings to transform the KGs into a vector space suitable for machine learning models. In the following we discuss each of these concepts in detail.

Ontology:

Originally a philosophical term, ontology refers to the study of existence and the nature of being. In computer science, Gruber Gruber (1993) redefined ontology as ‘explicit specifications of conceptualisations’, where a conceptualisation represents a simplified, abstract view of a domain to capture essential aspects. An ontology establishes a standardised vocabulary for knowledge sharing within a specific domain. Formally, an ontology represented as $O = (C, R, H^{C})$ encompasses a collection of concepts $C$ , a set of relations $R$ and a hierarchical structure of concepts $H^{C}$ . Each relation $r \in R$ indicates an association between pairs of concepts, such that $R \subseteq C \times C$ . The concept hierarchy $H^{C}$ is a subset of $C \times C$ , illustrating the relationships among concepts.

Knowledge Graphs:

KGs expand on ontologies by capturing not only the structured relationships between concepts but also the specific instances and values within a domain. Originally popularised by Google in 2012 (Singhal, 2012) to enhance search understanding, KGs have since become integral in a range of applications, providing a structured, machine-readable format to represent knowledge. We define the KG as $K G = (E, R^{'}, L, T r)$ where:

–
$E$ represents the set of entities in the KG. Each entity $e \in E$ can represent a real-world concept, object or idea, such as ‘Person’ or ‘City’.
–
$R^{'}$ represents the set of instantiated relations between entities within the KG such as ‘hasAge’ or ‘worksAt’.
–
$L$ represents the set of literals, which are attributes associated with entities, such as numerical values or textual descriptions (e.g., ‘30’ or ‘Alice’).
–
$T r$ denotes a set of triples, where each triple $t r = (e_{1}, e, e_{2}) \in T r$ represents a fact or statement in the KG.

KG embeddings:

While KGs provide a structured representation of entities and their relationships, they can become highly complex as the number of entities and relations grows. To enable efficient computation, learning and reasoning over KGs, knowledge graph embeddings (KGEs) are commonly used (Bordes et al., 2013; Lin et al., 2015; Wang et al., 2014). KGEs transform entities and relations from a discrete symbolic space into a continuous vector space, capturing the structure and semantics of the KG in a form that is compatible with ML algorithms. KGE algorithms can be broadly categorised into three main types based on their methodology and objectives: translational distance models, semantic matching models and random walk-based models. In the following, we briefly describe the embedding algorithms used in our experiments: Node2Vec (Grover & Leskovec, 2016) and Rdf2Vec (Ristoski & Paulheim, 2016) as random-walk-based models that leverage the graph structure, DistMult (Yang et al., 2014) as a semantic matching model and TransH (Wang et al., 2014) as a translational model.

–
Node2Vec uses a flexible random walk strategy to combine depth-first and breadth-first sampling, allowing it to capture various structural features of the graph whether they are labelled or unlabelled, directed or undirected. Node2Vec employs random walks, incorporating an adjustable bias parameter that allows for targeted exploration of local neighbourhoods as well as a broader global search.
–
RDF2Vec is designed specifically for RDF (Resource Description Framework) graphs within the Semantic Web, RDF2Vec generates embeddings for entities and relations by leveraging random walks to create sequences from the graph. These sequences are then transformed into embeddings using Word2Vec, making RDF2Vec particularly effective at capturing the semantic and relational attributes present in RDF data. While both RDF2Vec and Node2Vec utilise random walks, RDF2Vec focuses more on semantic relationships within the context of the Semantic Web, whereas Node2Vec emphasises structural characteristics applicable to a wider range of graph types.
–
DistMult is a semantic matching model that uses a bilinear scoring function to evaluate the interactions between entities and relations in a KG. In this model, each relation is represented as a diagonal matrix, simplifying the bilinear form to a weighted element-wise multiplication of entity embeddings. While this approach effectively captures pairwise relationships, it inherently assumes that all relations are symmetric, which may restrict its expressiveness for datasets containing asymmetric relations.
–
TransH is a translational model that represents entities as vectors and relations as hyperplanes in the embedding space. Each relation is associated with a specific hyperplane and a translation vector on that hyperplane. Entities are projected onto the hyperplane of a relation before the translation operation is applied. This method allows entities to have different representations in the context of different relations, enabling the model to capture complex and diverse relationships thereby improving its ability to represent multiple types of relationships in a KG.

3. Related Work

We review related work on (i) the categorisation of neuro-symbolic approaches, positioning our work within these categories, (ii) we discuss the use of ML models in disease prediction and (iii) enhancing ML predictions with semantic knowledge, and we conclude by discussing the novelty of our approach.

Categorisation of Neuro-Symbolic Approaches

In recent years, the field of neuro-symbolic AI has gained significant attention due to its potential to combine the strengths of both symbolic and sub-symbolic AI (Garcez & Lamb, 2023; Hitzler et al., 2022; Sarker et al., 2021). Symbolic AI excels at logical reasoning and explainability, while sub-symbolic approaches, such as deep learning, have proven effective in pattern recognition and data-driven decision-making. Combining these approaches, neuro-symbolic AI seeks to leverage the best of both worlds: the learning capability of sub-symbolic methods and the structured, interpretable reasoning of symbolic methods. Several efforts have focussed on categorising neuro-symbolic approaches. Kautz et al. Kautz (2022) classify neuro-symbolic systems into six types based on the interaction between neural networks and symbolic reasoning. Type 1 employs standard deep learning with symbolic inputs and outputs, while Type 2 combines neural networks with symbolic solvers, as seen in systems such as AlphaGo. Type 3 uses neural networks for tasks such as object detection, while symbolic systems handle complementary tasks such as query answering. In Type 4, symbolic knowledge is embedded into neural network training, whereas Type 5 incorporates symbolic rules as constraints in the loss function. Finally, Type 6 aims for fully integrated systems, merging symbolic reasoning with neural architectures, although fully mature combinatorial reasoning within such systems remains a challenge. Our approach belongs to Type 4 of Kautz’s classification, where symbolic knowledge is incorporated into the training process.

Similarly, Sheth et al. Kursuncu et al. (2019); Sheth et al. (2019) identify three levels of knowledge infusion in neural models: shallow, semi-deep and deep. Shallow infusion introduces syntactic and symbolic knowledge at the input level, semi-deep infusion introduces external knowledge into intermediate layers via attention mechanisms or constraints, and deep infusion embeds structured, multi-layered knowledge into the network itself, aligning abstraction layers with learning stages. Our work adopts the shallow infusion approach by enriching input data with syntactic and symbolic knowledge, enhancing the model’s performance.

Dash et al. Dash et al. (2022) categorise methods for integrating domain-specific knowledge into deep neural networks into three main approaches: enhancing input data, modifying the loss function and adjusting the network architecture. Our research aligns with the input transformation category, where domain-specific knowledge is integrated by enriching the input data provided to the ML models.

Van Harmelen and ten Teije Van Harmelen and Ten Teije (2019) introduced a conceptual framework known as ‘boxology’, which outlines various patterns for integrating machine learning with semantic web technologies. Breit et al. Breit et al. (2023) expanded this framework by identifying 44 distinct patterns used in hybrid learning and reasoning techniques, based on a review of around 500 papers from 2010 to 2020. Our approach falls under the T patterns, specifically T4, where input transformations using symbolic knowledge are applied to improve model performance.

As a summary, our approach falls under the shallow infusion category as described by Sheth et al. Sheth et al. (2019), where syntactic and symbolic knowledge is introduced at the input level. It aligns with Type 4 in Kautz’s classification (Kautz, 2022), as symbolic knowledge is embedded into the training process. Furthermore, it belongs to the input transformation approach discussed by Dash et al. Dash et al. (2022), where domain-specific knowledge enhances the input data provided to machine learning models. Finally, our work corresponds to the T4 pattern in the ‘boxology’ framework, focussing on input transformations to improve model performance.

Machine Learning Models in Disease Prediction

The application of ML in healthcare has attracted significant research interest due to its potential. Kraivsnikovic et al. Kraišniković et al. (2025) proposed an approach leveraging fine-tuned BERT models to analyse German pathology reports. Their work highlights how domain-specific adaptations can enhance the interpretability and utility of ML models in medical diagnostics by effectively capturing contextual representations. Additionally, ML algorithms have been successfully employed in predicting diseases such as heart disease (Katarya & Meena, 2021; Rani et al., 2021; Shah et al., 2020; Yadav et al., 2023) and kidney disease (Chittora et al., 2021; Rady & Anwar, 2019; Vijayarani et al., 2015; Yildirim, 2017), using various techniques such as data preprocessing, feature selection and hyperparameter tuning to enhance prediction accuracy.

Several studies have also explored combining ML methods to further improve performance. For instance, Mohan et al. Mohan et al. (2019) combined random forest and linear methods to enhance heart disease prediction, while Ali et al. Ali et al. (2019) introduced a framework for heart failure prediction using dual support vector machine (SVM) models – one for feature selection and the other for the prediction task.

Although these models demonstrate good predictive performance, they often rely on extensive preprocessing (Hassler et al., 2019), feature selection and hyperparameter tuning to achieve optimal results. Moreover, the effectiveness of ML models can be limited by insufficient or sub-optimal quality data. In this context, healthcare ontologies (Chute & Çelik, 2021; El-Sappagh et al., 2018; Ivanović & Budimac, 2014; Jovic et al., 2007; Pisanelli, 2004) offer a structured, semantically rich layer of information that can enhance the contextual understanding of ML models, which is further explored in this work.

Enhancing ML Predictions with Semantic Knowledge

Recent research has increasingly focussed on integrating semantic knowledge, such as KGs and ontologies, into ML models to enhance their performance. KGs have been widely applied in various domains, notably improving feature extraction and entity representation in natural language processing tasks. For instance, Moussallem et al. Moussallem et al. (2019) demonstrated how augmenting neural machine translation systems with KGs improved the translation quality by enhancing the semantic understanding of terminological expressions. Similarly, KG-based input enhancement has been shown to improve recommendation systems and community detection, enhancing both accuracy and explainability (Bhatt et al., 2020).

Moreover, KG-augmented neural networks have demonstrated improved performance in text classification and natural language inference tasks. Annervaz et al. Annervaz et al. (2018) showed that integrating structured knowledge from KGs not only improved model accuracy but also allowed models to perform well with less labelled data, addressing the common issue of data sparsity. Ziegler et al. Ziegler et al. (2017) adopted a similar approach by incorporating semantic knowledge through graph embeddings for credit card fraud detection, demonstrating how the injection of background knowledge – such as public holidays from DBpedia into neural models could enhance classification outcomes.

In addition to NLP and fraud detection, Szilagyi et al. Szilagyi and Wira (2018) applied semantic knowledge in smart building management by integrating taxonomies, schemas and logic rules with ML models. This hybrid system optimised building management by combining data-driven insights with rule-based reasoning, showing the potential of semantic knowledge in enhancing decision-making processes. Huang et al. Huang et al. (2023) introduced an Abductive Learning with KG approach that automatically mines logic rules from KGs and integrates them into ML models using a knowledge-forgetting mechanism to filter irrelevant information, thereby improving model performance even with limited labelled data.

In healthcare, Gazzotti et al. Gazzotti et al. (2019) demonstrated how augmenting sparse electronic medical records (EMRs) with ontological resources improved the predictive capabilities of ML algorithms, specifically in hospitalisation prediction. Ontologies such as DBPedia, Wikidata and the more domain specific ones provide structured medical knowledge, enabling a richer representation of patient data. Similarly, Ruiz et al. Ruiz et al. (2024) introduced the PLATO method, which uses a KG to regularise a multilayer perceptron for tabular datasets, showing that semantic knowledge can help ML models handle high-dimensional and low-sample-size data more effectively.

These studies highlight the growing importance of integrating semantic knowledge into ML models to address challenges such as data quality, sparsity and explainability across different domains. Table 1 provides an overview of these studies, outlining the types of semantic knowledge used, the domains or tasks covered, the ML models applied, the incorporation of KGEs and the integration methods employed. These approaches generally fall into two main categories: (i) direct integration of structured knowledge through explicit rules or ontological features and (ii) representation learning via KGEs, where entities and relations are embedded into a continuous vector space, allowing downstream ML models to leverage the semantic structure.

Table 1.
Summary of Related Work on Integrating Semantic Knowledge Into ML Models.

Paper Domain/task KG/ontology ML algorithm KGE Method of including the KG

Huang et al. (2023) Classification on tabular and image data DBPedia, Wikidata, YAGO3, ConceptNet Neural networks Logic rules mining from KGs during learning

Gazzotti et al. (2019) Healthcare / Predicting hospitalisation DBpedia, Wikidata domain specific¹ SVM, RF, LogReg Enriching EMR with features from ontologies

Szilagyi and Wira (2018) Smart building management custom Neural networks Integrating data from sensors with knowledge

Ruiz et al. (2024) High-dimensional tabular learning Custom auxiliary KG Multilayer Perceptron (MLP) KG to regularise a MLP for tabular datasets

Moussallem et al. (2019) Neural machine translation DBPedia RNN and Transformer ✓ 1) Entity Linking + KGE 2) Semantic enrichment of KGE via entity labels

Annervaz et al. (2018) Text classification and natural language inference Freebase, WordNet LSTM ✓ KG embeddings injected into the model for enriched representation learning

Ziegler et al. (2017) Credit card fraud detection DBPedia Deep neural network ✓ Augmentation of dataset with semantic vector representation of countries and public holidays information

Our paper Disease prediction Healthcare SNOMED Custom KG KNN, NN, SVM, XGBoost ✓ Augmenting tabular data with KG embeddings

Anatomical Therapeutic Chemical Classification, National Drug File - Reference Terminology, International Primary Care Classification

Paper	Domain/task	KG/ontology	ML algorithm	KGE	Method of including the KG
Huang et al. (2023)	Classification on tabular and image data	DBPedia, Wikidata, YAGO3, ConceptNet	Neural networks		Logic rules mining from KGs during learning
Gazzotti et al. (2019)	Healthcare / Predicting hospitalisation	DBpedia, Wikidata domain specific¹	SVM, RF, LogReg		Enriching EMR with features from ontologies
Szilagyi and Wira (2018)	Smart building management	custom	Neural networks		Integrating data from sensors with knowledge
Ruiz et al. (2024)	High-dimensional tabular learning	Custom auxiliary KG	Multilayer Perceptron (MLP)		KG to regularise a MLP for tabular datasets
Moussallem et al. (2019)	Neural machine translation	DBPedia	RNN and Transformer	✓	1) Entity Linking + KGE 2) Semantic enrichment of KGE via entity labels
Annervaz et al. (2018)	Text classification and natural language inference	Freebase, WordNet	LSTM	✓	KG embeddings injected into the model for enriched representation learning
Ziegler et al. (2017)	Credit card fraud detection	DBPedia	Deep neural network	✓	Augmentation of dataset with semantic vector representation of countries and public holidays information
Our paper	Disease prediction Healthcare	SNOMED Custom KG	KNN, NN, SVM, XGBoost	✓	Augmenting tabular data with KG embeddings

Although this prior research has demonstrated that embedding-based methods can improve ML performance, most existing work relies on large, general-purpose KGs, such as Wikidata or DBpedia. In contrast, our recent study Llugiqi et al. (2024) introduced four approaches for augmenting tabular data with KGEs using two embedding algorithms, focussing on smaller, domain-specific ontologies. These initial methods illustrated the potential of embedding-driven enrichment exploring how semantic context could be systematically incorporated into tabular datasets.

Building on that foundation, this article proposes four additional approaches that calculate various metrics in the embedding space to enrich tabular data with additional semantic context. Furthermore, we employ two more KG embedding algorithms, extending the methodology and formalising our approaches. We also perform a more thorough evaluation of the proposed techniques, applying them to the prediction of heart disease and chronic kidney disease.

Compared to other studies shown in Table 1, our method goes beyond simply embedding entities and relations, we exploit the embedding space itself to derive meaningful metrics that further enrich tabular features. Moreover, instead of relying on extensive, generic KGs, our approach leverages small, existing domain-specific ontologies (or select subsets of existing big ontologies), which we populate with relevant tabular data to form task-specific KGs. Such domain-focussed strategies remain underexplored within medical prediction and systems. By emphasising smaller ontologies and extracting deeper semantic insights from the embedding space, our work aims to advance semantic knowledge integration for ML in medical and other specialised domains.

4. Knowledge Graph Embedding-Based Augmentation for Tabular Data

To improve the performance of ML models, we leverage KGs to enrich tabular datasets with semantic information. This section outlines the two core steps of our approach. First, in Section 4.1, we explain the construction of KGs using instances from the tabular data, as shown in the top part of Figure 2. This step focuses on building KGs that capture deeper relationships within the data. Next, in Section 4.2, we focus on integrating KG embeddings into the ML pipeline. This includes various augmentation strategies designed to enhance model performance by incorporating structural and relational information from the KG into the training data, as depicted in the bottom part of Figure 2.

Figure 2.

Overview of the proposed approach including (i) KG construction (top) and (ii) knowledge injection into data (bottom) (adapted from Llugiqi et al. (2024) following the boxology notation (van Bekkum et al., 2021)).

4.1. Knowledge Graph Construction

For our approach to enrich ML input data with supplementary knowledge, constructing KGs is essential. They serve as structured representations of domain knowledge, capturing the semantics of the data and allowing for the integration of ontological information into datasets. This enrichment allows ML models to leverage contextual and relational information, enhancing their predictive capabilities. The upper part of Figure 2 illustrates the methodology used for building these KGs, which represent data that was initially captured in tabular form. The following steps provide a formal description of this construction process.

Step 1: Ontology Definition

The first step in constructing the KG is defining an ontology, which is used to capture domain semantics and provide a structured framework for enriching the datasets. There are different ways to develop an ontology, represented as $O = (C, R, H^{C})$ . We considered (i) creating a new ontology from scratch, (ii) extending and reusing existing ontologies to include additional domain-specific information, or (iii) extracting relevant components from a more extensive ontology (see Section 6.2).

Step 2: Mapping Definition

The process of mapping dataset features to the concepts in the ontology is crucial for using instances from tabular data to populate the ontology and, consequently, construct a KG. A key aspect of this mapping process is the mapping function $ψ : F \to C$ , where $F = {f_{1}, f_{2}, \dots, f_{n}}$ represents the features within the tabular dataset $T$ , defined as a matrix of dimensions $m \times n$ . This function entails the manual mapping of each feature $f_{i}$ with a corresponding concept $C$ in the ontology $O$ .

Step 3: Knowledge Graph Population

The knowledge graph is constructed by utilising the ontology $O$ along with the instances from the tabular data $T$ and applying the mapping function $ψ : F \to C$ . This process is automated through a Python script. We define the KG as $K G = (E, R^{'}, L, T r)$ where:

–
$E$ signifies the set of entities, with each entity $e_{i} \in E$ corresponding to an instance in the tabular data derived from each row $m_{i}$ in $T$ ,
–
$R^{'}$ represents the set of instantiated relations within the KG, which includes relations from $R$ through the mapping $ψ$ , and illustrates direct relationships between entities $E$ or between an entity and a literal value,
–
$L$ represents the set of literals, which are attribute values associated with entities, such as numerical data (e.g., 30) from $T$ ,
–
$T r$ consists of triples generated for each feature value in an instance row $m_{i}$ , following the mapping $ψ$ . For instance, if a feature $f_{glucoseLevel}$ corresponds to an instance $e_{i}$ with a blood pressure value of $95$ , the associated triple would be $(e_{i}, r_{hasGlucoseLevel}, 95)$ , indicating the relationship $r_{hasGlucoseLevel}$ between entity $e_{i}$ and the literal value $95$ .

This preprocessing phase ensures that features from the tabular data $T$ are semantically represented within the KG using the defined ontology $O$ .
4.2. Integrating KG Embeddings Into ML Pipeline

In Section 4.1, we outlined the construction of enriched data structures that capture deeper semantics beyond the raw data. This section will now focus on transforming these enriched structures into a vectorised format suitable for ML, and on the optimal strategies for augmenting the input data, as illustrated in the lower part of Figure 2.

Step 4: Knowledge Graph Embedding Generation

With the populated KG with enriched data structures, the subsequent step is to prepare the KG for ML model training. This requires transforming the KG into a vector space representation suitable for ML models, using KGE algorithms. Having a knowledge graph $K G = (E, R)$ , the goal of the embedding algorithm is to map entities $E$ and relations $R$ into a continuous vector space. Formally, this can be represented as function: $ϕ : E \cup R \to R^{d}$ , where $ϕ$ is the embedding function that maps each entity and relation in the KG to a d -dimensional real-valued vector in the vector space $R^{d}$ . This transformation allows the KG to be represented in a way that preserves its semantic information while being computationally efficient. In the next steps 5 & 6 we will see how these embeddings are used as such or to compute features that are added to augment the dataset for a better ML performance.

Step 5 & 6: Tabular Data Enrichment and ML Model Training

After computing KGEs, our objective is to explore the integration of these embeddings to enhance the performance of ML models. We experimented with different approaches for augmenting the training set using KGEs. First, we established a baseline that trains ML models using only tabular data $T$ , following the traditional approach, shown in Figure 1, where no KG information is being added. Then we experimented with different ways for enhancing the dataset with KGEs and training the ML models, which are shown in details in the following section.

5. Proposed Approaches for Tabular Data Enrichment and ML Model Training

In this section, we outline the eight distinct approaches we explored for integrating KG embeddings into the training dataset, each designed to evaluate the impact of enriched semantic information on model performance.¹

5.1. Embeddings As ML Model Inputs (EmbedOnly)

We begin by our initial objective to investigate whether training a model on the vector representations generated from these KGs, using various embedding algorithms, could reveal underlying patterns and relationships within the data. Therefore, we define our first sub-hypothesis as follows.

H1.1: Using the embeddings alone, without any additional tabular data, could provide meaningful insights and capture latent relationships that enhance the model’s predictive capability.

To explore this, we first explored the EmbedOnly approach, focussing solely on the embeddings to assess their standalone effectiveness in capturing meaningful insights as shown in Figure 3.

Figure 3.

Embedding vectors, highlighted in yellow, serve as inputs to the ML model.

For each instance $p_{i}$ in the tabular data $T$ , we have them represented as a subset $P \subseteq E$ of $K G$ , where $P$ represents the set of entities corresponding to instances in the tabular data. The embedding function: $ϕ : P \cup R \to R^{d}$ is used to map each instance entity $p_{i} \in P$ to a $d$ -dimensional vector space. Consequently, during both the training and testing phases only the embeddings ${ϕ (p) ∣ p \in P} \subset R^{d}$ derived from the instance entities are used, as outlined in Algorithm 1. This ensures that the model is trained and evaluated only on the vector representations, capturing the semantic relationships within the KG relevant to the instances.

5.2. Combining Embeddings with Tabular Data Features (EmbedAugTab)

Building upon EmbedOnly approach, we define our second sub-hypothesis as follows.

H1.2: Combining the KG-derived embeddings with traditional tabular data might enhance model performance by introducing additional relational information from the KG structure.

This led us to design approaches that integrate both embeddings and tabular features, aiming to see if the KG information could complement and enrich the existing dataset. Thus, we investigated EmbedAugTab and other subsequent approaches that leverage embeddings for data augmentation based on this intuition.

EmbedAugTab approach involves training ML algorithms on datasets that integrate the original tabular data with additional columns derived from embeddings, as illustrated in Figure 4 and presented in Algorithm 2. For each instance $p$ in the tabular dataset $T$ , we augment $T$ by appending the embedding vector $ϕ (p)$ , corresponding to the instance $p \in P$ . The embedding vector $ϕ (p)$ is generated using the embedding function $ϕ : P \cup R \to R^{d}$ . This process yields an augmented tabular matrix $T^{'}$ with dimensions $m \times (n + d)$ , where each row $i$ contains the original features from $T$ concatenated with the d-dimensional embedding vector $ϕ (p)$ . The resulting augmented matrix $T^{'}$ is then utilised to train ML models, leveraging both the original tabular features and the vector representations of the instances. In the healthcare domain, each instance $p$ represents a patient, and the embedding vectors are added for each patient, in order to improve the models’ ability to predict the presence or absence of specific diseases, such as heart disease or chronic kidney disease.

Figure 4.

Tabular dataset enrichment with embedding vectors, highlighted in yellow, used as inputs for the ML model.

5.3. Tabular Dataset Enrichment with Distance Measures From Knowledge Graphs (DistAugTab)

Utilising embedding vectors directly to augment the tabular data may introduce noise. Thus, our sub-hypothesis is defined as follows.

H1.3: Extracting specific structural information from the embedding space, such as distance matrices or cluster characteristics, might enhance model performance by providing more interpretable features for distance-based models.

This led us to introduce the DistAugTab and ClustAugTab approaches, which aim to selectively extract meaningful information from the embeddings to improve the learning process.

In DistAugTab approach, we enhance the tabular dataset $T$ by incorporating additional features derived from embedding-based distance calculations, as illustrated in Figure 5 and presented in Algorithm 3. For each instance $p_{i}$ in the dataset $T$ , we compute its embedding vector ${\vec{v}}_{i}$ using the embedding function $ϕ$ . To further enrich the representation of each instance, we introduce $| C |$ additional columns, where $C$ denotes the set of target classes.

Figure 5.

Tabular dataset enrichment with distance measures from the KG (highlighted in yellow), used as inputs for the ML model.

The new columns are calculated by determining the Euclidean distance between the embedding vector ${\vec{v}}_{i}$ of instance $p_{i}$ and the centroid ${\vec{c}}_{C_{j}}$ of each target class $C_{j} \in C$ . The centroid ${\vec{c}}_{C_{j}}$ is calculated as the mean of the embedding vectors ${\vec{v}}_{i}$ for all instances $p_{i}$ belonging to the target class $C_{j}$ . These distance-based features are added to the augmented dataset $T^{'}$ , resulting in an expanded dataset with dimensions $m \times (n + | C |)$ , where $m$ is the number of instances and $n$ is the original number of features.

By including these distance features, we aim to capture how closely each instance’s embedding aligns with the class centroids, thereby potentially improving the model’s ability to differentiate between target classes. For example, in the healthcare domain, the target classes could represent the presence or absence of a disease $C = {disease, noDisease}$ , where the distance features’s aim is to help refine the model’s predictions based on proximity to the centroids of the disease and noDisease classes.

5.4. Embedding and Distance Features Augmented Tabular Data (EmbedDistTabAug)

This approach augments the tabular dataset by incorporating both embedding vectors and distance-based features, as depicted in Figure 6 and presented in Algorithm 4. For each instance $p_{i}$ , the augmented dataset $T^{'}$ is expanded by adding $d + | C |$ new columns, where $d$ represents the embedding dimension and $| C |$ denotes the number of target classes. This results in an enhanced dataset with dimensions $m \times (n + d + | C |)$ , combining the original features, embedding vectors and distances to class centroids.

Figure 6.

Tabular dataset enrichment with distance measures from the KG and vector embeddings, highlighted in yellow, used as inputs for the ML model.

5.5. Tabular Dataset Enrichment with Embedding Clusters’ Membership (ClusterAugTab)

In this approach, referred to as ClusterAugTab, we augment the tabular dataset by first computing embeddings for the data $E_{train} = {ϕ (p_{i}) | p_{i} \in T_{train}}$ , where $ϕ : E \cup R \to R^{d}$ , and then clustering these embeddings into $n$ clusters using the K-means algorithm, as shown in Figure 7 and presented in Algorithm 5. Each instance $p_{i} \in T$ is assigned a cluster membership based on its embedding, which is added as an additional feature to the dataset. The augmented dataset $T^{'}$ now has dimensions $m \times (n + 1),$ where the original $n$ features are extended by one column representing the cluster membership derived from the embeddings. This enhanced dataset is then used to train the ML model, with the added cluster-level information facilitating the grouping of similar instances. By capturing these underlying patterns in the embeddings, the model can achieve improved predictive performance.

Figure 7.

Tabular dataset enrichment with embedding clusters’ membership (highlighted in yellow), used as ML model inputs.

5.6. Tabular Dataset Enrichment with Embeddings and Embedding Clusters’ Membership (EmbedClusterAugTab)

This approach, the tabular dataset is augmented by integrating both embedding vectors and cluster memberships, as shown in Figure 8 and detailed in Algorithm 6. For each instance $p_{i}$ , the augmented dataset $T^{'}$ is expanded by appending both the $d$ -dimensional embedding vector and the corresponding cluster membership, where $d$ represents the embedding dimension. The resulting dataset has dimensions $m \times (n + d + 1)$ , combining the original features, the learned embeddings and the cluster assignments derived from the embeddings. This enriched representation enables the model to leverage both latent structure and group similarity for improved predictive performance.

Figure 8.

Tabular dataset enrichment with embedding clusters’ membership and vector embeddings, highlighted in yellow, used as ML model inputs.

5.7. Tabular Dataset Enrichment with Feature Interaction (InteraAugTab)

To further optimise the integration of KG information, we hypothesised that interactions between embeddings and existing features could reveal complex patterns. We define the sub-hypothesis as follows.

H1.4: Some classes may only be distinguishable through the combined effects of KG embeddings and tabular data.

By developing approaches that compute these interaction terms, we aimed to enrich the feature space, enabling the model to capture dependencies arising from the integration of KG-derived and tabular features. This approach, implemented in the InteraAugTab approach, offers a multi-dimensional perspective that aims to improve accuracy and F2 score.

InteraAugTab approach augments the tabular dataset by incorporating interaction terms derived from the original features, as illustrated in Figure 9 and presented in Algorithm 7. For each instance $p_{i}$ , the embedding vector $v_{i}$ is computed using an embedding function $ϕ$ . Interaction terms are then generated by element-wise multiplying each feature in $p_{i}$ with each component of the embedding vector $v_{i}$ . The augmented dataset $T^{'}$ thus contains the original features and the interaction terms. This results in an enhanced dataset with dimensions $m \times (n + (n \times d))$ , where $n$ is the number of original features and $d$ is the embedding dimension. The interaction terms enable the model to capture complex relationships between the original features and the latent information in the embeddings, potentially leading to improved predictive performance.

Figure 9.

Tabular dataset enrichment with feature interaction (highlighted in yellow), used as ML model inputs.

5.8. Tabular Dataset Enrichment with Embedding and Feature Interaction (EmbedInteraAugTab)

In this approach, referred to as EmbedInteractionAugTab, we augment the tabular dataset by incorporating both the embedding vectors and the interaction terms between the original features and the embedding vectors, as shown in Figure 10 and presented in Algorithm 8. Similar to InteraAugTab approach, the embeddings are computed and the interaction terms. The augmented dataset $T^{'}$ thus contains the original features, the embedding vectors and the interaction terms resulting in dimensions $m \times (n + d + (n \times d))$ , where $n$ is the number of original features and $d$ is the embedding dimension.

Figure 10.

Tabular dataset enrichment with feature interaction and vector embeddings, highlighted in yellow, used as ML model inputs.

To address the risk of high dimensionality, which can adversely affect the performance of certain models, we implemented a dimensionality reduction step using the PCA algorithm (Abdi & Williams, 2010). This reduction was specifically applied to approaches integrating embeddings, namely EmbedOnlyRed, EmbedAugTabRed and EmbedDistAugTabRed. We define our sub-hypothesis as follows:

H1.5: Reducing the dimensionality of the embedding-augmented datasets will improve model performance by eliminating redundant or noisy features, thereby retaining only the most informative ones.

6. Experimental Analysis

In this section, we discuss the experimental goals that guide our investigation in Section 6.1 and in Section 6.2 we discuss the experimental setup and materials used to achieve these goals.

6.1. Experimental Goals

The goal of our experimental evaluation is to investigate the use of KGs through KGEs to enhance the predictive performance of ML methods. We leverage the semantic structure of the ontologies, to represent the instances with more semantics and then through our proposed approaches use these to augment the tabular dataset for a better ML performance. The specific goals of our experiments are as follows:

Optimal Integration of KGs into ML Pipelines (RQ1): We examine effective methods for incorporating KGs into ML pipelines to improve model performance, with a particular emphasis on accuracy and F2 score. This entails analysing the integration strategies that can enhance the predictive power of ML models.

Influence of KG Embedding Techniques (RQ2): We seek to understand how different KG embedding algorithms affect performance outcomes in ML models when utilised to enrich tabular data. This exploration focuses on identifying which embedding techniques yield the best enhancements in model accuracy and F2 score.

Comparative Analysis of ML Algorithms with KG-Enhanced Data (RQ3): We assess the relative performance of various ML algorithms when supplemented with KG-derived information. This analysis will highlight how distinct algorithms exploit KG semantics to boost the predictive performance.

6.2. Experiment Setup

Datasets. In our experiments, we used two publicly available datasets from Kaggle: the Heart Disease² and Chronic Kidney Disease³ datasets. Both datasets are used for binary classification tasks, where the goal is to predict the presence (disease) or absence (no disease) of the disease.

–
Heart disease dataset consists of 303 instances, with 14 features capturing various patient health indicators relevant to diagnosing heart disease such as heart rate and cholesterol.
–
Chronic kidney disease contains 400 instances and 25 features, capturing various health metrics related to chronic kidney disease such as blood pressure and albumin levels.

Both datasets contain a mix of categorical and numerical attributes, making them suitable for testing the integration of KGE with tabular data. Additional details about the datasets’ features can be found in Appendix 8.

Ontologies. For the heart disease, we used three different ontologies:

–
The Small ontology, denoted as $O = (C, R, H^{C})$ , is a handcrafted model derived from Trepan Reloaded (Confalonieri et al., 2021) that encapsulates the features found in the Heart Disease dataset.
–
The Extended ontology, represented as $O = (C^{'}, R^{'}, H^{C})$ , is an extension of an existing ontology⁴ $O = (C, R, H^{C})$ which incorporates additional features from the dataset.
–
The Snomed ontology is derived as sub-ontology from the SNOMED-CT ontology⁵. This ontology was constructed using the methodology proposed by Chen et al. Chen et al. (2019), which focuses on extracting relevant ontological structures from SNOMED-CT based on a predefined set of seed concepts required in the output. Initially, we selected the relevant concepts in the SNOMED-CT browser⁶ that align with the dataset’s features. These concepts served as seed concepts in the extraction process, ensuring the resulting ontology included them.

For chronic kidney disease, we only used the third approach, extracting a sub-ontology from SNOMED-CT, due to the lack of ontologies specific to this domain. An overview of the ontologies used for both domains, including the count of classes and properties, is presented in Table 2.

Table 2.
Details of the Ontologies for Heart and Kidney Disease Domain.

Domain Ontologies Classes Object prop. Data prop.

Heart Small 29 6 10

Extended 1664 6 10

Snomed 80 24 10

Kidney Snomed 113 27 21

KG embedding methods. We used four embedding methods: Node2Vec, RDF2vec, DistMult and TransH. The first two methods were selected as random-walk-based models in the embedding landscape, while DistMult and TransH were chosen based on the findings in the Sem@K paper (Hubert et al., 2023), which identified them as outperforming models from the semantic matching and geometric model families, respectively. An overview of these models is provided in Section 2.

In Table 3, we illustrate the parameters used for the embedding methods, tailored to the specific characteristics of the KGs. The embedding dimensions ([64, 128, 100]) were selected to provide a range of vector sizes that are large enough to capture meaningful patterns but small enough to maintain computational efficiency. For Node2Vec and RDF2Vec, the walk length and the number of walks per node were adapted to the size and complexity of each ontology. For smaller ontologies, shorter walks and fewer iterations, while larger or more complex ontologies required slightly longer walks. We averaged performance across three embedding dimensions to provide a more robust evaluation of each method and also computed the standard deviation to capture variability across runs.

Table 3.
Parameters for Different KGE Methods for Different KGs.

Node2Vec Param. RDF2Vec Param. TransH & DistMult

Domain KG dimens. walk length walks window depth walks/node window params

Heart Small [64,128,100] 40 200 5 4 100 5 default

Extended [64,128,100] 60 200 10 6 150 10 default

Snomed [64,128,100] 50 200 7 5 100 7 default

Kidney Snomed [64,128,100] 50 200 7 10 100 7 default

ML models. In our experiments, we used four models: K-nearest neighbours (KNNs), support vector machine (SVM), extreme gradient boosting (XGB) and a simple feedforward neural network (NN). KNN and SVM were chosen because they are distance-based, aligning with our hypothesis that KGEs, which are also distance-based, would enhance their performance. Whereas, XGB and NN were included to test the effect of KGEs on more complex, non-distance-based models.

To ensure robust evaluation, we used stratified 5-fold cross-validation, maintaining the same class distribution in each fold. For reproducibility, a fixed random seed was applied throughout the experiments. We initially experimented with a wide range of hyperparameters and, to reduce computational cost, we narrowed the range to focus on the best-performing configurations, as shown in Table 4. Results were averaged to ensure consistency across different configurations.

Table 4.
Parameter Grid for ML Methods.

Method Parameter (Grid) Values

KNN n_neighbours [20, 25, 30, 35, 40]

SVM C; kernel; probability [0.9, 1.0, 1.1, 1.2]; rbf; True

XGB learning_rate [0.08, 0.09, 0.1, 0.11]

NN layers; activation; loss; optimiser [32, 16, 1]; [relu, relu, sigmoid]; binary crossentropy; adam

Evaluation metrics. In our experiments, we computed both accuracy and F2 score to assess model performance. We selected the F2 score as a key metric due to its relevance in disease prediction tasks, where maximising true positive cases is critical for effectively identifying patients with the disease.
7. Results

Domain	Ontologies	Classes	Object prop.	Data prop.
Heart	Small	29	6	10
	Extended	1664	6	10
	Snomed	80	24	10
Kidney	Snomed	113	27	21

			Node2Vec Param.	RDF2Vec Param.	TransH & DistMult
Heart	Small	[64,128,100]	40	200	5	4	100	5	default
	Extended	[64,128,100]	60	200	10	6	150	10	default
	Snomed	[64,128,100]	50	200	7	5	100	7	default
Kidney	Snomed	[64,128,100]	50	200	7	10	100	7	default

Method	Parameter	(Grid) Values
KNN	n_neighbours	[20, 25, 30, 35, 40]
SVM	C; kernel; probability	[0.9, 1.0, 1.1, 1.2]; rbf; True
XGB	learning_rate	[0.08, 0.09, 0.1, 0.11]
NN	layers; activation; loss; optimiser	[32, 16, 1]; [relu, relu, sigmoid]; binary crossentropy; adam

In this section are shown the experiment results based on the experiment setup that we discussed in Section 6.2, starting with heart disease prediction, followed by kidney disease prediction. We show the concluding results for each research question.

7.1. Heart Disease Prediction

Table 5 shows the average accuracy and F2 scores, along with the standard deviation across different vector sizes of the embeddings, for four different ML models (KNN, NN, SVM, XGB). The results include the models’ baseline performance on tabular data alone, compared with their performance when the data is augmented using embeddings generated by four KG embedding algorithms (Node2Vec, RDF2Vec, DistMult, TransH). Additional results, including average recall with standard deviation across vector sizes, evaluated using different KGs, models, approaches and embedding methods, are provided in Table 11 in Appendix 8. In the following, the results are analysed based on the research questions.

Table 5.
Average Accuracy and F2 Scores (With Standard Deviation Across Different Vector Sizes), Across Different KGs, for Various Models, Approaches and Embedding Methods in Heart Disease Prediction.

KNN NN SVM XGBoost

Methods Acc. F2 Acc. F2 Acc. F2 Acc. F2

Baseline 81.02 71.33 81.77 77.44 79.75 77.18 79.32 75.19

Node2Vec

EmbedOnly 49.36 $\pm$ 2.55 47.10 $\pm$ 3.97 49.03 $\pm$ 3.26 50.71 $\pm$ 3.27 48.31 $\pm$ 4.33 48.79 $\pm$ 2.42 48.61 $\pm$ 1.52 47.50 $\pm$ 3.72

EmbedOnlyRed 49.36 $\pm$ 2.55 47.10 $\pm$ 3.97 49.00 $\pm$ 3.59 50.48 $\pm$ 2.29 48.19 $\pm$ 4.09 48.52 $\pm$ 2.21 48.93 $\pm$ 2.38 50.90 $\pm$ 4.82

EmbedAugTab 81.27 $\pm$ 0.21 71.43 $\pm$ 0.48 78.15 $\pm$ 1.27 76.08 $\pm$ 1.58 80.59 $\pm$ 0.27 78.07 $\pm$ 0.54 64.94 $\pm$ 1.29 64.56 $\pm$ 4.35

EmbedAugTabRed 80.60 $\pm$ 0.13 70.63 $\pm$ 0.27 81.02 $\pm$ 0.83 76.93 $\pm$ 1.13 79.34 $\pm$ 0.11 77.53 $\pm$ 0.18 79.72 $\pm$ 0.49 75.24 $\pm$ 0.81

DistAugTab 81.17 $\pm$ 0.12 71.54 $\pm$ 0.18 82.17 $\pm$ 0.50 78.78 $\pm$ 1.09 81.81 $\pm$ 0.09 78.36 $\pm$ 0.02 92.51 $\pm$ 2.14 90.85 $\pm$ 3.96

EmbedDistAugTab 81.43 $\pm$ 0.28 71.70 $\pm$ 0.55 77.71 $\pm$ 1.53 76.10 $\pm$ 1.83 81.67 $\pm$ 0.32 78.57 $\pm$ 0.25 91.82 $\pm$ 3.11 89.27 $\pm$ 5.73

EmbedDistAugTabRed 80.66 $\pm$ 0.20 70.76 $\pm$ 0.33 80.06 $\pm$ 0.55 75.74 $\pm$ 1.33 79.46 $\pm$ 0.20 77.71 $\pm$ 0.30 79.32 $\pm$ 0.35 75.15 $\pm$ 0.59

EmbedClustAugTab 81.17 $\pm$ 0.72 70.97 $\pm$ 1.49 72.43 $\pm$ 0.89 72.96 $\pm$ 3.61 76.10 $\pm$ 0.56 75.01 $\pm$ 2.00 55.32 $\pm$ 2.25 57.39 $\pm$ 4.94

EmbedInteraAugTab 79.07 $\pm$ 0.82 65.56 $\pm$ 1.82 75.95 $\pm$ 1.28 75.70 $\pm$ 1.90 80.12 $\pm$ 0.28 76.95 $\pm$ 1.15 69.22 $\pm$ 1.89 68.40 $\pm$ 3.69

ClustAugTab 81.21 $\pm$ 0.64 71.16 $\pm$ 1.43 78.01 $\pm$ 0.10 76.03 $\pm$ 1.14 77.58 $\pm$ 0.41 76.02 $\pm$ 0.69 62.93 $\pm$ 0.73 65.05 $\pm$ 3.99

InteraAugTab 79.06 $\pm$ 0.68 65.55 $\pm$ 1.49 78.81 $\pm$ 1.34 76.77 $\pm$ 0.41 80.11 $\pm$ 0.70 77.00 $\pm$ 1.10 74.18 $\pm$ 1.71 72.90 $\pm$ 2.57

RDF2Vec

EmbedOnly 52.04 $\pm$ 0.69 31.42 $\pm$ 4.68 53.04 $\pm$ 1.28 22.06 $\pm$ 9.23 51.27 $\pm$ 0.92 40.14 $\pm$ 3.48 50.65 $\pm$ 1.72 43.41 $\pm$ 2.18

EmbedOnlyRed 52.04 $\pm$ 0.69 31.42 $\pm$ 4.68 53.34 $\pm$ 1.10 21.84 $\pm$ 9.30 51.27 $\pm$ 0.92 40.16 $\pm$ 3.44 51.00 $\pm$ 1.21 43.18 $\pm$ 1.71

EmbedAugTab 81.02 $\pm$ 0.00 71.33 $\pm$ 0.00 82.07 $\pm$ 0.29 78.64 $\pm$ 0.42 79.75 $\pm$ 0.00 77.18 $\pm$ 0.00 78.56 $\pm$ 0.54 75.30 $\pm$ 1.34

EmbedAugTabRed 79.95 $\pm$ 0.00 69.59 $\pm$ 0.00 80.32 $\pm$ 0.58 76.06 $\pm$ 0.90 79.32 $\pm$ 0.00 77.63 $\pm$ 0.00 78.77 $\pm$ 0.19 75.25 $\pm$ 0.16

DistAugTab 81.02 $\pm$ 0.00 71.33 $\pm$ 0.00 81.96 $\pm$ 0.58 78.57 $\pm$ 0.98 79.75 $\pm$ 0.00 77.18 $\pm$ 0.00 84.38 $\pm$ 1.59 81.62 $\pm$ 2.24

EmbedDistAugTab 81.02 $\pm$ 0.00 71.33 $\pm$ 0.00 81.85 $\pm$ 0.10 78.46 $\pm$ 0.58 79.75 $\pm$ 0.00 77.18 $\pm$ 0.00 80.60 $\pm$ 0.82 77.20 $\pm$ 0.82

EmbedDistAugTabRed 79.95 $\pm$ 0.00 69.59 $\pm$ 0.00 80.49 $\pm$ 0.77 76.45 $\pm$ 0.90 79.32 $\pm$ 0.00 77.63 $\pm$ 0.00 78.77 $\pm$ 0.19 75.25 $\pm$ 0.16

EmbedClustAugTab 81.18 $\pm$ 0.11 71.12 $\pm$ 0.17 81.44 $\pm$ 0.63 77.48 $\pm$ 0.67 80.16 $\pm$ 0.08 77.36 $\pm$ 0.28 78.64 $\pm$ 0.40 75.34 $\pm$ 0.80

EmbedInteraAugTab 81.02 $\pm$ 0.00 71.33 $\pm$ 0.00 81.81 $\pm$ 0.28 78.43 $\pm$ 0.79 79.76 $\pm$ 0.02 77.18 $\pm$ 0.01 79.33 $\pm$ 1.04 76.03 $\pm$ 0.75

ClustAugTab 81.18 $\pm$ 0.11 71.12 $\pm$ 0.17 81.81 $\pm$ 0.70 78.38 $\pm$ 1.11 80.16 $\pm$ 0.08 77.36 $\pm$ 0.28 79.10 $\pm$ 0.35 75.22 $\pm$ 0.42

InteraAugTab 81.02 $\pm$ 0.00 71.33 $\pm$ 0.00 82.25 $\pm$ 0.39 78.59 $\pm$ 0.51 79.75 $\pm$ 0.00 77.18 $\pm$ 0.00 79.23 $\pm$ 0.45 76.15 $\pm$ 0.72

DistMult

EmbedOnly 48.04 $\pm$ 0.68 62.88 $\pm$ 7.55 46.43 $\pm$ 2.21 64.43 $\pm$ 5.32 47.14 $\pm$ 1.42 68.57 $\pm$ 2.13 47.78 $\pm$ 2.26 54.97 $\pm$ 4.72

EmbedOnlyRed 48.04 $\pm$ 0.68 62.88 $\pm$ 7.55 49.35 $\pm$ 3.91 69.01 $\pm$ 3.24 47.46 $\pm$ 0.92 64.98 $\pm$ 4.09 47.35 $\pm$ 1.69 59.07 $\pm$ 4.39

EmbedAugTab 81.07 $\pm$ 0.10 71.40 $\pm$ 0.19 80.35 $\pm$ 0.99 78.30 $\pm$ 0.71 80.03 $\pm$ 0.31 77.68 $\pm$ 0.40 49.60 $\pm$ 3.09 55.53 $\pm$ 1.33

EmbedAugTabRed 80.16 $\pm$ 0.15 70.03 $\pm$ 0.12 80.79 $\pm$ 0.36 76.45 $\pm$ 0.49 79.33 $\pm$ 0.08 77.71 $\pm$ 0.11 78.27 $\pm$ 0.23 74.25 $\pm$ 0.15

DistAugTab 80.88 $\pm$ 0.08 71.04 $\pm$ 0.14 82.18 $\pm$ 0.77 78.90 $\pm$ 0.81 80.27 $\pm$ 0.02 77.39 $\pm$ 0.09 53.42 $\pm$ 4.61 54.84 $\pm$ 2.15

EmbedDistAugTab 80.94 $\pm$ 0.15 71.14 $\pm$ 0.21 80.57 $\pm$ 1.28 78.55 $\pm$ 0.73 80.11 $\pm$ 0.25 77.56 $\pm$ 0.29 50.49 $\pm$ 1.92 61.31 $\pm$ 1.66

EmbedDistAugTabRed 80.16 $\pm$ 0.15 70.02 $\pm$ 0.19 81.30 $\pm$ 0.11 77.69 $\pm$ 0.77 79.34 $\pm$ 0.08 77.71 $\pm$ 0.11 78.16 $\pm$ 0.25 73.92 $\pm$ 0.24

EmbedClustAugTab 81.39 $\pm$ 0.23 71.32 $\pm$ 0.11 72.17 $\pm$ 2.71 72.09 $\pm$ 2.44 75.31 $\pm$ 1.20 73.72 $\pm$ 1.25 50.12 $\pm$ 3.03 55.90 $\pm$ 1.53

EmbedInteraAugTab 80.80 $\pm$ 0.21 69.92 $\pm$ 0.65 70.67 $\pm$ 1.42 70.85 $\pm$ 0.59 80.20 $\pm$ 0.27 77.46 $\pm$ 0.21 47.54 $\pm$ 1.95 52.40 $\pm$ 5.09

ClustAugTab 81.43 $\pm$ 0.16 71.45 $\pm$ 0.06 76.78 $\pm$ 0.69 75.76 $\pm$ 1.84 76.17 $\pm$ 0.76 74.26 $\pm$ 1.77 59.35 $\pm$ 2.53 62.11 $\pm$ 8.32

InteraAugTab 80.84 $\pm$ 0.13 69.93 $\pm$ 0.66 76.42 $\pm$ 1.61 74.98 $\pm$ 2.28 80.18 $\pm$ 0.35 77.49 $\pm$ 0.44 47.29 $\pm$ 0.74 48.89 $\pm$ 5.33

TransH

EmbedOnly 48.22 $\pm$ 1.36 53.12 $\pm$ 5.31 47.88 $\pm$ 2.07 59.80 $\pm$ 8.61 48.55 $\pm$ 1.95 59.32 $\pm$ 2.70 47.57 $\pm$ 0.21 50.25 $\pm$ 11.95

EmbedOnlyRed 48.22 $\pm$ 1.36 53.12 $\pm$ 5.31 48.17 $\pm$ 1.24 57.54 $\pm$ 14.81 48.65 $\pm$ 0.36 53.07 $\pm$ 5.43 51.85 $\pm$ 2.31 54.12 $\pm$ 8.21

EmbedAugTab 81.00 $\pm$ 0.06 71.24 $\pm$ 0.14 80.68 $\pm$ 0.31 77.51 $\pm$ 1.17 79.89 $\pm$ 0.22 77.32 $\pm$ 0.39 48.59 $\pm$ 0.85 50.51 $\pm$ 14.01

EmbedAugTabRed 80.00 $\pm$ 0.05 69.74 $\pm$ 0.13 79.95 $\pm$ 0.66 75.62 $\pm$ 0.92 79.33 $\pm$ 0.08 77.69 $\pm$ 0.03 78.16 $\pm$ 0.38 74.17 $\pm$ 0.20

DistAugTab 80.98 $\pm$ 0.02 71.27 $\pm$ 0.03 81.99 $\pm$ 0.28 78.55 $\pm$ 0.85 80.10 $\pm$ 0.15 77.30 $\pm$ 0.05 75.34 $\pm$ 6.56 73.77 $\pm$ 4.42

EmbedDistAugTab 80.95 $\pm$ 0.06 71.19 $\pm$ 0.10 80.89 $\pm$ 1.80 77.97 $\pm$ 0.67 80.11 $\pm$ 0.04 77.50 $\pm$ 0.33 55.48 $\pm$ 4.80 57.76 $\pm$ 4.51

EmbedDistAugTabRed 80.08 $\pm$ 0.00 69.95 $\pm$ 0.06 80.72 $\pm$ 0.44 76.61 $\pm$ 0.28 79.38 $\pm$ 0.09 77.78 $\pm$ 0.05 78.20 $\pm$ 0.27 74.17 $\pm$ 0.30

EmbedClustAugTab 80.76 $\pm$ 0.43 70.42 $\pm$ 1.30 76.68 $\pm$ 2.10 74.95 $\pm$ 2.07 78.32 $\pm$ 0.91 74.42 $\pm$ 2.54 48.52 $\pm$ 1.59 50.38 $\pm$ 13.27

EmbedInteraAugTab 80.39 $\pm$ 0.22 69.04 $\pm$ 0.57 74.31 $\pm$ 4.18 72.08 $\pm$ 2.19 80.07 $\pm$ 0.39 77.14 $\pm$ 0.16 49.50 $\pm$ 1.57 48.30 $\pm$ 6.44

ClustAugTab 80.82 $\pm$ 0.38 70.55 $\pm$ 1.22 80.24 $\pm$ 0.82 75.94 $\pm$ 2.34 78.52 $\pm$ 0.93 74.52 $\pm$ 2.34 60.35 $\pm$ 1.96 56.58 $\pm$ 7.13

InteraAugTab 80.43 $\pm$ 0.25 69.10 $\pm$ 0.53 78.55 $\pm$ 2.05 75.16 $\pm$ 0.36 79.94 $\pm$ 0.35 76.94 $\pm$ 0.40 49.40 $\pm$ 0.77 41.28 $\pm$ 6.69

	KNN	NN	SVM	XGBoost
Node2Vec
EmbedOnly	49.36 $\pm$ 2.55	47.10 $\pm$ 3.97	49.03 $\pm$ 3.26	50.71 $\pm$ 3.27	48.31 $\pm$ 4.33	48.79 $\pm$ 2.42	48.61 $\pm$ 1.52	47.50 $\pm$ 3.72
EmbedOnlyRed	49.36 $\pm$ 2.55	47.10 $\pm$ 3.97	49.00 $\pm$ 3.59	50.48 $\pm$ 2.29	48.19 $\pm$ 4.09	48.52 $\pm$ 2.21	48.93 $\pm$ 2.38	50.90 $\pm$ 4.82
EmbedAugTab	81.27 $\pm$ 0.21	71.43 $\pm$ 0.48	78.15 $\pm$ 1.27	76.08 $\pm$ 1.58	80.59 $\pm$ 0.27	78.07 $\pm$ 0.54	64.94 $\pm$ 1.29	64.56 $\pm$ 4.35
EmbedAugTabRed	80.60 $\pm$ 0.13	70.63 $\pm$ 0.27	81.02 $\pm$ 0.83	76.93 $\pm$ 1.13	79.34 $\pm$ 0.11	77.53 $\pm$ 0.18	79.72 $\pm$ 0.49	75.24 $\pm$ 0.81
DistAugTab	81.17 $\pm$ 0.12	71.54 $\pm$ 0.18	82.17 $\pm$ 0.50	78.78 $\pm$ 1.09	81.81 $\pm$ 0.09	78.36 $\pm$ 0.02	92.51 $\pm$ 2.14	90.85 $\pm$ 3.96
EmbedDistAugTab	81.43 $\pm$ 0.28	71.70 $\pm$ 0.55	77.71 $\pm$ 1.53	76.10 $\pm$ 1.83	81.67 $\pm$ 0.32	78.57 $\pm$ 0.25	91.82 $\pm$ 3.11	89.27 $\pm$ 5.73
EmbedDistAugTabRed	80.66 $\pm$ 0.20	70.76 $\pm$ 0.33	80.06 $\pm$ 0.55	75.74 $\pm$ 1.33	79.46 $\pm$ 0.20	77.71 $\pm$ 0.30	79.32 $\pm$ 0.35	75.15 $\pm$ 0.59
EmbedClustAugTab	81.17 $\pm$ 0.72	70.97 $\pm$ 1.49	72.43 $\pm$ 0.89	72.96 $\pm$ 3.61	76.10 $\pm$ 0.56	75.01 $\pm$ 2.00	55.32 $\pm$ 2.25	57.39 $\pm$ 4.94
EmbedInteraAugTab	79.07 $\pm$ 0.82	65.56 $\pm$ 1.82	75.95 $\pm$ 1.28	75.70 $\pm$ 1.90	80.12 $\pm$ 0.28	76.95 $\pm$ 1.15	69.22 $\pm$ 1.89	68.40 $\pm$ 3.69
ClustAugTab	81.21 $\pm$ 0.64	71.16 $\pm$ 1.43	78.01 $\pm$ 0.10	76.03 $\pm$ 1.14	77.58 $\pm$ 0.41	76.02 $\pm$ 0.69	62.93 $\pm$ 0.73	65.05 $\pm$ 3.99
InteraAugTab	79.06 $\pm$ 0.68	65.55 $\pm$ 1.49	78.81 $\pm$ 1.34	76.77 $\pm$ 0.41	80.11 $\pm$ 0.70	77.00 $\pm$ 1.10	74.18 $\pm$ 1.71	72.90 $\pm$ 2.57
RDF2Vec
EmbedOnly	52.04 $\pm$ 0.69	31.42 $\pm$ 4.68	53.04 $\pm$ 1.28	22.06 $\pm$ 9.23	51.27 $\pm$ 0.92	40.14 $\pm$ 3.48	50.65 $\pm$ 1.72	43.41 $\pm$ 2.18
EmbedOnlyRed	52.04 $\pm$ 0.69	31.42 $\pm$ 4.68	53.34 $\pm$ 1.10	21.84 $\pm$ 9.30	51.27 $\pm$ 0.92	40.16 $\pm$ 3.44	51.00 $\pm$ 1.21	43.18 $\pm$ 1.71
EmbedAugTab	81.02 $\pm$ 0.00	71.33 $\pm$ 0.00	82.07 $\pm$ 0.29	78.64 $\pm$ 0.42	79.75 $\pm$ 0.00	77.18 $\pm$ 0.00	78.56 $\pm$ 0.54	75.30 $\pm$ 1.34
EmbedAugTabRed	79.95 $\pm$ 0.00	69.59 $\pm$ 0.00	80.32 $\pm$ 0.58	76.06 $\pm$ 0.90	79.32 $\pm$ 0.00	77.63 $\pm$ 0.00	78.77 $\pm$ 0.19	75.25 $\pm$ 0.16
DistAugTab	81.02 $\pm$ 0.00	71.33 $\pm$ 0.00	81.96 $\pm$ 0.58	78.57 $\pm$ 0.98	79.75 $\pm$ 0.00	77.18 $\pm$ 0.00	84.38 $\pm$ 1.59	81.62 $\pm$ 2.24
EmbedDistAugTab	81.02 $\pm$ 0.00	71.33 $\pm$ 0.00	81.85 $\pm$ 0.10	78.46 $\pm$ 0.58	79.75 $\pm$ 0.00	77.18 $\pm$ 0.00	80.60 $\pm$ 0.82	77.20 $\pm$ 0.82
EmbedDistAugTabRed	79.95 $\pm$ 0.00	69.59 $\pm$ 0.00	80.49 $\pm$ 0.77	76.45 $\pm$ 0.90	79.32 $\pm$ 0.00	77.63 $\pm$ 0.00	78.77 $\pm$ 0.19	75.25 $\pm$ 0.16
EmbedClustAugTab	81.18 $\pm$ 0.11	71.12 $\pm$ 0.17	81.44 $\pm$ 0.63	77.48 $\pm$ 0.67	80.16 $\pm$ 0.08	77.36 $\pm$ 0.28	78.64 $\pm$ 0.40	75.34 $\pm$ 0.80
EmbedInteraAugTab	81.02 $\pm$ 0.00	71.33 $\pm$ 0.00	81.81 $\pm$ 0.28	78.43 $\pm$ 0.79	79.76 $\pm$ 0.02	77.18 $\pm$ 0.01	79.33 $\pm$ 1.04	76.03 $\pm$ 0.75
ClustAugTab	81.18 $\pm$ 0.11	71.12 $\pm$ 0.17	81.81 $\pm$ 0.70	78.38 $\pm$ 1.11	80.16 $\pm$ 0.08	77.36 $\pm$ 0.28	79.10 $\pm$ 0.35	75.22 $\pm$ 0.42
InteraAugTab	81.02 $\pm$ 0.00	71.33 $\pm$ 0.00	82.25 $\pm$ 0.39	78.59 $\pm$ 0.51	79.75 $\pm$ 0.00	77.18 $\pm$ 0.00	79.23 $\pm$ 0.45	76.15 $\pm$ 0.72
DistMult
EmbedOnly	48.04 $\pm$ 0.68	62.88 $\pm$ 7.55	46.43 $\pm$ 2.21	64.43 $\pm$ 5.32	47.14 $\pm$ 1.42	68.57 $\pm$ 2.13	47.78 $\pm$ 2.26	54.97 $\pm$ 4.72
EmbedOnlyRed	48.04 $\pm$ 0.68	62.88 $\pm$ 7.55	49.35 $\pm$ 3.91	69.01 $\pm$ 3.24	47.46 $\pm$ 0.92	64.98 $\pm$ 4.09	47.35 $\pm$ 1.69	59.07 $\pm$ 4.39
EmbedAugTab	81.07 $\pm$ 0.10	71.40 $\pm$ 0.19	80.35 $\pm$ 0.99	78.30 $\pm$ 0.71	80.03 $\pm$ 0.31	77.68 $\pm$ 0.40	49.60 $\pm$ 3.09	55.53 $\pm$ 1.33
EmbedAugTabRed	80.16 $\pm$ 0.15	70.03 $\pm$ 0.12	80.79 $\pm$ 0.36	76.45 $\pm$ 0.49	79.33 $\pm$ 0.08	77.71 $\pm$ 0.11	78.27 $\pm$ 0.23	74.25 $\pm$ 0.15
DistAugTab	80.88 $\pm$ 0.08	71.04 $\pm$ 0.14	82.18 $\pm$ 0.77	78.90 $\pm$ 0.81	80.27 $\pm$ 0.02	77.39 $\pm$ 0.09	53.42 $\pm$ 4.61	54.84 $\pm$ 2.15
EmbedDistAugTab	80.94 $\pm$ 0.15	71.14 $\pm$ 0.21	80.57 $\pm$ 1.28	78.55 $\pm$ 0.73	80.11 $\pm$ 0.25	77.56 $\pm$ 0.29	50.49 $\pm$ 1.92	61.31 $\pm$ 1.66
EmbedDistAugTabRed	80.16 $\pm$ 0.15	70.02 $\pm$ 0.19	81.30 $\pm$ 0.11	77.69 $\pm$ 0.77	79.34 $\pm$ 0.08	77.71 $\pm$ 0.11	78.16 $\pm$ 0.25	73.92 $\pm$ 0.24
EmbedClustAugTab	81.39 $\pm$ 0.23	71.32 $\pm$ 0.11	72.17 $\pm$ 2.71	72.09 $\pm$ 2.44	75.31 $\pm$ 1.20	73.72 $\pm$ 1.25	50.12 $\pm$ 3.03	55.90 $\pm$ 1.53
EmbedInteraAugTab	80.80 $\pm$ 0.21	69.92 $\pm$ 0.65	70.67 $\pm$ 1.42	70.85 $\pm$ 0.59	80.20 $\pm$ 0.27	77.46 $\pm$ 0.21	47.54 $\pm$ 1.95	52.40 $\pm$ 5.09
ClustAugTab	81.43 $\pm$ 0.16	71.45 $\pm$ 0.06	76.78 $\pm$ 0.69	75.76 $\pm$ 1.84	76.17 $\pm$ 0.76	74.26 $\pm$ 1.77	59.35 $\pm$ 2.53	62.11 $\pm$ 8.32
InteraAugTab	80.84 $\pm$ 0.13	69.93 $\pm$ 0.66	76.42 $\pm$ 1.61	74.98 $\pm$ 2.28	80.18 $\pm$ 0.35	77.49 $\pm$ 0.44	47.29 $\pm$ 0.74	48.89 $\pm$ 5.33
TransH
EmbedOnly	48.22 $\pm$ 1.36	53.12 $\pm$ 5.31	47.88 $\pm$ 2.07	59.80 $\pm$ 8.61	48.55 $\pm$ 1.95	59.32 $\pm$ 2.70	47.57 $\pm$ 0.21	50.25 $\pm$ 11.95
EmbedOnlyRed	48.22 $\pm$ 1.36	53.12 $\pm$ 5.31	48.17 $\pm$ 1.24	57.54 $\pm$ 14.81	48.65 $\pm$ 0.36	53.07 $\pm$ 5.43	51.85 $\pm$ 2.31	54.12 $\pm$ 8.21
EmbedAugTab	81.00 $\pm$ 0.06	71.24 $\pm$ 0.14	80.68 $\pm$ 0.31	77.51 $\pm$ 1.17	79.89 $\pm$ 0.22	77.32 $\pm$ 0.39	48.59 $\pm$ 0.85	50.51 $\pm$ 14.01
EmbedAugTabRed	80.00 $\pm$ 0.05	69.74 $\pm$ 0.13	79.95 $\pm$ 0.66	75.62 $\pm$ 0.92	79.33 $\pm$ 0.08	77.69 $\pm$ 0.03	78.16 $\pm$ 0.38	74.17 $\pm$ 0.20
DistAugTab	80.98 $\pm$ 0.02	71.27 $\pm$ 0.03	81.99 $\pm$ 0.28	78.55 $\pm$ 0.85	80.10 $\pm$ 0.15	77.30 $\pm$ 0.05	75.34 $\pm$ 6.56	73.77 $\pm$ 4.42
EmbedDistAugTab	80.95 $\pm$ 0.06	71.19 $\pm$ 0.10	80.89 $\pm$ 1.80	77.97 $\pm$ 0.67	80.11 $\pm$ 0.04	77.50 $\pm$ 0.33	55.48 $\pm$ 4.80	57.76 $\pm$ 4.51
EmbedDistAugTabRed	80.08 $\pm$ 0.00	69.95 $\pm$ 0.06	80.72 $\pm$ 0.44	76.61 $\pm$ 0.28	79.38 $\pm$ 0.09	77.78 $\pm$ 0.05	78.20 $\pm$ 0.27	74.17 $\pm$ 0.30
EmbedClustAugTab	80.76 $\pm$ 0.43	70.42 $\pm$ 1.30	76.68 $\pm$ 2.10	74.95 $\pm$ 2.07	78.32 $\pm$ 0.91	74.42 $\pm$ 2.54	48.52 $\pm$ 1.59	50.38 $\pm$ 13.27
EmbedInteraAugTab	80.39 $\pm$ 0.22	69.04 $\pm$ 0.57	74.31 $\pm$ 4.18	72.08 $\pm$ 2.19	80.07 $\pm$ 0.39	77.14 $\pm$ 0.16	49.50 $\pm$ 1.57	48.30 $\pm$ 6.44
ClustAugTab	80.82 $\pm$ 0.38	70.55 $\pm$ 1.22	80.24 $\pm$ 0.82	75.94 $\pm$ 2.34	78.52 $\pm$ 0.93	74.52 $\pm$ 2.34	60.35 $\pm$ 1.96	56.58 $\pm$ 7.13
InteraAugTab	80.43 $\pm$ 0.25	69.10 $\pm$ 0.53	78.55 $\pm$ 2.05	75.16 $\pm$ 0.36	79.94 $\pm$ 0.35	76.94 $\pm$ 0.40	49.40 $\pm$ 0.77	41.28 $\pm$ 6.69

Table 6.

Averages of F2 Scores Across Embedding Algorithms for Different ML Models and Approaches in Heart Disease Prediction.

Model	KNN	NN	SVM	XGB
Baseline	71.33	77.44	77.18	75.19
EmbedOnly	48.63	48.84	54.21	49.03
EmbedOnlyRed	48.63	49.31	51.68	51.82
EmbedAugTab	71.35	77.22	77.56	61.47
EmbedAugTabRed	70.00	75.86	77.64	74.73
DistAugTab	71.30	78.29	77.56	75.27
EmbedDistAugTab	71.34	77.36	77.70	71.39
EmbedDistAugTabRed	70.08	76.21	77.71	74.62
EmbedClustAugTab	70.96	73.96	75.13	59.75
EmbedInteraAugTab	68.96	73.86	77.18	61.28
ClustAugTab	71.07	76.12	75.54	64.74
InteraAugTab	68.98	75.97	77.15	59.81

Table 7.

Average Accuracy and F2 Scores (With Standard Deviation Across Different Vector Sizes), For Various Models, Approaches and Embedding Methods in Kidney Disease Prediction.

	KNN		NN		SVM		XGBoost
Methods	Acc.	F2	Acc.	F2	Acc.	F2	Acc.	F2
Baseline	97.00	98.43	99.92	99.96	100.00	100.00	99.75	99.46
Node2Vec
EmbedOnly	62.12 $\pm$ 0.90	20.45 $\pm$ 9.74	61.83 $\pm$ 1.76	25.97 $\pm$ 15.45	62.56 $\pm$ 1.33	22.70 $\pm$ 12.15	61.60 $\pm$ 1.07	26.67 $\pm$ 11.16
EmbedOnlyRed	62.12 $\pm$ 0.90	20.45 $\pm$ 9.74	61.58 $\pm$ 1.13	24.62 $\pm$ 14.89	62.92 $\pm$ 1.38	23.26 $\pm$ 12.30	61.42 $\pm$ 1.18	25.43 $\pm$ 12.01
EmbedAugTab	97.19 $\pm$ 0.11	98.53 $\pm$ 0.06	99.83 $\pm$ 0.29	99.78 $\pm$ 0.39	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedAugTabRed	99.10 $\pm$ 0.07	98.55 $\pm$ 0.08	99.75 $\pm$ 0.00	99.73 $\pm$ 0.23	99.83 $\pm$ 0.14	99.64 $\pm$ 0.31	99.28 $\pm$ 0.38	99.08 $\pm$ 0.44
DistAugTab	97.06 $\pm$ 0.00	98.46 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTab	97.21 $\pm$ 0.10	98.54 $\pm$ 0.05	99.92 $\pm$ 0.14	99.82 $\pm$ 0.31	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTabRed	99.10 $\pm$ 0.07	98.55 $\pm$ 0.08	99.75 $\pm$ 0.25	99.73 $\pm$ 0.27	99.83 $\pm$ 0.14	99.64 $\pm$ 0.31	99.47 $\pm$ 0.05	99.18 $\pm$ 0.26
EmbedClustAugTab	97.38 $\pm$ 0.25	98.56 $\pm$ 0.24	99.92 $\pm$ 0.14	99.82 $\pm$ 0.31	99.97 $\pm$ 0.05	99.94 $\pm$ 0.10	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedInteraAugTab	96.54 $\pm$ 0.22	98.03 $\pm$ 0.32	98.83 $\pm$ 0.52	98.44 $\pm$ 1.21	99.06 $\pm$ 0.70	97.96 $\pm$ 1.53	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
ClustAugTab	97.27 $\pm$ 0.16	98.54 $\pm$ 0.13	99.75 $\pm$ 0.00	99.87 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
InteraAugTab	96.46 $\pm$ 0.22	97.99 $\pm$ 0.32	99.00 $\pm$ 0.25	98.39 $\pm$ 0.49	99.14 $\pm$ 0.63	98.14 $\pm$ 1.37	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
RDF2Vec
EmbedOnly	59.45 $\pm$ 0.87	8.59 $\pm$ 3.09	62.50 $\pm$ 1.10	4.10 $\pm$ 1.30	56.64 $\pm$ 2.21	13.90 $\pm$ 0.58	53.19 $\pm$ 0.96	26.59 $\pm$ 3.53
EmbedOnlyRed	59.45 $\pm$ 0.87	8.59 $\pm$ 3.09	56.25 $\pm$ 0.00	7.75 $\pm$ 0.00	56.58 $\pm$ 2.29	13.89 $\pm$ 0.58	53.83 $\pm$ 1.50	27.52 $\pm$ 0.64
EmbedAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedAugTabRed	99.06 $\pm$ 0.00	98.70 $\pm$ 0.00	99.75 $\pm$ 0.00	99.87 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.50 $\pm$ 0.00	99.33 $\pm$ 0.00
DistAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTabRed	99.06 $\pm$ 0.00	98.70 $\pm$ 0.00	99.83 $\pm$ 0.29	99.78 $\pm$ 0.39	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.50 $\pm$ 0.00	99.33 $\pm$ 0.00
EmbedClustAugTab	97.17 $\pm$ 0.16	98.52 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedInteraAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
ClustAugTab	97.17 $\pm$ 0.16	98.52 $\pm$ 0.08	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
InteraAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
DistMult
EmbedOnly	66.09 $\pm$ 1.99	46.62 $\pm$ 9.74	59.17 $\pm$ 1.91	44.73 $\pm$ 28.08	64.17 $\pm$ 0.00	16.95 $\pm$ 0.00	55.83 $\pm$ 6.17	26.58 $\pm$ 35.97
EmbedOnlyRed	66.09 $\pm$ 1.99	46.62 $\pm$ 9.74	51.88 $\pm$ 9.72	74.79 $\pm$ 3.58	58.75 $\pm$ 0.00	11.63 $\pm$ 0.00	63.75 $\pm$ 0.00	4.13 $\pm$ 0.00
EmbedAugTab	97.02 $\pm$ 0.04	98.44 $\pm$ 0.02	99.92 $\pm$ 0.14	99.82 $\pm$ 0.31	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedAugTabRed	99.08 $\pm$ 0.04	98.68 $\pm$ 0.07	99.83 $\pm$ 0.14	99.91 $\pm$ 0.08	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00	99.42 $\pm$ 0.14	99.29 $\pm$ 0.08
DistAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	99.75 $\pm$ 0.25	99.87 $\pm$ 0.13	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTab	97.04 $\pm$ 0.07	98.45 $\pm$ 0.04	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTabRed	99.08 $\pm$ 0.04	98.71 $\pm$ 0.02	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	99.89 $\pm$ 0.13	99.76 $\pm$ 0.27	99.50 $\pm$ 0.00	99.33 $\pm$ 0.00
EmbedClustAugTab	97.44 $\pm$ 0.27	98.59 $\pm$ 0.26	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedInteraAugTab	97.27 $\pm$ 0.24	98.27 $\pm$ 0.27	98.25 $\pm$ 0.75	96.12 $\pm$ 1.74	99.72 $\pm$ 0.35	99.40 $\pm$ 0.75	95.84 $\pm$ 3.69	89.86 $\pm$ 9.52
ClustAugTab	97.38 $\pm$ 0.17	98.59 $\pm$ 0.14	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
InteraAugTab	97.29 $\pm$ 0.16	98.35 $\pm$ 0.19	98.00 $\pm$ 0.66	95.53 $\pm$ 1.64	99.86 $\pm$ 0.24	99.70 $\pm$ 0.52	98.33 $\pm$ 2.45	96.29 $\pm$ 5.49
TransH
EmbedOnly	41.82 $\pm$ 3.92	67.08 $\pm$ 6.86	45.81 $\pm$ 4.30	58.45 $\pm$ 7.49	56.64 $\pm$ 13.86	33.89 $\pm$ 25.46	44.73 $\pm$ 3.76	63.87 $\pm$ 7.00
EmbedOnlyRed	41.82 $\pm$ 3.92	67.08 $\pm$ 6.86	39.79 $\pm$ 4.16	62.73 $\pm$ 9.94	67.50 $\pm$ 0.00	85.23 $\pm$ 0.00	45.03 $\pm$ 5.18	62.13 $\pm$ 12.14
EmbedAugTab	97.00 $\pm$ 0.00	98.43 $\pm$ 0.00	99.83 $\pm$ 0.14	99.91 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedAugTabRed	99.08 $\pm$ 0.04	98.71 $\pm$ 0.02	99.83 $\pm$ 0.29	99.91 $\pm$ 0.15	99.92 $\pm$ 0.14	99.82 $\pm$ 0.31	99.50 $\pm$ 0.00	99.33 $\pm$ 0.00
DistAugTab	97.06 $\pm$ 0.00	98.46 $\pm$ 0.00	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedDistAugTab	97.06 $\pm$ 0.00	98.46 $\pm$ 0.00	99.83 $\pm$ 0.14	99.91 $\pm$ 0.08	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	98.78 $\pm$ 1.68	98.86 $\pm$ 1.04
EmbedDistAugTabRed	99.12 $\pm$ 0.00	98.73 $\pm$ 0.00	99.83 $\pm$ 0.29	99.91 $\pm$ 0.15	99.86 $\pm$ 0.13	99.70 $\pm$ 0.27	99.50 $\pm$ 0.00	99.33 $\pm$ 0.00
EmbedClustAugTab	96.92 $\pm$ 0.18	98.32 $\pm$ 0.18	99.50 $\pm$ 0.50	99.74 $\pm$ 0.26	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
EmbedInteraAugTab	96.94 $\pm$ 0.06	98.36 $\pm$ 0.03	99.25 $\pm$ 0.43	98.64 $\pm$ 1.19	99.83 $\pm$ 0.29	99.64 $\pm$ 0.62	86.85 $\pm$ 4.36	66.82 $\pm$ 11.84
ClustAugTab	96.90 $\pm$ 0.20	98.31 $\pm$ 0.18	99.83 $\pm$ 0.14	99.91 $\pm$ 0.08	99.92 $\pm$ 0.14	99.96 $\pm$ 0.08	99.75 $\pm$ 0.00	99.46 $\pm$ 0.00
InteraAugTab	97.00 $\pm$ 0.00	98.40 $\pm$ 0.06	99.00 $\pm$ 0.66	98.11 $\pm$ 1.36	100.00 $\pm$ 0.00	100.00 $\pm$ 0.00	87.93 $\pm$ 4.93	69.47 $\pm$ 13.95

Table 8.

Averages of F2 Scores Across Embedding Algorithms for Different ML Models and Approaches in Kidney Disease Prediction.

Model	KNN	NN	SVM	XGBoost
Baseline	98.43	99.96	100	99.46
EmbedOnly	35.69	33.31	21.86	35.93
EmbedOnlyRed	35.69	42.47	33.50	29.80
EmbedAugTab	98.46	99.88	100.00	99.46
EmbedAugTabRed	98.66	99.86	99.73	99.26
DistAugTab	98.45	99.96	100.00	99.46
EmbedDistAugTab	98.47	99.92	100.00	99.31
EmbedDistAugTabRed	98.67	99.84	99.78	99.29
EmbedClustAugTab	98.50	99.88	99.97	99.46
EmbedInteraAugTab	98.27	98.30	99.25	88.90
ClustAugTab	98.49	99.93	99.99	99.46
InteraAugTab	98.29	98.01	99.46	91.17

Investigating the impact of various methods for data augmentation through KGE

The different methods of augmenting tabular data with KG embeddings yield mixed results across models. Approaches such as EmbedAugTab, DistAugTab and EmbedDistAugTab often provide the most performance improvements, especially for models such as XGBoost and NN. For example, DistAugTab when Node2Vec is being used to generate the embeddings significantly improved the F2 score of XGB from 75.19 (baseline) to 90.85, highlighting the ability of XGB to effectively use the additional distance features from KG embeddings.

Conversely, SVM and KNN tend to struggle with complex augmentation methods, showing lower gains and even losses in some cases, as they are less suited to high-dimensional data and the resulting feature complexity.

In the following we consider the effectiveness of different approaches based on the sub-hypotheses H1.1 to H1.5.

H1.1 Analysis: Our initial hypothesis (H1.1) proposed that using KG-derived embeddings alone (EmbedOnly) could provide meaningful insights by capturing latent relationships within the data. However, the results across all models and embedding algorithms contradict this hypothesis. The EmbedOnly approach consistently underperformed the baseline for each ML model, regardless of the KG embedding method used. For example, with Node2Vec, the F2 score of SVM dropped significantly from 77.18 (baseline) to 48.31, and similar trends were observed for other models and embeddings. Even when dimensionality reduction (EmbedOnlyRed) was applied to the embeddings, the results remained poor. This suggests that the standalone embeddings lack the richness of information provided by the original tabular features, which include more direct indicators of patient characteristics and clinical factors. Additionally, using embeddings alone may introduce complexity without clear connections to the target variable, making it difficult for the models to extract useful patterns.

H1.2 Analysis: Our second hypothesis (H1.2) proposed that combining the KG embeddings with traditional tabular data (EmbedAugTab) would enhance predictive performance by adding relational information from the KG structure. This approach showed mixed results. In some cases, it led to modest improvements, such as with NN using RDF2Vec to generate the embeddings (F2 score improved from 77.44 to 78.64) or KNN with Node2Vec and DistMult. For SVM, there were F2 score gains when using any embedding algorithm and EmbedAugTabRed compared to the baseline, suggesting that the additional KG information could help refine decision boundaries for SVM’s kernel-based approach. However, XGBoost often underperformed when embeddings were added (EmbedAugTab), with scores generally below the baseline. This could be due to XGBoost’s preference for a simpler feature space where tabular data alone provides more direct information, making the additional, less structured KG-derived features more of a drawback than a help.

H1.3 Analysis: To address potential noise from directly using embeddings, H1.3 suggested that extracting specific structural features, such as distances from class centroids (DistAugTab) or clustering characteristics (ClustAugTab), would yield better results. The performance of DistAugTab, especially using Node2Vec to generate embeddings, supports this hypothesis, showing significant improvements over the baseline across NN, SVM and XGBoost models. For instance, using Node2Vec to generate embeddings and then using DistAugTab approach for data augmentation boosted the F2 score of NN from 77.44% to 78.78% and XGBoost from 75.19% to 90.85%, indicating that distance-based features may help capture nuanced relationships between instances and classes that are relevant for classification. NN and SVM also performend well using DistMult and TransH for embedding generation and DistAugTab approach, likely because these models can benefit from the distance measures, making it easier to distinguish between similar instances.

The ClustAugTab approach, on the other hand also showed improvements, particularly with KNN. For example, using RDF2Vec with ClustAugTab led to better clustering of instances, resulting in improved accuracy for KNN and SVM from 81.02% to 81.18% and from 79.75% to 80.16%, respectively. KNN benefited from this approach as it relies on distance metrics to identify neighbours, and having meaningful clusters aligned with its decision-making process. Similarly, SVM showed better results using RDF2Vec for embedding generations and ClustAugTab approach for data augmentation, possibly because the cluster memberships served as a valuable feature that helped define clearer support vectors for class separation.

H1.4 Analysis: Sub-hypothesis (H1.4) suggests that certain classes in the data may be more effectively distinguished when interactions between KG embeddings and traditional tabular data are considered. This hypothesis assumes that there are complex dependencies between the relational information captured by the embeddings and the raw features of the tabular data. The results of the InteraAugTab and EmbedInteraAugTab approaches provide some support for this hypothesis. For example, SVM shows minor improvement in accuracy using interaction terms as additional features, from 79.75% (baseline) to 80.11%, 80.18% and 79.94% when Node2Vec, DistMult and TransH algorithms respectively are being used to generate the embeddings. Comparing between different embedding algorithms, from the table it is shown that RDF2Vec generated the most suitable embeddings to be used for IntraAugTab approach.

H1.5 Analysis: Sub-hypothesis (H1.5) suggests that reducing dimensionality would help models by removing irrelevant or noisy features, thus focussing learning on the most informative aspects of the data. The results show mixed outcomes: while dimensionality reduction sometimes improved performance by simplifying the input space, it often failed to match the effectiveness of methods that used the full set of features without PCA.

For instance, while EmbedAugTabRed and EmbedDistAugTabRed helped reduce overfitting for XGBoost when using DistMult to generate the embeddings by eliminating redundant features, it did not outperform the EmbedAugTab and EmbedDistAugTab methods for KNN. This suggests that, while PCA can be useful for some models it might remove valuable information that more sophisticated models can use, highlighting a trade-off between feature simplification and richness.

Investigating the impact of embedding algorithm

The choice of KG embedding algorithm has a significant impact on model performance across the various approaches. Each embedding method captures different aspects of the KG structure, influencing how well the derived embeddings integrate with the original tabular data and the model’s ability to leverage this information.

From the results shown in Table 5 and the F2 score differences illustrated in Figure 11⁷, it is evident that Node2Vec and RDF2Vec generally lead to more consistent performance improvements compared to DistMult and TransH, particularly when combined with approaches such as EmbedAugTab and DistAugTab. For example, Node2Vec embeddings with the EmbedDistAugTab approach provided the most notable gains across models, including SVM, and XGBoost. This improvement suggests that Node2Vec’s random walk-based approach is effective in preserving local neighbourhood information and graph structure, which seems to translate well into the feature space used by the models. The relational patterns it captures may align better with the tabular features, providing additional context that aids in classification.

Figure 11.

F2 score differences relative to baseline across models and embedding methods, showing gain/loss for each approach for heart disease prediction

RDF2Vec also showed good performance, particularly with EmbedAugTab and ClustAugTab approaches. Its ability to leverage RDF graph structures and preserve semantic relationships appears to be beneficial, especially for models such as NN and SVM.

In contrast, DistMult and TransH showed more variable results. While these methods performed well in specific scenarios – such as DistAugTab with DistMult or TransH, particularly for SVM and XGBoost – they were less consistent across different approaches. For example, while DistMult’s tensor factorisation approach allows it to capture specific types of relational patterns, this does not always translate into performance gains when used for approaches such as ClustAugTab, IntraAugTab or EmbedAugTab.

Moreover, the figures show that XGBoost’s performance is particularly sensitive to the choice of embedding algorithm. While XGBoost generally excelled for DistAugTab or EmbedDistAugTab approach using Node2Vec, it underperformed with simpler methods such as EmbedAugTab when combined with TransH or DistMult. This suggests that XGBoost requires embeddings that add clear, structured relational information rather than purely dense vector representations. Thus, Node2Vec and RDF2Vec’s ability to provide richer, more interpretable representations likely aligns better with XGBoost’s learning mechanism.

In conclusion, the choice of the embedding algorithm plays a crucial role in determining the success of different data augmentation approaches. RDF2Vec consistently provides more valuable representations for enhancing model performance across a range of methods, likely due to their strength in capturing both local and global graph structures. DistMult and TransH, while potentially effective in capturing specific relational patterns, exhibit more variability and require carefully chosen augmentation methods to translate their structural information into improved model performance. These findings emphasise that selecting the right embedding algorithm is critical, as it can significantly influence how well the additional relational data is integrated into the learning process.

Investigating the impact of KGs choice

Figure 12 shows the average accuracy and F2 score across all evaluated approaches implemented with each ontology. It shows that the choice of ontology (Small, Extended, or Snomed) slightly affects model performance. Using Snomed ontology generally provides the highest accuracy and F2 scores, due to its clinically structured information from medical experts, highlighting its ability to enrich predictions. The Small KG yields the poorest results among the three ontologies, arguably due to its handcrafted nature by non-medical experts, which limits its depth and relevance to complex medical relationships.

Figure 12.

Comparison of accuracy (left) and F2 (right) scores for different ML models across various KGs.

Investigating the performance of different ML models across various approaches and embedding algorithms

Across the evaluated models, XGBoost and NN showed the most significant improvements when incorporating various KG augmentation methods. Specifically, XGBoost’s performance saw the largest gains using the DistAugTab and EmbedDistAugTab approaches. For example, with Node2Vec embeddings combined using EmbedDistAugTab approach, XGBoost’s F2 score increased from a baseline of 75.19% to 89.27%. This can be attributed to XGBoost’s ability to effectively handle high-dimensional feature spaces, allowing it to extract valuable patterns from the distance-based features derived from the embeddings. However, XGBoost showed a lot of underperformance when the datasets were augmented with various approaches, especially when embeddings were generated with TransH and DistMult. This reduced performance may be due to the relational complexity in TransH and DistMult embeddings, which introduces interdependent features that XGBoost struggles to interpret independently.

On the other hand, KNN showed only slight performance gains when augmented with embeddings but maintained stable results across different approaches and embedding algorithms.

Looking at the average F2 scores in Table 6 when different embedding algorithms are used to generate the embeddings, we can observe that for four approaches SVM gained slightly better performance compared to the baseline, making it in general more suitable model that gains performance when additional data from KGs is being added, especially when computing the distances to the target classes, or when the vectors are added as such to augment the tabular data.

7.2. Kidney Disease Prediction

Table 7 shows the average accuracy and F2 scores, along with the standard deviation across different vector sizes of the embeddings, for different models, approaches and embedding methods in kidney disease prediction. Additional results, including average recall with standard deviation across vector sizes are provided in Table 12 in Appendix 8. In the following paragraphs, we will discuss the results based on the research questions, considering also the sub-hypothesis H1.1 - H1.5 from Section 5.

Investigating the impact of various methods for data augmentation through KGE

From Table 7, we observe that adding distance-based features to tabular data improves ML model performance, especially for KNN and NN. Although the baseline is already high, enhancements such as distance-to-class, cluster membership features and embedding vectors still boost performance. For example, KNN accuracy increases from 97% to 99.08% and 99.12% when the data is augmented with vector embeddings (EmbedAugTabRed) and with embeddings plus Euclidean distances to classes (EmbedDistAugTabRed), using TransH to generate embeddings. In the following we will discuss the hypothesis H1.1 to H1.5 based on the results.

H1.1 Analysis: As with heart disease prediction, using only embeddings (EmbedOnly) consistently underperforms compared to the baseline, regardless of the ML model or embedding method, contradicting the hypothesis. This suggests that embeddings alone provide less insight than tabular data for capturing relationships. Performance is particularly poor with RDF2vec embeddings, likely because RDF2vec focuses on structural patterns rather than the detailed, feature-specific information captured in tabular data.

H1.2 Analysis: In line with this hypothesis, the table shows that augmenting tabular data with embedding vectors (EmbedAugTab) generally results in similar or slightly improved accuracy and F2 scores compared to the baseline. For instance, for KNN, adding Node2Vec embeddings increases accuracy from 97% to 97.19% and the F2 score from 98.43% to 98.53%. Likewise, for NN, augmenting with RDF2Vec embeddings raises both accuracy and F2 scores from 99.92% and 99.96% to 100%.

H1.3 Analysis: This hypothesis suggests that adding structural features, such as distances to class centroids (DistAugTab) or clustering membership (ClustAugTab), also adding the embedding vectors (e.g., EmbedDistAugTab), should enhance performance. Structural features help capture relationships in the data by adding context about group distances, which is especially useful for proximity-based models such as KNN, SVM and NN. The results support this, showing performance generally remains consistent with or slightly better than the baseline. For example, for NN using Node2Vec embeddings, the DistAugTab approach increases accuracy and F2 score from 99.92% and 99.96% to 100%, as Node2Vec effectively captures neighbourhood structures that align with these models’ reliance on similarity. However, with TransH embeddings, which emphasise hierarchical relationships, ClustAugTab and EmbedClustAugTab slightly decrease accuracy and F2 scores for KNN, NN and SVM, as these embeddings may introduce noise rather than meaningful distance-based information.

H1.4 Analysis: Similarly to the heart disease prediction results, Table 7 shows some results supporting the hypothesis that complex interactions between embeddings and raw tabular features can further improve model performance. SVM generally maintained its 100% performance across different embedding algorithms. RDF2Vec showed to be the most compatible embedding algorithm to be used with our approaches, as it did not lead to performance drops with any approach. KNN on the other hand showed performance improvements when embeddings are used in those two approaches generated from DistMult, with accuracy improvement from 97.00% with only tabular data to 97.29% when interaction terms were included via the InterAugTab approach.

H1.5 Analysis: In the line with this hypothesis, suggesting that tabular dimensionality reduction can help eliminate noisy features, the results for kidney prediction show mixed outcomes. For KNN, applying the PCA algorithm consistently increases both accuracy and F2 scores compared to using the full feature set, particularly for the approaches EmbedDistAugTab and EmbedAugTabRed. This improvement is expected given KNN’s struggles with high-dimensional data; it relies heavily on distance calculations, and PCA effectively enhances the quality of these calculations by reducing noise and emphasising the most informative dimensions. For instance, when using embeddings generated with Node2Vec, dimensionality reduction in EmbedAugTab and EmbedDistAugTab boosts the accuracy from 97.19% and 97.21% to 99.10% in both cases.

Conversely, the performance of NN, SVM and XGBoost generally worsened after the dimensionality reduction step. This could be attributed to the already high baseline accuracy (ranging from 99.75% to 100%); further reducing the dimensionality might eliminate features that, while not highly significant, still contribute to the model’s performance. Additionally, these models can inherently manage high-dimensional data and may not benefit as much from PCA as KNN does. As a result, the reduced feature set may lack the nuanced information that these more complex models require for optimal performance.

Investigating the impact of embedding algorithm

Figure 13 shows the F2 score differences relative to the baseline across models and embedding methods, highlighting gains and losses for each approach in kidney disease prediction. The approaches EmbedOnly, EmbedOnlyRed and EmbedInteraAugTab, as well as InteraAugTab, were excluded from analysis due to their skewed performance, particularly when using DistMult and TransH to generate the embeddings, which demonstrated low performance.

Figure 13.

F2 score differences relative to baseline across models and embedding methods, showing gain/loss for each approach for kidney disease prediction.

Figure 14.

Different ways of infusing the KG as input into ML pipeline.

From the figure we see that RDF2VEC was shown to be the best suited algorithm among the four embedding algorithms for our approaches. It generates effective embeddings particularly for KNN where the F2 score is increased for some of the approaches and stayed the same for the others. Moreover for the SVM model it maintained a perfect F2 score of 100%, in comparison to other embedding algorithms where the performance dropped. This could indicate that RDF2VEC caputes relevant features that enhance the SVM’s ability to establish clear decision boundaries, ultimately resulting in higher predictive performance.

In contrast, Node2Vec generated effective embeddings for the KNN model, where it slightly improved performance. However, its utility decreased for SVM, XGBoost and NN, often leading to slight performance drops. This suggests that while Node2Vec captures local structural information well, it may introduce noise or irrelevant features for models such as SVM and XGBoost.

Using DistMult to generate the embeddings achieved performance gains with KNN but resulted in decreased performance for SVM. This inconsistency indicates that while DistMult enhances KNN’s ability to capture relationships, it introduces noise for SVM’s decision-making process.

Similarly, TransH performed best with KNN, especially in the EmbedAugTabRed and EmbedDistAugTabRed approaches, yet showed weaker results for other models, particularly XGBoost. This discrepancy highlights that TransH may capture specific relational aspects beneficial for KNN but lacks the broader applicability needed for more complex models such as XGBoost.

Overall, our findings suggest that RDF2Vec algorithm is the optimal choice for embedding generation for augmenting data across various models due to its ability to enhance relevant feature representation. Conversely, Node2Vec is particularly advantageous for KNN, emphasising the need to carefully select embedding algorithms based on the specific model and approach used to ensure the most effective performance enhancement.

Investigating the performance of different ML models across various approaches

Table 8 presents the average F2 scores for various ML models in kidney disease prediction, showing the impact of different approaches for combining tabular data with embeddings, averaged across multiple embedding methods. KNN showed the most notable improvements when augmented with KG embeddings across different approaches, likely due to its weaker baseline performance compared to the rest and the suitability of distance-based metrics and dimensionality reduction for this model. Excluding cases where only embeddings were used for training, NN generally maintained its performance with only slight drops in some approaches, suggesting that NN’s ability to learn complex patterns is somewhat robust to variations in feature augmentation. SVM, which achieved a perfect F2 score (100%) with only tabular data, retained this performance in EmbedAugTab, DistAugTab and EmbedDistAugTab. Similarly, XGBoost preserved its performance with the four embedding-augmented configurations, though it experienced slight declines in the remaining cases.

8. Conclusions and Future Work

In this article, we proposed several innovative approaches to augment tabular data with semantic information by leveraging ontologies to capture domain semantics as shown in Figure 14. We utilised these ontologies to construct KGs, thereby enriching the datasets with structured ontological information. To make the KGs suitable for ML models, we employed KGEs to transform the graphs into a vector space representation. This process enhances the data used to train ML models by integrating domain-specific semantics, allowing the models to leverage contextual and relational information. Based on our experiment setup, we conducted experiments for heart and kidney disease prediction.

For RQ1, our experiments demonstrated that incorporating KG embeddings, particularly by augmenting tabular data with distance-based features to target classes, improves model performance in most of the cases. This enhancement is particularly evident in challenging domains such as chronic kidney disease, where accuracy and F2 scores improved despite limited room for improvement, underscoring the value of KG information for refining ML predictions, especially in data-sparse environments.

For RQ2, our findings indicate that RDF2Vec is the most effective embedding algorithm across models for both heart and kidney disease prediction, given its ability to capture relevant feature representations without performance drops. Node2Vec proved particularly beneficial for KNN in kidney disease prediction, while in heart disease prediction, Node2Vec enhanced XGBoost the most. However, XGBoost exhibited instability across approaches and embedding algorithms in both cases, suggesting the need for careful pairing of embedding methods and models.

For RQ3, in one hand for heart disease prediction overall on average SVM showed the most F2 score improvement across multiple approaches. Whereas on the other hand for kidney disease prediction, KNN showed the largest performance gains when enhanced with KG embeddings across various approaches, likely due to its weaker baseline performance and the suitability of distance-based metrics and dimensionality reduction, which complement KNN’s neighbour-based approach.

Future work will explore the effectiveness of KGs across diverse domains, particularly those with limited data, by augmenting sparse datasets to address the data dependency issues in ML models. Additionally, we plan to assess the scalability of our methods based on data size and structure and experiment with more complex ML models to further optimize the integration of KG embeddings. Furthermore, we aim to explore alternative embedding models and investigate methods for mapping literals into the embedding space to evaluate their impact on model performance.

Footnotes

Acknowledgements

This work was supported by the FFG SENSE (894802) and FAIR-AI (904624) projects, as well as by the Austrian Science Fund (FWF) BILAI 10.55776/COE12 and HOnEst (V 745-N) projects. For open access purposes, the author has applied a CC BY public copyright license to any author accepted manuscript version arising from this submission.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Majlinda Llugiqi

Fajar J Ekaputra

Marta Sabou

Notes

Appendix A: Additional Experimental Analysis

Appendix B: Additional Results

Tables 11 and 12 present the average recall (with standard deviation across different vector sizes) for various models, approaches and embedding methods in heart disease and kidney disease prediction, respectively.

References

Abdi

Williams

L. J.

(2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.

Ali

Niamat

Khan

J. A.

Golilarz

N. A.

Xingzhong

Noor

Nour

Bukhari

S. A. C.

(2019). An optimized stacked support vector machines based expert system for the effective prediction of heart failure. IEEE Access, 7, 54007–54014.

Annervaz

Chowdhury

S. B. R.

Dukkipati

(2018). Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. arXiv preprint arXiv:1802.05930.

Bhatt

Sheth

Shalin

Zhao

(2020). Knowledge graph semantic enhancement of input data for improving AI. IEEE Internet Computing, 24(2), 66–72.

Bordes

Usunier

Garcia-Duran

Weston

Yakhnenko

(2013). Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26.

Breit

Waltersdorfer

Ekaputra

F. J.

Sabou

Ekelhart

Iana

Paulheim

Portisch

Revenko

Teije

A. T.

, et al (2023). Combining machine learning and semantic web: A systematic mapping study. ACM Computing Surveys, 55(14s), 1–41.

Chen

Alghamdi

Schmidt

R. A.

Walther

Gao

(2019). Ontology extraction for large ontologies via modularity and forgetting. In Proceedings of the 10th international conference on knowledge capture (pp. 45–52).

Chittora

Chaurasia

Chakrabarti

Kumawat

Chakrabarti

Leonowicz

Jasiński

Ł.

Gono

Jasińska

, et al. (2021). Prediction of chronic kidney disease – a machine learning perspective. IEEE Access, 9, 17312–17334.

Chute

C. G.

Çelik

(2021). Overview of ICD-11 architecture and structure. BMC Medical Informatics and Decision Making, 21(6), 1–7.

10.

Confalonieri

Weyde

Besold

T. R.

del Prado Martín

F. M.

(2021). Using ontologies to enhance human understandability of global post-hoc explanations of black-box models. Artificial Intelligence, 296, 103471.

11.

Dash

Chitlangia

Ahuja

Srinivasan

(2022). A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12(1), 1040.

12.

Deng

Dong

Socher

L.-J.

Fei-Fei

(2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

13.

El-Sappagh

Franda

Ali

Kwak

K.-S.

(2018). SNOMED CT standard ontology based on the ontology for general medical science. BMC Medical Informatics and Decision Making, 18, 1–19.

14.

Garcez

A.d.

Lamb

L. C.

(2023). Neurosymbolic AI: The 3rd wave. Artificial Intelligence Review (pp. 1–20).

15.

Gaur

Kursuncu

Alambo

Sheth

Daniulaityte

Thirunarayan

Pathak

(2018). “Let me tell you about your mental health!” Contextualized classification of reddit posts to DSM-5 for web-based intervention. In Proceedings of the 27th ACM international conference on information and knowledge management (pp. 753–762).

16.

Gazzotti

Faron-Zucker

Gandon

Lacroix-Hugues

Darmon

(2019). Injecting domain knowledge in electronic medical records to improve hospitalization prediction. In The semantic web: 16th international conference, ESWC 2019, Portorož, Slovenia, June 2–6, 2019, Proceedings 16 (pp. 116–130). Springer.

17.

Grover

Leskovec

(2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855–864).

18.

Gruber

T. R.

(1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199–220.

19.

Hassler

A. P.

Menasalvas

García-García

F. J.

Rodríguez-Mañas

Holzinger

(2019). Importance of medical data preprocessing in predictive modeling and risk factor discovery for the frailty syndrome. BMC Medical Informatics and Decision Making, 19, 1–17.

20.

Herron

Jiménez-Ruiz

Weyde

(2023). On the benefits of OWL-based knowledge graphs for neural-symbolic systems. In Proceedings of the 17th international workshop on neural-symbolic learning and reasoning, Vol. 3432, CEUR Workshop Proceedings (pp. 327–335).

21.

Hitzler

Eberhart

Ebrahimi

Sarker

M. K.

Zhou

(2022). Neuro-symbolic approaches in artificial intelligence. National Science Review, 9(6), nwac035.

22.

Huang

Y.-X.

Sun

Tian

Dai

W.-Z.

Jiang

Zhou

Z.-H.

(2023). Enabling abductive learning to exploit knowledge graph. In Proceedings of the thirty-second international joint conference on artificial intelligence (pp. 3839–3847).

23.

Hubert

Monnin

Brun

Monticolo

(2023). Sem@K: Is my knowledge graph embedding model semantic-aware?. Semantic Web, 14(6), 1273–1309. https://doi.org/10.3233/SW-233508

24.

Ivanović

Budimac

(2014). An overview of ontologies and data resources in medical domains. Expert Systems with Applications, 41(11), 5158–5166.

25.

Jarrett

Stride

Vallis

Gooding

M. J.

(2019). Applications and limitations of machine learning in radiation oncology. The British Journal of Radiology, 92(1100), 20190001.

26.

Jovic

Prcela

Gamberger

(2007). Ontologies in medical knowledge representation. In 2007 29th International conference on information technology interfaces (pp. 535–540). IEEE.

27.

Katarya

Meena

S. K.

(2021). Machine learning techniques for heart disease prediction: A comparative study and analysis. Health and Technology, 11, 87–97.

28.

Kautz

(2022). The third AI summer: Aaai robert s. engelmore memorial lecture. AI Magazine, 43(1), 105–125.

29.

Kenton

J. D. M.-W. C.

Toutanova

L. K.

(2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, Vol. 1, Minneapolis, Minnesota (p. 2).

30.

Kraišniković

Harb

Plass

Al Zoughbi

Holzinger

Müller

(2025). Fine-tuning language model embeddings to reveal domain knowledge: An explainable artificial intelligence perspective on medical decision making. Engineering Applications of Artificial Intelligence, 139, 109561.

31.

Kursuncu

Gaur

Sheth

(2019). Knowledge infused learning (K-IL): Towards deep incorporation of knowledge in deep learning. arXiv preprint arXiv:1912.00512.

32.

Lin

Liu

Sun

Liu

Zhu

(2015). Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29).

33.

Llugiqi

Ekaputra

F. J.

Sabou

(2024). Enhancing machine learning predictions through knowledge graph embeddings. In International conference on neural-symbolic learning and reasoning (pp. 279–295). Springer.

34.

Mohan

Thirumalai

Srivastava

(2019). Effective heart disease prediction using hybrid machine learning techniques. IEEE Access, 7, 81542–81554.

35.

Moussallem

Arčan

Ngomo

A.-C. N.

Buitelaar

(2019). Augmenting neural machine translation with knowledge graphs. arXiv preprint arXiv:1902.08816.

36.

Pisanelli

D. M.

(2004). Ontologies in Medicine, vol. 102. IOS Press.

37.

Poulinakis

Drikakis

Kokkinakis

I. W.

Spottswood

S. M.

(2023). Machine-learning methods on noisy and sparse data. Mathematics, 11(1), 236.

38.

Rady

E.-H. A.

Anwar

A. S.

(2019). Prediction of kidney disease stages using data mining algorithms. Informatics in Medicine Unlocked, 15, 100178.

39.

Ramesh

Dhariwal

Nichol

Chu

Chen

(2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2), 3.

40.

Rani

Kumar

Ahmed

N. M. S.

Jain

(2021). A decision support system for heart disease prediction based upon machine learning. Journal of Reliable Intelligent Environments, 7(3), 263–275.

41.

Ristoski

Paulheim

(2016). RDF2Vec: RDF graph embeddings for data mining. In The semantic Web–ISWC 2016: 15th international semantic web conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part I 15 (pp. 498–514). Springer.

42.

Ruiz

Ren

Huang

Leskovec

(2024). High dimensional, tabular deep learning with an auxiliary knowledge graph. Advances in Neural Information Processing Systems (Vol. 36).

43.

Sarker

M. K.

Zhou

Eberhart

Hitzler

(2021). Neuro-symbolic artificial intelligence. AI Communications, 34(3), 197–209.

44.

Shah

Patel

Bharti

S. K.

(2020). Heart disease prediction using machine learning techniques. SN Computer Science, 1, 1–6.

45.

Sheth

Gaur

Kursuncu

Wickramarachchi

(2019). Shades of knowledge-infused learning for enhancing deep learning. IEEE Internet Computing, 23(6), 54–63.

46.

Singhal

, et al. (2012). Introducing the knowledge graph: Things, not strings. Official Google Blog, 5(16), 3.

47.

Szilagyi

Wira

(2018). An intelligent system for smart buildings using machine learning and semantic technologies: A hybrid data-knowledge approach. In 2018 IEEE industrial cyber-physical systems (ICPS) (pp. 20–25). IEEE.

48.

van Bekkum

de Boer

van Harmelen

Meyer-Vitali

ten Teije

(2021). Modular design patterns for hybrid learning and reasoning systems. Applied Intelligence, 51(9), 6528–6546. https://doi.org/10.1007/s10489-021-02394-3

49.

Van Harmelen

Ten Teije

(2019). A boxology of design patterns for hybrid learning and reasoning systems. Journal of Web Engineering, 18(1–3), 97–123.

50.

Vijayarani

Dhayanand

Phil

(2015). Kidney disease prediction using SVM and ANN algorithms. International Journal of Computing and Business Research (IJCBR), 6(2), 1–12.

51.

Wang

Qiu

Wang

(2021). A survey on knowledge graph embeddings for link prediction. Symmetry, 13(3), 485.

52.

Wang

Zhang

Feng

Chen

(2014). Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI conference on artificial intelligence (Vol. 28).

53.

Wortsman

Ilharco

Gadre

S. Y.

Roelofs

Gontijo-Lopes

Morcos

A. S.

Namkoong

Farhadi

Carmon

Kornblith

, et al (2022). Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning (pp. 23965–23998). PMLR.

54.

Yadav

A. L.

Soni

Khare

(2023). Heart diseases prediction using machine learning. In 2023 14th International conference on computing communication and networking technologies (ICCCNT) (pp. 1–7). IEEE.

55.

Yang

Yih

W.-T.

Gao

Deng

(2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575.

56.

Yildirim

(2017). Chronic kidney disease prediction on imbalanced data by multilayer perceptron: Chronic kidney disease prediction. In 2017 IEEE 41st annual computer software and applications conference (COMPSAC) (Vol. 2, pp. 193–198). doi: https://doi.org/10.1109/COMPSAC.2017.84

57.

Yin

Zhao

Qian

Zhang

(2019). Domain knowledge guided deep learning with electronic health records. In 2019 IEEE international conference on data mining (ICDM) (pp. 738–747). IEEE.

58.

Ziegler

Caelen

Garchery

Granitzer

He-Guelton

Jurgovsky

Portier

P.-E.

Zwicklbauer

(2017). Injecting semantic background knowledge into neural networks using graph embeddings. In 2017 IEEE 26th international conference on enabling technologies: Infrastructure for collaborative enterprises (WETICE) (pp. 200–205). IEEE.

			Node2Vec Param.			RDF2Vec Param.			TransH & DistMult
Domain	KG	dimens.	walk length	walks	window	depth	walks/node	window	params
Heart	Small	[64,128,100]	40	200	5	4	100	5	default
	Extended	[64,128,100]	60	200	10	6	150	10	default
	Snomed	[64,128,100]	50	200	7	5	100	7	default
Kidney	Snomed	[64,128,100]	50	200	7	10	100	7	default

Semantic-Based Data Augmentation for Machine Learning Prediction Enhancement

Abstract

Keywords

1. Introduction

2.1. Problem Description

Ontology:

Knowledge Graphs:

KG embeddings:

Categorisation of Neuro-Symbolic Approaches

Machine Learning Models in Disease Prediction

Enhancing ML Predictions with Semantic Knowledge

Step 1: Ontology Definition

Step 2: Mapping Definition

Step 3: Knowledge Graph Population

Step 4: Knowledge Graph Embedding Generation

Step 5 & 6: Tabular Data Enrichment and ML Model Training

5. Proposed Approaches for Tabular Data Enrichment and ML Model Training

5.1. Embeddings As ML Model Inputs (EmbedOnly)

6.1. Experimental Goals

6.2. Experiment Setup

7.1. Heart Disease Prediction

Investigating the impact of various methods for data augmentation through KGE

Investigating the impact of embedding algorithm

Investigating the impact of KGs choice

Investigating the performance of different ML models across various approaches and embedding algorithms

Investigating the impact of various methods for data augmentation through KGE

Investigating the impact of embedding algorithm

Investigating the performance of different ML models across various approaches

Footnotes

Acknowledgements

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

Appendix A: Additional Experimental Analysis

Appendix B: Additional Results

References