Abstract
With the rising popularity of user-generated genealogical family trees, new genealogical information systems have been developed. State-of-the-art natural question answering algorithms use deep neural network (DNN) architecture based on self-attention networks. However, some of these models use sequence-based inputs and are not suitable to work with graph-based structure, while graph-based DNN models rely on high levels of comprehensiveness of knowledge graphs that is nonexistent in the genealogical domain. Moreover, these supervised DNN models require training datasets that are absent in the genealogical domain. This study proposes an end-to-end approach for question answering using genealogical family trees by: (1) representing genealogical data as knowledge graphs, (2) converting them to texts, (3) combining them with unstructured texts, and (4) training a transformer-based question answering model. To evaluate the need for a dedicated approach, a comparison between the fine-tuned model (Uncle-BERT) trained on the auto-generated genealogical dataset and state-of-the-art question-answering models was performed. The findings indicate that there are significant differences between answering genealogical questions and open-domain questions. Moreover, the proposed methodology reduces complexity while increasing accuracy and may have practical implications for genealogical research and real-world projects, making genealogical data accessible to experts as well as the general public.
Keywords
Introduction
The popularity of “personal heritage”, user-generated genealogical family tree creation, has increased in recent years, driven by new digital services, such as online family tree sharing sites, family tree creation software, and even self-service DNA analysis by companies like Ancestry and My Heritage. These genealogical information systems allow users worldwide to create, upload and share their family tree in a semi-structured graph format named GEDCOM (GEnealogical Data COMmunication)1
As humans, we are accustomed to asking questions and receiving answers from others. However, the standard search engines and information retrieval (IR) systems require users to find answers from a list of documents. For example, for the question “How many children does Kate Kaufman have?”, the system will retrieve a list of documents containing the words “children” and “Kate Kaufman”. Unlike search engines and IR systems, natural question answering algorithms aim to provide precise answers to specified questions [48]. Thus, if a user is searching a genealogical database for the family tree of Kate Kaufman2
DNN models for open-domain natural question answering achieved high accuracy in multiple studies [15,103,117,119–121,128]. Training DNN models for question answering requires a golden standard dataset constructed from questions, answers, and corresponding texts from which these answers can be extracted. An extensive golden standard dataset for the natural question answering task widely used for training such models is Stanford Question Answering Dataset (SQuAD) [91,92]. However, in the field of genealogy, there are no standard training datasets of questions and answers similar to SQuAD.
Generating a genealogical training dataset for question answering DNN is challenging, since genealogical data constitutes a semi-structured heterogeneous graph. It contains a mix of a structured graph and unstructured texts with multiple nodes and edge types, where nodes may include structured data on a specific person node (e.g., person’s birthplace), structured data on a specific family node (e.g., marriage date), relations between nodes, and unstructured text sequences (e.g., bio notes of a person). Such a mix of structured heterogeneous graph data and unstructured text sequences is not the type of input that state-of-the-art models, like BERT [22] and other sequence-based DNN models, are designed to work with.
Therefore, the main objective of the proposed study is to design and empirically validate an end-to-end pipeline and a novel methodology for question-answering DNN using graph-based genealogical family trees combined with unstructured texts.
The research questions addressed in this study are:
What is the effect of the training corpus domain (i.e., open-domain vs. genealogical data) and the consanguinity scope on the accuracy of neural network models in the genealogical question answering task?
How to traverse a genealogical data graph while preserving the meaning of the genealogical relationships and family roles?
What is the effect of the question type on the DNN models’ accuracy in the genealogical question answering task?
The main contributions of the study are:
A new automated method for question answering dataset generation derived from family tree data, based on the knowledge graph representation of genealogical data and its automatic conversion into a free text; A new graph traversal method for genealogical data; A fine-tuned question answering DNN model for the genealogical domain, Uncle-BERT, based on BERT3
This section covers related work in the fields relevant to this research: genealogical family trees, neural network architecture, and question answering using neural networks.
Genealogical family trees
Genealogical family trees have become popular in recent years. Both non-profit organizations and commercial companies allow users worldwide to upload and update their family tree online. For example, commercial enterprises like Ancestry and My Heritage collect over 100 million4
The de facto standard in the field of genealogical family trees is the GEDCOM format [37,57]. The standard developed by The Church of Jesus Christ of Latter-day Saints in 1984, and the latest released version (5.5.1) that was drafted in 1999 and fully released in 2019, still dominates the market [43]. Other standards have been suggested as replacements, but none were extensively adopted by the industry. GEDCOM is an open format with a simple lineage-linked structure, in which each record relates to either an individual or a family, and relevant information, such as names, events, places, relationships, and dates, appears in a hierarchical structure [37]. There are several open online GEDCOM databases, including GenealogyForum [36], WikiTree,7

Relation degrees in genealogy10.
In GEDCOM format, every person (individual) in the family tree is represented as a node that may contain known attributes, such as first name, last name, birth date and place, death date and place, burial date and place, notes, occupation, and other information. Two individuals are not linked to one another directly. Each individual is linked to a family node as a “spouse” (i.e., a parent) or a “child” in the family. Figure 2 shows a sub-graph corresponding to a Source Person (SP) whose data is presented in the GEDCOM file in Fig. 3. Each individual and family are assigned a unique ID – a number bracketed by @ symbols and a class name (INDI – individual, FAM – family). The source person is noted as SP (@I137@ INDI – Emily Williams in the GEDCOM file), families as F and other persons as P. In this example, P3, P4, P5, and P6 are the grandparents of SP; P1 and Figure created by WClarke (

Family tree structure.
Wyoming, USA, and was buried three days later in the same place. He was baptized on 9 AUG 1877, although there is a note stating that it may be on the 12 of AUG 1877, and he was endowed with his wife. For practical reasons, the GEDCOM file example in Fig. 3 contains only a small portion of the data presented in Fig. 2.

(Part of the) GEDCOM family tree file.
A DNN is a computational mathematical model that consists of several “neurons” arranged in layers. Each neuron performs a computational operation and transmits the computed information (calculation result) to the neurons in the next layer. The information is passed over and changed from layer to layer until it becomes the output in the network’s last layer. The conventional learning method is backpropagation, which refers to learning as an optimization problem [123]. After each training cycle, a comparison between the network prediction (output) and the actual expected result is performed, and a “loss” (i.e., the gap) is calculated to estimate the changes needed in the network operations (the weight of neuron’s transformation). Changes in the network weights are usually performed using the Gradient Descent methods [7].
In recent years, DNNs have become the state-of-the-art method for text analysis in the cultural heritage space [111], and natural language question-answering systems based on DNN have become the state-of-the-art method for solving the question answering task [62]. The underlying task of question answering is Machine Reading Comprehension (MRC), which allows machines to read and comprehend a specified context passage for answering a question, similarly to language proficiency exams. Question answering, on the other hand, aims to answer a question without a specific context. These QA systems store a database containing a sizeable unstructured corpus and generate the context in real-time based on relevant text passages to the input question [139]. Due to the magnitude of comparisons needed between the query and each text passage in the corpus, and due to the number of calculations (a large number of multiplications of vectors and matrices) when a DNN model predicts the answer span for every given text passage, DNNs are not applied on the entire database of texts, but only on a limited number of passages. Hence, when a user asks a question, the system searches12 A common approach for finding relevant passages is reverse indexing [11,54,55,72,105].

Typical open-domain question answering pipeline.
Over the years, different deep learning layers have been developed with various abilities. Until recently, the typical architecture for natural language questions answering was based on Recurrent Neural Networks (RNN) such as Long Short Term Memory (LSTM) [49] and Gated Recurrent Units (GRU) layers [17]. RNN layers allow the network to “remember” previously calculated data and thus learn answers regarding an entire sequence. These layers are used to construct different models, including a sequence-to-sequence model [113] that uses an encoder-decoder architecture [17] that fits the question-answering task. This model maps a sequence input to a sequence output, like a document (sequence of words) and a question (sequence of words) to an answer (sequence of words) or to classify words (whatever the word is the start or the end of the answer). RNN architecture often works with direct and reverse order sequences (bidirectional-RNN) [97]. It may also include an attention mechanism [116], which “decides” (i.e., ranks) which parts in the sequence are more important than others during the transformation of a sequence from one layer to another.
Another typical architecture is based on a Convolutional Neural Network (CNN). Unlike RNNs, CNNs architecture does not have any memory state that accumulates the information from the sequence data. CNN architecture uses pre-trained static embeddings where each CNN channel aggregates information from the vectorial representation. Channels of different sizes enable it to deal with n-gram-like information in a sentence [58].
Question answering task can also be modeled as a graph task (e.g., traversal, subgraph extraction). The data can be represented as a knowledge graph (KGQA), where each node is an entity, and each edge is a relation between two entities. When answering the question, the algorithm finds the relevant entities for the question and traverses over the relations or uses the node’s attributes to find the answer node or attribute [13,24,135]. To work with graphs, Graph Neural Networks (GNN) [95] models have been developed that operate directly on the graph structure. GNN can be used for resolving answers directly from a knowledge graph by predicting an answer node from question nodes (i.e., entities) [29,39,73,81,102,135]. The GNN model is similar to RNN in the sense that it uses near nodes and relations (instead of previous and next token in RNN) to classify (i.e., label) each node. However, these models cannot directly work with unstructured or semi-structured data or rely on the ability to complete and update the knowledge graph from free texts using knowledge graph completion tasks, such as relation extraction [8,83,129] or link prediction [32,53].
An improved approach considered to be the state-of-the-art in many NLP tasks, including question answering, is Transformers architecture [116], which uses the attention mechanism with feed-forward layers (not RNNs); this kind of attention is also called Self Attention Network (SAN). Well-known examples of SANs are Bidirectional Encoder Representations from Transformers (BERT) [22] and GPT-2 [90] models. Several BERT-based models were developed in recent years [126], achieving state-of-the-art performance (accuracy) in different question answering tasks. These include RoBERTa – a BERT model with hyperparameters and training data size tuning [71]; DistilBERT – a smaller, faster, and lighter version of BERT [94]; ELECTRA – a BERT-like model with a different training approach [18]. Although standard BERT-based models receive textual sequence as input, all the above architectures can also be mixed. For example, a Graph Convolutional Network (GCN) [115] can be utilized for text classification by modeling the text as a graph and using the filtering capabilities of a CNN [132].
There are several question-answering DNN pipelines based on knowledge graphs that support semi-structured data (a mix of a structured graph and unstructured texts) [29,41,135,138]. As shown in Fig. 5, a current state-of-the-art pipeline of this type, Deciphering Entity Links from Free Text (DELFT) [135], uses the knowledge graph to extract related entities and sentences, filters possible textual sentences using BERT, and then traverses a filtered subgraph using a GNN. The pipeline starts with identifying the entities in the question. Then, related entities (“candidates”) from the knowledge graph and relevant sentences (“evidence relations”) from unstructured texts are extracted and filtered using BERT. A new subgraph is generated using the question entities, the filtered evidence relations, and the candidate entities. Using this subgraph, a GNN model learns to rank the most relevant node. Thus, the model obtains a “trail” from the question nodes to a possible candidate node (i.e., answer). The pipeline applies two DNN models: a BERT model to rank the evidence relations and a GNN model to traverse the graph (i.e., predict the answer node).

Typical knowledge graph question answering pipeline.
However, these methods, using the unstructured texts to create or complete the knowledge graph, rely heavily on well-defined semantics and fail to handle questions with entities completely outside the knowledge graph or questions that cannot be modeled within the knowledge graph. For example, Differentiable Neural Computer (DNC) [39] can be used to answer traversal questions (“Who is John’s great-great-grandfather?”), but not to answer content-related questions when the answer is written in the person’s bio notes (e.g., “When did John’s great-great-grandfather move to Florida?”). As part of the evaluation experiments in this study, the performance of the above mentioned DELFT pipeline, adapted to the genealogical domain, was compared to that of the proposed pipeline.
In summary, the generic question answering pipelines described above cannot be applied as-is in the genealogical domain, without compromising on accuracy, for the following reasons: (1) The raw data is structured as graphs, each graph contains more information than a DNN model can handle in a single inference process (each node is equivalent to a document), (2) A user may ask about different nodes and different scopes of relations (i.e., different genealogical relation degrees); (3) There is a high number of nodes containing a relatively small volume of structured data and a relatively large volume of unstructured textual data. In addition, the vast amount of different training approaches, hyperparameters tuning, and architectures indicate the complexity of the models and sensitivity to a specific domain and sub-task.
The question answering approach proposed in this study simplifies the task pipeline by converting the genealogical knowledge graph into text, which is then combined with unstructured genealogical texts and processed by BERT’s contextual embeddings. Converting the genealogical graph into text passages can be performed using knowledge-graph-to-text templates and methodologies [21,26,56,77,124], and knowledge-graph-to-text machine learning and DNN models [5,34,64,67,69,79,80,100,107]. Template-based knowledge-graph-to-text methods use hardcoded or extracted linguistic rules or templates to convert a subgraph into a sentence. Machine learning and DNN models can be trained to produce a text from knowledge-graph nodes. The input for a knowledge-graph-to-text model is a list of triples of two nodes and their relation, and the output is a text passage containing a natural language text with input nodes and their relations as syntactic sentences. To this end, DNN models are often trained using commonsense knowledge graphs of facts, such as ConceptNet [108], BabelNet [84], DBpedia [3], and Freebase [86], where nodes are entities, and the edges represent the semantic relationships between them. Some models use the fact that knowledge graphs are language-agnostic to generate texts in multi-languages (e.g., [80]).
Training of a DNN question answering model requires a set of text passages and corresponding pairs of questions and answers. Multiple approaches exist for generation of questions (and answers): knowledge-graph-to-question template-based methodology (similar to the context generation) [68,99,137,141], WH questions (e.g., Where, Who, What, When, Why) rule-based approach [81], knowledge graph-based question generation [16,51], and DNN-based models for generating additional types of questions [25,50,118,136]. The rule-based method uses part-of-speech parsing of sentences using the Stanford Parser [60], creates a tree query language and tree manipulation [66], and applies a set of rules to simplify and transform the sentences to a question. To guarantee question quality, questions are ranked by a logistic regression model for question acceptability [45]. The DNN question generation models are trained on SQuAD [91,92] or on facts from a knowledge graph to predict the question and its correct answer from the context (i.e., the opposite task from question answering) using bi-directional [97] LSTM [49] encoder-decoder [17] model with attention [116].
This study adopted the format of the SQuAD dataset, which is a well-known benchmark for machine learning models on question answering tasks with a formal leaderboard.13
SQuAD 2.0 is a JSON formatted dataset, presented in Fig. 6, where each topic (a Wikipedia article) has a

SQuAD 2.0 JSON format example.
While using DNNs for the open-domain question answering task has become the state-of-the-art approach, automated question answering systems for genealogical data is still an underexplored field of research. This paper presents a new methodology for a DNN-based question answering pipeline for semi-structured heterogeneous genealogical knowledge graphs. First, a training corpus that captures both the structured and unstructured information in genealogical graphs is generated. Then, the generated corpus is used to train a DNN-based question answering model.
Gen-SQuAD generation and graph traversal
The first phase in the proposed methodology is to generate a training dataset using the text sequence encoding with a graph traversal algorithm. This dataset should contain questions with answers and free text passages from which the model can retrieve these answers.
Generating a training dataset from genealogical data is a three-step process. The result of the process is Gen-SQuAD, a SQuAD 2.0 format dataset tailored to the genealogical domain. As shown in Fig. 7, the process includes the following steps: (1) decomposing the GEDCOM graphs to CIDOC-CRM-based14

Gen-SQuAD generation.
While there are some DNN models that can accept large inputs [9,59], due to computational resource limitations, many DNN models tend to accept limited size inputs, usually ranging from 128 to 512 tokens (i.e., words) [33]. However, family trees tend to hold a lot of information, from names, places, and dates to free-text notes, life stories, and even manifests. Therefore, using the proposed methodology, it is not practical to build a model that will read an entire family tree as an input (sequence), and it is necessary to split the family tree into sub-trees (sub-graphs). Several generic graph traversal algorithms may be suitable for traversing a graph and extracting sub-graphs, such as Breadth-First-Search (BFS) and Depth-First-Search (DFS). BFS’s scoping resembles a genealogy exploration process that treats first relations between individuals that are at the same depth level (relation degree) in the family tree, moving from the selected node’s level to the outer level nodes. However, the definition of relation degrees in genealogy (i.e., consanguinity) is different from the pure graph-theory mathematical definition implemented in BFS [12]. For example, parents are considered first-degree relations in genealogy (based on the ontology), while they are considered to be second-degree relations mathematically, since there is a family node between the parent and the child (i.e., the parent and the child are not connected directly), with siblings considered to be second-degree relations in both genealogy and graph theory. Combined BFS-DFS algorithms such as Random Walks [40] do not take into account domain knowledge and sample nodes randomly. In the genealogical research field, several traversal algorithms have been suggested for user interface optimization [57]. However, these algorithms aim to improve interfaces and user experience and are not suitable for complete data extraction (graph to text) tasks.
This paper presents a new traversal algorithm, Gen-BFS, which is essentially the BFS algorithm adapted to the genealogical domain. The Gen-BFS algorithm is formally defined as follows.
Where each node can be a Person or a Family, each Person node has two links (edges) types: famchild (FAMC in GEDCOM standard) and famparent (FAMS in GEDCOM standard), each Family has the opposite edge types: childfam and parentfam. Where {famchild} is the collection of all the families in which a person is considered a child (biological family and adopted families), {famparent} is the collection of all the families in which a person is a parent (spouse) (i.e., all the person’s marriages), {childfam} is a collection of all the persons that are considered to be children in a family and {parentfam} is a collection of all the persons considered to be a parent in a family. For example, the SP in Fig. 2 is linked to two nodes. The link type to F1 is famchild, and the link type to F4 is famparent. The family F1 in Fig. 2 has two types of links. The link type to SP, P7, P8 is childfam, and the link type to P1 and P2 is parentfam.
Figure 8 illustrates the Gen-BFS traversal applied to the family tree presented in Fig. 2. As shown in Fig. 8, Gen-BFS is aware of the genealogical meaning of the nodes and reduces the tree traversal’s logical depth. It ignores families in terms of relation degree, considers SP’s spouses as the same degree as SP and SP’s parents and children as first degree, and keeps siblings and grandparents as second-degree. In particular, lines 1–20 in Algorithm 1 represent a BFS-style traverse over the graph. In lines 5–8, the algorithm introduces domain knowledge and adds nodes to its queue according to the node type. The code in lines 9–17 ensures that the traversal will stop at the desired depth level. If the current node is a Person (line 12) and the current depth (CD) is about to get deeper than the required depth (D), then the while loop will end (line 14). Otherwise, the Persons and Families in the current depth (kn) will be added to the node queue (NQ) and may (depending on the stop mechanism) be added to the depth queue (DQ). In line 21, the depth queue (DQ) holds all the Family nodes and most of the Person nodes (except for spouses of the last depth level’s Person nodes) within the desired depth level. For example, traversing with

Gen-BFS algorithm15.
An algorithm step is noted as S. The degree of relation is noted as D. Relations are color-coded as follows: Zero-degree relation (self) – turquoise, First-degree relations – black, and Second-degree relations – brown.

Gen-BFS algorithm.
Once extracted, each genealogical sub-graph was presented as a knowledge graph. This study adopted an event-based approach to data modeling presented in the past literature ([2,31,114]). As in [114], a formal representation of the GEDCOM heterogeneous graph (excluding the unstructured texts) as a knowledge graph was implemented using CIDOC-CRM, but in a more specific manner (e.g., we used concrete events and properties such as

GEDCOM individual’s knowledge graph in the CIDOC-CRM-based format.
Figure 9 is an example of a representation of the GEDCOM sub-graph as a knowledge graph. As illustrated in the figure, the SP node is an instance of the class Person and has a relation (property) to a birth event (
Next, a textual passage from each sub-graph is generated, representing the
Using a knowledge-graph-to-text DNN model [69] and a knowledge-graph-to-text templates methodology [77], multiple variations of sentences conveying the same facts (comprised of the same nodes and edges in the graph) were composed based on different templates and combined with the sentence paraphrasing using a DNN-based model (the model of [64]). Most of the text passages were generated using a DNN model. However, the template-based method added variations that the DNN model did not capture. Table 1 above presents examples of such sentences created for the sub-graph in Fig. 9.
Another critical challenge resolved by this approach is the multi-hop question answering problem, where the model needs to combine information from several sentences to answer the question. Although there are multi-hop question answering models presented in the literature [30,75], their accuracy is significantly lower than a single-hop question answering. To illustrate the problem, consider a user asking about the SP’s (John’s) grandfather: “Where was John’s grandfather born?” or “Where was Tim Cohen born?”, where Tim Cohen refers to John’s grandfather. To answer both questions without multi-hop reasoning for resolution of multiple references to the same person, the graph-to-text template-based rules include patterns that encapsulate both the SP’s relationship type (John’s grandfather) and the relative’s name (Tim Cohen), thus allowing the model to learn that Tim Cohen is John’s grandfather. There are three types of references to a person that allows the DNN model to resolve single or multi-hop questions: (1) Direct referencing to a person with his/hers first and last name (e.g., John Williams), (2) Partial referencing to a person with his/hers first or last name (e.g., John), and (3) Multi-hop encapsulation, i.e., referencing to a person with their relative name to the SP (e.g., Alexander’s son).
As a result of the above processing, multiple text passages were created for each SP’s sub-graph. Since each sentence is standalone and contains one fact, sentences were randomly ordered within each text passage. Thus, even if the passage is longer than the neural model’s computing capability, the model will likely encounter all types of sentences during its training process. These text passages were further encoded as vectors (i.e., embeddings) to train a DNN model that learns contextual embeddings to predict the answer (i.e., start and end positions in the text passage) for a given question.
Genealogical-knowledge-graph-to-text context template example
Genealogical-knowledge-graph-to-text context template example
Using the generated text passages (contexts), pairs of questions and answers were created. The answers were generated first, and then the corresponding questions were built for them as follows. Knowledge graph nodes and properties (relationships), as well as named entities and other characteristic keywords extracted from free text passages were used as answers. To achieve extensive coverage, multiple approaches were used for generation of questions. First, a rule-based approach was applied for question generation from knowledge graphs [141] and a statistical question generation technique [45] was utilized for WH question generation from the unstructured texts in GEDCOM.
Most of the questions (73%) were created using these methods. To identify the types of questions typical of the genealogical domain and define rule-based templates for their automatic generation, this study examined the genealogical analysis tasks that users tend to perform on genealogical graphs [10]. These tasks include: (1) identifying the SP’s ancestors (e.g., parents, grandparents) or descendants (e.g., children, grandchildren), (2) identifying the SP’s extended family (second-degree relations), (3) identifying family events, such as marriages, (4) identifying influential individuals (e.g., by occupation, military rank, academic achievements, number of children), and (5) finding information about dates and places, such as the date of birth, and place of marriage [4,10]. These analysis tasks were adopted to define characteristic templates for natural language questions that a user may ask about the SP or its relatives. Some of these questions can be answered directly from the structured knowledge graph (e.g., “When was Tim’s father born?”), while others can only be answered using the unstructured texts attached to the nodes (e.g., “Did Tim’s father have cancer?”).
A DNN-based model for generating additional types of questions [25] was used to complement the rule-based method. The neural question generation model predicted questions from all the unstructured texts in the GEDCOM data and produced 24% of the questions in the dataset (excluding duplicate questions already created using the WH-based and rule-based approaches).
Knowledge-graph-to-text question template examples
Knowledge-graph-to-text question template examples
Finally, additional rules were manually compiled using templates [1,28] to create questions missed by previous methods, mainly quantitative and yes-no questions (as illustrated in Table 2). These questions were 3% of all the questions in the datasets. All answer indexes were tested automatically to ensure that the answer text exists in the context passage. A random sample of 120 questions was tested manually by the researchers as a quality control process, and the observed accuracy was virtually 100%. However, it is still possible that DNN generated some errors. Nevertheless, even in this case, the study’s conclusions would not change, as such errors would have a similar effect (same embeddings) on all the tested models.
Fine-tuning a DNN model is the process of adapting a model that was trained on generic data to a specific task and domain [22]. An initial DNN model is usually designed and trained to perform generic tasks on large domain-agnostic texts, like Wikipedia. In the case of the open-domain question answering, the BERT baseline model was pre-trained on English Wikipedia and Books Corpus [140] using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives [22]. The MLM methodology is a self-supervised dataset generation method. For each input sentence, one or more tokens (words) are masked, and the model’s task is to generate the most likely substitute for each masked token. In this fill-in-the-blank task, the model uses the context words surrounding a mask token to try to predict what the masked word should be. The NSP methodology is also a self-supervised dataset generation method. The model gets a pair of sentences and predicts if the second sentence follows the first one in the dataset. MLM and NSP are effective ways to train language models without annotations as a basis for various supervised NLP tasks. Combining MLM and NSP training methods allow modeling languages with both word-level relations and sentence-level relations understanding. The pre-trained BERT-based question answering model was designed with 12 layers, 768 hidden nodes, 12 attention heads, and 110 million parameters. Using such a pre-trained model, DNN layers can be added to fit to a specific task [22].
As shown in Fig. 10, a new BERT-based model, Uncle-BERT, was fine-tuned for genealogical question answering as follows: (1) adding a pair of output dense layers (vectors) with dimensions of the hidden states in the model (
Figure 11 summarizes the developed genealogical question answering pipeline. To simplify the task, the proposed architecture asks the user to first select the family tree from the corpus (future research can eliminate this step by embedding the family trees [38] and ranking them based on similarity to the question [93]). As demonstrated in the figure, the family tree corpus (comprised of GEDCOM files) is processed into question answering datasets for different scopes. The process starts when a user selects a specific person from a family tree. Then the user indicates a scope (a genealogical relation degree, as described in Fig. 1) to ask about (e.g., the

The DNN model fine-tuning process.

Genealogical question answering pipeline (the proposed architecture).
This section describes the experimental dataset and training conducted to validate the proposed methodology for the genealogical domain.
Datasets
In this research, 3,140 family trees containing 1,847,224 different individuals from the corpus of the Douglas E. Goldman Jewish Genealogy Center in Anu Museum16
From the filtered GEDCOM files belonging to the above corpus, and after removing some files with parsing or encoding errors, three datasets were generated: Gen-SQuAD0 using zero relation degree (SP and its spouses) with 6,283,082 questions, Gen-SQuAD1 using first-degree relations with 28,778,947 questions, and Gen-SQuAD2 using second-degree relations with 75,281,088 questions. Although all generated datasets contain millions of examples, only 131,072 randomly selected questions were used from each dataset when training the Uncle-BERT models. These were enough for the models to converge. Therefore, the size of the dataset did not impact the training results.
Each dataset was split into a training set (60%), a test set (20%), and an evaluation set (20%). To better evaluate the success of the different question answering models, the 131,072 questions in each dataset were classified into twelve types. Examples of questions and their classification types are shown in Table 3. Each question may refer to the
Question types
For fine-tuning Uncle-BERT,19 A link to the code:

Uncle-BERT model input example.
Figure 12 presents the model’s input, where the [CLS] tag, which stands for classifier token, is the beginning of the input, followed by the first part of the input – the question. The [SEP] tag, which stands for a separator, separates the first part of the input (i.e., a question) and the second part – the context. [CLS] at the end indicates the end of the input.
To evaluate the effect of the depth of the consanguinity scope on the model’s accuracy, an Uncle-BERT model was trained for each of the three datasets: Uncle-BERT0 using Gen-SQuAD0, Uncle-BERT1 using Gen-SQuAD1, and Uncle-BERT2 using Gen-SQuAD2. All models were trained with the same hyperparameters, that are shown in Table 4.
Uncle-BERT training hyperparameters
Max question tokens is the maximum number of tokens to process from the question input; if the question input length was greater than the Max question tokens, it was trimmed. Max sequence tokens are the maximum tokens to process from the combined context and question inputs.
If the cumulative context and question length was longer than the Max sequence tokens hyperparameter value, the context was split into shorter sub-texts using a sliding window technique; the Doc stride represents the sliding window overlap size. For example, consider the following hyperparameters’ values: the max sequence tokens hyperparameter is 25, the doc stride hyperparameter is 6, and the following training example: “[CLS] When was Matt Adler’s father born? [SEP] Matt’s father (Noah Adler) was born in 1950 in London, England. Matt’s father (Noah Adler) was a male. Matt’s brother (Joanne Adler) was a male. Matt Adler was born in 1975 in London, England. Matt’s mother (Carol) was born in 1950. [CLS]”; the question of the training example contains 7 tokens, and 18 tokens are left for the context. Therefore, the context will be split into three training examples: 1) “[CLS] When was Matt Adler’s father born? [SEP] Matt’s father (Noah Adler) was born in 1950 in London, England. Matt’s father (Noah Adler) was a male [CLS]” (i.e., tokens 1 to 18), 2) “[CLS] When was Matt Adler’s father born? [SEP] father (Noah Adler) was a male. Matt’s brother (Joanne Adler) was a male. Matt Adler was born in [CLS]” (i.e., tokens 12 to 30), 3) “[CLS] When was Matt Adler’s father born? [SEP] a male. Matt Adler was born in 1975 in London, England. Matt’s mother (Carol) was born in 1950. [CLS]” (i.e., tokens 24 to 42). The model will be trained with the same question on the three new examples; if the answer span does not exist in an example, it is considered unanswerable.
Max answer tokens is the maximum number of tokens that a generated answer can contain. Train size is the number of examples used from the dataset during the training cycle.
As is customary with the SQuAD benchmark, an F1 score was calculated to evaluate Uncle-BERT models:
Precision equals the fraction of correct tokens out of the retrieved tokens (i.e., words that exist in both the predicted and the expected answer), and recall equals the fraction of the correct tokens in the retrieved (predicted) answer out of the tokens in the expected answer. This metric allows measuring both exact and partial answers.
To evaluate the accuracy of the proposed fine-tuned models, the Gen-SQuAD2 dataset was used to represent a real-world use-case in which a user is investigating her genealogical roots with the genealogical scope of two relation degrees (generations)20 Similar to Anu Museum user interface -
Figures 13 and 14 show the training loss and F1 scores of each of the three models. As expected, the more complex the context and questions, the lower the F1 score. While on narrow persons’ contexts and questions (Gen-SQuAD0), the model achieved an F1 score of 99.84; on second-degree genealogical relations (Gen-SQuAD2), it achieved only an F1 score of 80.28.
Furthermore, as can be observed in Table 5, compared to the Uncle-BERT2 model (trained with broader contexts of second-degree genealogical relations), the Uncle-BERT0, which was trained using information about the

The three Uncle-BERT model’s train loss.21.

The three Uncle-BERT model’s train F1 score.22.
Uncle-BERT models F1 score on Gen-SQuAD2
As can be observed in Table 6, the baseline BERT model trained on the open-domain SQuAD 2.0 achieved an F1 score of 83 on the open-domain SQuAD 2.0 dataset [91]. However, on the genealogical domain dataset (Gen-SQuAD2), it achieved a significantly lower F1 score (60.12) compared to the Uncle-BERT2 (81.45). The fact that Uncle-BERT2 achieves a higher F1 score is not surprising since the model was trained on genealogical data, as opposed to the baseline BERT model trained on the open-domain question data. However, when comparing Uncle-BERT2 to Uncle-DELFT2, it is clear that the performance improvement is due to the proposed methodology and not just due to the richer or domain-specific training data. Moreover, the DELFT method is much more complex than BERT, yet it achieved a lower score even when trained on the same domain-specific data. The fact that the vast majority of entities (found in both the “user” question and the expected answer) exists only in the unstructured data makes it hard for the GNN to find the correct answer (i.e., to complete the graph). This finding emphasizes the uniqueness of a genealogical question answering task compared to the open-domain question-answering and the need for the end-to-end pipeline and methodology for training and using DNNs for this task, as presented in this paper. Since Uncle-BERT2 achieved a higher accuracy score than the more complex Uncle-DELFT2 model, we conclude that the proposed method reduces complexity while increasing accuracy.
F1 scores of Uncle-BERT2 and other state-of-the-art models on Gen-SQuAD2
As shown in Table 6, although some questions appear in both Gen-SQuAD2 and SQuAD 2.0 datasets, there is still a significant difference between open-domain questions and genealogical questions. Except for Uncle-DELFT2 in the case of date questions, all the state-of-the-art models failed to answer natural genealogical questions compared to Uncle-BERT2 (and in many cases, even compared to Uncle-BERT1). However, Uncle-DELFT2 was successful regarding date questions. This may imply that objective date questions are harder to extract from unstructured texts and the graph structure contributes to resolving such questions. Moreover, BERT’s success on SP’s date questions (compared to Uncle-BERT2) may suggest that these questions are more generic and have more common features among different domains than unique features in the genealogical domain. Furthermore, the current state-of-the-art knowledge graph pipeline (i.e., DELFT) achieved performance similar to simpler BERT-based models. This indicates that while it is beneficial for open-domain questions, it is not as effective in the genealogical domain. This result, combined with the additional complexity of DELFT, makes it less satisfactory in this domain (except for date questions, as mentioned above).
Interestingly, the “basic” BERT model outperforms all the newer BERT-based models (except for Uncle-BERT2). Furthermore, the fact that Uncle-BERT1 achieved a higher F1 score on place type questions may indicate that place type questions may be more sensitive to “noise” or broad context. For example, place names may have different variations for the same entity (high “noise”), e.g., NY, NYC, New York, and New York City are all references to the same entity. This variety makes the model’s task more difficult, thus adding broader contextual information and other types of “noise” (e.g., other entities, more people names, and dates), which may reduce the model’s accuracy. Another possible reason for Uncle-BERT2’s lower accuracy on place type questions may be the fact that Uncle-BERT2 was trained with both one-hop-away and two-hop-away contexts while Uncle-BERT1 was trained only with one-hop-away contexts. The fact that the F1 score of the model is smaller on second-degree place objective questions (1.39) than on first-degree (4.72) and zero-degree (10.01) place objective questions may reinforce this indication. However, it is important to notice that in many cases, this factor will not affect the F1 score since the F1 score does not use the position of the answer (start and end index), but only the selected tokens compared to the answer tokens. Since most children and parents live in the same place, either the parent’s place (e.g., birthplace) or the child’s place can be selected by the model without affecting the F1 score. Table 7 presents some examples of answer predictions for place objective questions by Uncle-BERT1 and Uncle-BERT2. These results suggest that higher accuracy can be achieved by classifying the question types and using a different model for different question types and relation depths.
Uncle-BERT1’s and Uncle-BERT2’s prediction examples
This study proposed and implemented a multi-phase end-to-end methodology for DNN-based answering natural questions using transformers in the genealogical domain.
The presented methodology was evaluated on a large corpus of 3,140 family trees comprised of 1,847,224 different persons. The evaluation results show that a fine-tuned Uncle-BERT2 model, trained on the genealogical dataset with second degree relationships, outperformed all the open-domain state-of-the-art models. This finding indicates that the genealogy domain is distinctive and requires a dedicated training dataset and fine-tuned DNN model. A comparison of the proposed knowledge-graph-to-text approach was also found to be superior to the direct knowledge graph-based models, such as DELFT, even after domain-adaptation, both in terms of accuracy and complexity. This study also examined the effect of the type of question on the accuracy of the question answering model. The date-related questions are different as they can be answered with greater accuracy directly from the knowledge graph and may have more generic features than other question types, while place-related questions are more sensitive to noise than other question types. In addition, the evaluation results of the three Uncle-BERT models showed that the consanguinity scope of graph traversal used for generating a training corpus influences the accuracy of the models.
In summary, this paper’s contributions are: (1) a genealogical knowledge graph representation of GEDCOM standard; (2) a dedicated graph traversal algorithm adapted to interpret the meaning of the relationships in the genealogical data (Gen-BFS); (3) an automatically generated SQuAD-style genealogical training dataset (Gen-SQuAD); (4) an end-to-end question answering pipeline for the genealogical domain; and (5) a fine-tuned question-answering BERT-based model for the genealogical domain (Uncle-BERT).
Although the proposed end-to-end methodology was implemented and validated for the question answering task, it can be applied to other NLP downstream tasks in the genealogical domain, such as entity extraction, text classification, and summarization. Researchers can utilize the study’s results to reduce the time, cost, and complexity and to improve accuracy in the genealogical domain NLP research.
Possible directions for future research may include: (1) investigating the tradeoff between rich context passage generation and increasing the Gen-BFS scope, (2) integration with DNC or GNNs for dynamic scoping, (3) finding a method for classifying question types, (4) investigating the contribution of each question type to the accuracy of the model, and developing a model selection or multi-model method for each question type, (5) investigating larger contexts (relation degrees) using models that can handle larger input (e.g., Longformer [59] or Reformer [9]), (6) extending the Gen-BFS algorithm to handle missing family relations by adding a knowledge graph completion step while traversing the graph, (7) investigating the influence of the order of verbalized sentences and especially the order of person reference types, (8) investigating an architecture that will rank family trees (embedding the entire graph [38]) based on similarity to the question [93]) and eliminate the need for the user to select a family tree, (9) investigating the impact of spelling mistakes and out-of-vocabulary words on the quality of the results, (10) and training other transformer models on genealogical data to further optimize question answering DNN models for the genealogical domain.
