Sage Journals: Discover world-class research

Abstract

Integrating Schema.org markup into web pages has resulted in the generation of billions of RDF triples. However, around 75% of web pages still lack this critical markup. Large language models (LLMs) present a promising solution by automatically generating the missing Schema.org markup. Despite this potential, there is currently no benchmark to evaluate the markup quality produced by LLMs. This article introduces LLM4Schema.org, an innovative approach for assessing the performance of LLMs in generating Schema.org markup. Unlike traditional methods, LLM4Schema.org does not require a predefined ground truth. Instead, it compares the quality of LLM-generated markup against human-generated markup. Our findings reveal that 40%–50% of the markup produced by GPT-3.5 and GPT-4 is invalid, non-factual, or non-compliant with the Schema.org ontology. These errors underscore the limitations of LLMs in adhering strictly to structured ontologies like Schema.org without additional filtering and validation mechanisms. We demonstrate that specialized LLM-powered agents can effectively identify and eliminate these errors. After applying such filtering for both human and LLM-generated markup, GPT-4 shows notable improvements in quality and outperforms humans. LLM4Schema.org highlights both the potential and the challenges of leveraging LLMs for semantic annotations, emphasizing the critical role of careful curation and validation to achieve reliable results.

Keywords

Schema.org large language model knowledge graph construction

1. Introduction

‘‘The price of this book is 30 $€$ , this product is in stock, that recipe will take 30 minutes; this job is full-time.’’ Search engines use Schema.org markups embedded in web pages to understand their content better and enrich search results with precise information. These markups can be authored directly by humans when writing their web pages or indirectly generated by software such as Web Content Management Systems configured to produce Schema.org markup.¹ Schema.org markup adheres to the Schema.org ontology, which consists of 806 types and over 1,400 properties to describe entities such as people, organizations, creative works, events, etc. The nature of entities and information differs significantly when comparing Schema.org markup to a public knowledge graph like Wikidata. For example, Wikidata has $\sim$ 43K books registered, whereas Schema.org markup encompasses over 3,5M books (Dang et al., 2023). While Wikidata tends to describe well-known books, people, and organizations, Schema.org describes any book, any shop, or any individual, providing a broader scope. Additionally, Wikidata includes encyclopedic information (e.g., a book’s literary genre), while Schema.org includes specific information (e.g., the price, the vendor, etc.). Thus, Schema.org markup can offer valuable information absent in public knowledge graphs, contributing to more comprehensive and diverse knowledge. Ensuring the availability of such knowledge is crucial to maintaining knowledge graphs that reflect the wealth of information available on the web. Currently, semantic annotations are present in 41% of the world’s web pages, and 25% specifically using Schema.org markup (Brinkmann et al., 2023; Dang et al., 2023). This leaves 75% of web pages without Schema.org markup and, thus, lacking structured data about their content.

In the era of large language models (LLMs), one potential solution is to use these models to generate Schema.org markup from text (Meyer et al., 2023). However, using LLMs to generate Schema.org markups raises questions about the reliability of the generated markups, especially without ground truth to evaluate them. Building a ground truth for all 806 types of Schema.org, considering different web page sizes and languages, requires tremendous effort. A natural idea might be to consider web pages with human-generated Schema.org markups as ground truth. After all, billions of such pages cover various languages, domains, and lengths. Unfortunately, there is no guarantee that existing human-generated markup are (i) correct, that is, fully grounded in the text and compliant with Schema.org recommendations, and (ii) complete, that is, containing Schema.org markup for all information present from the text. Human-generated Schema.org markup can contain errors, such as incorrect facts or missing information, due to various factors. One common reason is the reliance on external knowledge, which is not covered by the text. Additionally, webmasters may lack professional expertise in structured data or web development, resulting in annotating mistakes or the omission of crucial details. Human error and varying levels of familiarity with Schema.org standards further contribute to these errors.

In this article, we propose a novel approach to evaluating the performance of LLMs in generating Schema.org markup. Unlike traditional methods, our approach does not rely on a predefined ground truth. Rather than constructing a ground truth from human-generated Schema.org markup, we aim to establish fair competition between LLMs and humans. Specifically, Do LLMs generate more comprehensive Schema.org markup than humans, given the text of a web page? To answer this question, we need two key elements: (i) First, we must remove any incorrect statements from Schema.org markup generated by humans and LLMs for a given text. As long as two markups contain errors, they cannot be fairly compared since the one with more errors might still appear to “win.” (ii) Second, once we have two correct markups, we need a scoring function to determine which is more comprehensive. The more comprehensive markup is considered the winner.

We propose LLM4Schema.org, a pipeline that takes a web page with human-generated Schema.org markup as input and outputs a scoring function to determine the winner (human or LLM). The article makes the following scientific contributions:

–
Schema.org markup curation agents: We developed three agents to curate Schema.org markup: (1)
The validity agent: Ensures syntactic correctness of the markup. This agent uses SHACL to verify that the markup adheres to the required structure and syntax.
(2)
The factuality agent: Checks that every markup is grounded in the text of the web page. This agent is powered by an LLM, and we created a dedicated benchmark based on examples from the Schema.org documentation to evaluate its precision and recall.
(3)
The compliance agent: Ensures that the content of the markup aligns with the expected types and values defined by the Schema.org documentation. This agent is also powered by an LLM, with its performance (precision/recall) evaluated on a dedicated benchmark using Schema.org examples.

–
MeMR (merged markup ratio): To compare the performance of humans and LLMs in the task of annotating web pages with Schema.org markup, we introduce MeMR, a scoring function that evaluates the coverage of two correct markups. MeMR returns a pair of numbers between 0 and 1; a higher score indicates a more comprehensive markup.
–
Experimental validation: To validate our approach, we created a diversified corpus of 180 web pages containing Schema.org markup by sampling WebDataCommons (Brinkmann et al., 2023; Dang et al., 2023). We then run the LLM4Schema.org pipeline on these 180 web pages. Our findings are as follows: (i) Approximately 40%–50% of the markup produced from the text by GPT-3.5 or GPT-4, using state-of-the-art prompts are either non-factual or non-compliant. Without curation, LLM-generated markup is thus not reliable. (ii) After curation, GPT3’s curated markups remain inferior to human-curated markups. However, GPT-4’s curated markups are more comprehensive than human-curated markups.

This article is organized as follows: Section 2 presents the background and motivations. Section 3 details the methodology of LLM4Schema.org for comparing humans and LLMs Schema.org markups. Section 4 presents our experimental study and discusses limitations. Section 5 explains the positioning of this work compared to related works. Section 6 concludes and outlines future work.
2. Background and Motivations

Currently, 41% of the world’s web pages include semantic annotations, and 25% specifically utilizing Schema.org markup (Brinkmann et al., 2023; Dang et al., 2023). This represents billions of web pages in various languages, with different sizes, each providing both textual content and Schema.org markups.

2.1. Web Pages With Schema.org Markup in JSON-LD

Schema.org is a lightweight ontology that includes 806 types, 1,476 properties, and 14 datatypes (Guha et al., 2016). It enables the description of various entities such as a person, a place, a product, an event, and so on. Schema.org annotations can be embedded in web pages using different formats, including JSON-LD (JavaScript Object Notation for Linked Data) and microdata. Microdata enables inline annotation within HTML attributes (e.g., itemscope, itemtype, and itemprop), closely tying annotations to HTML elements for higher semantic clarity. However, it complicates intricate structures and can bloat HTML, making it harder to manage in large or dynamic applications. JSON-LD is a lightweight Linked Data format based on JSON, designed for Web-scale interoperability.² It uses a separate script block to define structured data with types and properties, avoiding direct HTML modification. This separation prevents bloat, simplifies updates, and improves readability. JSON-LD is mainly used to annotate web pages with the types and properties defined in the Schema.org ontology. According to the latest WebDataCommons statistics (Brinkmann et al., 2023), over 1.1 billion URLs now contain triples in JSON-LD, compared to 822 million in RDFa. In this article, we focus on the JSON-LD format.

Figure 1 presents a simple example of an HTML web page describing an Apple Pie recipe. The page includes Schema.org markup in JSON-LD format between the tags <script>, </script>. This markup instantiates the Recipe Schema.org type to describe the content of the <body> section of the web page. The Schema.org markup is formatted using JSON-LD. JSON-LD is an RDF serialization largely adopted for annotating web pages with the types and properties defined in the Schema.org ontology. This markup is equivalent to the RDF representation shown in Figure 2.

Figure 1.

The Apple HTML Web Page Mixing Text in the Body and JSON-LD Describing Apple Pie with Its Ingredients. HTML = HyperText Markup Language; JSON-LD = JavaScript Object Notation for Linked Data.

Figure 2.

RDF Triples Produced by the Deserialization of the Apple Pie JSON-LD Markup of Figure 1. RDF = Resource Description Framework; JSON-LD = JavaScript Object Notation for Linked Data.

The JSON-LD markup describes a recipe entity along with its ingredients. For simplicity, the Apple Pie web page includes one entity of one type (Recipe). Typically, a single web page may consist of many entities, and each entity can belong to multiple Schema.org types.

Definition 1 Markup Entity

Given a web page embedding an RDF graph $G$ serialized as JSON-LD, a markup entity associated with a subject $s$ is defined as the set of all triples in $G$ that include at least one triple asserting a Schema.org type for $s$ , as well as all other triples reachable from $s$ .

In our example in Figure 2, there is one markup entity denoted by the blank node _:b0 that contains all triples presented in Figure 2.

2.2. Using LLMs to Generate Markup From Text

Given the text of a web page, an LLM can generate a Schema.org markup with a straightforward prompt such as:

Using this prompt, GPT-3.5 produces the Schema.org markup shown in Figure 3. Comparing the LLM-generated markup in Figures 3 and 4 with the human-generated markup in Figure 2, we observe that the LLM-generated version includes recipeInstructions, which are not present in the human version. In this example, the LLM provides a more comprehensive markup than humans. However, we also observe that the LLM generates nutrition facts that do not appear in the text. We consider these incorrect since they are not grounded in the provided text.

Figure 3.

GPT-3.5 Generated Schema.org Markup from the Apple Pie Text of Figure 1. Compared to the JSON-LD Markup of Figure 1, GPT-3.5 Produced the RecipeInstructions that are Grounded in the Text and Nutrition Facts that are not Grounded in the Text. GPT-3.5 = Generative Pre-trained Transformer 3.5; JSON-LD = JavaScript Object Notation for Linked Data.

Figure 4.

LLM4Schema.org Overall Pipeline. Given a Web Page and a Type $T$ , It Extracts a Human-markup $M^{h}$ and Generates LLM-Markup $M^{l}$ . It Removes Incorrect Facts from Both Markups Using Validity, Factuality, and Compliance Agents. Finally, It Compares Which Markup is the Most Comprehensive Using the MeMR Score Function. LLM = Large Language Model; MeMR = Merged Markup Ratio.

This example highlights two crucial points for a web page that includes both text and Schema.org markup:

The human-generated markup cannot be considered ground truth because there is no guarantee that it is: (i) Correct, meaning all RDF facts are grounded in the text and comply with the Schema.org ontology and (ii) Complete, meaning all information in the text is represented as RDF facts.

It is only possible to fairly compare two markups of a web page if they both contain correct facts.

In summary, the challenge lies in assessing the quality of LLM-generated markup when the corresponding human-generated markup cannot be assumed to be ground truth.

3. LLM4Schema.org Approach

As we cannot use human-generated markup as ground truth, LLM4Schema.org determines if the LLM-generated markup is more or less comprehensive than the human-generated markup. To enable fair comparison, we need two crucial elements:

Ensuring markup quality: We must eliminate incorrect facts from human-generated and LLM-generated markup. In LLM4Schema.org, we define incorrect facts as those not grounded in the text or compliant with the Schema.org ontology.

Scoring function: Once we have two correct markups, we need a way to assess the contribution of LLM and humans. In LLM4Schema.org, we define a scoring function determining which markup provides a bigger proportion of merged markups.

In the following, we define the LLM4Schema.org pipeline for a fair comparison between LLM-generated and human-generated markup.

3.1. LLM4Schema.org Overview

The pipeline begins with a web page with Schema.org markup in JSON-LD format sampled from the WebDataCommons project (Brinkmann et al., 2023). The method for extracting a representative sample from this corpus is explained in Section 4.3. We choose a type $T$ from this web page belonging to the page’s markup. For the Apple Pie example in Figure 1, there is only one type $T =$ Recipe. The objective is to compare markup entities of type $T$ for humans and LLMs:

For humans:
We select markup entities of type $T$ noted $M^{h}$ . An illustrative example is presented in Figure 5(a) for our motivating example of Apple Pie (cf. Figure 1).
For LLMs:
We first extract the text of the web page (the <body> text in Figure 1) and ask an LLM to generate the markup $M^{l}$ for the given type $T$ . The prompt we used is detailed in Section 3.2. Figure 5(b) presents an illustrative example of the output.

Figure 5.
Illustrative Schema.org Markups for the Apple Pie Recipe Text Figure 1. (a) Human Markup Before Curation. At Line 12, the “The Eiffel Tower” is not a Compliant Value for an Ingredient and (b) Large Language Model (LLM) Markup Before Curation. At Lines 5 and 10, Markup is not Factual: “Main Dish” and “10-inch” Are not Grounded in the Text. At Lines 6 and 13, the Markup is Invalid: “Cookoo” is not a Property Name, and “Dataset” not Allowed in Recipe Instruction.

For both humans and LLMs, markups presented in Figure 5(a) and (b), some errors have been intentionally introduced to illustrate subsequent steps. As we do not assume that either LLM-generated or human-generated markup is inherently reliable, we apply the following evaluation steps:

(1)
Validity agent: Takes a markup and the schema.org ontology as input, verifying whether the markup is syntactically correct. It returns only valid markup. Details are provided in Section 3.3.
(2)
Factuality agent: Takes a markup and a text as input, checking whether each markup statement is grounded in the text. It retains only factual statements. Details are provided in Section 3.4.
(3)
Compliance agent: Takes a markup and the schema.org documentation as input, ensuring each markup value aligns with the property expectations defined in the Schema.org documentation. It returns a markup containing only compliant statements. Details provided in Section 3.5.

We consider markups that pass these agents to be correct and then compare them using a scoring function called $M e M R$ detailed in Section 3.6. $M e M R$ merges the human and LLM markup and evaluates each member’s contribution to the final result. For example, a score of 0.5/0.5 indicates that humans and LLMs contributed equally, that is, each providing half of the final markup.
3.2. Schema.org Markup Generation With LLMs

Given a web page and a set of Schema.org types, we use LLMs to generate Schema.org markup by employing prompt engineering techniques (Brown et al., 2020). We adapted the prompt of Text2KgBench (Mihindukulasooriya et al., 2023) to fit our context. Our prompting strategy provides the language model with (i) the target Schema.org type, (ii) a list of relevant properties, (iii) an example of Schema.org compliant markup taken from the official documentation, and (iv) the input text to annotate. The model is instructed to output only the corresponding JSON-LD markup. This structured prompt guides the model in aligning the generated output with the expected schema. We use the following prompt (the template on the left, and an example of the instantiated template on the right):

–
TYPES represents the types of the entity. In our example, there is only one Recipe.
–
PROPS represents the properties that can be filled for TYPES, including supertypes. For Recipe, the list of properties includes cookTime, cookingMethod, and so on, in addition to the properties of its supertypes HowTo, CreativeWork, and Thing.
INDEX
represents the index of the example; there can be several examples for one type.
–
ECONTENT and EMARKUP represent one example from the Schema.org documentation for a type. For our example, this corresponds to the example at the bottom of https://schema.org/Recipe. ECONTENT is the text of the example, and EMARKUP is the JSON-LD markup for the text of the example.
–
CONTENT is the text of the page to annotate, that is, the text of the web page shown in Figure 1.

To address the limitation of an LLM’s context window, such as the 16,385 token context window and 4,096 token output limit of GPT-3.5-turbo, we follow the standard practice of chunking with overlap, widely used in LangChain.³ Many techniques exist to manage long documents (Dong et al., 2023) with different trade-off. Chunking with overlap has already been explored by Chalkidis et al. (2022). We divide the text $T X T$ into chunks while maintaining a 10% overlap between adjacent chunks to preserve context. The goal is to minimize the number of chunks $C$ and maximize the number of tokens in each chunk to ensure context preservation. The number of chunks is estimated using the following formula:
$\begin{aligned} c h u n k_t o k e n s & = c o n t e x t_w i n d o w - o u t p u t_t o k e n s - | T o k e n s (p r o m p t) | \\ C & = \frac{| T o k e n s (T X T) |}{0.90 \times c h u n k_t o k e n s} account for 10% overlap \end{aligned}$

Subsequently, we merge the JSON-LD output from each chunk to create the final markup for the text. For the sake of simplicity, a mock example of the results of LLM markup generation is illustrated in Figure 5(b).
3.3. Schema.org Markup Validity Agent

The validity agent ensures syntactic correctness of the markup. For instance, consider the LLM-generated markup shown in Figure 5(b), specifically at line 6: the property “cookoo” is not a valid property of schema:Recipe, that is, it does not belong to the set of properties of schema:Recipe. The validity agent checks for the following rules:

(1)
The type must be defined in Schema.org ontology. For example, Recipe is a valid type, but Recette is not.
(2)
The property must be defined in the type’s ontology or inherited from parent types. For example, recipeIngredient is a valid property for Recipe, but cookoo is not. Additionally, name is a valid property for Recipe because it is inherited from Thing.
(3)
The value must match the expected property type. For instance, the value of recipeInstruction can be either CreativeWork, Text, or ListItem. In our example, HowToStep entity (child type of CreativeWork), the string "Preheat oven ..." is valid, but the Dataset entity is not.

To enforce these rules, we generated SHACL⁴ shapes for the entire Schema.org ontology,⁵ following these steps for each shape: (1) use OWL, RDFS terms for inference; (2) propagate all properties from the parent type to the child types; and (3) close the shapes to enable reasoning under the closed-world assumption. The generated SHACL shapes and code used to generate them are available on the project website.⁶ ^,⁷ Figures 6 and 7 show an excerpt of the resulting SHACL shape for the Recipe type. We use RDFLib pySHACL⁸ to validate the markup against these shapes, removing any invalid property-value pairs from the markup.

Figure 6.
Example of SHACL Shape for Recipe Type (Excerpt).

Figure 7.
Computing Scores.
3.4. Schema.org Markup Factuality Agent

The factuality agent takes a text and a markup as inputs, verifying whether the properties and values mentioned in the markup are grounded in the web page text. It is LLM-based. While it may seem surprising to verify the output of an LLM with another LLM, the precision of LLMs varies depending on the task. The initial task in this article is to generate all markup starting from a text, which is quite complex. The factuality agent’s task is more straightforward; it merely verifies that each markup fragment is grounded in the text. We utilize the following prompt:

If the factuality agent rejects a markup property, this signifies the detection of a hallucination in LLM-generated markup. In our context, there are two possible types of hallucination as defined by Ji et al. (2023):

Intrinsic hallucination: The property-value-type triple contradicts the web page content. For example, in the markup of Figure 5(b), the how-to step of line 10 indicates a 10-inch pie, while it is a 9-inch pie in the text.

Extrinsic hallucination: The web page content cannot support or contradict the property-value-type triple. For example, in the markup of Figure 5(b), the pair {"text": "Main Dish"} (line 5) is not supported by the web page content.

For each property in the markup, we instantiate the prompt template as follows:

Because the prompt includes the entire web page text, it may encounter the token limit in LLMs, where the text’s size may exceed the maximum token count supported by LLMs. To address this issue, we employ a technique of text chunking with overlaps, similar to our method for markup generation (Section 3.2). We also aim to minimize the number of chunks $C$ , that is, maximize the number of tokens in each chunk to preserve context. The number of chunks is estimated using the modified formula from Section 3.2:

\begin{aligned} c h u n k_t o k e n s & = m a x_t o k e n s - | {max}_{i = 1}^{P} T o k e n s (p_{i}) | & P is a set of prompts \\ C & = \frac{| T o k e n s (T) |}{0.90 \times c h u n k_t o k e n s} & account for 10% overlap \end{aligned}

Note that the number of output tokens is omitted from the equation because the model only replies with ‘‘yes” or ‘‘no,” making the number of output tokens negligible compared to other terms. The factuality agent then validates each chunk, producing a Boolean vector for each chunk. The final factuality score is computed as the element-wise logical OR of these vectors. Finally, we remove the invalid property-value-type triples from the markup.

3.5. Schema.org Markup Compliance Agent

The compliance agent takes a markup as input and ensures that each property’s value aligns with the ontology expectations outlined in the Schema.org documentation. For example, the expected value for recipeIngredient is defined as “A single ingredient used in the recipe, e.g., sugar, flour, or garlic.”.⁹ When analyzing the human-generated markup depicted in Figure 5(a) at line 12, it becomes evident that ‘‘The Eiffel Tower” does not constitute a complaint recipe ingredient. The compliance agent is LLM-based; we rely on language comprehension of LLMs to verify that property values are compliant with Schema.org expectations described in the documentation. We use the following prompt:

[PROP] is a placeholder for a property of the input markup, and [DEFINITION] is replaced by the expectation for this property in the Schema.org documentation. For each property in the markup, we send a query to the LLM, for instance:

This prompt is instantiated for each property-value pair in the markup. By analyzing the LLM responses, we identify and remove any non-compliant property-value pairs from the markup.

3.6. MeMR: Merged Markup Ratio

We designed the MeMR, a scoring function to estimate the contribution of human and LLM-generated markups. This function calculates the percentage of markup contributed by humans/LLM in the final merged markup. For example, applying the MeMR metric to the curated markup presented in Figure 5(a) and (b), the MeMR yields scores of $(0.8, 0.33)$ , where 0.8 is the ratio contributed by humans and 0.33 by LLMs. In this case, humans win the match.

Algorithm 1 details the computation of MeMR. For each property $p$ in either human markup $H$ or LLM markup $L$ , we count the number of occurrences of the property and store these counts in $C_{h} [p]$ and $C_{l} [p]$ , respectively. Then, for each property $p$ in the union of $C_{h} [p]$ and $C_{l} [p]$ , we determine the maximum count and store it in $C_{m} [p]$ . We choose properties with larger counts to represent the merged markup. The values of $C_{h}, C_{l}, C_{m}$ for our motivating example are represented in Tables 1 to 7. The final scores $(s c o r e_{h}, s c o r e_{l})$ are computed using the generalized Jaccard similarity coefficient $J_{M}$ defined as follows:

J_{M} (C_{a}, C_{b}) = \frac{\sum_{p} \min (C_{a} [p], C_{b} [p])}{\sum_{p} \max (C_{a} [p], C_{b} [p])}

Table 1.
Number of Examples in the Ground-Truth Dataset for Factuality Agent.

Positive Negative Total

Factual (intrinsic) 785 498 1,283

Factual (extrinsic) 785 630 1,415

	Positive	Negative	Total
Factual (intrinsic)	785	498	1,283
Factual (extrinsic)	785	630	1,415

Table 2.

Precision, Recall, and F1 of the Factuality Agent.

	Prec.	Recall	F1
Factual (intrinsic)	0.866	0.938	0.901
Factual (extrinsic)	0.945	0.945	0.945

Table 3.

Statistics of the Ground Truth for Compliance Agent.

	Positive	Negative	Total
Compliance	785	147	932

Table 4.

Precision, Recall, and F1 of the Compliance Agent.

	Prec.	Recall	F1
Compliance	0.9	0.914	0.907

Table 5.

Example of C-Sets From WebDataCommons 2022.

C-set	# instances	# properties	URLs
Address, alternateName, description, image, logo, name, sameAs, telephone, url, isa: <NewsMediaOrganization>	36	10	https://sports442.com, http://www.bet88news.net
areaServed, contactPoint, description, image, knowsAbout, legalName, logo, name, telephone, url, isa:<Organization>	1	6	https://mridt.com/
aggregateRating, description, isSimilarTo, model, name, offers, isa:<Product>	1	7	https://www.destination-fougeres.bzh/loisirs/cinema-le-club/, https://www.ardennes.com/activite-sportive/balade-a-cheval/
Address, contactPoint, description, name, openingHours, url, isa:<EmergencyService>	3	7	https://www.annarbortow.com/, https://www.annarbortow.com/about.html, https://www.annarbortow.com/services.html
name, sameAs, isa:<HowToSection>	6	3	https://breadboozebacon.com/tag/pasta/#id

Table 6.

Statistics of the 180 Web Pages.

		PV	DP	DT	DS
Number of instances	Low	4/27/193	155	26	2/23/78
	Medium	4/41/598	131	16	0.2/13/45
	High	3/85/924	180	19	2/22/84
Number of properties	Low	1/75/1,600	109	25	2.5/20/214
	Medium	8/28/168	97	20	1.5/19/73
	High	15/68/1,000	147	20	2/17/53
	Overall	1/53/1,600	364	78	0.2/19/214

PV $=$ #property-value pairs (min, avg, max); DP $=$ #distinct properties; DT $=$ #distinct types; DS $=$ document size (min/avg/max in KB).

Table 7.

Results Throughout the Evaluation Pipeline, Where Input Is the Number of Triples in the Input. Valid, Fact, and Comp Are the Number of Triples Resulting from the Step. The Rejection Rate Is the Percentage of Triples Rejected by the Pipeline. MeMR: h for Human, l for LLMs.

	Input	Valid.	Fact.	Comp.	Rejection rate	MeMR
Human	5,690	4,875	3,315	2,719	52.2%
GPT-3.5	4,190	3,369	2,489	2,055	50.9%	( $h = 0.687$ , $l = 0.585$ )
GPT-4	5,260	4,613	3,573	3,113	40.8%	( $h = 0.568$ , $l = 0.707$ )

MeMR = merged markup ratio; GPT-3.5 = Generative Pre-trained Transformer 3.5; LLMs = large language models.

This allows us to evaluate the contribution of humans and LLMs to the merged markup. For our example, the resulting scores are $s c o r e_{h} = \frac{12}{15} = 0.8$ and $s c o r e_{l} = \frac{5}{15} \approx 0.33$ . This indicates that the human-generated markup is closer to the merged markup $I$ than the LLM-generated markup, that is, humans provide more comprehensive markup. The MeMR function remains coarse-grained and purely quantitative. Given two markup $M_{1}, M_{2}$ , the MeMR score might indicate $M_{1} <_{M e M R} M_{2}$ , however, from the qualitative perspective $M_{1}$ can be considered better than $M_{2}$ . For instance, a markup with fewer properties might include properties that are considered more important. In our experiments, we assess the qualitative perspective of the MeMR function using human evaluations. Details are presented in Section 4.5.

4. Experiments

The experimental study aims to address the following questions:

How reliable is the factuality agent? As the factuality agent relies on LLM prompts, we must evaluate its accuracy.

How reliable is the compliance agent? As the compliance agent relies on LLM prompts, we must evaluate its accuracy.

How can we obtain a representative sample of the Schema.org corpus?

Are LLM-generated markups more comprehensive than human-generated markups?

Is the MeMR reliable for comparing markups? The MeMR function is purely quantitative. Between two markups, do humans choose the same one as MeMR?

All experimental results and the code for reproducibility are available on the project website.¹⁰

4.1. How Reliable Is the Factuality Agent?

As the factuality agent is based on an LLM, we must assess its accuracy in detecting the presence or absence of property-value-type triples in the input text.

4.1.1. Ground Truth Dataset for Factuality Agent

We built a ground truth based on the many examples provided in the Schema.org documentation.¹¹ Each example associates a JSON-LD Schema.org markup $M$ with a concise text snippet $τ$ . Figure 8 illustrates two such examples denoted $(M_{1}, τ_{1})$ and $(M_{2}, τ_{2})$ . As we rely on the official documentation of Schema.org, we assume that each property-value-type triple $(p r o p e r t y, v a l u e, t y p e)$ in M is present in $τ$ . For example, the pair $(P_{1} = `` d e s c r i p t i o n, " V_{1} = `` w h i t e t - s h i r t \dots, " and T_{1} = P r o d u c t)$ is found in $τ_{1}$ , and this is true for all properties of $M_{1}$ . By reusing examples from the Schema.org documentation, we can generate 785 positive tests. To generate negative examples, we follow two principles:

Intrinsic principle:
For a pair $((P_{i}, V_{i}, T_{i}), τ)$ , we modify the numeric values from $V_{i}$ to $V_{j}$ , where $i \neq j$ to avoid matching these values in $τ$ . $V_{j}$ are selected from other properties of the same markup that are semantically distant from $P_{i}$ . We measured semantic distance between properties using Spacy embeddings.¹² Consequently, we generated $((P_{i}, V_{j}, T_{i}), τ)$ such that $(P_{i}, V_{j}, T_{i})$ is very unlikely to have a match in $τ$ . We generated 498 negative examples from the 785 positive examples using this principle.
Extrinsic principle:
For two pairs $((P_{1}, V_{1}, T_{1}) \in M_{1}, τ_{1})$ and $(M_{2} = (P_{2}, V_{2}, T_{2}) \in M_{2}, τ_{2})$ , where $M_{1}$ and $M_{2}$ are disjoint, we create new pairs $((P_{1}, V_{1}, T_{1}), τ_{2})$ and $((P_{2}, V_{2}, T_{2}), τ_{1})$ , as presented in Figure 8. When $M_{1}$ and $M_{2}$ are disjoint; there is no match for every property of $M_{1}$ in $τ_{2}$ and for every property of $M_{2}$ in $τ_{1}$ . Thanks to this principle, we generated 630 negative examples from the 785 positive examples.¹³

Figure 8.
Example of Ground Truth Generation for Extrinsic Hallucination.

The factuality ground truth consists of 785 positive examples, 498 negative intrinsic examples, and 630 negative extrinsic examples, as shown in Table 1. The ground truth dataset is available on our GitHub repository.¹⁴
4.1.2. Experimental Results for the Factuality Agent

We evaluate the precision, recall, and F1 score for the factuality agent using its ground truth dataset described in the previous section. As the Factuality agent performs a simpler task than markup generation but requires one LLM call per property-value-type triple in the markup, we opted for a locally hosted Mixtral model instead of using OpenAI models, primarily for cost-efficiency. We used a quantized version¹⁵ of Mixtral-8x7B-Instruct (Jiang et al., 2024). As mentioned in Section 3.4, the factuality agent verifies whether each markup fragment is grounded in the input text.

When prompted, we set the temperature to $0.0$ to ensure a deterministic response from the model. The agent produces a numerical value between 0 and 1, where 1 indicates full factual adherence. We classify property-value-type triples as VALID if their score is $\geq$ threshold, and INVALID otherwise. Naturally, the choice of threshold significantly impacts evaluation outcomes. We opted for a threshold of $0.5$ , aligning with the interpretation of SelfCheckGPT-Prompt (Manakul et al., 2023), where property-value-type triples may exhibit some factual inconsistencies but are overall factually correct. We refer the reader to Appendix G for more details on the implementation of the factuality agent.

Table 2 presents the precision, recall, and F1 score of the factuality agent. The factuality agent obtains high F1 scores in intrinsic and extrinsic test cases. Some examples of false positives and negatives are presented in Appendix A.

4.2. How Reliable Is the Compliance Agent?

Given that the compliance agent is based on an LLM, it is crucial to assess its accuracy in determining whether a property value complies with the property expectations specified in the Schema.org documentation. Similarly to the factuality agent, this evaluation relies on the many examples provided in the Schema.org documentation.

4.2.1. Ground Truth Dataset for Compliance Agent

From the examples in the Schema.org documentation, we consider pairs $(P, V)$ , where $P$ is the description of the property in natural language, and $V$ is a value available in Schema.org examples for the property $P$ . According to the Schema.org documentation, we consider that a value $V$ is compliant with the description of the property $P$ .

Figure 9 illustrates two compliance pairs: one for the recipeIngredient property and the other for recipeYield property. In these examples, $V_{1}$ is compliant with $P_{1}$ and $V_{2}$ is compliant with $P_{2}$ .

Figure 9.

Example of Ground Truth Generation for the Compliance Agent.

In addition to the positive examples, we generate negative examples by swapping values between textual properties that are semantically distant. For example, $V_{1}$ is considered not compliant with $P_{2}$ and $V_{2}$ is considered not compliant with $P_{1}$ . To ensure that $(P_{1}, V_{1}$ ) is distant from $(P_{2}, V_{2})$ during negative test generation, we measure the semantic distance between their embeddings using Spacy¹⁶ and retain pairs with the largest distances as negative examples.

As described in Table 3, we generated 932 tests with positive and negative examples. The ground truth dataset is available on our GitHub repository.¹⁷

4.2.2. Experimental Results of the Compliance Agent

We evaluated the compliance agent’s precision, recall, and F1 score using its ground truth datasets. As for the factuality agent, we used a quantized version of Mixtral-8x7B-Instruct. The outputs of the agent are numbers between 0 and 1, where 1 indicates full compliance of property-value pairs. We assign the label VALID to the property value pairs with a score $\geq$ threshold, and INVALID otherwise. Naturally, the results are sensitive to the choice of the threshold. As before, we chose the threshold of $0.5$ following the interpretation of SelfCheckGPT-Prompt. We refer the reader to Appendix G for more details on the implementation of the compliance agent.

Table 4 presents the precision, recall, and F1 score of the compliance agent. High precision and recall indicate that the compliance agent performs well on positive and negative tests (see Appendix B). A high F1 score also means we can safely integrate the compliance agent into our pipeline to evaluate the quality of the generated markup.

4.3. Sampling Over WebDataCommons

For practical reasons, we consider a representative sampled subset of the corpus of 877 million web pages visited in 2022 that feature Schema.org markup in JSON-LD format. This corpus is extracted from the CommonCrawl of October 2022 dataset,¹⁸ which contains 3.15 billion web pages. Of these, 1.5 billion pages contain structured data,¹⁹ and 877M include Schema.org markup in JSON-LD format.

The corpus is available in RDF as quad files using the format $s u b j e c t, p r e d i c a t e, o b j e c t, U R L$ , where $U R L$ is the $U R L$ of the visited web page, $p r e d i c a t e$ is a Schema.org predicate, and the subject can be either a URL or a Blank node.

Sampling over these 877 million pages is challenging because some Schema.org types/properties are much more frequent and better described than others (Dang et al., 2023). Sampling only frequent Schema.org types or well-described entities may bias results in favor of LLMs.

To create representative samples, we rely on ideas explored by Schema.org observatory (Dang et al., 2023). Schema.org observatory has computed characteristic sets for WebDataCommons released in October 2021. Characteristic sets (C-sets) are properties shared by entities across web pages, revealing how humans combine properties to describe web entities. We extend this work by adding a web page source (URL) for each C-set, allowing us to extract the textual content and human-generated JSON-LD markup for each C-set. Formally, a C-set for a subject $s$ in a quad $s, p, o, u$ is defined as follows:

S_{C} (s) = {p | \exists o : (s, p, o, u) \in D}

where

(s, p, o, u)

represents the quads in the dataset

D

. For example, applying this definition using the markup in Figure 5(a) yields a C-set of length 5:

{

isa:Recipe, name, recipeCategory, recipeCuisine, recipeIngredient

}

. More examples of C-set creation from Schema.org markup are available by Dang et al. (2023). For the corpus of 877 million pages, we computed C-sets. Within each C-set, we removed invalid properties and types (outside Schema.org, typographic errors), resulting in a total of 1.2 million C-sets. Table 5 shows some examples of C-sets from WebDataCommons 2023. The entire C-sets dataset is available on Zenodo.²⁰

To select representatives C-sets, we study the C-sets using two features: (1) The number of instances per C-set measures how frequently a given combination of properties occurs. (2) The number of properties per C-set indicates how many properties are combined in a given combination.

We observe a very weak monotonic relationship between the number of instances and the number of properties in a C-set (Spearman $ρ = - 0.18$ , $p - v a l u e = 0.0$ ). This aligns with the observations made in the Schema.org observatory (Dang et al., 2023), where longer combinations of properties tend to have fewer instances, but this is not the case for all types. In other words, longer C-sets do not necessarily share the same instances as C-sets with higher instance counts and vice versa; hence, both features must be sampled independently.

Next, we divided the distribution of the number of instances and the number of properties into three quantiles: low, medium, and high (Figure 10). This ensures a representative sample of C-sets with different lengths and cardinalities. We grouped the C-sets by quantiles for each feature and sampled 30 pages from each, resulting in 180 pages.

Figure 10.

Distribution of C-sets in WebDataCommons 2023: (a) # Instances Per C-set and (b) # Properties Per C-set.

For each web page, we extracted the textual content from fully rendered web pages using HTML2Text,²¹ and the human JSON-LD markup using extruct.²² It is important to note that: (i) a web page can have multiple JSON-LD markup and (ii) the type in the C-set is not necessarily the same as the type in the JSON-LD markup. For example, in the markup in Figure 5(b), the C-set is ${$ isa:HowToStep, text $}$ but the markup is of type Recipe. When extracting the JSON-LD markup, we only retained those containing the C-set and considered the root type, that is, the type on the first level in the markup, as input for the evaluating pipeline. This step provides more context to the LLMs while retaining representativeness with respect to the C-sets. In another word, instead of evaluating the markup ${$ "@type": "HowToStep", "text": "Preheat oven to 375F (190C)." $}$ , we evaluate the entire markup of type Recipe in Figure 5(b).

Table 6 presents the statistics about our 180 web pages. We can see some diversity in the length of web pages and the number of triples per page.

4.4. Humans Versus LLMs

We compare human-generated markup to GPT-generated markup (GPT-3.5-Turbo-16k-0613 and GPT-4-0125-preview)²³ for 180 web pages, using the pipeline in Figure 4. According to the MeMR, human-generated markup contributes more to the merged markups than GPT-3.5-generated markup, but less than GPT-4-generated markup. This is due to GPT-4 being better at following instructions (Achiam et al., 2023), generating more triples with a lower rejection rate than GPT-3.5.

Table 7 presents the total number of RDF triples present in the output of each agent for the whole 180 web pages. For example, for humans, on the 5,690 input triples, the validation agent only retained 4,875 triples. When examining the proportion of rejected triples, 52.2% of human-generated triples are rejected. While most triples are valid, they are not factual or compliant. This mainly indicated that the text of web pages should be improved to ground information available in the markup. This may concern information only available as images with no alternative text, for example. As for LLMs, 50.9% of GPT-3.5 generated markup, and 40.8% of GPT-4 generated markup were incorrect. This finding indicates that LLMs should not be used out-of-the-box to generate Schema.org markup and require curation to ensure the quality of the generated markup.

Table 8 presents the results throughout the evaluation pipeline per feature (number of instances and number of properties) and per quantile (low, medium, and high). We observe the same pattern in both features: the MeMR is higher for Humans in the high quantiles and GPT-4 in the low quantiles. Although LLMs cannot outperform humans on web pages with the highest number of instances/properties, they can help improve web pages in the low and medium quantiles. This finding suggests that LLMs can help generate the first draft of Schema.org markup, which humans can further curate to improve its quality.

Table 8.
Results Throughout the Evaluation Pipeline, Per Feature (Number of Instances and Properties), and Per Quantile (Low, Medium, and High).

(a) Number of instances per C-set

Human GPT-3.5 GPT-4

Low Med High Low Med High Low Med High

Input 910 861 1039 1293 572 401 1,286 619 671

Valid. 748 731 812 781 452 373 1,064 592 619

Fact. 463 488 602 589 245 284 819 469 513

Comp. 402 346 515 549 222 256 683 441 448

RR 55% 59% 50% 57% 61% 36% 46% 28% 33%

MeMR ( $h = 0.603$ , $l = 0.669$ ) ( $h = 0.689$ , $l = 0.577$ ) ( $h = 0.802$ , $l = 0.506$ ) ( $h = 0.491$ , $l = 0.795$ ) ( $h = 0.576$ , $l = 0.673$ ) ( $h = 0.653$ , $l = 0.638$ )

(b) Number of properties per C-set

Human GPT-3.5 GPT-4

Low Medium High Low Medium High Low Medium High

Input 535 740 1,605 812 567 545 849 897 1,605

Valid. 369 687 1,528 793 474 496 778 713 1,528

Fact. 273 457 1032 645 339 387 629 506 1,032

Comp. 216 373 867 381 281 366 525 444 867

RR 59% 49% 45% 53% 50% 32% 38% 50% 45%

MeMR ( $h = 0.585$ , $l = 0.643$ ) ( $h = 0.687$ , $l = 0.618$ ) ( $h = 0.772$ , $l = 0.480$ ) ( $h = 0.434$ , $l = 0.797$ ) ( $h = 0.569$ , $l = 0.758$ ) ( $h = 0.704$ , $l = 0.567$ )

(a) Number of instances per C-set
Input	910	861	1039	1293	572	401	1,286	619	671
Valid.	748	731	812	781	452	373	1,064	592	619
Fact.	463	488	602	589	245	284	819	469	513
Comp.	402	346	515	549	222	256	683	441	448
RR	55%	59%	50%	57%	61%	36%	46%	28%	33%
MeMR				( $h = 0.603$ , $l = 0.669$ )	( $h = 0.689$ , $l = 0.577$ )	( $h = 0.802$ , $l = 0.506$ )	( $h = 0.491$ , $l = 0.795$ )	( $h = 0.576$ , $l = 0.673$ )	( $h = 0.653$ , $l = 0.638$ )
(b) Number of properties per C-set
	Human	GPT-3.5	GPT-4
	Low	Medium	High	Low	Medium	High	Low	Medium	High
Input	535	740	1,605	812	567	545	849	897	1,605
Valid.	369	687	1,528	793	474	496	778	713	1,528
Fact.	273	457	1032	645	339	387	629	506	1,032
Comp.	216	373	867	381	281	366	525	444	867
RR	59%	49%	45%	53%	50%	32%	38%	50%	45%
MeMR				( $h = 0.585$ , $l = 0.643$ )	( $h = 0.687$ , $l = 0.618$ )	( $h = 0.772$ , $l = 0.480$ )	( $h = 0.434$ , $l = 0.797$ )	( $h = 0.569$ , $l = 0.758$ )	( $h = 0.704$ , $l = 0.567$ )

The input, valid, factuality (Fact.), and compliance (Comp.) rows show the number of triples after each step. The MeMR for Humans (h) and LLMs (l), respectively. The RR is the percentage of triples rejected by the pipeline. MeMR = merged markup ratio; LLMs = large language models; RR = rejection rate.

The in-depth analysis of the errors made by humans and LLMs is presented in Appendix C.

4.5. Evaluating the Accuracy of the MeMR

Judging the quality of the generated markup is challenging, as there might be multiple valid ways to represent the same information. The MeMR measures the quantity of information that contributes to the merging of human and LLM-generated markups, but it does not capture the quality of the information. To evaluate the accuracy of our MeMR metric, we conducted a human evaluation by measuring the perceived quality by humans. This is done in three steps: (1) we randomly selected a subset of web pages from the dataset, (2) we asked human evaluators to compare the human-generated markup with the LLM-generated markup for each web page, and (3) we measured the MeMR-Human agreement.

We first selected 10% of random web pages, resulting in 18 pages to validate the scoring function. Seven participants (master’s students familiar with Schema.org) were presented with two curated markups of the same web page and were asked to choose between A: human-generated, B: GPT-generated, or Tie.²⁴ The participants were not informed of the markup’s origin. Table 9 shows the participants’ responses.

Table 9.
Human Assessment Versus MeMR Assessment. Each Row Shows the Vote Count for Each Document.

ID GPT-3.5 GPT-4

Human MeMR Human MeMR

0aab6 $\dots$ Markup B Markup B Markup B Markup B

1bce9 $\dots$ Markup A Markup B Tie Markup B

21a44 $\dots$ Markup B Markup B Tie Tie

2e921 $\dots$ Markup B Markup B Markup A Markup B

34575 $\dots$ Markup A Markup A Markup B Markup B

43abb $\dots$ Markup A Markup A Tie Markup A

5acb1 $\dots$ Markup A Markup A Markup B Markup B

5c362 $\dots$ Markup A Markup A Markup B Markup B

690a9 $\dots$ Markup A Markup A Markup A Markup A

6b394 $\dots$ Markup A Markup A Markup A Markup A

759d6 $\dots$ Markup B Markup B Markup B Markup B

75fd9 $\dots$ Markup B Markup A Markup B Markup B

7bf0c $\dots$ Markup A Markup A Markup A Markup A

7d7a3 $\dots$ Markup A Tie Markup B Markup B

90193 $\dots$ Tie Markup A Markup B Markup B

d6f28 $\dots$ Tie Markup B Markup B Markup B

daffe $\dots$ Markup A Markup A Tie Markup A

e5449 $\dots$ Tie Markup A Tie Markup A

Cohen’s Kappa 0.750 0.895

ID	GPT-3.5	GPT-4
	Human	MeMR	Human	MeMR
0aab6 $\dots$	Markup B	Markup B	Markup B	Markup B
1bce9 $\dots$	Markup A	Markup B	Tie	Markup B
21a44 $\dots$	Markup B	Markup B	Tie	Tie
2e921 $\dots$	Markup B	Markup B	Markup A	Markup B
34575 $\dots$	Markup A	Markup A	Markup B	Markup B
43abb $\dots$	Markup A	Markup A	Tie	Markup A
5acb1 $\dots$	Markup A	Markup A	Markup B	Markup B
5c362 $\dots$	Markup A	Markup A	Markup B	Markup B
690a9 $\dots$	Markup A	Markup A	Markup A	Markup A
6b394 $\dots$	Markup A	Markup A	Markup A	Markup A
759d6 $\dots$	Markup B	Markup B	Markup B	Markup B
75fd9 $\dots$	Markup B	Markup A	Markup B	Markup B
7bf0c $\dots$	Markup A	Markup A	Markup A	Markup A
7d7a3 $\dots$	Markup A	Tie	Markup B	Markup B
90193 $\dots$	Tie	Markup A	Markup B	Markup B
d6f28 $\dots$	Tie	Markup B	Markup B	Markup B
daffe $\dots$	Markup A	Markup A	Tie	Markup A
e5449 $\dots$	Tie	Markup A	Tie	Markup A
Cohen’s Kappa	0.750	0.895

MeMR = merged markup ratio; GPT-3.5 = Generative Pre-trained Transformer 3.5.

From the responses, we counted the votes for ‘‘Markup A” and ‘‘Markup B” and added one vote to both ‘‘Markup A” and ‘‘Markup B” whenever a participant votes ‘‘Tie.” To assess whether the MeMR score is consistent with human preferences, we measured the inter-rater reliability using Cohen’s kappa coefficient following Groth et al. (2018). The high kappa statistic indicates a substantial agreement between the human judgments and the MeMR score.

We repeated the experimentation with 36 random pages, including the original 18 pages, focusing only on the markup generated by GTP-4, as it showed better performance than GPT-3 in the initial validation.

In this extended experiment, 23 participants were asked again to choose between A: human-generated, B: GPT-4 generated, or Tie.²⁵ The details of this experiment are presented in Appendix E. The new experiment confirms the previous results: the kappa statistic remains nearly the same, and indicates a substantial agreement between human judgments and the MeMR score.

4.6. Discussions and Limitations

This section discusses our findings, addresses limitations, and highlights areas for future exploration.

Lack of ground truth: One of the primary issues for this work is the absence of ground truth for evaluating the ability of LLMs to generate Schema.org markup from web pages. The lack of ground truth for assessing LLMs-generated knowledge has been highlighted as a significant issue by Allen et al. (2023). Our approach does not attempt to establish a ground truth. After the execution of the pipeline, agents may fail to filter out incorrect facts or inadvertently remove correct facts. However, since the same agents process both human-generated markup and LLM-generated markups, the results remain comparable. This pipeline cannot guarantee that Schema.org markup generated from a web page is complete, that is, that all possible Schema.org markups are effectively present at the end of the pipeline. Thus, we cannot claim that the markup produced by the pipeline is correct and complete, as would be expected with a traditional ground truth.

Evaluation approach: In the absence of a classical precision/recall with a ground truth, we devised an alternative method. We evaluated whether LLMs are more capable than humans in populating schema.org classes with values extracted from text. We gamified the evaluation process, and the scoring function MeMR plays an important role. However, this coarse-grained score may fail to capture certain nuances. In our proposal, all properties of a class are weighted equally, for example, a software license is treated as equally important as its download link. Experts might argue that some properties are more critical than others, and the scoring function should reflect this distinction. Despite these limitations, the simple scoring function we defined is sufficient to discern trends over a large corpus of documents. While MeMR may not reflect expert scoring, it enables meaningful comparisons between human-generated and LLM-generated markup at scale.

Potential for cheating: A noteworthy concern is the potential for LLMs to ‘‘cheat” by relying on prior exposure to human-generated Schema.org markup during training. For instance, GPT-3 and GTP-4 are trained on data from CommonCrawl, which includes web pages from 2016 to 2019 (Brown et al., 2020). While it is stated that

The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext.

This data consists only of plain text; there is no way to confirm that Schema.org markup was excluded. If LLMs have been trained on such markup, this may explain their performance. However, our findings show that $\sim$ 40% of the generated markup is incorrect and must be filtered out. While the training process of LLMs may shed light on their performance, it does not address the primary question of this article: Can we trust LLMs to generate Schema.org markup from text?

Prompt engineering: The prompt design significantly impacts markup generation. In Section 3.2, we detailed the prompt we ultimately employed, which was chosen for its superior performance on a sample of pages (i.e., highest score). However, we tested multiple prompts, with and without examples, and with varying levels of detail about properties. This highlights another potential use of the pipeline: as a tool to evaluate the effectiveness of different prompts. The different prompts we tested are available in the website companion.²⁶

Impact of the type: For fair comparison, LLM4Schema.org requires a web page containing JSON-LD and a type $T$ . The quality of instances of type $T$ in human-generated markup is then compared with whose generated by LLMs. $T$ is selected from types available on the web page. An alternative approach is to prompt LLMs to generate all possible Schema.org types from the text of the page and compare them with those provided by humans. This can reveal cases where certain types are identified by the LLM but missed by humans, or vice versa. Asking LLMs to generate the different types from the text of the page could be considered an additional interesting task. This step could be performed before running LLM4Schema.org.

Impact of chunking techniques: We employed a chunking technique with overlapping to handle long web pages. While this approach may influence the performance of LLMs, more sophisticated chunking techniques exist (Chalkidis et al., 2022), as discussed in Section 3.2. To assess the impacts of our chunking technique on the results, we analyzed how the scores of LLMs and humans evolve as the number of chunks of web pages increases. The results are presented in Appendix H. Interestingly, we observed no significant decrease in the LLMs’ performance relative to humans as document size grows. This suggests that the chosen chunking technique is suitable for this task.

Generalization of LLM4Schema.org: LLM4Schema.org focuses on the Schema.org ontology; however, our approach to comparing human and LLM performance is only loosely coupled to Schema.org. The validity agent requires a SHACL file to check predicates, types, and the type hierarchy-resources that can easily be generated for other ontologies. The factuality agent verifies whether a property-value-type triple is mentioned in the text. As such, changes in the property or type do not affect the agent’s functionality. The compliance agent checks whether the value of a property conforms to guidelines expressed in natural language. These guidelines and associated properties can be modified without altering the agent’s core behavior. Finally, the initial prompt used for markup generation includes few-shot examples to guide the model. As shown by Brown et al. (2020) and further supported by Mihindukulasooriya et al. (2023), only a small number of examples are needed to improve generation quality.

5. Related Work

By generating Schema.org markup from text, LLM4Schema.org is related to LLM-augmented knowledge graph construction (LLM-KGC) (Kumar et al., 2020; Pan et al., 2024). End-to-end knowledge graph (KG) construction approaches such as PiVE (Han et al., 2023) or AutoKG (Zhu et al., 2023) handle all stages of knowledge graph construction, including: (1) entity discovery, (2) coreference resolution, and (3) relation extraction. In LLM4Schema.org, the objective is to generate RDF facts based on the Schema.org ontology from a given web page.

Text2KGBench (Mihindukulasooriya et al., 2023) is a benchmark designed to evalute the performance of LLMs in extracting RDF facts from text while adhering to predefined ontologies. It features two datasets, Wikidata-TekGen and DBpedia-WebNLG. The prompt template comprises a sentence and a context describing ontology concepts, relations, and examples. Generated RDF facts are compared to a ground truth, measuring precision, recall, ontology conformance, and subject/relation/object hallucinations. In LLM4Schema.org, the input is a web page extracted from CommonCrawl, with the context restricted to Schema.org concepts. Compared to a single sentence in English, processing a web page is challenging; it can be long, and the language is not predetermined. Regarding the ontology, Text2KGBench focuses on small-sized ontologies by design (up to 20 types and 68 relations). In contrast, LLM4Schema.org targets Schema.org, which includes 806 types and 1,476 properties.²⁷ Most importantly, Text2KGBench relies on a ground truth where LLM4Schema.org does not. Proposing an alternative to ground truth is the primary contribution of this article.

KGValidator (Boylan et al., 2024) presents a novel framework that leverages LLMs to validate and evaluate the completion of KGs. Unlike traditional methods, KGValidator does not require a gold standard to validate LLM-generated RDF facts. Instead, each candidate’s RDF fact is validated against different trustworthy sources using: (1) retrieval-augmented generation (RAG) with trusted sources, (2) web search, and (3) reference KGs. Compared to KGValidator, LLM4Schema.org does not verify the veracity of facts; its objective is to reflect the content of the page without performing any fact-checking. We consider generating Schema.org markup and fact-checking to be two distinct tasks.

Specifically regarding Schema.org, there are few studies in the LLM-KGC domains that focus on generating Schema.org markup from unstructured data (Abbasi et al., 2022; Gonzalez-Garcia et al., 2024; Meyer et al., 2023). For instance, Abbasi et al. (2022) use earlier pre-trained language models to extract Schema.org from 12 web pages in HTML format. They use long-short-term memory (LSTM) networks to classify HTML blocs using eight Schema.org classes/properties and generate the markup using a predefined template. Gonzalez-Garcia et al. (2024) aim to complete the Wikidata KG with triples procured in the web for tourism domain. They perform entity linking to recognize Wikidata entities within the Schema.org markup, use LLMs to transform Schema.org triples into Wikidata triples and evaluates the system using test and validation sets. Bengtson (2024) focuses on refining existing Schema.org markups using LLMs. This work explores the potential of using ChatGPT-4 to refine and enhance Schema.org metadata for the NMSU Library website, aiming to improve search engine optimization. While ChatGPT provides constructive suggestions, it also makes errors, as we observed LLM4Schema.org. Compared to prior works on Schema.org, this article is the first to propose a systematic approach for evaluting the quality of LLM-generated markup.

6. Conclusion

If LLMs can generate Schema.org markup from the text of a web page, assessing its quality poses significant challenges: (i) The Schema.org ontology is large, covering many different domains from products to drugs, (ii) there is considerable diversity in web pages, including variations in languages and page size, and (iii) most importantly, no ground truth is available for comparison.

Fortunately, billions of web pages already contain human-generated Schema.org markup. However, this cannot be considered a ground truth as there is no guarantee of its correctness and completeness.

In LLM4Schema.org, we address the problem of assessing the quality of LLM-generated Schema.org markup by fairly comparing it to the existing human-generated markup. This fair comparison relies on three agents that filter incorrect markup and a score function MeMR, capable of comparing two correct markup in terms of coverage.

Thanks to LLM4Schema.org, we can sample web pages from the CommonCrawl and evaluate whether LLMs produce better Schema.org markup than humans. Our findings indicate that LLMs should not be used out-of-the-box for generating Schema.org markups, as they often produce invalid, unfactual, or non-compliant markups. For instance, with our best-performing LLM, GPT-4, over 40% of the generated markup is incorrect. However, after filtering the incorrect markup by LLM4Schema.org agents, filtered markup generated with GPT-4 can surpass human performance, achieving a MeMR score of 0.70 compared to 0.568 for humans. In contrast, GPT-3.5 does not outperform humans even after filtering with a MeMR score of 0.585 compared to 0.687 for humans. Additionally, we observed that GPT-3.5 and GPT-4 can enhance web pages by improving poorly filled types or utilizing less popular types, providing added value in specific contexts.

This article opens several interesting perspectives. First, our study was limited to a sample of 180 web pages containing JSON-LD with high diversity. Future work could expand this scope by sampling pages with other annotation formats, such as microdata, and increasing the sample size to include greater diversity in languages and domains. For instance, it would be valuable to investigate whether the performance of LLMs is influenced by the language or domain, for example, whether generated markup is better in English than in Vietnamese or Arabic or whether generating product markup is more straightforward than generating event markup (see Appendix F).

Second, additional LLMs, such as Gemini or LLama, could be evaluated, exploring variations in the number of parameters and quantization techniques.

Third, the score function could be refined to incorporate weights for specific properties or to leverage a more sophisticated scoring model trained on human comparisons.

Finally, the output of LLM4Schema.org’s agents could be used as labeled data to fine-tune LLMs, further improving the quality of generated markup.

Footnotes

Acknowledgements

This work is supported by the French ANR project MeKaNo (Search the Web with Things) (ANR-22-CE23-0021), and the French Labex CominLabs projects MiKroloG (The Microdata Knowledge Graph) and WanderLoG (Exploring Large Knowledge Graphs With Sampling).

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

ORCID iDs

Minh-Hoang Dang

Thi Hoang Thi Pham

Pascal Molli

Hala Skaf-Molli

Alban Gaignard

Notes

Appendix A. Example of Errors for Factuality Agent

In Section 4.2, we assessed the reliability of the factuality. In this section, we provide an in-depth analysis of some errors made by the factuality agent. We re-run the agent on some positive and negative cases without the binary output constraint to probe the LLMs reasoning (Huang et al., 2023). Wrong predictions occur when the LLMs fail to understand the subtleties in the positive examples or negative examples.

Table 10 showcases some notable errors detected by the factuality agent. It includes a fragment of the text, a triple, and a fragment of the probe provided by the factuality agent and the error type. If the error type is false negative (FN) it means that the factuality agent failed to validate a positive example, and false positive (FP) means that the factuality agent failed to reject a negative example.

The first error in Table 10 concerns the triple (CourseMode, part-time, CourseInstance). As a positive example, this triple means that it exists in the text an instance of CourseInstance, with a property named courseMode with the value part-time. This triple is extracted from Schema.org examples²⁸ and is grounded in the associated text. The factuality agent is powered by Mixtral-8x7b-instruct considers that the triple is not grounded in the text. The probe results show that the Factuality agent cannot infer that ‘‘Evenings only and weekends” means ‘‘Part-Time.” Similarly, the triple (isAccessibleForFree,False, TouristAttraction) is grounded in the text. However, the factuality agent considers that the triple is not grounded in the text, although the probe explains that the entrance requires a ticket.

Appendix B. Example of Errors for Compliance Agent

In Section 4.2, we assessed the reliability of the compliance agent. In this section, we provide an in-depth analysis of the errors made by the compliance agent. Table 11 shows the compliance agent errors. It includes the definition of the expected values for a given property, a pair (property, value) that can be a positive or negative example, the probe result of the compliance agent for explaining its decision, and finally, the error type. If the error type is FN it means that the compliance agent failed to validate a positive example, and FP means that the compliance agent failed to reject a negative example.

The first error of Table 11 concerns the pair (serviceType, Small business) that is not compliant with the definition of serviceType. The compliance agent, powered by Mixtral-8x7b-instruct considers Small business match the definition, although its reasoning explains that small business does not match the definition. The reasoning is correct, but the final decision is wrong. In such a case, the prompt may be improved to consider the reasoning steps better.

Appendix C. Examples of Errors Throughout the Pipeline

In Section 4.4, we compared the performance of humans and LLMs in generating Schema.org markups. In this section, we provide an in-depth analysis of the errors made by humans and LLMs throughout the pipeline.

Appendix D. Human Assessment

Judging markup quality is challenging, as it requires a deep understanding of the web page content and the Schema.org ontology. The example in Table 17 refers to a Book review²⁹ in a magazine. There is a clear preference for the GPT-generated markup as human evaluators voted 3 Tie and 3 B while MeMR chose B. This is due to the lack of ‘‘essential” information in the human-generated markup, for example, a short description publisher. However, the extra details are not always well-received by human annotators. This is the case for the example in Table 16 which received 3 Tie, 2 B and 1 A, while MeMR chose A. Both versions included extraneous information, for example, subjectOf, potentialAction, CEO being a male, and so on.

Appendix E. Additional Human Assessment for Evaluating the Accuracy of the MeMR Between Human and GPT-4

In Section 4.5, we proposed an evaluation of the accuracy of MeMR based on human assessment of 18 pages. In this section, we present an additional evaluation. We added 18 more pages, resulting in a total of 36 pages, and conducted the evaluation using GPT-4 only, as it demonstrated better performance than GPT-3 in our initial validation. Furthermore, we increased the number of human evaluators from seven to 23. The original seven participants evaluated the 18 new pages, and 16 additional participants assessed all 36 pages.

As before, participants were asked to choose between ‘‘Markup A,” ‘‘Markup B,” or ‘‘Tie.” Table 19 presents participants’ responses. From the responses, we counted the votes for ‘‘Markup A” and ‘‘Markup B” and added one vote to both ‘‘Markup A” and ‘‘Markup B” whenever a participant votes ‘‘Tie.” The high Cohen’s kappa score again indicates a strong agreement between the human judgment and MeMR score.

Appendix F. Language as a Sampling Feature

In Section 6, we proposed ‘‘language” as a potential feature for sampling. In this section, we further discuss the benefits of using language as a sampling feature.

Our generator models, namely GPT-3.5 (Brown et al., 2020) and GPT-4 (Achiam et al., 2023), are trained on a corpus of web pages from CommonCrawl. As such, the models perform better in high-resource languages like English than in other languages. Table 18(b) shows that English is the ‘‘dominant” language on the Web at 63.5%. This skew towards English and subsequent degradation in performance when the prompt is in lower-resource languages is a well-known issue (Shen et al., 2024; Zhang et al., 2023). In our context, the web page content might be in lower-resource languages, but the instruction phrases are always in English. As such, we do not know the quality of the generated markups when the web pages are in different languages. Table 18(a) shows that our sample’s language distribution pattern is consistent with that of CommonCrawl, that is, English remains the dominant language at 45.8%.

Future iterations of this work should sample the C-set (see Section 4.3) based on the language in three quantiles: low-resource, medium-resource, and high-resource languages.

Appendix G. Factuality and Compliance Agents Implementation Choice

In Sections 4.1 and 4.2, we assessed the reliability of the factuality and compliance agents. In this section, we explain the choices behind implementing the agents.

Previous works by Allen et al. (2024), Manakul et al. (2023), Mehta et al. (2024), Mündler et al. (2023), and Wei et al. (2024) detect hallucinations by pooling multiple stochastically generated responses from the same input. More specifically, the pooling method ranges from majority voting (Allen et al., 2024; Manakul et al., 2023; Mehta et al., 2024), textual entailment (Mündler et al., 2023) to ensemble learning (Wei et al., 2024). We chose the majority voting method as they are implemented by the top scorers of SHROOM competition (Mickus et al., 2024) in the task of hallucination detection. In the context of factuality and compliance agents, the majority voting method serves as a means to mitigate inconsistencies thus reduce hallucination. However, this process is costly and time-consuming as our prompt, in our case, is much longer than that of the previous works. Fortunately, SelfCheckGPT (Manakul et al., 2023) also demonstrated that the performance is consistent with human evaluation even with zero samples. Nonetheless, the approach reliability (i.e., precision, recall, and F1 score) when the number of samples decreases is unknown.

In order to build a cost-time-efficient yet reliable agent, we based our agents on SelfCheckGPT-Prompt (Manakul et al., 2023) as follows: (1) we modified the prompt to match the downstream tasks better and (2) the agents cast a single deterministic vote ( $t e m p e r a t u r e = 0$ ) for each property-value-type triple. We ran both SelfCheckGPT-Prompt with four stochastic samples at $t e m p e r a t u r e = 1$ for each triple and our agents against the same validation set (see Sections 4.1.1 and 4.2.1) and compared the results in Table 20.

Overall, SelfCheckGPT obtains a better F1-score than our agents, but the gain is negligible, ranging from $- 0.013$ in the worst case to $0.006$ in the best case. This finding suggests that the trade-off between performance and cost-time efficiency is marginal enough for our baseline agents to remain reliable.

Appendix H. Effect of Chunking on Throughout the Pipeline

In Section 3.2, we described how the pipeline handles long input, that is, the number of tokens in the prompt exceeds those allowed in the LLMs context window. First, we divide the prompt into chunks with 10% overlap to preserve context. Then, we run the generation step or the factuality agent on each chunk. Finally, we aggregate the result into a single output.

In Figure 11, we plot the average MeMR score of humans and LLMs while the number of chunks grows. As we can see, small web pages only require one chunk, while the longest web page requires 17 chunks.

We performed a Student’s t-test with the null hypothesis that the MeMR for shorter web pages (fit in $1$ chunk) is comparable to longer web pages (fit in $> 1$ chunks). A high p-value suggests that we cannot reject the null hypothesis, that is, the chunking does not affect MeMR score.

References

Abbasi

B. U. D.

Fatima

Mukhtar

Khan

Alhumam

Ahmad

H. F.

(2022). Autonomous schema markups based on intelligent computing for search engine optimization. PeerJ Computer Science, 8, e1163.

Achiam

Adler

Agarwal

Ahmad

Akkaya

Aleman

F. L.

Almeida

Altenschmidt

Altman

Anadkat

Avila

Babuschkin

Balaji

Balcom

Baltescu

Bao

Bavarian

Belgum

Bello

Berdine

,... Zoph

(2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

Allen

B. P.

Polat

Groth

(2024). Shroom-indelab at semeval-2024 task 6: Zero-and few-shot LLM-based classification for hallucination detection. arXiv preprint arXiv:2404.03732.

Allen

B. P.

Stork

Groth

(2023). Knowledge engineering using large language models. arXiv preprint arXiv:2310.00637.

Bengtson

(2024). Testing the feasibility of Schema.org metadata refinement through the use of a large language model. Journal of Library Metadata, 24(4), 275–290.

Boylan

Mangla

Thorn

Ghalandari

D. G.

Ghaffari

Hokamp

(2024). Kgvalidator: A framework for automatic validation of knowledge graph construction. arXiv preprint arXiv:2404.15923.

Brinkmann

Primpeli

Bizer

(2023). The web data commons Schema.org data set series. In Companion proceedings of the ACM web conference 2023, WWW 2023 (pp. 136–139). ACM.

Brown

T. B.

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Hesse

,... Amodei

(2020). Language models are few-shot learners. In 33: Annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, virtual.

Chalkidis

Dai

Fergadiotis

Malakasiotis

Elliott

(2022). An exploration of hierarchical attention transformers for efficient long document classification. CoRR abs/2210.05529. https://doi.org/10.48550/arXiv.2210.05529.

10.

Dang

M. H.

Gaignard

Skaf-Molli

Molli

(2023). Schema.org: How is it used? In 22nd International semantic web conference (ISWC 2023), Posters and Demos track, CEUR Workshop Proceedings (Vol. 3632). CEUR-WS.org.

11.

Dong

Tang

Zhao

W. X.

(2023). A survey on long text modeling with transformers. CoRR abs/2302.14502. https://doi.org/10.48550/arXiv.2302.14502.

12.

Gonzalez-Garcia

González-Carreño

Rivas Machota

A. M.

Padilla Fernández-Vega

(2024). Enhancing knowledge graphs with microdata and llms: The case of Schema.org and Wikidata in touristic information. The Electronic Library, 42(3), 443–454.

13.

Google . (2025) General structured data guidelines. Retrieved June 12, 2025, from https://developers.google.com/search/docs/appearance/structured-data/sd-policies#content.

14.

Groth

Lauruhn

Scerri

Daniel Jr

(2018). Open information extraction on scientific text: An evaluation. arXiv preprint arXiv:1802.05574.

15.

Guha

R. V.

Brickley

Macbeth

(2016). Schema.org: Evolution of structured data on the web. Communications of the ACM, 59(2), 44–51. https://doi.org/10.1145/2844544

16.

Han

Collier

Buntine

W. L.

Shareghi

(2023). Pive: Prompting with iterative verification improving graph-based generative capability of LLMs. CoRR abs/2305.12392.

17.

Huang

Song

Wang

Chen

(2023). Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236.

18.

Lee

Frieske

Ishii

Bang

Y. J.

Madotto

Fung

(2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.

19.

Jiang

A. Q.

Sablayrolles

Roux

Mensch

Savary

Bamford

Chaplot

D. S.

Casas

D. dl.

Hanna

E. B.

Bressand

Lengyel

Bour

Lample

Lavaud

L. R.

Saulnier

Lachaux

M. A.

Stock

Subramanian

Yang

Antoniak

,... Sayed

W. E.

(2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.

20.

Kumar

Pandey

Gadia

Mishra

(2020). Building knowledge graph using pre-trained language model for learning entity-aware relationships. In 2020 IEEE international conference on computing, power and communication technologies (GUCON) (pp. 310–315). https://doi.org/10.1109/GUCON48875.2020.9231227.

21.

Manakul

Liusie

Gales

M. J.

(2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.

22.

Mehta

Hoblitzell

O’Keefe

Jang

Varma

(2024). Metacheckgpt—A multi-task hallucination detection using llm uncertainty and meta-models. arXiv preprint arXiv:2404.06948.

23.

Meyer

L. P.

Stadler

Frey

Radtke

Junghanns

Meissner

Dziwis

Bulert

Martin

(2023). LLM-assisted knowledge graph engineering: Experiments with chatgpt. In Working conference on artificial intelligence development for a resilient and sustainable tomorrow (pp. 103–115). Springer Fachmedien Wiesbaden Wiesbaden.

24.

Mickus

Zosa

Vázquez

Vahtola

Tiedemann

Segonne

Raganato

Apidianaki

(2024). Semeval-2024 shared task 6: Shroom, a shared-task on hallucinations and related observable overgeneration mistakes. arXiv preprint arXiv:2403.07726.

25.

Mihindukulasooriya

Tiwari

Enguix

C. F.

Lata

(2023). Text2kgbench: A benchmark for ontology-driven knowledge graph generation from text. In 22nd International semantic web conference, proceedings, Part II, Lecture Notes in Computer Science (Vol. 14266, pp. 247–265). Springer.

26.

Mündler

Jenko

Vechev

(2023). Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.

27.

Pan

Luo

Wang

Chen

Wang

(2024). Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36(7), 3580–3599. https://doi.org/10.1109/TKDE.2024.3352100

28.

Schemaorg . (2025) Getting started with Schema.org. Retrieved June 12, 2025, from https://schema.org/docs/gs.html#schemaorg_expected.

29.

Shen

Tan

Chen

Zhang

Zheng

Koehn

Khashabi

(2024). The language barrier: Dissecting safety challenges of LLMs in multilingual contexts. arXiv preprint arXiv:2401.13136.

30.

Wei

Chen

Fang

Gao

(2024). Opdai at semeval-2024 task 6: Small LLMs can accelerate hallucination detection with weakly supervised data. arXiv preprint arXiv:2402.12913.

31.

Zhang

Hauer

Shi

Kondrak

(2023). Don’t trust chatgpt when your question is not in English: A study of multilingual abilities and types of LLMs. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 7915–7927).

32.

Zhu

Wang

Chen

Qiao

Yao

Deng

Chen

Zhang

(2023). LLMs for knowledge graph construction and reasoning: Recent capabilities and future opportunities. CoRR abs/2305.13168.

LLM4Schema.org: Generating Schema.org Markups With Large Language Models

Abstract

Keywords

1. Introduction

2.1. Web Pages With Schema.org Markup in JSON-LD

2.2. Using LLMs to Generate Markup From Text

3.1. LLM4Schema.org Overview

3.5. Schema.org Markup Compliance Agent

3.6. MeMR: Merged Markup Ratio

Table 1. Number of Examples in the Ground-Truth Dataset for Factuality Agent. Positive Negative Total Factual (intrinsic) 785 498 1,283 Factual (extrinsic) 785 630 1,415

4.1. How Reliable Is the Factuality Agent?

4.1.1. Ground Truth Dataset for Factuality Agent

4.2. How Reliable Is the Compliance Agent?

4.2.1. Ground Truth Dataset for Compliance Agent

4.3. Sampling Over WebDataCommons

5. Related Work

6. Conclusion

Footnotes

Acknowledgements

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

Appendix A. Example of Errors for Factuality Agent

Appendix B. Example of Errors for Compliance Agent

Appendix C. Examples of Errors Throughout the Pipeline

Appendix D. Human Assessment

Appendix E. Additional Human Assessment for Evaluating the Accuracy of the MeMR Between Human and GPT-4

Appendix F. Language as a Sampling Feature

Appendix G. Factuality and Compliance Agents Implementation Choice

Appendix H. Effect of Chunking on Throughout the Pipeline

References

Table 1.
Number of Examples in the Ground-Truth Dataset for Factuality Agent.

Positive Negative Total

Factual (intrinsic) 785 498 1,283

Factual (extrinsic) 785 630 1,415