Spanish Triple-to-Text Benchmark on Low-Resource Large Language Models

Abstract

The verbalization of structured data is a beneficial process for several applications. In the context of knowledge graphs (KGs), transforming Resource Description Framework (RDF) triples into natural language facilitates tasks such as KG documentation or alternative exploration methods for different user needs. While significant progress has been made on the English verbalization of KGs, Spanish remains an under-represented language for this task due to the lack of suitable resources. This hinders the development and evaluation of models capable of generating high-quality Spanish verbalizations. To tackle this problem, we create a Spanish adaptation of the WebNLG dataset, a benchmark consisting of over 45,000 verbalizations paired with DBpedia triple sets. To our knowledge, this is the first formal attempt to provide such a dataset in Spanish, which not only serves for data verbalization but can also potentially support the automated generation of RDF triples from text. We leverage this dataset to conduct a comprehensive evaluation of resource-efficient models for the Spanish triple-to-text task employing two different learning approaches: context learning (zero-shot, one-shot, and few-shot settings) and supervised learning through partial fine-tuning. Our results highlight the challenges of generating fluent and accurate Spanish text and demonstrate that partial fine-tuning of the evaluated models significantly improves performance.

Keywords

data-to-text triple-to-text data verbalization triple verbalization Spanish WebNLG

1. Introduction

Knowledge graphs (KGs) contain interconnected entities and relationships, usually presented as Resource Description Framework (RDF) (World Wide Web Consortium, 2014) triples with a subject-predicate-object structure. While KGs are useful for machines, humans naturally understand and interpret plain text more intuitively than structured data, as it aligns with our ability to process language, context and nuances related to it. In the context of natural language processing (NLP), the data-to-text (D2T) task focuses on converting structured data into natural language text. Its goal is to make complex data more understandable and accessible by generating human-readable summaries or descriptions from raw structured data.

A lot of effort has been put into automating both the generation and verbalization of triples with a growing interest in the use of language models for these tasks (Lin et al., 2024). Most of these efforts only focus on the English language (Osuji et al., 2024), leaving a wide gap between said language and others. In fact, there is a notable lack of resources for other languages. Osuji et al. (2024) report that, of 90 works they studied in their review of D2T literature, 85% of them were approaches focused exclusively on English and that only two of the works, which propose multilingual approaches, feature the Spanish language, none of which tackle the triples-to-text task. This highlights the need to develop resources for other languages, such as Spanish. Over 450 million people around the world speak Spanish natively (Cervantes, 2024), which represents around 6% of the world population, without considering the population that speaks Spanish non-natively. Even though Spanish is one of the most spoken languages in the world, the resources found related to the automatic generation and verbalization of triples in this language are almost non-existent.

In this work, we present our contribution to the task of KG, specifically triples, verbalization in Spanish, focusing on the triples-to-text challenge. Spanish is a language characterized by its rich inflectional morphology and considerable flexibility in word order, which, in comparison to English, includes a more extensive verbal and nominal inflection and multiple acceptable positions for subjects and objects around the verb (Aguado-Orea et al., 2019; Moreno-Sandoval & Goñi-Menoyo, 2002). As a result, a single structured input may correspond to a broader set of equally natural verbalization options in Spanish. For example, for the triple [subject: Maria, relationship: need, object: help], English commonly allows expressions such as "Maria needs help.", "Maria requires help.", or "Maria is in need of help.” In Spanish, the same content can be rendered as "María necesita ayuda.", "María requiere ayuda.", "María precisa ayuda..", "María tiene necesidad de ayuda.", or "María está necesitada de ayuda.", among others. To tackle this task, we create a semi-supervised Spanish adaptation of the WebNLG dataset (Gardent et al., 2017), which contains the Spanish translation of the triples-verbalization pairs included in the English WebNLG. For this, we followed an automatic machine translation process which was then verified and partially manually revised through a detection process of potentially problematic cases. We also present a study of the performance of a selection of resource-efficient large language models (LLMs) on the task of translating Spanish triples to text. With this study, we aim to answer three research questions:

RQ1: How effectively can resource-efficient LLMs verbalize triples into Spanish across different complexity levels?

RQ2: How does task contextualization through examples impact model performance in Spanish triple verbalization?

RQ3: What are the comparative advantages and limitations of context learning versus partial fine-tuning for Spanish triple verbalization?

This article presents our adaptation of WebNLG to Spanish and the evaluation of low-resource LLM performance for the verbalization of triples in Spanish. It is structured as follows: We introduce the background related to structured D2T verbalization in Section 2, followed by the introduction to the proposed Spanish WebNLG and the methodology followed for its development in Section 3. Then, in Section 4.1, we explain the process followed to fine-tune and evaluate a selection of LLMs on the task of Spanish triples-to-text generation using our proposed dataset, followed by Section 4.2, with the presentation of the results obtained in the previous process, and Section 4.3, where we present the discussion of the findings. Finally, in Section 5, we present the conclusions extracted from the previous contributions and future work.

All the code and resources used to develop this work and the results obtained are available in GitHub¹ and Zenodo.²

2. Background

Structured data verbalization, also known as data-to-text generation, is the name given to the task of generating natural language passages from structured data. Structured data can be stored in different forms, with the most common being graphs, tables, and meaningful representations (MRs) such as key-value structures or abstract meaning representation (AMR) (Banarescu et al., 2013), which is a semantic representation framework that captures the meaning of a sentence as a rooted, labeled graph of concepts and their relationships (Figure 1).

Figure 1.

Data-to-text (D2T) generation overview.

2.1. Datasets and Multilingual Coverage for RDF Triple Verbalization

A lot of work has been done to produce data resources to develop and evaluate data verbalization systems. In their review of the work presented in the literature related to D2T, Osuji et al. (2024) identify 63 distinct D2T datasets, mainly for table and AMR data formats. In terms of our specific task, that is, RDF-like data verbalization, specifically triples, they identify four datasets: WebNLG (Gardent et al., 2017), DART (Nan et al., 2021), T-REx (Elsahar et al., 2018), and WITA (Fu et al., 2020). WebNLG dataset (Gardent et al., 2017) stems from the WebNLG challenge (Colin et al., 2016), a shared task to generate text descriptions from structured data found in DBpedia (Mendes et al., 2012). It consists of pairs of DBPedia triples sets and texts. This dataset was originally presented as an English resource, although with time it has been adapted to other languages and, at present,³ it also features Russian, Maltese, Irish, Breton, and Welsh (Cripwell et al., 2023) verbalization translations and partial triples translation (only the entities) for Russian, making it the only multilingual dataset of the ones named. Beyond the dataset itself, prior work has briefly addressed Spanish verbalization in the WebNLG setting. Mille et al. (2019) extend the existing rule-based FORGe generator to cover a subset of DBpedia properties in Spanish within a template- and grammar-based framework, but they do not provide a Spanish WebNLG dataset or a parallel benchmark for training and evaluating neural triple-to-text models. This, together with its variety in categories and manual supervision, is the reason we chose WebNLG to adapt to Spanish. In our case, differently from the previous adaptations, we fully translate both the triples and all the available English verbalizations for each entry in all the data splits-train, validation and test-, which results in a fully parallel dataset between English and Spanish. This enables the possibility of not only exploring Spanish triple verbalization, but also cross-lingual approaches.

Similar to WebNLG, DART (Nan et al., 2021) consists of structured data in the form of RDF triples and their corresponding textual descriptions. DART is built from multiple sources, including WikiTableQuestions (Pasupat & Liang, 2015), WikiSQL (Zhong et al., 2017), WebNLG (Gardent et al., 2017), and E2E (Novikova et al., 2017) datasets, covering diverse domains such as Wikipedia, databases, and e-commerce. On the same line, T-REx (Elsahar et al., 2018) is built from Wikipedia text and Wikidata triples. It contains millions of automatically aligned sentences with structured knowledge, making it useful for pre-training and fine-tuning language models to better understand factual knowledge and entity relationships. Lastly, WITA (Fu et al., 2020) is constructed from Wikipedia tables and their associated text, using a distant supervision approach to match table records with textual descriptions. Unlike fully supervised datasets, WITA contains partial alignments, reflecting real-world challenges in structured-to-text generation. It helps train models to generalize from imperfect data, making it useful for applications requiring robust natural language generation from structured inputs.

Regarding the language aspect of the task, as stated in Section 1, Osuji et al. (2024) report that, only 15% of the works they reviewed weren’t English-exclusive and only one of the works (Moussallem et al., 2018) focused completely on a non-English language, Brazilian Portuguese specifically, while the other 12 focus on multilingual approaches. From these 12 articles, only two (Fan & Gardent, 2020; Xu et al., 2021) incorporate Spanish to their multilingualism, both of which are based on AMR-formatted data. These numbers, in addition to the fact that the representation of non-English languages on well established datasets is quite low, reinforce our belief that more work has to be done in the multilingual aspect of the task for a wider spectrum of data formats.

2.2. Modeling Approaches for RDF Triple-to-Text Generation

As illustrated in Figure 1, the approaches proposed in the D2T research can be separated into two main blocks: rules or template-based approaches and neural approaches. The first block of approaches, template- or rule-based D2T, relies on predefined templates and handcrafted rules to convert structured data into fixed or semi-fixed textual outputs (Goldberg et al., 1994; Reiter & Dale, 1997). These systems operate by mapping data inputs to specific linguistic constructs, ensuring consistent and contextually appropriate outputs (Gatt & Krahmer, 2018). This approach offers simplicity and control over the generated text, making it particularly useful in domains where precision and reliability are essential (Gkatzia, 2016), such as weather (Boyd, 1998; Ramos-Soto et al., 2015) and triple-like health reporting data (Hallett et al., 2006). However, the rigidity of templates can limit linguistic variability and may require extensive manual effort to create and maintain, especially when scaling across diverse domains or languages. Some approaches propose the dynamization of these templates by adding control expressions and/or attribute mechanisms to better control the possible inconsistencies during the slot-filling process (Mcroy et al., 2003). Despite these challenges, template-based methods remain a viable solution for applications where the data structure is well-understood, and the desired output follows a predictable pattern (Wiseman et al., 2017).

More recently (Osuji et al., 2024), neural approaches, which leverage neural architectures to automate the verbalization process, have taken a more prominent role, moving away from traditional rule-based and modular approaches that require handcrafted features and domain-specific templates. Lin et al. (2024) propose a taxonomy of neural approaches based on two axes: neural end-to-end and modular D2T. End-to-end neural models have become the dominant approach for structured data verbalization (Liu et al., 2019; Puduppully et al., 2019; van der Lee et al., 2018; Wiseman et al., 2018; Yang et al., 2022), and more specifically triple verbalization (Chen et al., 2020; Lorandi & Belz, 2024), due to their ability to learn complex mappings between structured data and natural language text without requiring explicit intermediate steps. These models are generally based on sequence-to-sequence architectures, where an encoder processes input triples into a latent representation, and a decoder generates textual descriptions (Agarwal et al., 2021; Duong et al., 2023). Various other architectures have been introduced to different approaches, such as the use of advanced encoding techniques like graph neural networks (Lu et al., 2023). Transformer-based architectures, including BERT (Devlin et al., 2019), BART (Lewis et al., 2020), T5 (Raffel et al., 2020), and GPT-2 (Radford et al., 2019), leverage large-scale pretraining and fine-tuning to boost contextual understanding and generalization (Ma et al., 2022). Despite their advantages, neural end-to-end models often struggle with weak controllability and factual inconsistency, sometimes generating hallucinations in their output not present in the input data (Maynez et al., 2020) or presenting semantic errors (Kasner & Dusek, 2024). To mitigate these issues, modular neural approaches reintroduce intermediate processing steps to enhance interpretability and control (Moryossef et al., 2019). A common framework for triple-to-text tasks is the two-stage approach, which separates content planning from text generation. The content planning phase determines what information should be included and its structure, with models such as neural content planning (NCP) (Zhao et al., 2020) or a two-step content planning based on an encoder-then-order approach (Su et al., 2021). The text generation phase then transforms the structured content plan into fluent and coherent text while maintaining faithfulness to the input data. Other approaches introduce templates into their neural modular approaches to define initial verbalization of single triples that will later be used for sentence fusion and scoring based on neural models (Kasner & Dušek, 2020).

2.3. Evaluation of RDF Triple Verbalization

Evaluation of triple-to-text and, more broadly, D2T systems typically relies on human evaluation or automatic reference-based metrics adapted from other natural language tasks such as machine translation or summarization, most prominently BLEU (Papineni et al., 2002), alongside alternatives such as METEOR (Lavie & Agarwal, 2007), ROUGE (Lin, 2004), chrF (Popović, 2017) and more recent embedding-based metrics like BERTScore (Zhang et al., 2020), among others (Osuji et al., 2024; Puduppully et al., 2019; Zhao et al., 2020). These measures estimate the quality of a system output by comparing it to one or more reference texts, usually in terms of n-gram or token overlap. However, some studies have highlighted important limitations of some reference-based metrics, specifically BLEU, as a proxy for human judgments (Callison-Burch et al., 2006; Post, 2018; Reiter, 2018). They report that BLEU-based system rankings can diverge from human evaluations, that BLEU is often under-specified in practice with inconsistent tokenization and parameter choices, and that its correlation with real-world utility and user satisfaction is far from guaranteed, particularly for natural language generation tasks. These observations motivate the use of complementary metrics, which we also adopt in our experiments.

In general, the domain of neural D2T generation is advancing toward finding an ideal equilibrium among fluency, factual accuracy, and controllability. Although end-to-end methods produce coherent text with little human involvement, modular methods offer enhanced control and dependability, which makes them better suited for applications that demand accuracy. However, a major obstacle persists due to the scarcity of multilingual resources and methodologies, which restricts the accessibility and efficiency of D2T systems in various linguistic and cultural settings. Tackling these deficiencies is necessary to create more inclusive and universally applicable solutions.

3. Spanish WebNLG

WebNLG (Gardent et al., 2017) is a structured D2T generation benchmark that was initially introduced as an English D2T dataset for a challenge (Colin et al., 2016). In the last few years, it has also been adapted to other languages, specifically Russian, Maltese, Irish, Breton, and Welsh (Cripwell et al., 2023). With three major releases, we chose to work with the most recent version, specifically WebNLG v3.0. The dataset is split into training, validation, and test sets, ensuring a standardized evaluation process. It is provided in XML format, where the root element, <benchmark>, contains multiple <entries>. Each <entry> has attributes such as category, ID, shape, shape type, and triple set size. The structure of an entry consists of three key components: <originaltripleset>, which holds raw RDF triples extracted from DBpedia and wrapped in <otriple>; <modifiedtripleset>, containing the processed original triples which were revised by annotators and wrapped in <mtriple>; and <lex>, which includes human-generated natural language text (lexicalizations) with quality annotations. The dataset accommodates varying complexity, with each entry containing between one and seven triples. The training set comprises around 13.2k entries with $\sim$ 35.4k triple-lexicalization pairs, while the validation and test sets contain 1.6k and 1.7k entries, corresponding to around 4.4k and 5.1k triple-lexicalization pairs, respectively.

WebNLG English v3.0 serves as the foundation for our Spanish adaptation due to its key attributes:

A well-structured benchmark featuring a diverse set of RDF triples from DBpedia, covering a wide range of topics. The topics are varied yet presented at a sufficiently high level, making them accessible to non-native English speakers with a relatively good command of the language and general knowledge.

High-quality triples and corresponding verbalizations, carefully verified by annotators and reviewers to maintain accuracy and linguistic clarity.

A range of complexity levels, with datasets containing between one and seven triples per set, allowing for a thorough evaluation of model performance across different levels of difficulty.

Demonstrated multilingual adaptability, as evidenced by successful extensions into other languages.

An active research community and well-established evaluation frameworks that support continuous improvement and benchmarking of generated text.

In Section 3.1, we introduce the methodology followed to adapt the dataset to Spanish.

3.1. Methodology Overview

Currently, WebNLG (Cripwell et al., 2023) supports a range of languages, but the multilingualism is generally exclusive to the verbalizations, meaning that the original KG or triples must strictly be in English, limiting its applicability for multilingual scenarios. To overcome this constraint, we introduce the translation of both triples and verbalizations within WebNLG. This enhancement enables seamless processing in English, Spanish, or a combination of both, expanding the potential for multilingual natural language generation and improving adaptability across different linguistic contexts.

During the adaptation of the dataset, we maintained the original structure of the WebNLG English dataset while translating both the triples and their corresponding verbalizations into Spanish. This ensures consistency in format and allows for direct comparisons between English and translated versions. An overview of this adaptation methodology is shown in Figure 2, which outlines the procedure used to align the translated triples and verbalizations. Specifically, it illustrates the process divided into two steps: the automatic translation phase, explained in Section 3.1.1, and the manual revision and triple composition, explained in Section 3.1.2. In Section 3.2, we introduce the final structure of the dataset and some of its characteristics and limitations.

Figure 2.

Methodology followed for the adaptation of WebNLG to Spanish, in blue the process followed for triples adaptation and in red the process followed for verbalizations adaptation.

We acknowledge that the methodology adopted here does not aim to cover the full range of possible Spanish verbalization variants. Owing to Spanish’ inflectional system and syntactic flexibility, a single structured input may correspond to many equally natural wording variants (Aguado-Orea et al., 2019; Moreno-Sandoval & Goñi-Menoyo, 2002), and generating all such variants would not be feasible. Moreover, Spanish is an official language in over 20 countries across Europe, the Americas, and Africa (Cervantes, 2024), which entails substantial national and regional variation in lexical and stylistic preferences. Rather than pursuing exhaustive coverage, our goal is to provide a fully aligned Spanish version of WebNLG with broad, practical coverage for benchmarking. We, therefore, translate all verbalizations available in the original English WebNLG entries to approximate a wide range of Spanish expressions. This initial release primarily reflects a standard Peninsular variety of Spanish, aligned with the authors’ linguistic expertize to support reliable manual revision in this first version. We recognize that Latin American lexical and stylistic variants are under-represented and acknowledge this as a limitation of the present dataset and a clear target for future, community-driven and dialect-aware expansion.

3.1.1. Automatic Translation

To adapt WebNLG into Spanish, we first processed the structured data and textual elements and generated automatic machine translations using DeepL⁴. DeepL is a neural machine translation service known for its high-quality machine translation, especially for European languages like Spanish (Kamaluddin et al., 2024).

For triples, which are initially presented in the format subject — relationship — object, we separated each component to handle them individually. This decomposition allowed us to translate subjects, relationships, and objects independently, avoiding potential errors from translating entire structured statements at once. We machine-translate all entities and relationships, which may introduce errors due to the lack of context during translation. These potential errors are later addressed through manual revision of all the triple instances. After extraction, given that the same entities and relationships can be present across different triples, we compiled separate sets for entities (subjects and objects) and relationships. Working on these unique sets makes it possible to ensure consistency across the whole dataset.

For entities, we combined KG information with machine translation. We queried Wikidata (Vrandečić & Krötzsch, 2014) to obtain Spanish labels and/or aliases when available, and only fell back to machine translation when such information was missing. To ensure the best possible translation quality, we applied the following priority order: (i) if a Spanish Wikidata label existed, we selected it as the translation; (ii) otherwise, if an alias was available, we used it; and (iii) if neither was found, we relied on machine translation. In total, this yielded 3,615 unique entities, of which 2,116 were translated using Wikidata labels/aliases and 1,499 via machine translation.

For relationships (properties), we relied exclusively on machine translation. This decision is motivated by the structure of WebNLG v3.0, where the triple sets used for training correspond to manually modified triples rather than the raw properties. As a result, there is no simple one-to-one mapping from these modified properties back to canonical properties, which makes property labels from the KG less reliable as a source of truth. Instead, we treated the 412 unique relationships as dataset-specific predicates and translated all of them via machine translation, followed by manual revision (see Section 3.1.2).

For verbalizations (or lexicalizations), which are natural language expressions of the triples, we extracted them as plain text and directly applied machine translation. Since these are full sentences rather than structured elements, they were translated without decomposition, ensuring fluency in the resulting Spanish text. After machine translation, we automatically detected dates in formats such as dd-mm-yyyy, mm/dd/yyyy, and so on, and reformatted them into fully textual expressions in Spanish (day, month, and year written out). This avoids misunderstandings related to date formats, such as confusion between day and month order, and matches the way both humans and language models typically realise dates in naturally written text.

Once the translations of entities and relationships had been manually revised (see Section 3.1.2), we performed an automatic consistency check over all verbalizations. In this pass, we verified that (i) the entities mentioned in the Spanish verbalizations matched the final, revised Spanish entities associated with the corresponding triples, and (ii) all dates still conformed to the normalized textual format described above. Whenever mismatches were found (for instance, if a verbalization still contained a previous version of an entity translation), the verbalization was automatically updated so that all entity mentions and dates were consistent with the final triple translations.

3.1.2. Detection of Problematic Cases and Manual Oversight

The second step of the adaptation was to manually revise the automatic translations. This step was carried out in two stages: first on entities and relationships, and then on a selected subset of verbalizations.

Regarding the triples, with a total of 3,615 unique entities and 412 unique relationships, since the numbers were manageable, we conducted a manual revision of their translations. We reviewed the selected translation for each entity and relationship and corrected any inaccurate or incomplete translations identified.

Overall, 301 entities ( $\sim$ 8.3% of the total) and 12 relationships (around 2.9%) required modification. Among these 301 corrected entities, 245 corresponded to Wikidata-based translations and 56 to machine-translated ones. Typical errors involved incomplete or partially translated names, such as geographical expressions like New Jersey, New York being translated only as Nueva Jersey, thus omitting relevant contextual information. These cases were manually corrected to ensure accuracy, completeness, and consistency across all triples.

Regarding the verbalizations, the dataset contains 45,031 lexicalizations, which made a complete manual revision of all of them infeasible for this first version of Spanish WebNLG. Instead, we relied on automatic detection of potentially problematic cases, followed by targeted manual revision. Our intention for future work is to enrich the dataset and to apply crowdsourcing to refine these translations at scale.

To identify potentially problematic verbalizations, we computed the cosine similarity between the embeddings of the English and Spanish verbalizations (one-to-one) using three different embedding models. We chose cosine similarity with multilingual embedding models over other metrics because it effectively captures semantic similarities across different languages, enabling accurate cross-linguistic comparison. We chose to use more than one model since the dataset contains a wide variety of topics, aiming to ensure the representativeness of the models as closely as possible. The models were selected based on the SentenceTransformers (Reimers & Gurevych, 2019) documentation on multilingual models.⁵ (Reimers & Gurevych, 2020). Specifically, we used the SentenceTransformers models paraphrase-multilingual-MiniLM-L12-v2,⁶ paraphrase-multilingual-mpnet-base-v2,⁷ and distiluse-base-multilingual-cased-v2⁸ (Reimers & Gurevych, 2019). For each verbalization, we took the maximum similarity score across all models.

We deliberately use cosine similarity not as a full evaluation metric, but as a risk-based heuristic to prioritize which MT verbalizations should be prioritised for human revision. Concretely, instead of randomly sampling from the 45,031 verbalizations, we rank English–Spanish pairs by their cross-lingual similarity and focus manual effort on those that look atypical in the embedding space. This is conceptually similar to outlier or noise detection in other NLP settings, where multilingual sentence embeddings and distance thresholds are used to flag misaligned or low-quality sentence pairs in large parallel corpora before downstream use (Kurfalı & Östling, 2019).

We focused our manual revision on verbalizations whose maximum similarity score was 0.9 or lower. To choose this value, we inspected the empirical distribution of similarity scores and observed a strongly right-skewed shape with a clear "elbow” or ‘‘knee” around 0.9, where the density of points drops sharply (Figure 3). Selecting a cut-off at such an elbow is a standard heuristic in data analysis and outlier detection (Thorndike, 1953), where it marks the transition between a dense region of typical cases and a sparse tail of atypical ones (Satopaa et al., 2011). At the same time, we needed a threshold that was operationally feasible: the selected band had to be sufficiently narrow that a complete manual revision of all retrieved cases was realistic, yet wide enough to cover a meaningful fraction of atypical verbalizations. In our data, a cutoff at 0.9 corresponds to approximately the lowest 3% of the similarity distribution, which we interpret as a high-risk band for potential translation problems. This allows us to concentrate manual effort on the most atypical English–Spanish pairs while keeping the annotation effort realistic. With this criterion, we obtained a selection of 1,239 verbalizations (around 3% of the total) for manual inspection, providing a practical compromise between expected error coverage and annotation cost.

Figure 3.

Histogram of verbalizations similarity results.

It is important to note that this procedure does not guarantee that all problematic translations fall below the threshold, nor that all items below 0.9 are incorrect. Rather, it provides a principled way to concentrate limited human effort on those verbalizations that are statistically more likely to be problematic, instead of relying on uninformed random sampling. Although we cannot manually revise all 45,031 verbalizations, every verbalization is nevertheless subject to an automatic consistency check: entity mentions are aligned with the final, manually revised entity translations, and dates are enforced to follow the normalized textual format described above. In this way, we at least aim to guarantee global consistency of entities and dates across the dataset, even when the full wording of a verbalization has not been manually inspected. A full, crowdsourced revision of all verbalizations remains part of our future work.

The manual revision was carried out by a native Spanish speaker with advanced, formally certified proficiency in English. To ensure consistent decisions, the annotator followed a simple set of internal guidelines: a verbalization was marked as erroneous and corrected if (i) it did not fully preserve the information and context expressed in the original English verbalization, (ii) it contained clear grammatical errors in Spanish (such as agreement, conjugation, or syntactic well-formedness issues), or (iii) it exhibited inconsistent or clearly inappropriate lexical choices with respect to the intended meaning (such as mistranslations or infelicitous word choice according to the triple sets). Since all manual revisions were performed by a single annotator, inter-annotator agreement could not be measured. Instead, we relied on the internal guidelines just described to ensure consistency. Under these criteria, 406 verbalizations were identified as requiring correction, corresponding to 31.5% of the manually inspected subset and <1% of all generated verbalizations.

3.2. Spanish WebNLG: Statistics, Characteristics and Limitations

When creating the Spanish adaptation of WebNLG, we decided to maintain the same structure and content as the English version with the goal of having a parallel translation that enables us to not only compare results along both languages but potentially create multilingual adaptations. In Table 1, we can observe the statistics about entities and verbalizations in the Spanish adaptation. Specifically, we have a total of 16.657 entries, which are composed of 45.031 triple set-verbalization pairs. These entries are divided into three splits, following the original grouping of entries, train, validation and test, in a proportion of 80%, 10%, and 10%, respectively, which potentially enables the training and evaluation of neural models.

Table 1.
Entry and Verbalization Statistics for the WebNLG Datasets.

Train Validation Test

Entries Verbalizations Entries Verbalizations Entries Verbalizations

Language English 13,211 35,426 1,667 4,455 1,779 5,150

Russian 5,573 14,630 790 2.065 1,101 2,780

Breton 13,211 35,426 1,399 1,399 1,778 Not reported

Welsh 13,211 35,426 1,665 1,665 1,778 1,778

Irish 13,211 35,426 1,665 1,665 1,778 1,778

Maltese 13,211 35,426 1,665 1,665 1,778 1,778

Spanish 13 ,211 35,426 1,667 4,455 1,779 5,150

		Train	Validation	Test
Language	English	13,211	35,426	1,667	4,455	1,779	5,150
	Russian	5,573	14,630	790	2.065	1,101	2,780
	Breton	13,211	35,426	1,399	1,399	1,778	Not reported
	Welsh	13,211	35,426	1,665	1,665	1,778	1,778
	Irish	13,211	35,426	1,665	1,665	1,778	1,778
	Maltese	13,211	35,426	1,665	1,665	1,778	1,778
	Spanish	13 ,211	35,426	1,667	4,455	1,779	5,150

In terms of category and triple size, in Table 2, we illustrate the distribution of entries per splits for Spanish WebNLG, that mirrors the original English version. The dataset contains a diverse range of 19 categories, including places (Airport, City, Monument, etc.), people (Artist, Astronaut, Politician, etc.), entities (Company, Film, MusicalWork, etc.), and abstract concepts among others. The number of triples per entry varies significantly, with most entries containing between one and five triples, though some categories, such as Astronaut and University, also include larger entries with up to seven triples. In the train split, categories like Food, MeanOfTransportation, and Politician have a large number of entries, whereas others such as Monument and Company have fewer examples. The validation split follows a similar distribution pattern, but with a significantly lower number of examples per category, ensuring a representative yet smaller validation dataset. The test set, however, includes categories absent in training, such as Film and MusicalWork, which help test the model’s generalization ability. Overall, the dataset balances a variety of domains while varying the complexity of entries by the number of triples, which gives us a wider scope of evaluation.

Table 2.

Category and Triple Size Statistics About English and Spanish WebNLG.

Split		Train							Validation							Test
Num of triples		1	2	3	4	5	6	7	1	2	3	4	5	6	7	1	2	3	4	5	6	7
Triple Category	Airport	301	193	187	206	198	0	0	37	24	24	25	25	0	0	13	22	19	15	14	10	2
	Artist	276	226	246	240	234	0	0	34	28	31	30	30	0	0	22	26	24	20	13	2	2
	Astronaut	71	46	64	82	86	90	90	8	5	8	11	11	11	12	8	18	15	14	11	8	8
	Athlete	285	188	216	147	67	0	0	35	24	27	19	9	0	0	8	12	14	9	7	0	0
	Building	236	171	203	206	156	0	0	30	44	0	25	20	0	0	0	7	8	7	8	8	8
	CelestialBody	169	131	129	118	87	0	0	21	17	16	14	11	0	0	8	9	11	9	7	4	1
	City	243	185	235	218	229	0	0	31	23	30	27	28	0	0	19	16	17	15	7	6	3
	ComicsCharacter	98	77	64	35	11	0	0	13	9	8	5	2	0	0	0	7	8	7	8	0	0
	Company	83	76	76	57	34	16	9	12	11	11	8	5	2	1	15	15	13	11	6	4	2
	Film	0	0	0	0	0	0	0	0	0	0	0	0	0	0	76	41	51	44	31	13	8
	Food	271	277	313	315	230	0	0	34	35	39	40	30	0	0	0	7	8	7	8	8	8
	MeanOfTransportation	298	211	228	242	153	0	0	38	27	28	31	19	0	0	12	15	15	13	2	1	0
	Monument	37	32	41	48	45	35	25	5	4	5	6	5	4	2	0	7	8	7	8	8	8
	MusicalWork	0	0	0	0	0	0	0	0	0	0	0	0	0	0	88	47	58	48	25	16	8
	Politician	299	248	258	243	146	0	0	38	31	32	31	19	0	0	0	6	8	7	8	0	0
	Scientist	0	0	0	0	0	0	0	0	0	0	0	0	0	0	77	52	45	46	27	8	4
	SportsTeam	251	170	169	149	43	0	0	32	22	22	18	5	0	0	0	6	8	7	7	8	8
	University	58	39	58	73	62	62	54	7	5	8	9	7	8	7	10	20	15	14	13	9	9
	WrittenWork	219	202	248	170	98	0	0	28	26	31	21	12	0	0	13	16	5	5	3	1	0
Total entries per triple size		3,195	2,472	2,735	2,549	1,879	203	178	403	335	320	320	238	25	22	369	349	350	305	213	114	79

Comparing the development process and nature of the data in the other available supported languages in WebNLG with our Spanish dataset, outside of English, the previously existing translations for the other languages were developed using machine translation and a posterior crowdsourcing process for verbalizations exclusively. In contrast, our Spanish WebNLG dataset includes translations for both triples and verbalizations, which adds more variability and consistency to the dataset.

To obtain the development and test data for each of the low-resource languages (Breton, Irish, Maltese, and Welsh), professional translators manually translated the English text from the WebNLG 2020 (Castro Ferreira et al., 2020) development and test sets, given both the English text and the input RDF graph. Only the first reference of each test example in the original English dataset was considered for translation, except in the case of Breton, which contains two translated references for some test items. For Russian, the WebNLG+ 2020 (Castro Ferreira et al., 2020) data was used. Although the Russian dataset includes partial triple translation—where only the DBpedia translation of entities was extracted, leaving relationships untranslated—it only contains data for nine of the 19 categories of entries: Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food, Monument, SportsTeam, and University. In all cases, the training split data is also available, with the verbalizations being generated exclusively through machine translation and thus considered "noisy" data.

In contrast, for Spanish, we translated all triples and verbalizations available in the English version of WebNLG. The translation process began with automatic generation, using Wikidata for triples, and was followed by a posterior manual revision of all triples. Additionally, we detected and manually revised potentially problematic cases for verbalizations across all data splits, further ensuring quality and variability in the dataset. This comprehensive approach ensures the full availability of all translations in the Spanish dataset, unlike the partial triple translation for Russian and the limited translation for the other languages.

Referring again to Table 1, which reports the entry and verbalization sizes for each split across languages, we can observe that, in terms of size, English and Spanish have notably more available verbalizations for the validation and test splits. This is because the developers report that only the first English verbalization for each entry was considered for translation into low-resource languages.⁹ In our case, we believe that a wider availability of verbalizations can provide a more comprehensive representation of the various forms a text can take to express the triples.

Nevertheless, while Spanish WebNLG is designed to be a broadly useful benchmark, some aspects of its current release delimit its scope. First, the diversity of Spanish realizations is naturally constrained by the paraphrastic space present in the English WebNLG v3.0 dataset: we obtain our range of expressions by translating all existing English triple sets and verbalizations, which we assume are inherently coherent with each other due to their gold standard nature. This means that, although the corpus offers multiple references per triple set and a wider variety than single-reference resources, it should not be interpreted as an exhaustive sample of all possible Spanish verbalizations. Second, although all triples are manually checked and all verbalizations pass through automatic consistency filters, only a similarity-based subset of verbalizations is manually corrected by an annotator, which may leave some residual noise or annotator preference. Finally, this first release primarily reflects a standard Peninsular variety of Spanish, with more limited coverage of other regional varieties. We view a broader, crowdsourced revision and a dialect-aware expansion of the corpus as natural next steps to further strengthen the resource.

4. Evaluation of Triple-to-Text Generation in Spanish Using the Spanish WebNLG

The goal of this study is to conduct a preliminary evaluation of resource-efficient LLMs for the task of Spanish triple verbalization using the Spanish WebNLG dataset presented in Section 3. Specifically, we aim to explore how context learning and fine-tuning can enhance the performance of resource-efficient LLMs in generating natural language text from Spanish triples. We aim to answer three key research questions: (RQ1) "How effectively can resource-efficient LLMs verbalize Spanish triples across different complexity levels?" (having that complexity levels are defined by the increase of the number of triples in a triple set), (RQ2) "How does task contextualization through examples impact model performance in Spanish triple verbalization?" and (RQ3) "What are the comparative advantages and limitations of prompt learning versus partial fine-tuning for Spanish triple verbalization?".

This Section is structured as follows: In Section 4.1, we present the Learning Approaches and Models employed in our study. This includes an exploration of Context learning (Section 4.1.1) and its role in model performance, the methodology behind Supervised learning through Fine-tuning (Section 4.1.2), the criteria for Models’ selection (Section 4.1.3), and a description of the Evaluation metrics used to assess our models (Section 4.1.4). Section 4.2 provides a comprehensive overview of the Evaluation Results and Analysis. We first examine the Context Learning performance (4.2.1) and compare it with the results obtained from Fine-tuning performance (Section 4.2.2). Additionally, we conduct a Cross-lingual Analysis (Section 4.2.3) to evaluate the adaptability of models across different languages. Finally, we perform an Error Analysis (Section 4.2.4) to identify common failure cases and areas for improvement. To conclude, in Section 4.3, we present a Discussion of our findings, reflecting on the key takeaways and outlining potential directions for future research.

With this structure, we aim to provide a clear and comprehensive examination of the learning approaches, their effectiveness, and the insights gained from our evaluation.

4.1. Learning Approaches and Models

To answer RQ1, "How effectively can resource-efficient LLMs verbalize Spanish triples across different complexity levels?", we assess the performance of different LLMs on the Spanish triple-to-text task, employing two approaches: Context-Based Learning by prompting and Data-based Learning by partial fine-tuning, explained in Sections 4.1.1 and 4.1.2, respectively. With the evaluation of these methods, we aim to answer the remaining two research questions: (RQ2), "How does task contextualization through examples impact model performance in Spanish triple verbalization?", and (RQ3), "What are the comparative advantages and limitations of prompt learning versus partial fine-tuning for Spanish triple verbalization?". For a fair comparison, both methods are analyzed using the same environment equipped with an RTX 3060 Laptop GPU (6 GB VRAM) and 16 GB RAM, with the same models (model selection criteria detailed in Section 4.1.3) and evaluated using the performance metrics detailed in Section 4.1.4. The evaluation presented in Section 4.2 has been conducted on the test split of the Spanish WebNLG. Each test has been conducted twice to assess the variability in the results, where variability refers to the performance differences in the generated outputs across runs. This helps capture the range of possible outcomes (or ‘‘hallucinations”) that each evaluated model may produce.

Even though our goal is to study the performance of the models on the Spanish task, for a better understanding of the results, we also run all the experiments presented before on the English WebNLG set, that is, the context-based learning and data-based learning of the same models. The English results allow us to gain a clearer understanding of the results in Spanish, as they serve as a comparison of the performance of the models in order to have a better understanding of whether the difficulties they may have in carrying out the task are exclusively rooted in the task itself or are also rooted in the language.

4.1.1. Context Learning Through Prompts

The context learning approach, sometimes referred to as prompt-based learning, enables models to generate responses based on provided prompts without modifying their internal parameters. This approach leverages contextual cues from the input text to guide the model’s behavior, allowing it to adapt dynamically to different tasks. It operates under three main settings: zero-shot (0S), one-shot (1S), and few-shot (FS).

Zero-shot learning (0S): The model generates responses without any prior examples, relying solely on its pre-trained knowledge.

One-shot learning (1S): The model is given a single example to help it understand the task before generating responses.

Few-shot learning (FS): The model is provided with a small number of examples to better grasp patterns and context, improving response accuracy.

In our case, we will evaluate the context learning approach using 0S, 1S, and FS (with two examples) settings. We use the same base prompt in all the tests and only add examples, in the case of 1S and FS tests. Our base prompt introduces the structured data format and instructs the model to generate fluent, grammatically correct Spanish text from given triples:

En español, los datos estructurados se representan comúnmente como tripletas o triples, con el formato [sujeto, predicado, objeto]. A partir de estas tripletas, genera un texto de un solo párrafo formado por oraciones completas, gramaticalmente correctas y naturales. Genera el texto únicamente a partir de las siguientes tripletas:

Depending on the modality of the test, that is, if examples are provided, the triples are added to this base prompt with the format [sujeto: subject, predicado: predicate, objeto: object] as a list, followed by their verbalization.

As we stated previously, we also run all the experiments on the English WebNLG set using the same configuration of the experiments as the Spanish evaluation. In this case, we translated the original base prompt to English:

In English, structured data is commonly represented as triples, with the format [subject, predicate, object]. Based on these triples, generate a single-paragraph text composed of complete, grammatically correct, and natural sentences. Generate the text solely from the following triples:

Again, depending of the modality of the test, the triples are added to this base prompt with the format [subject: subject, predicate: predicate, object: object] as a list, followed by their verbalization.

The full prompts used to evaluate the models are available in Appendix A.

4.1.2. Data-Based Learning Through Fine-Tuning

In contrast to the previous approach, the supervised learning method we used involves a partial fine-tuning approach, which adapts the model’s weights using low-rank adaptation (LoRA) (Hu et al., 2021). This method is particularly efficient for fine-tuning large models, as it introduces small, trainable low-rank matrices into the model’s architecture rather than updating all of the model’s parameters. By doing so, LoRA significantly reduces the computational cost and memory requirements, making it feasible to fine-tune large-scale models on limited hardware resources such as ours.

The fine-tuning process in our case lasts $\sim$ 4–5 hours per model on the same hardware, which is considerably faster and more resource-efficient compared to full fine-tuning methods. LoRA achieves this efficiency by freezing the pre-trained model weights and injecting trainable low-rank decomposition matrices into specific layers (in our case, the attention layers of the LLMs). This approach preserves the generalization capabilities of the pre-trained model and enables a faster adaptation to downstream tasks, such as triple verbalization, in comparison to the fine-tuning of the whole model.

During the fine-tuning phase, we employ the same base prompt used in the zero-shot setting of the context learning approach to ensure a fair comparison and evaluate the differences in performance under identical conditions. This consistency allows us to isolate the impact of LoRA-based fine-tuning and demonstrate its effectiveness in improving model performance without extensive computational overhead.

In this approach, we also evaluate the models with the English WebNLG. That means that we fine-tune each model twice, once with each of the following prompts:

Spanish base prompt: “En español, los datos estructurados se representan comúnmente como tripletas o triples, con el formato [sujeto, predicado, objeto]. A partir de estas tripletas, genera un texto de un solo párrafo formado por oraciones completas, gramaticalmente correctas y naturales. Genera el texto únicamente a partir de las siguientes tripletas:”

English base prompt: “In English, structured data is commonly represented as triples, with the format [subject, predicate, object]. Based on these triples, generate a single-paragraph text composed of complete, grammatically correct, and natural sentences. Generate the text solely from the following triples:”

The models are fine-tuned using the train split of WebNLG (both in English and Spanish) as examples and the validation split as reference evaluation. The code and the results are available in GitHub¹⁰ and Zenodo.¹¹

4.1.3. Model Selection

To select the LLMs to evaluate, we defined four main criteria:

The models must be (non-exclusively) trained on Spanish: Models must be able to handle data in Spanish to ensure high-quality text generation in our task, as language-specific training enhances fluency, coherence, and grammatical accuracy.

The models can have up to 2 billion parameters: Over time, LLMs have tended to grow larger and more resource-intensive. This means that they often become inaccessible to regular users who lack the necessary hardware to run them efficiently, as high-end GPUs and large amounts of memory are typically required. At the same time, when dealing with large amounts of data, in our task sometimes involving thousands or millions of triple sets, the resources and/or time needed to process them grow exponentially. To address this, we focus on resource-efficient models that can run efficiently (in our case, this means that it takes a few seconds to compute each answer) on a machine with an RTX 3060 Laptop GPU (6 GB VRAM) and 16 GB RAM. This generally limits us to models with up to around 2 billion parameters, ensuring practical usability without compromising too much on performance.

The models must be trained for instruction-following tasks: Instruct LLMs (instruction-tuned LLMs) are language models fine-tuned to follow explicit natural language instructions. Unlike generic LLMs, which predict text based on training data patterns, these models are optimized to understand and execute user commands. They process structured prompts containing instructions, ranging from open-ended queries to specific tasks like summarization, translation, or coding. In our case, to ensure direct and well-structured outputs, the models must be trained for instruction-following tasks, eliminating the need for additional post-processing. Our goal is to generate responses that are already aligned with the given prompt, minimizing manual adjustments or corrections that could affect the evaluation results of model performance.

The models must be available in HuggingFace (Wolf et al., 2020).

Given the previous limitations, the test models selected come from two of the large open-source families of LLM, Llama 3 (Grattafiori et al., 2024) and Qwen (Team, 2024; Yang et al., 2024), and Salamandra (Gonzalez-Agirre et al., 2025), a family of LLMs created by the Barcelona Supercomputing Center¹²) in Spain which, although it was trained with data from a wide range of European and programming languages, was mainly trained with English and Spanish data. Specifically, based on our criteria, we selected the following models¹³:

Qwen2.5-0.5B-Instruct¹⁴ and Qwen2.5-1.5B-Instruct¹⁵

Llama-3.2-1B-Instruct ¹⁶

Salamandra-2b-instruct ¹⁷

4.1.4. Evaluation Metrics

We evaluate the quality of the generated text using both lexical semantic similarity and efficiency metrics (Table 3), further explained along this section.

Lexical similarity metrics

Table 3.
Evaluation Metrics Categorized by Type.

Category	Metric	Range	Best Value
Lexical Similarity (Information preservation)	BLEU	0–1	Higher is better
	METEOR	0–1	Higher is better
Lexical Similarity (Linguistic fluency)	CHRF++	0–1	Higher is better
Semantic Similarity	Cosine Similarity	0–1	Higher is better
	BERTScore (P, R, F1)	0–1	Higher is better
Efficiency	Time (seconds)	0– $\infty$	Lower is better

BLEU = Bilingual Evaluation Understudy; METEOR = Metric for Evaluation of Translation with Explicit ORdering; CHRf++ = Character n-gram F-score, extended version.

Lexical similarity metrics measure the surface-level overlap between generated and reference texts, focusing on the exact words or n-grams that appear in both texts. The metrics selected for this purpose can be divided between information preservation metrics and language fluency metrics:

–

Information preservation metrics: These metrics assess how well the generated text retains the content of the reference text by evaluating n-gram overlap and word alignment. These metrics include: *

BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002): BLEU is a metric that measures the n-gram overlap between generated and reference texts. It’s calculated as a weighted geometric mean of n-gram precision, with a brevity penalty to penalize short translations. In our evaluation, we use n = 4 sequences. BLEU scores range from 0 to 1, with higher scores indicating closer matches. This metric is sensitive to exact word matches and does not account for synonyms or paraphrasing, making it less effective in evaluating texts with flexible word choices. However, as discussed in Section 2.3, BLEU is sensitive to exact word overlap and has known limitations in natural language generation, including weak correlations with human judgements in some settings (Mathur et al., 2020; Reiter, 2018). We, therefore, report BLEU alongside additional lexical and semantic similarity metrics.

METEOR (Metric for Evaluation of Translation with Explicit ORdering (Lavie & Agarwal, 2007): METEOR improves upon BLEU by incorporating precision, recall, stemming, synonym matching, and word order penalties. Unlike BLEU, METEOR aligns words between the generated and reference texts and computes a harmonic mean of precision and recall. METEOR scores range from 0 to 1, with higher values indicating better alignment between the predicted and reference sentences. This metric is useful for evaluating languages with rich morphology, such as Spanish, where word forms can vary significantly.

–

Language fluency metrics: These metrics assess the readability and naturalness of generated text by analyzing character-level and subword-level coherence. These metrics include: *

CHRf++ (Character n-gram F-score, extended version) (Popović, 2017): CHRF++ calculates an F-score based on the overlap of character n-grams between the reference and generated texts. In our evaluation, we set n = 6. The CHRF++ score ranges from 0 to 1, with higher values indicating better similarity. Since this metric operates at the character level, it is more robust to minor spelling variations and inflectional changes compared to BLEU.

Semantic similarity metrics

In contrast to previous metrics, semantic similarity metrics focus on meaning rather than exact matches. These metrics compare the underlying meaning or context of the sentences, making them more robust to synonyms, paraphrasing, or other linguistic variations that don’t affect the underlying content. These metrics include, among others: –

Cosine Similarity: This metric computes the cosine similarity between sentence embeddings, which are high-dimensional vector representations of the sentences. Sentence embeddings capture semantic meaning beyond exact word overlap. The value ranges from 0 to 1, where 1 indicates identical meaning. For our experiments, we selected the sentence model paraphrase-multilingual-MiniLM-L12-v2¹⁸ (Reimers & Gurevych, 2019), given that it is currently the most downloaded sentence transformers model with Spanish support.¹⁹ We deliberately opt for a multilingual encoder rather than a Spanish-only sentence model to keep the embedding space consistent across our Spanish and English evaluations: both languages are embedded into a shared semantic space, enabling direct cross-lingual comparison. Since our goal is not to benchmark embedding models, we leave a systematic study of other encoders such as monolingual Spanish encoders for future work.

–

BERTScore (Zhang et al., 2020): BERTScore computes contextual token-level similarities using embeddings from a pretrained BERT model. BERTScore considers precision, recall, and the F1-score for evaluating the match between the predicted and reference sentences at the semantic level taking values between 0 and 1, where the higher the value the better. For BERTScore, we use the HuggingFace (Wolf et al., 2020) "evaluate" wrapping of the metric implementation (Zhang et al., 2020), which in our case corresponds to the bert-base-multilingual-cased model (Devlin et al., 2019) with the default layer configuration.

Efficiency metric:

Efficiency metrics measure the computational cost of generating text. In our case, the efficiency metric chosen is the following: –

Time: This metric measures the duration required to generate a text output in seconds, with lower values indicating more efficient text generation.

4.2. Evaluation Results and Analysis

This section is structured to assess the performance of the models in both context learning and fine-tuning scenarios, with a primary focus on the Spanish evaluation, based on the results presented in Table 4 (for the full results of the Spanish evaluation, see Appendix B), which present the average results of the Spanish evaluation across all tests. First, we analyze the model’s ability to perform tasks using context learning in Section 4.2.1. Next, in Section 4.2.2, we present the results of fine-tuning on the Spanish dataset, comparing its performance to context learning and examining its efficiency and effectiveness. We then conduct a cross-lingual analysis to explore how the models’ performance differs between English and Spanish, in Section 4.2.3, (for the full results of the English evaluation, see Appendix C). Finally, in Section 4.2.4, we perform an error analysis to try and understand the types of errors made by the model in Spanish.

Table 4.
Spanish Evaluation Results.

Eval. modality Model BLEU METEOR CHRF++ Cosine Similarity BERTScore Precision BERTScore Recall BERTScore F1 Average time per triple set (s)

Zero Qwen2.5-0.5B-Instruct 0.097 $\pm$ 0.7% 0.376 $\pm$ 0.1% 0.422 $\pm$ 0.6% 0.697 $\pm$ 0.3% 0.699 $\pm$ 0.3% 0.790 $\pm$ 0,.1% 0.739 $\pm$ 0.2% 1.911 $\pm$ 0.5%

Llama-3.2-1B-Instruct 0.116 $\pm$ 1.2% 0.370 $\pm$ 0.5% 0.430 $\pm$ 0.4% 0.674 $\pm$ 0.2% 0.728 $\pm$ 0.1% 0.774 $\pm$ 0.1% 0.748 $\pm$ 0.1% 1.955 $\pm$ 0.6%

Qwen2.5-1.5B-Instruct 0.141 $\pm$ 0.8% 0.473 $\pm$ 0.2% 0.474 $\pm$ 0.5% 0.804 $\pm$ 0.5% 0.729 $\pm$ 0.2% 0.824 $\pm$ 0.0% 0.771 $\pm$ 0.1% 3.129 $\pm$ 1.4%

Salamandra-2B-Instruct 0.122 $\pm$ 0.9% 0.377 $\pm$ 0.1% 0.444 $\pm$ 0.2% 0.698 $\pm$ 0.2% 0.730 $\pm$ 0.1% 0.773 $\pm$ 0.1% 0.749 $\pm$ 0.1% 1.488 $\pm$ 0.5%

One Qwen2.5-0.5B-Instruct 0.197 $\pm$ 1.0% 0.465 $\pm$ 0.8% 0.510 $\pm$ 0.3% 0.809 $\pm$ 0.1% 0.803 $\pm$ 0.0% 0.817 $\pm$ 0.1% 0.808 $\pm$ 0.0% 0.787 $\pm$ 0.8%

Llama-3.2-1B-Instruct 0.189 $\pm$ 1.4% 0.464 $\pm$ 0.5% 0.524 $\pm$ 0.6% 0.830 $\pm$ 0.6% 0.809 $\pm$ 0.1% 0.821 $\pm$ 0.2% 0.813 $\pm$ 0.1% 1.174 $\pm$ 2.1%

Qwen2.5-1.5B-Instruct 0.256 $\pm$ 0.5% 0.559 $\pm$ 0.2% 0,599 $\pm$ 0.1% 0.894 $\pm$ 0.0% 0.839 $\pm$ 0.1% 0.858 $\pm$ 0.0% 0.847 $\pm$ 0.0% 1.454 $\pm$ 0.7%

Salamandra-2B-Instruct 0.132 $\pm$ 0.8% 0.393 $\pm$ 0.2% 0.461 $\pm$ 0.0% 0.716 $\pm$ 0.6% 0.744 $\pm$ 0.0% 0.781 $\pm$ 0.0% 0.760 $\pm$ 0.0% 1.406 $\pm$ 0.8%

Few Qwen2.5-0.5B-Instruct 0.226 $\pm$ 0.1% 0.504 $\pm$ 0.0% 0.544 $\pm$ 0.0% 0.844 $\pm$ 0.2% 0.820 $\pm$ 0.0% 0.831 $\pm$ 0.0% 0.823 $\pm$ 0.0% 0.794 $\pm$ 0.9%

Llama-3.2-1B-Instruct 0.228 $\pm$ 0.4% 0.508 $\pm$ 0.1% 0.561 $\pm$ 0.2% 0.858 $\pm$ 0.2% 0.821 $\pm$ 0.1% 0.834 $\pm$ 0.0% 0.826 $\pm$ 0.1% 1.210 $\pm$ 0.6%

Qwen2.5-1.5B-Instruct 0.269 $\pm$ 0.1% 0.563 $\pm$ 0.2% 0.604 $\pm$ 0.0% 0.898 $\pm$ 0.0% 0.843 $\pm$ 0.0% 0.860 $\pm$ 0.0% 0.850 $\pm$ 0.0% 1.426 $\pm$ 0.4%

Salamandra-2B-Instruct 0.178 $\pm$ 0.6% 0.451 $\pm$ 0.3% 0.513 $\pm$ 0.1% 0.805 $\pm$ 0.2% 0.789 $\pm$ 0.0% 0.811 $\pm$ 0.1% 0.798 $\pm$ 0.1% 1.184 $\pm$ 1.7%

Fine-tuned Qwen2.5-0.5B-Instruct 0.310 $\pm$ 0.4% 0.601 $\pm$ 0.4% 0.631 $\pm$ 0.3% 0.898 $\pm$ 0.1% 0.856 $\pm$ 0.1% 0,860 $\pm$ 0.1% 0.857 $\pm$ 0.1% 1.126 $\pm$ 0.8%

Llama-3.2-1B-Instruct 0.141 $\pm$ 0.0% 0.475 $\pm$ 0.0% 0.435 $\pm$ 0.1% 0.848 $\pm$ 0.0% 0.665 $\pm$ 0.0% 0.782 $\pm$ 0.0% 0.713 $\pm$ 0.0% 3.565 $\pm$ 0.1%

Qwen2.5-1.5B-Instruct 0.348 $\pm$ 0.6% 0.640 $\pm$ 0.0% 0.660 $\pm$ 0.2% 0.916 $\pm$ 0.0% 0.870 $\pm$ 0.0% 0.873 $\pm$ 0.0% 0.870 $\pm$ 0.0% 1.310 $\pm$ 0.1%

Salamandra-2B-Instruct 0.279 $\pm$ 0.9% 0.562 $\pm$ 0.4% 0.606 $\pm$ 0.1% 0.896 $\pm$ 0.0% 0.853 $\pm$ 0.1% 0.851 $\pm$ 0.1% 0.851 $\pm$ 0.1% 0.909 $\pm$ 0.0%

Eval. modality	Model	BLEU	METEOR	CHRF++	Cosine Similarity	BERTScore Precision	BERTScore Recall	BERTScore F1	Average time per triple set (s)
Zero	Qwen2.5-0.5B-Instruct	0.097 $\pm$ 0.7%	0.376 $\pm$ 0.1%	0.422 $\pm$ 0.6%	0.697 $\pm$ 0.3%	0.699 $\pm$ 0.3%	0.790 $\pm$ 0,.1%	0.739 $\pm$ 0.2%	1.911 $\pm$ 0.5%
	Llama-3.2-1B-Instruct	0.116 $\pm$ 1.2%	0.370 $\pm$ 0.5%	0.430 $\pm$ 0.4%	0.674 $\pm$ 0.2%	0.728 $\pm$ 0.1%	0.774 $\pm$ 0.1%	0.748 $\pm$ 0.1%	1.955 $\pm$ 0.6%
	Qwen2.5-1.5B-Instruct	0.141 $\pm$ 0.8%	0.473 $\pm$ 0.2%	0.474 $\pm$ 0.5%	0.804 $\pm$ 0.5%	0.729 $\pm$ 0.2%	0.824 $\pm$ 0.0%	0.771 $\pm$ 0.1%	3.129 $\pm$ 1.4%
	Salamandra-2B-Instruct	0.122 $\pm$ 0.9%	0.377 $\pm$ 0.1%	0.444 $\pm$ 0.2%	0.698 $\pm$ 0.2%	0.730 $\pm$ 0.1%	0.773 $\pm$ 0.1%	0.749 $\pm$ 0.1%	1.488 $\pm$ 0.5%
One	Qwen2.5-0.5B-Instruct	0.197 $\pm$ 1.0%	0.465 $\pm$ 0.8%	0.510 $\pm$ 0.3%	0.809 $\pm$ 0.1%	0.803 $\pm$ 0.0%	0.817 $\pm$ 0.1%	0.808 $\pm$ 0.0%	0.787 $\pm$ 0.8%
	Llama-3.2-1B-Instruct	0.189 $\pm$ 1.4%	0.464 $\pm$ 0.5%	0.524 $\pm$ 0.6%	0.830 $\pm$ 0.6%	0.809 $\pm$ 0.1%	0.821 $\pm$ 0.2%	0.813 $\pm$ 0.1%	1.174 $\pm$ 2.1%
	Qwen2.5-1.5B-Instruct	0.256 $\pm$ 0.5%	0.559 $\pm$ 0.2%	0,599 $\pm$ 0.1%	0.894 $\pm$ 0.0%	0.839 $\pm$ 0.1%	0.858 $\pm$ 0.0%	0.847 $\pm$ 0.0%	1.454 $\pm$ 0.7%
	Salamandra-2B-Instruct	0.132 $\pm$ 0.8%	0.393 $\pm$ 0.2%	0.461 $\pm$ 0.0%	0.716 $\pm$ 0.6%	0.744 $\pm$ 0.0%	0.781 $\pm$ 0.0%	0.760 $\pm$ 0.0%	1.406 $\pm$ 0.8%
Few	Qwen2.5-0.5B-Instruct	0.226 $\pm$ 0.1%	0.504 $\pm$ 0.0%	0.544 $\pm$ 0.0%	0.844 $\pm$ 0.2%	0.820 $\pm$ 0.0%	0.831 $\pm$ 0.0%	0.823 $\pm$ 0.0%	0.794 $\pm$ 0.9%
	Llama-3.2-1B-Instruct	0.228 $\pm$ 0.4%	0.508 $\pm$ 0.1%	0.561 $\pm$ 0.2%	0.858 $\pm$ 0.2%	0.821 $\pm$ 0.1%	0.834 $\pm$ 0.0%	0.826 $\pm$ 0.1%	1.210 $\pm$ 0.6%
	Qwen2.5-1.5B-Instruct	0.269 $\pm$ 0.1%	0.563 $\pm$ 0.2%	0.604 $\pm$ 0.0%	0.898 $\pm$ 0.0%	0.843 $\pm$ 0.0%	0.860 $\pm$ 0.0%	0.850 $\pm$ 0.0%	1.426 $\pm$ 0.4%
	Salamandra-2B-Instruct	0.178 $\pm$ 0.6%	0.451 $\pm$ 0.3%	0.513 $\pm$ 0.1%	0.805 $\pm$ 0.2%	0.789 $\pm$ 0.0%	0.811 $\pm$ 0.1%	0.798 $\pm$ 0.1%	1.184 $\pm$ 1.7%
Fine-tuned	Qwen2.5-0.5B-Instruct	0.310 $\pm$ 0.4%	0.601 $\pm$ 0.4%	0.631 $\pm$ 0.3%	0.898 $\pm$ 0.1%	0.856 $\pm$ 0.1%	0,860 $\pm$ 0.1%	0.857 $\pm$ 0.1%	1.126 $\pm$ 0.8%
	Llama-3.2-1B-Instruct	0.141 $\pm$ 0.0%	0.475 $\pm$ 0.0%	0.435 $\pm$ 0.1%	0.848 $\pm$ 0.0%	0.665 $\pm$ 0.0%	0.782 $\pm$ 0.0%	0.713 $\pm$ 0.0%	3.565 $\pm$ 0.1%
	Qwen2.5-1.5B-Instruct	0.348 $\pm$ 0.6%	0.640 $\pm$ 0.0%	0.660 $\pm$ 0.2%	0.916 $\pm$ 0.0%	0.870 $\pm$ 0.0%	0.873 $\pm$ 0.0%	0.870 $\pm$ 0.0%	1.310 $\pm$ 0.1%
	Salamandra-2B-Instruct	0.279 $\pm$ 0.9%	0.562 $\pm$ 0.4%	0.606 $\pm$ 0.1%	0.896 $\pm$ 0.0%	0.853 $\pm$ 0.1%	0.851 $\pm$ 0.1%	0.851 $\pm$ 0.1%	0.909 $\pm$ 0.0%

BLEU = Bilingual Evaluation Understudy; METEOR = Metric for Evaluation of Translation with Explicit ORdering; CHRf++ = Character n-gram F-score, extended version.

4.2.1. Context Learning Performance

The results of our experiments on Spanish triple generation using a context learning approach (Table 4 and Figure 4) reveal several insights. First, we observe a significant improvement in performance from 0S to 1S learning, with only a slight improvement from 1S to FS learning. This suggests that providing even a single example significantly enhances the model’s understanding of the task, likely due to better contextualization. Notably, this improvement is not only qualitative but also time-efficient: 1S and FS settings generally require less time for verbalization compared to 0S. This efficiency gain could be attributed to the model generating more precise outputs with less non-relevant information, as 0S often produced additional content outside the scope of the triples, which also results in nearly double the generation time in some cases. For example, when given the triple [sujeto: MotorSport_Vision, predicado: Ciudad, objecto: Fawkham]²⁰ in 0S setting, Qwen2.5-1.5B-Instruct generated ‘‘Aunque no tengo información adicional sobre la ciudad Fawkham, puedo generar un parágrafo basado en las tripletas proporcionadas: La empresa MotorSport_Vision es conocida por su apasionante interés en documentar eventos deportivos de alta categoría, incluyendo carreras de coches y motocicletas. Sus reportajes son famosos por capturar momentos inolvidables durante competiciones globales, desde la Gran Premi d’Alanya hasta la Copa del Mundo de Motocross. Una de sus últimas iniciativas destacadas fue la producción de una serie exclusiva que se centró en la historia y evolución de la carrera de rallyes en Fawkham, uno de los circuitos más emblemáticos de Europa. Este proyecto ha sido especialmente fascinante para el equipo de MotorSport_Vision, ya que ofrece una perspectiva única sobre cómo han cambiado las reglas y los patrones de la industria de la velocidad en la región.",²¹ in contrast to FS setting, which generated ‘‘La empresa MotorSport Vision tiene su sede en Fawkham.”²²

Figure 4.

Average Spanish context learning results by triple set size. (a) BLEU, (b) METEOR, (c) CHRF++, (d) Cosine Similarity, (e) BERTScore Precision, (f) BERTScore Recall, and (g) BERTScore F1. BLEU = Bilingual Evaluation Understudy; METEOR = Metric for Evaluation of Translation with Explicit ORdering; CHRf++ = Character n-gram F-score, extended version.

In terms of evaluation metrics, BLEU scores are notably low, with a maximum of 0.27 (on a scale of 1).

As discussed in Sections 2.3 and 4.1.4, this behavior is consistent with known limitations of BLEU, which relies on strict lexical n-gram overlap and tends to under-reward valid paraphrases and alternative surface realizations. In contrast, other lexical metrics such as METEOR and CHRf++ yield more favorable results, ranging between 0.47 and 0.6, as they are more tolerant to morphological variation and partial lexical matches. These scores suggest that, despite lexical differences from the references, the generated outputs capture more of the intended content than would be indicated by BLEU alone. Additionally, similarity-based metrics provide further evidence of the model’s effectiveness. Cosine similarity increases from 0.8 in 0S to nearly 0.9 in 1S and FS settings, indicating that the generated content is contextually closer to the reference text when examples are provided. Similarly, BERTScore metrics show consistent improvements: BERTScore recall remains stable (0.82–0.86), while precision rises from 0.73 in 0S to around 0.84 in 1S and FS, and F1 score improves from 0.77 to 0.85. These results suggest that, while 0S outputs may include unnecessary information, 0S and FS settings produce more focused and contextually aligned responses.

At the same time, if we observe the graphics from Figure 4, where the results can be seen according to triple size, we can see that, in general, the performance tends to drop with the increase of the number of triples. It can also be seen that, in general, 1S and FS (marked in blue and red in the figures) perform better than 0S (marked in green). Albeit, if we observe these results more deeply, we can see that for the 0S tests, BLEU, CHRF++, cosine similarity and BERTScore Precision, this trend is contrary; the more triples, the better the performance. This could probably be explained with the fact presented before about 0S generation, where models often produced additional content outside the scope of the triples. Logically, the more triples a set has, the longer the verbalization has to be. If the model tends to create larger and richer answers when no examples are given, it could be logical to think that the longer the ground truth, the higher the probability that the evaluation finds more similarity among the reference and generated texts. Moreover, despite our efforts to explain what triples are and clarify our goal, it’s important to recognize that we are introducing structured data to models that were generally not trained to handle such information. These models may lack domain-specific knowledge, making the prompt potentially insufficiently descriptive without examples. Consequently, this lack of context could affect the quality and accuracy of the verbalization, particularly in cases where structured data is involved.

Among the models evaluated, Qwen2.5-1.5B-Instruct consistently outperforms the others across all tests. Although less time-efficient, its competitive performance in 1S and FS settings highlights its robustness for Spanish triple generation. Surprisingly, Salamandra-2B-Instruct, which we expected to be among the better-performing models due to its Spanish-centric training (Spanish being the second most prominent language in its training data) and development by a Spanish institution, underperforms relative to the other models. This suggests that factors beyond language-specific training, such as task-specific fine-tuning, may play a more critical role in its performance.

Taken together, these findings highlight the importance of contextualization in helping models correctly interpret structured inputs such as triples, as well as the limitations of relying on a single evaluation metric. As discussed earlier, lexical-overlap measures alone may fail to reflect improvements in meaning preservation when multiple valid verbalized forms are possible, a limitation widely noted in natural language generation research (Mathur et al., 2020; Reiter, 2018). By contrast, considering both lexical and similarity-based metrics offers a clearer view of model behavior, showing that although lexical overlap may remain limited, semantic and contextual alignment improves noticeably in 1S and FS settings.

4.2.2. Fine-Tuning Performance

As explained in Section 4.1.2, we apply a LoRA fine-tuning to the models with our dataset, having the configuration of the LoRA training being the same for each model. LoRA is applied to attention layers because they are the most critical for learning contextual relationships while significantly reducing the number of trainable parameters, making fine-tuning more efficient. In Table 5, we can see the parameter configuration of each model. Generally, we only fine-tune between 0.2% and 0.4% of the parameters. It is notable that Salamandra-2B-Instruct has a smaller proportion of trainable parameters despite having more total parameters. Here, we have to take into account that the number of trainable parameters depends on how many attention layers the model has and their size, rather than the total parameter count. A possible explanation is that Salamandra-2B-Instruct might have fewer or smaller attention layers relative to its total size, meaning LoRA modifies a smaller portion of the model. Other architectures, like Qwen or Llama, might distribute their parameters differently, with a larger fraction dedicated to attention layers, leading to more trainable parameters under the same LoRA configuration.

Table 5.
Models’ Parameters Fine-Tune Configuration.

Parameters

Model Total Trainable Non-trainable

Qwen2.5-0.5B-Instruct 496,232,320 2,199,552 494,032,768

Llama-3.2-1B-Instruct 1,238,632,448 2,818,048 1,235,814,400

Qwen2.5-1.5B-Instruct 1,548,330,496 4,616,192 1,543,714,304

Salamandra-2B-Instruct 2,253,490,176 3,729,408 2,249,760,768

	Parameters
Qwen2.5-0.5B-Instruct	496,232,320	2,199,552	494,032,768
Llama-3.2-1B-Instruct	1,238,632,448	2,818,048	1,235,814,400
Qwen2.5-1.5B-Instruct	1,548,330,496	4,616,192	1,543,714,304
Salamandra-2B-Instruct	2,253,490,176	3,729,408	2,249,760,768

In Table 4, we can see that when transitioning from FS learning to fine-tuning, we observe notable improvements across all metrics for Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct, and Salamandra-2B-Instruct, with the exception of time efficiency. Specifically, Qwen2.5-0.5B-Instruct experiences a slight degradation in time efficiency, while Qwen2.5-1.5B-Instruct and Salamandra-2B-Instruct show improvements. This suggests that fine-tuning enhances the models’ ability to generate more accurate and contextually appropriate text, albeit at a potential cost in computational speed for some models.

As illustrated in Figure 5, fine-tuned models exhibit greater stability in performance as the size of the triple set increases, compared to the variability seen in 0S, 1S, and FS settings. However, Llama-3.2-1B-Instruct stands out as an exception, showing a significant drop in performance relative to its FS results. This indicates that Llama-3.2-1B-Instruct may not have adapted effectively during fine-tuning, potentially due to limitations in its architecture or training dynamics. Additionally, while Qwen2.5-1.5B-Instruct and Salamandra-2B-Instruct improve in computation time, Qwen2.5-0.5B-Instruct and Llama-3.2-1B-Instruct experience degradation, with the latter taking more than double the time to generate text compared to its context learning performance.

Figure 5.

Average Spanish fine-tuning results by triple set size. (a) Qwen2.5-0.5B-Instruct, (b) Llama-3.2-1B-Instruct, (c) Qwen2.5-1.5B-Instruct, and (d) Salamandra-2B-Instruct.

Overall, these results demonstrate that even a small-scale fine-tuning process, such as the one employed in this study, can significantly enhance the performance of models for Spanish D2T generation. Fine-tuning not only improves the stability and quality of the generated text but also highlights the importance of model-specific adaptability. While some models, like Qwen2.5-1.5B-Instruct, show rather good results in both performance and efficiency after fine-tuning, others, like Llama-3.2-1B-Instruct, may require further optimization to achieve comparable results. These findings underscore the value of fine-tuning as a practical approach for improving multilingual models, particularly for underrepresented languages.

4.2.3. Cross-Lingual Analysis

As stated in the methodology of the evaluation, we also computed the same tests as with the Spanish WebNLG using English WebNLG. Table 6 shows the results we obtained during this evaluation for all the models for 0S, 1S, and FS setting.

Table 6.
English Evaluation Results.

Eval. modality Model BLEU METEOR CHRF++ Cosine Similarity BERTScore Precision BERTScore Recall BERTScore F1 Average time per triple set (s)

Zero Qwen2.5-0.5B-Instruct 0.160 $\pm$ 0.1% 0.528 $\pm$ 0.0% 0.528 $\pm$ 0,0% 0.841 $\pm$ 0.1% 0,896 $\pm$ 0.0% 0.924 $\pm$ 0.0% 0.909 $\pm$ 0.0% 1.466 $\pm$ 5.6%

Llama-3.2-1B-Instruct 0.118 $\pm$ 1.6% 0.468 $\pm$ 0.7% 0.483 $\pm$ 0.3% 0.698 $\pm$ 0.4% 0.857 $\pm$ 0.1% 0.908 $\pm$ 0.1% 0.881 $\pm$ 0.1% 1.515 $\pm$ 2.4%

Qwen2.5-1.5B-Instruct 0.245 $\pm$ 0.8% 0.635 $\pm$ 0.1% 0.605 $\pm$ 0.2% 0.898 $\pm$ 0.0% 0.911 $\pm$ 0.0% 0.942 $\pm$ 0.0% 0.925 $\pm$ 0.0% 1.868 $\pm$ 0.2%

Salamandra-2B-Instruct 0.127 $\pm$ 2.1% 0.460 $\pm$ 0.3% 0.476 $\pm$ 0.2% 0.770 $\pm$ 0.2% 0.872 $\pm$ 0.0% 0.904 $\pm$ 0.0% 0.887 $\pm$ 0.0% 1.885 $\pm$ 2.9%

One Qwen2.5-0.5B-Instruct 0.213 $\pm$ 0.3% 0.586 $\pm$ 0.2% 0.584 $\pm$ 0.0% 0.870 $\pm$ 0.1% 0.914 $\pm$ 0.0% 0.930 $\pm$ 0.0% 0.921 $\pm$ 0.0% 1.177 $\pm$ 4.1%

Llama-3.2-1B-Instruct 0.260 $\pm$ 0.2% 0.584 $\pm$ 0.2% 0.584 $\pm$ 0.2% 0.873 $\pm$ 0.2% 0.929 $\pm$ 0.0% 0.924 $\pm$ 0.0% 0.925 $\pm$ 0.0% 0.611 $\pm$ 3.0%

Qwen2.5-1.5B-Instruct 0.365 $\pm$ 0.6% 0.708 $\pm$ 0.1% 0.699 $\pm$ 0.0% 0.939 $\pm$ 0.0% 0.945 $\pm$ 0.0% 0.949 $\pm$ 0.0% 0.946 $\pm$ 0.0% 1.172 $\pm$ 1.1%

Salamandra-2B-Instruct 0.161 $\pm$ 0.5% 0.513 $\pm$ 0.4% 0.531 $\pm$ 0.4% 0.803 $\pm$ 0.3% 0.891 $\pm$ 0.0% 0.912 $\pm$ 0.1% 0.900 $\pm$ 0.0% 1.406 $\pm$ 2.0%

Few Qwen2.5-0.5B-Instruct 0.256 $\pm$ 0.5% 0.623 $\pm$ 0.1% 0.622 $\pm$ 0.2% 0.892 $\pm$ 0.2% 0.922 $\pm$ 0.0% 0.935 $\pm$ 0.0% 0.928 $\pm$ 0.0% 1.109 $\pm$ 5.7%

Llama-3.2-1B-Instruct 0.293 $\pm$ 0.3% 0.611 $\pm$ 0.1% 0.611 $\pm$ 0.3% 0.894 $\pm$ 0.1% 0.938 $\pm$ 0.0% 0.927 $\pm$ 0.0% 0.932 $\pm$ 0.0% 0.582 $\pm$ 2.0%

Qwen2.5-1.5B-Instruct 0.358 $\pm$ 0.3% 0.700 $\pm$ 0.0% 0.696 $\pm$ 0.0% 0.940 $\pm$ 0.0% 0.945 $\pm$ 0.0% 0.948 $\pm$ 0.0% 0.946 $\pm$ 0.0% 1.175 $\pm$ 0.8%

Salamandra-2B-Instruct 0.185 $\pm$ 0.2% 0.522 $\pm$ 0.6% 0.550 $\pm$ 0.5% 0.824 $\pm$ 0.5% 0.900 $\pm$ 0.1% 0.915 $\pm$ 0.1% 0.906 $\pm$ 0.1% 1.248 $\pm$ 2.2%

Fine-tuned Qwen2.5-0.5B-Instruct 0.373 $\pm$ 0.9% 0.707 $\pm$ 0.0% 0.709 $\pm$ 0.1% 0.941 $\pm$ 0.0% 0.946 $\pm$ 0.0% 0.945 $\pm$ 0.0% 0.945 $\pm$ 0.0% 0.909 $\pm$ 1.4%

Llama-3.2-1B-Instruct 0.417 $\pm$ 0.1% 0.743 $\pm$ 0.1% 0.738 $\pm$ 0.0% 0.951 $\pm$ 0.0% 0.950 $\pm$ 0.0% 0.952 $\pm$ 0.0% 0.950 $\pm$ 0.0% 0.679 $\pm$ 0.8%

Qwen2.5-1.5B-Instruct 0.379 $\pm$ 0.6% 0.722 $\pm$ 0.3% 0.721 $\pm$ 0.2% 0.950 $\pm$ 0.1% 0.947 $\pm$ 0.0% 0.949 $\pm$ 0.0% 0.947 $\pm$ 0.0% 1.100 $\pm$ 0.4%

Salamandra-2B-Instruct 0.301 $\pm$ 0.3% 0.655 $\pm$ 0.2% 0.671 $\pm$ 0.2% 0.928 $\pm$ 0.2% 0.933 $\pm$ 0.0% 0.936 $\pm$ 0.0% 0.934 $\pm$ 0.0% 0.919 $\pm$ 0.0%

Eval. modality	Model	BLEU	METEOR	CHRF++	Cosine Similarity	BERTScore Precision	BERTScore Recall	BERTScore F1	Average time per triple set (s)
Zero	Qwen2.5-0.5B-Instruct	0.160 $\pm$ 0.1%	0.528 $\pm$ 0.0%	0.528 $\pm$ 0,0%	0.841 $\pm$ 0.1%	0,896 $\pm$ 0.0%	0.924 $\pm$ 0.0%	0.909 $\pm$ 0.0%	1.466 $\pm$ 5.6%
	Llama-3.2-1B-Instruct	0.118 $\pm$ 1.6%	0.468 $\pm$ 0.7%	0.483 $\pm$ 0.3%	0.698 $\pm$ 0.4%	0.857 $\pm$ 0.1%	0.908 $\pm$ 0.1%	0.881 $\pm$ 0.1%	1.515 $\pm$ 2.4%
	Qwen2.5-1.5B-Instruct	0.245 $\pm$ 0.8%	0.635 $\pm$ 0.1%	0.605 $\pm$ 0.2%	0.898 $\pm$ 0.0%	0.911 $\pm$ 0.0%	0.942 $\pm$ 0.0%	0.925 $\pm$ 0.0%	1.868 $\pm$ 0.2%
	Salamandra-2B-Instruct	0.127 $\pm$ 2.1%	0.460 $\pm$ 0.3%	0.476 $\pm$ 0.2%	0.770 $\pm$ 0.2%	0.872 $\pm$ 0.0%	0.904 $\pm$ 0.0%	0.887 $\pm$ 0.0%	1.885 $\pm$ 2.9%
One	Qwen2.5-0.5B-Instruct	0.213 $\pm$ 0.3%	0.586 $\pm$ 0.2%	0.584 $\pm$ 0.0%	0.870 $\pm$ 0.1%	0.914 $\pm$ 0.0%	0.930 $\pm$ 0.0%	0.921 $\pm$ 0.0%	1.177 $\pm$ 4.1%
	Llama-3.2-1B-Instruct	0.260 $\pm$ 0.2%	0.584 $\pm$ 0.2%	0.584 $\pm$ 0.2%	0.873 $\pm$ 0.2%	0.929 $\pm$ 0.0%	0.924 $\pm$ 0.0%	0.925 $\pm$ 0.0%	0.611 $\pm$ 3.0%
	Qwen2.5-1.5B-Instruct	0.365 $\pm$ 0.6%	0.708 $\pm$ 0.1%	0.699 $\pm$ 0.0%	0.939 $\pm$ 0.0%	0.945 $\pm$ 0.0%	0.949 $\pm$ 0.0%	0.946 $\pm$ 0.0%	1.172 $\pm$ 1.1%
	Salamandra-2B-Instruct	0.161 $\pm$ 0.5%	0.513 $\pm$ 0.4%	0.531 $\pm$ 0.4%	0.803 $\pm$ 0.3%	0.891 $\pm$ 0.0%	0.912 $\pm$ 0.1%	0.900 $\pm$ 0.0%	1.406 $\pm$ 2.0%
Few	Qwen2.5-0.5B-Instruct	0.256 $\pm$ 0.5%	0.623 $\pm$ 0.1%	0.622 $\pm$ 0.2%	0.892 $\pm$ 0.2%	0.922 $\pm$ 0.0%	0.935 $\pm$ 0.0%	0.928 $\pm$ 0.0%	1.109 $\pm$ 5.7%
	Llama-3.2-1B-Instruct	0.293 $\pm$ 0.3%	0.611 $\pm$ 0.1%	0.611 $\pm$ 0.3%	0.894 $\pm$ 0.1%	0.938 $\pm$ 0.0%	0.927 $\pm$ 0.0%	0.932 $\pm$ 0.0%	0.582 $\pm$ 2.0%
	Qwen2.5-1.5B-Instruct	0.358 $\pm$ 0.3%	0.700 $\pm$ 0.0%	0.696 $\pm$ 0.0%	0.940 $\pm$ 0.0%	0.945 $\pm$ 0.0%	0.948 $\pm$ 0.0%	0.946 $\pm$ 0.0%	1.175 $\pm$ 0.8%
	Salamandra-2B-Instruct	0.185 $\pm$ 0.2%	0.522 $\pm$ 0.6%	0.550 $\pm$ 0.5%	0.824 $\pm$ 0.5%	0.900 $\pm$ 0.1%	0.915 $\pm$ 0.1%	0.906 $\pm$ 0.1%	1.248 $\pm$ 2.2%
Fine-tuned	Qwen2.5-0.5B-Instruct	0.373 $\pm$ 0.9%	0.707 $\pm$ 0.0%	0.709 $\pm$ 0.1%	0.941 $\pm$ 0.0%	0.946 $\pm$ 0.0%	0.945 $\pm$ 0.0%	0.945 $\pm$ 0.0%	0.909 $\pm$ 1.4%
	Llama-3.2-1B-Instruct	0.417 $\pm$ 0.1%	0.743 $\pm$ 0.1%	0.738 $\pm$ 0.0%	0.951 $\pm$ 0.0%	0.950 $\pm$ 0.0%	0.952 $\pm$ 0.0%	0.950 $\pm$ 0.0%	0.679 $\pm$ 0.8%
	Qwen2.5-1.5B-Instruct	0.379 $\pm$ 0.6%	0.722 $\pm$ 0.3%	0.721 $\pm$ 0.2%	0.950 $\pm$ 0.1%	0.947 $\pm$ 0.0%	0.949 $\pm$ 0.0%	0.947 $\pm$ 0.0%	1.100 $\pm$ 0.4%
	Salamandra-2B-Instruct	0.301 $\pm$ 0.3%	0.655 $\pm$ 0.2%	0.671 $\pm$ 0.2%	0.928 $\pm$ 0.2%	0.933 $\pm$ 0.0%	0.936 $\pm$ 0.0%	0.934 $\pm$ 0.0%	0.919 $\pm$ 0.0%

BLEU = Bilingual Evaluation Understudy; METEOR = Metric for Evaluation of Translation with Explicit ORdering; CHRf++ = Character n-gram F-score, extended version.

When comparing the results for English and Spanish, it is evident that the models generally perform better with English input than with Spanish. This is expected, given that English constitutes a significant portion of the training data for most multilingual models, making them inherently more proficient in English. However, the performance gap also highlights the challenges of adapting these models to languages like Spanish, which, despite being widely spoken, may not receive the same level of representation in training datasets.

Across all context learning settings—0S, 1S, and FS—the Qwen2.5-1.5B-Instruct model consistently comes up as the best performer. This suggests that Qwen2.5-1.5B-Instruct may possess some inherent capability to generate text from triples, regardless of the input language. Its robustness in both English and Spanish settings underscores its versatility and effectiveness for structured text generation tasks, even in low-resource or 0S scenarios.

Interestingly, the Llama-3.2-1B-Instruct model presents a contrasting case. In the fine-tuning experiments, Llama-3.2-1B-Instruct struggled to adapt effectively to the Spanish triple verbalization task. However, in the English context learning results, it performs competitively, achieving the best results across all metrics including time efficiency. This discrepancy suggests that the issue may lie not in the task itself but rather in the model’s ability to learn and generalize the task specifically for Spanish. This could be attributed to differences in linguistic characteristics, training data distribution, or the model’s architectural limitations when handling Spanish compared to English.

4.2.4. Error Analysis

To interpret some of the errors observed in our evaluation, it is useful to recall that English and Spanish encode grammatical information differently. English generally relies more on word order and has comparatively limited overt verbal inflection, whereas Spanish marks person and number more systematically on verbs and shows gender and number agreement within the noun phrase (Moreno-Sandoval & Goñi-Menoyo, 2002). For example, in English the verb eat shows relatively limited inflection across subjects ("I eat", "you eat", "she eats", and "we eat"), whereas in Spanish the verb exhibits richer inflectional variation ("yo como", "tú comes", "ella come", and "nosotros comemos"). Spanish also allows null (silent) subjects, since subject features are often recoverable from the verb (Ordóñez & Treviño, 1999). In addition, Spanish permits greater flexibility in the placement of subjects and objects than English, which further expands the set of grammatical realizations available for a given triple (Aguado-Orea et al., 2019). Taken together, these properties increase the likelihood that a faithful Spanish output may differ in surface form from a single reference. This is why we interpret lexical overlap metrics alongside more meaning-sensitive measures, in line with prior work showing that automatic metrics capture complementary aspects of NLG quality and are best used in combination (Mathur et al., 2020; Reiter, 2018).

In Table 7, we can see a good example of the importance of variety in evaluation metrics.²³ In the first row, we see that Qwen2.5-1.5B-Instruct predicted ‘‘La empresa MotorSport Vision tiene su sede en Fawkham.” (‘‘MotorSport Vision is based in Fawkham.”) as the verbalization, which generally would be correct, as, in this case ‘‘based” can be expressed as "se ubica"; "está", "se encuentra", "tiene su base en", and so on. If we observe the lexical results only, we can see that the BLEU value is 0.14, which is a very low performance, while METEOR and CHRF++ have a score of 0.60 and 0.63, respectively, which is not nearly as low as BLEU, but it is also not that high. On the other hand, we have that the semantic metrics, Cosine Similarity and the BERTScore scores are all above 0.91, which means that, semantically, the answer is very close to the ground truth.

Table 7.
Predictions Samples and Their Performance (see Footnote 23).

Metrics

Model Test modality Triples Ground truth Prediction B M C CS BP BR BF1

Qwen2.5-1.5B-Instruct Few-shot Triple 1: [Subject: MotorSport_Vision, Predicate: Ciudad, Object: Fawkham] ‘‘MotorSport Vision se encuentra en la ciudad de Fawkham.", ‘‘MotorSport Vision se encuentra en la ciudad de Fawkham, Reino Unido.", and ‘‘MotorSport Vision se ubica en Fawkham.” ‘‘La empresa MotorSport Vision tiene su sede en Fawkham.” 0.14 0.60 0.63 0.97 0.91 0.94 0.92

Salamandra-2B-Instruct Zero-shot Triple 1: [Subject: MotorSport_Vision, Predicate: Ciudad, Object: Fawkham] ‘‘MotorSport Vision se encuentra en la ciudad de Fawkham.", ‘MotorSport Vision se encuentra en la ciudad de Fawkham, Reino Unido.", and ‘‘MotorSport Vision se ubica en Fawkham.” ‘‘La ciudad Fawkham atrae cada año a miles de visitantes que disfrutan del hermoso paisaje natural y sus monumentos históricos. Los sitios más populares incluyen el Parque Nacional Mount Abundance, conocido por su gran variedad de flora y fauna; además, cuenta con una amplia gama de actividades recreativas para todas las edades. Para aquellos interesados en la historia local, no pueden perderse el Museo Histórico de Fawkham, donde podrán descubrir detalles interesantes sobre la vida cotidiana durante siglos pasados. Por lo tanto, si estás buscando lugares fascinantes cerca de Londres pero quieres evitar grandes multitudes, definitivamente deberías visitar Fawkham; te garantizo que no te arrepentirás!” 0.01 0.1 0.18 0.48 0.57 0.73 0.64

					Metrics
Qwen2.5-1.5B-Instruct	Few-shot	Triple 1: [Subject: MotorSport_Vision, Predicate: Ciudad, Object: Fawkham]	‘‘MotorSport Vision se encuentra en la ciudad de Fawkham.", ‘‘MotorSport Vision se encuentra en la ciudad de Fawkham, Reino Unido.", and ‘‘MotorSport Vision se ubica en Fawkham.”	‘‘La empresa MotorSport Vision tiene su sede en Fawkham.”	0.14	0.60	0.63	0.97	0.91	0.94	0.92
Salamandra-2B-Instruct	Zero-shot	Triple 1: [Subject: MotorSport_Vision, Predicate: Ciudad, Object: Fawkham]	‘‘MotorSport Vision se encuentra en la ciudad de Fawkham.", ‘MotorSport Vision se encuentra en la ciudad de Fawkham, Reino Unido.", and ‘‘MotorSport Vision se ubica en Fawkham.”	‘‘La ciudad Fawkham atrae cada año a miles de visitantes que disfrutan del hermoso paisaje natural y sus monumentos históricos. Los sitios más populares incluyen el Parque Nacional Mount Abundance, conocido por su gran variedad de flora y fauna; además, cuenta con una amplia gama de actividades recreativas para todas las edades. Para aquellos interesados en la historia local, no pueden perderse el Museo Histórico de Fawkham, donde podrán descubrir detalles interesantes sobre la vida cotidiana durante siglos pasados. Por lo tanto, si estás buscando lugares fascinantes cerca de Londres pero quieres evitar grandes multitudes, definitivamente deberías visitar Fawkham; te garantizo que no te arrepentirás!”	0.01	0.1	0.18	0.48	0.57	0.73	0.64

On the other hand, in the second row of Table 7, we can see a case contrary to the previous one. We can see that, for the same triple as the previous instance, Salamandra-2B-Instruct has generated a very long text introducing Fawkham city, which is not what we were aiming for. Similarly to the first case, lexical metrics perform quite poor and the similarity metrics perform better (but nevertheless worse than previously). If we analyze the metrics more deeply, we can see that BLEU, METEOR, and CHRF++ all present results under 0.2, which was to be expected given the content of the prediction. The semantic scores give a better result, ranging between 0.48 and 0.73, which can also be expected given that, even though it is not the answer we expected, it also speaks about Fawkham, which is a relevant part of the real content of the triple.

These examples make it clear that it is necessary to rely on a range of metrics to better understand the results obtained. As noted earlier, lexical-overlap measures capture only part of text quality in data-to-text generation, particularly when a single reference is used. We, therefore, treat BLEU as one indicator among others and place additional emphasis on METEOR and CHRF++, which better accommodate stemming, character-level overlap, and inflectional variation.

The second example of Table 7 also illustrates the problem of hallucinations when the model cannot grasp the task it is asked for. In Section 4.2.1, we explained that, for 0S instances, when the model didn’t really understand the task we gave them, they tended to generate quite elaborated long texts that were somewhat related to some contents of the triples but did not reflect their meaning. This can also be an example of this case, which reinforces the idea that some sort of learning is needed when we want to execute specific tasks on general-purpose LLM, either context learning, fine tuning, transfer learning or even training them from scratch in some cases.

4.3. Discussion

The results of our study demonstrate the importance of contextualizing the task and/or fine-tuning the model. We have seen that the performance greatly improves for both Spanish and English when going from 0S to 1S setting. Going one step further, small-scale fine-tuning, even with limited computational resources, significantly enhances model performance for Spanish D2T generation, as evidenced by improvements in both task-specific metrics and stability across varying triple set sizes (Figure 5). These findings underscore the practical value of fine-tuning for adapting multilingual models to underrepresented languages, where pre-trained models often lag behind their English counterparts due to disparities in training data representation.

Regarding the individual performance, model-specific performance reveals a few insights for practitioners: Qwen2.5-1.5B-Instruct emerges as the top performer across most metrics, benefiting substantially from fine-tuning and exhibiting robust cross-lingual capabilities. Similarly, Qwen2.5-0.5B-Instruct, while less efficient, delivers competitive results, positioning it as a viable option for low-resource scenarios where computational constraints prioritize smaller models.

On the other hand, Llama-3.2-1B-Instruct struggles with Spanish fine-tuning despite adequate English performance. This pattern suggests that, in our setting, pre-training language representation and model capacity may be more influential than task familiarity alone when adapting to Spanish triple-to-text generation. Finally, Salamandra-2B-Instruct, though computationally efficient, generally underperforms the other models in both Spanish and English settings.

These observations align with our multilingual and error analysis, which reinforces the necessity of language-specific adaptations. While models like Qwen2.5-1.5B-Instruct demonstrate promising cross-lingual transfer (despite a performance gap between English and Spanish), others like Llama-3.2-1B-Instruct may benefit from targeted training strategies or task-specific adaptation. The stark contrast in Llama-3.2-1B-Instruct’s performance across languages (excelling in English but faltering in Spanish) suggests that task competence alone is insufficient; successful adaptation hinges on a model’s ability to internalize language-specific structural and semantic patterns.

This study also highlights three key implications for Spanish D2T that could also be extrapolated to multilingual NLP:

Fine-tuning efficiency: Even minimal fine-tuning can help mitigate linguistic under-representation in pre-trained models, offering a cost-effective pathway to improve performance for languages like Spanish.

Model selection criteria: Performance, computational cost, and language adaptability must be balanced. For Spanish-centric applications, Qwen2.5-1.5B-Instruct is optimal for low or medium-resource settings, while Qwen2.5-0.5B-Instruct provides a pragmatic compromise for more resource-constrained environments.

Performance metrics selection: We observed that, for our task, some metrics that were normally used in the English sibling task were not really representative of the results we were obtaining. When selecting metrics, we have to take into account not only the task being evaluated but also the language of the data, given that each language has its own grammatical and vocabulary characteristics that might need specific accommodations.

5. Conclusions and Future Work

This article presented a study on the performance of resource-efficient models in the Spanish triples-to-text task. First, we created a Spanish dataset for triple-to-text, Spanish WebNLG. This dataset was developed via a semi-supervised process of automatic translation and formatting, followed by a manual revision of triples and potentially problematic verbalizations. Given the availability of both the triples and verbalizations in Spanish, the dataset also presents the potential to be used bidirectionally for generating triples from plain text as well. Second, we developed a study addressing the three main research questions presented in Section 1. Regarding RQ1, "How effectively can resource-efficient LLMs verbalize Spanish triples across different complexity levels?", the results presented in the evaluation show that resource-efficient LLMs are potentially capable of verbalizing triples sets of different size in Spanish, demonstrating their suitability for this task, as the models present competitive results across most lexical and semantic metrics. For RQ2, ‘‘How does task contextualization through examples impact model performance in Spanish triple verbalization?", the study showed that contextualizing the task with examples significantly improved the models’ performance, as we observed a notable improvement from a 0S to a 1S scenario, both in terms of evaluation metrics and processing time. Lastly, for RQ3, ‘‘What are the comparative advantages and limitations of prompt learning versus partial fine-tuning for Spanish triple verbalization?", the study found that partial fine-tuning, specifically using LoRA, generally led to improved performance in both metrics and time efficiency. Even though fine-tuning requires an initial time and resource investment, we see that fine-tuning also translates into a time efficiency improvement that, although modest, makes the overall performance better with fine-tuning compared to prompt learning.

During the multilingual and error analysis, we also observed that Spanish can offer a broader range of valid surface realizations for the same underlying content, supported by richer verbal inflection, gender/number agreement, and more flexible word order. Because of these differences, we cannot always apply the same methods and metrics to both languages. Evaluation must consider linguistic nuances, such as morphology, syntax, and idiomatic expressions, to ensure fair and accurate assessments.

As future work, we plan to explore more concrete modeling strategies for Spanish triple-to-text verbalization, including alternative parameter-efficient fine-tuning methods, improved prompting and constrained decoding, and the end-to-end training of neural architectures on the Spanish WebNLG dataset under a resource-efficiency perspective. On the data side, we aim to conduct a deeper, crowdsourced revision of the corpus that goes beyond low-similarity cases, in order to systematically assess the quality of MT-derived verbalizations, validate cosine-similarity thresholds, and check coherence between triples and texts. We also see importance in extending the benchmark to better reflect the diversity of Spanish by incorporating other Spanish varieties, while continuing to expand the resource to other languages to broaden its linguistic coverage. In parallel, we plan to adapt the dataset towards ontology verbalization for ontology documentation and to investigate how linguistic factors such as morphological richness and syntactic flexibility interact with common automatic metrics and model behavior. Finally, we intend to explore other structured data formats and datasets beyond WebNLG-style triples to further support non-English, resource-efficient NLG in a wider range of scenarios.

Footnotes

Acknowledgments

This work is supported by Grant MALTA PID2024-159504OB-I00, funded by MICIU/AEI/10.13039/501100011033 and by "ERDF/EU". We also want to thank the WebNLG creators and especially the CNRS/LORIA team for their guidance during the planning of the Spanish WebNLG development.

Funding

This work is supported by Grant MALTA PID2024-159504OB-I00, funded by MICIU/AEI/10.13039/501100011033 and by "ERDF/EU".

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Virginia Ramón-Ferrer

Carlos Badenes-Olmedo

Oscar Corcho

Notes

Appendix A. Context Learning Prompts

Table 11.

Fine Tuning Spanish Test 2 Full Evaluation Results.

		Metrics
Triple size	Model	BLEU	METEOR	CHRF++	Cosine Similarity	BERTScore Precision	BERTScore Recall	BERTScore F1	Average time per triple set in seconds
1	Llama-3.2-1B-Instruct	24.27	0.516	47.68	0.846	0.710	0.823	0.754	2.33
	Qwen2.5-0.5B-Instruct	34.91	0.623	65.43	0.894	0.882	0.889	0.884	0.47
	Qwen2.5-1.5B-Instruct	40.23	0.668	68.89	0.909	0.894	0.903	0.897	0.56
	Salamandra-2B-Instruct	32.98	0.605	64.79	0.896	0.877	0.885	0.880	0.45
2	Llama-3.2-1B-Instruct	12.81	0.450	39.79	0.833	0.651	0.782	0.704	3.26
	Qwen2.5-0.5B-Instruct	33.14	0.639	65.49	0.910	0.872	0.876	0.873	0.77
	Qwen2.5-1.5B-Instruct	35.63	0.663	67.55	0.921	0.879	0.883	0.879	0.95
	Salamandra-2B-Instruct	28.97	0.592	62.39	0.904	0.862	0.863	0.861	0.68
3	Llama-3.2-1B-Instruct	10.99	0.456	40.43	0.840	0.646	0.774	0.700	3.75
	Qwen2.5-0.5B-Instruct	31.87	0.626	63.98	0.907	0.857	0.862	0.858	1.06
	Qwen2.5-1.5B-Instruct	36.11	0.656	67.12	0.921	0.870	0.873	0.870	1.24
	Salamandra-2B-Instruct	27.75	0.572	60.89	0.900	0.851	0.849	0.849	0.88
4	Llama-3.2-1B-Instruct	10.25	0.462	41.59	0.846	0.647	0.763	0.697	4.04
	Qwen2.5-0.5B-Instruct	27.78	0.572	60.78	0.891	0.841	0.845	0.842	1.36
	Qwen2.5-1.5B-Instruct	31.42	0.614	63.81	0.911	0.857	0.858	0.857	1.58
	Salamandra-2B-Instruct	24.65	0.532	58.05	0.889	0.839	0.833	0.835	1.10
5	Llama-3.2-1B-Instruct	10.92	0.467	44.57	0.861	0.654	0.763	0.701	4.34
	Qwen2.5-0.5B-Instruct	27.07	0.560	60.25	0.893	0.837	0.837	0.836	1.64
	Qwen2.5-1.5B-Instruct	30.56	0.607	63.57	0.918	0.854	0.853	0.853	1.94
	Salamandra-2B-Instruct	24.11	0.511	56.84	0.889	0.835	0.825	0.829	1.29
6	Llama-3.2-1B-Instruct	12.35	0.503	48.83	0.879	0.676	0.767	0.716	4.49
	Qwen2.5-0.5B-Instruct	27.70	0.560	60.48	0.890	0.834	0.833	0.833	2.00
	Qwen2.5-1.5B-Instruct	30.78	0.591	62.96	0.919	0.850	0.848	0.848	2.28
	Salamandra-2B-Instruct	24.29	0.502	56.76	0.890	0.833	0.819	0.825	1.50
7	Llama-3.2-1B-Instruct	12.96	0.510	50.89	0.895	0.681	0.764	0.718	4.53
	Qwen2.5-0.5B-Instruct	29.75	0.561	61.65	0.894	0.825	0.824	0.824	2.31
	Qwen2.5-1.5B-Instruct	33.87	0.587	64.44	0.923	0.846	0.839	0.842	2.62
	Salamandra-2B-Instruct	22.90	0.473	55.23	0.886	0.828	0.809	0.817	1.58

BLEU = Bilingual Evaluation Understudy; METEOR = Metric for Evaluation of Translation with Explicit ORdering; CHRf++ = Character n-gram F-score, extended version.

Table 15.

Fine Tuning English Test 2 Full Evaluation Results.

		Metrics
Triple size	Model	BLEU	METEOR	CHRF++	Cosine Similarity	BERTScore Precision	BERTScore Recall	BERTScore F1	Average time per triple set in seconds
1	Llama-3.2-1B-Instruct	56.63	0.815	81.41	0.970	0.965	0.966	0.965	0.272
	Qwen2.5-0.5B-Instruct	48.26	0.776	77.54	0.956	0.958	0.959	0.958	0.404
	Qwen2.5-1.5B-Instruct	49.09	0.796	79.31	0.960	0.958	0.961	0.959	0.474
	Salamandra-2B-Instruct	39.73	0.735	74.34	0.940	0.945	0.950	0.947	0.431
2	Llama-3.2-1B-Instruct	43.46	0.779	76.00	0.964	0.957	0.959	0.957	0.458
	Qwen2.5-0.5B-Instruct	39.46	0.743	73.28	0.952	0.949	0.951	0.949	0.653
	Qwen2.5-1.5B-Instruct	40.05	0.764	74.55	0.961	0.951	0.954	0.952	0.769
	Salamandra-2B-Instruct	33.04	0.703	70.28	0.938	0.938	0.943	0.940	0.680
3	Llama-3.2-1B-Instruct	38.75	0.741	72.45	0.950	0.947	0.951	0.948	0.670
	Qwen2.5-0.5B-Instruct	35.72	0.720	70.32	0.944	0.945	0.946	0.945	0.884
	Qwen2.5-1.5B-Instruct	37.41	0.731	72.15	0.955	0.946	0.949	0.947	1.051
	Salamandra-2B-Instruct	27.66	0.649	65.95	0.933	0.932	0.935	0.933	0.879
4	Llama-3.2-1B-Instruct	35.77	0.711	70.47	0.938	0.942	0.946	0.943	0.859
	Qwen2.5-0.5B-Instruct	31.51	0.669	67.67	0.931	0.939	0.939	0.939	1.133
	Qwen2.5-1.5B-Instruct	33.03	0.685	68.79	0.944	0.942	0.943	0.942	1.329
	Salamandra-2B-Instruct	25.37	0.621	63.74	0.920	0.926	0.928	0.926	1.105
5	Llama-3.2-1B-Instruct	34.24	0.681	69.07	0.934	0.938	0.941	0.939	1.020
	Qwen2.5-0.5B-Instruct	30.17	0.644	65.65	0.928	0.937	0.935	0.935	1.324
	Qwen2.5-1.5B-Instruct	31.21	0.652	66.78	0.935	0.939	0.938	0.938	1.580
	Salamandra-2B-Instruct	24.01	0.570	61.09	0.906	0.924	0.924	0.923	1.309
6	Llama-3.2-1B-Instruct	32.45	0.661	67.91	0.933	0.936	0.938	0.936	1.207
	Qwen2.5-0.5B-Instruct	28.26	0.614	64.08	0.924	0.935	0.932	0.933	1.603
	Qwen2.5-1.5B-Instruct	29.88	0.651	66.06	0.933	0.936	0.937	0.936	1.946
	Salamandra-2B-Instruct	23.18	0.565	59.81	0.900	0.923	0.920	0.921	1.538
7	Llama-3.2-1B-Instruct	35.19	0.672	69.70	0.929	0.939	0.937	0.937	1.346
	Qwen2.5-0.5B-Instruct	31.38	0.623	66.06	0.922	0.937	0.931	0.934	1.827
	Qwen2.5-1.5B-Instruct	32.71	0.635	67.61	0.931	0.939	0.936	0.937	2.187
	Salamandra-2B-Instruct	24.87	0.543	60.05	0.892	0.925	0.918	0.921	1.759

BLEU = Bilingual Evaluation Understudy; METEOR = Metric for Evaluation of Translation with Explicit ORdering; CHRf++ = Character n-gram F-score, extended version.

References

Agarwal

Shakeri

Al-Rfou

(2021). Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: Human Language Technologies. Online: Association for Computational Linguistics, (pp. 3554–3565). https://doi.org/10.18653/v1/2021.naacl-main.278

Aguado-Orea

Witherstone

Bourgeois

Baselga

(2019). Learning to construct sentences in Spanish: A replication of the weird word order technique. Journal of Child Language, 46(6), 1249-1259. https://doi.org/10.1017/S0305000919000448

Banarescu

Bonial

Cai

Georgescu

Griffitt

Hermjakob

Knight

Koehn

Palmer

Schneider

(2013). Abstract Meaning Representation for sembanking. In A. Pareja-Lora, M. Liakata, & S. Dipper (Eds.), Proceedings of the 7th linguistic annotation workshop and interoperability with discourse (pp. 178–186). Sofia, Bulgaria: Association for Computational Linguistics. https://aclanthology.org/W13-2322/

Boyd

(1998). Trend: A system for generating intelligent descriptions of time series data. In Proceedings of the IEEE international conference on intelligent processing systems (ICIPS-1998).

Callison-Burch

Osborne

Koehn

(2006). Re-evaluating the role of Bleu in machine translation research. In D. McCarthy & S. Wintner (Eds.), 11th Conference of the European chapter of the association for computational linguistics. Trento, Italy: Association for Computational Linguistics, (pp. 249–256). https://aclanthology.org/E06-1032/.

Castro Ferreira

Gardent

Ilinykh

van der Lee

Mille

Moussallem

Shimorina

(2020). The 2020 bilingual, bi-directional WebNLG+ shared task: Overview and evaluation results (WebNLG+ 2020). In T. Castro Ferreira, C. Gardent, N. Ilinykh, C. van der Lee, S. Mille, D. Moussallem, & A. Shimorina (Eds.), Proceedings of the 3rd international workshop on natural language generation from the semantic web (WebNLG+) (pp. 55–76). Dublin, Ireland (Virtual): Association for Computational Linguistics. https://aclanthology.org/2020.webnlg-1.7/.

Cervantes

(2024). El español en el mundo: Anuario del Instituto Cervantes 2024. Instituto Cervantes. https://cvc.cervantes.es/lengua/anuario/anuario_24/. Accessed: 2025-02-05.

Chen

Yan

Wang

W. Y.

(2020). KGPT: Knowledge-grounded pre-training for data-to-text generation. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). Online: Association for Computational Linguistics, (pp. 8635–8648). https://doi.org/10.18653/v1/2020.emnlp-main.697

Colin

Gardent

M’rabet

Narayan

Perez-Beltrachini

(2016). The WebNLG challenge: Generating text from DBPedia data. In A. Isard, V. Rieser, & D. Gkatzia (Eds.), Proceedings of the 9th international natural language generation conference. Edinburgh, UK: Association for Computational Linguistics, (pp. 163–167). https://doi.org/10.18653/v1/W16-6626

10.

Cripwell

Belz

Gardent

Gatt

Borg

Judge

Lorandi

Nikiforovskaya

Soto Martinez

(2023). The 2023 WebNLG shared task on low resource languages. overview and evaluation results (WebNLG 2023). In A. Gatt, C. Gardent, L. Cripwell, A. Belz, C. Borg, A. Erdem, & E. Erdem (Eds.), Proceedings of the workshop on multimodal, multilingual natural language generation and multilingual webnlg challenge (MM-NLG 2023). Prague, Czech Republic: Association for Computational Linguistics, (pp. 55–66). https://aclanthology.org/2023.mmnlg-1.6/.

11.

Devlin

Chang

M. W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

12.

Duong

Lumbreras

Gartrell

Gallinari

(2023). Learning from multiple sources for data-to-text and text-to-data. In F. Ruiz, J. Dy, & J. W. van de Meent (Eds.), Proceedings of the 26th international conference on artificial intelligence and statistics, Proceedings of Machine Learning Research, (Vol. 206, pp. 3733–3753). PMLR. https://proceedings.mlr.press/v206/duong23a.html.

13.

Elsahar

Vougiouklis

Remaci

Gravier

Hare

Laforest

Simperl

(2018). T-REx: A large scale alignment of natural language with knowledge base triples. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), Proceedings of the Eleventh international conference on language resources and evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). https://aclanthology.org/L18-1544/.

14.

Fan

Gardent

(2020). Multilingual AMR-to-text generation. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2889–2901). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.231

15.

Shi

Lam

Bing

Liu

(2020). Partially-aligned data-to-text generation with distant supervision. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on empirical methods in natural language processing (EMNLP) (pp. 9183–9193). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.738

16.

Gardent

Shimorina

Narayan

Perez-Beltrachini

(2017). The WebNLG challenge: Generating text from RDF data. In J. M. Alonso, A. Bugarín, & E. Reiter (Eds.), Proceedings of the 10th international conference on natural language generation (pp. 124–133). Santiago de Compostela, Spain: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-3518

17.

Gatt

Krahmer

(2018). Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61(1), 65–170.

18.

Gkatzia

(2016). Content selection in data-to-text systems: A survey. https://arxiv.org/abs/1610.08375.

19.

Goldberg

Driedger

Kittredge

(1994). Using natural-language processing to produce weather forecasts. IEEE Expert, 9(2), 45–53. https://doi.org/10.1109/64.294135

20.

Gonzalez-Agirre

Pàmies

Llop

Baucells

Dalt

S. D.

Tamayo

Saiz

J. J.

Espuña

Prats

Aula-Blasco

Mina

Rubio

Shvets

Sallés

Lacunza

Pikabea

Palomar

Falcão

Tormo

Villegas

(2025). Salamandra technical report. https://arxiv.org/abs/2502.08489.

21.

Grattafiori

Dubey

Jauhri

Pandey

Kadian

Al-Dahle

Letman

Mathur

Schelten

(2024). The llama 3 herd of models. https://arxiv.org/abs/2407.21783.

22.

Hallett

Power

Scott

(2006). Summarisation and visualisation of e-health data repositories. In UK E-Science All-Hands Meeting. https://oro.open.ac.uk/5261/.

23.

E. J.

Shen

Wallis

Allen-Zhu

Wang

Chen

(2021). Lora: Low-rank adaptation of large language models. https://arxiv.org/abs/2106.09685.

24.

Kamaluddin

M. I.

Rasyid

M. W. K.

Abqoriyyah

F. H.

Saehu

(2024). Accuracy analysis of deepL: Breakthroughs in machine translation technology. Journal of English Education Forum (JEEF), 4(2), 122-126. https://doi.org/10.29303/jeef.v4i2.681

25.

Kasner

Dušek

(2020). Data-to-text generation with iterative text editing. In B. Davis, Y. Graham, J. Kelleher, & Y. Sripada (Eds.), Proceedings of the 13th international conference on natural language generation (pp. 60–67). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.inlg-1.9

26.

Kasner

Dusek

(2024). Beyond traditional benchmarks: Analyzing behaviors of open LLMs on data-to-text generation. In L. W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 12045–12072). Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.651

27.

Kurfalı

Östling

(2019). Noisy parallel corpus filtering through projected word embeddings. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, A. Martins, C. Monz, M. Negri, A. Névéol, M. Neves, M. Post, M. Turchi, & K. Verspoor (Eds.), Proceedings of the fourth conference on machine translation (Volume 3: Shared Task Papers, Day 2) (pp. 277–281). Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-5438

28.

Lavie

Agarwal

(2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In C. Callison-Burch, P. Koehn, C. S. Fordyce, & C. Monz (Eds.), Proceedings of the second workshop on statistical machine translation (pp. 228–231). Prague, Czech Republic: Association for Computational Linguistics. https://aclanthology.org/W07-0734/.

29.

Lewis

Liu

Goyal

Ghazvininejad

Mohamed

Levy

Stoyanov

Zettlemoyer

(2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics. Online: Association for Computational Linguistics, (pp. 7871–7880). https://doi.org/10.18653/v1/2020.acl-main.703

30.

Lin

C. Y.

(2004). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81). Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013/.

31.

Lin

Ruan

Liu

Wang

(2024). A survey on neural data-to-text generation. IEEE Transactions on Knowledge and Data Engineering, 36(4), 1431–1449. https://doi.org/10.1109/TKDE.2023.3304385

32.

Liu

Luo

Xia

Chang

Sui

(2019). Hierarchical encoder with auxiliary supervision for neural table-to-text generation: Learning better representation for tables. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6786–6793. https://doi.org/10.1609/aaai.v33i01.33016786

33.

Lorandi

Belz

(2024). High-quality data-to-text generation for severely under-resourced languages with out-of-the-box large language models. https://arxiv.org/abs/2402.12267.

34.

Bao

Han

Cui

(2023). AUGUST: An automatic generation understudy for synthesizing conversational recommendation datasets. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the association for computational linguistics: ACL 2023 (pp. 10538–10549). Toronto, Canada: Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.670

35.

Cheng

Liu

Nyberg

Gao

(2022). Open domain question answering with a unified knowledge interface. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 1605–1620). Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.113

36.

Mathur

Baldwin

Cohn

(2020). Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 4984–4997). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.448

37.

Maynez

Narayan

Bohnet

McDonald

(2020). On faithfulness and factuality in abstractive summarization. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906–1919). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.173

38.

Mcroy

S. W.

Channarukul

Ali

S. S.

(2003). An augmented template-based approach to text realization. Natural Language Engineering, 9(4), 381-420. https://doi.org/10.1017/S1351324903003188

39.

Mendes

Jakob

Bizer

(2012). DBpedia: A multilingual cross-domain knowledge base. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 1813–1817). Istanbul, Turkey: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/570_Paper.pdf.

40.

Mille

Dasiopoulou

Fisas

Wanner

(2019). Teaching FORGe to verbalize DBpedia properties in Spanish. In K. van Deemter, C. Lin, & H. Takamura (Eds.), Proceedings of the 12th international conference on natural language generation (pp. 473–483). Tokyo, Japan: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-8659

41.

Moreno-Sandoval

Goñi-Menoyo

J. M.

(2002). Spanish inflectional morphology in DATR. Journal of Logic, Language and Information, 11(1), 79–105. https://doi.org/10.1023/A:1013019622647

42.

Moryossef

Goldberg

Dagan

(2019). Step-by-step: Separating planning from realization in neural data-to-text generation. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 2267–2277). Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1236

43.

Moussallem

Ferreira

T. C.

Zampieri

Cavalcanti

M. C.

Xexéo

Neves

Ngomo

A. C. N.

(2018). Rdf2pt: Generating Brazilian Portuguese texts from RDF data. https://arxiv.org/abs/1802.08150.

44.

Nan

Radev

Zhang

Rau

Sivaprasad

Hsieh

Tang

Vyas

Verma

Krishna

Liu

Yea

(2021). DART: Open-domain structured data record to text generation. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy , S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 432–447). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.37

45.

Novikova

Dušek

Rieser

(2017). The E2E dataset: New challenges for end-to-end generation. In K. Jokinen, M. Stede, D. DeVault, & A. Louis (Eds.), Proceedings of the 18th Annual SIGdial meeting on discourse and dialogue (pp. 201–206). Saarbrücken, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-5525

46.

Ordóñez

Treviño

(1999). Left dislocated subjects and the pro-drop parameter: A case study of Spanish. Lingua. International Review of General Linguistics. Revue Internationale de Linguistique Generale, 107(1), 39–68. https://doi.org/10.1016/S0024-3841(98)00020-5

47.

Osuji

C. C.

Ferreira

T. C.

Davis

(2024). A systematic review of data-to-text NLG. https://arxiv.org/abs/2402.08496.

48.

Papineni

Roukos

Ward

Zhu

W. J.

(2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, ACL ’02 (pp. 311–318). USA: Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135

49.

Pasupat

Liang

(2015). Compositional semantic parsing on semi-structured tables. In C. Zong & M. Strube (Eds.), Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers) (pp. 1470–1480). Beijing, China: Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-1142

50.

Popović

(2017). CHRF++: Words helping character n-grams. In O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, & J. Kreutzer (Eds.), Proceedings of the second conference on machine translation (pp. 612–618). Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-4770

51.

Post

(2018). A call for clarity in reporting BLEU scores. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, C. Monz, M. Negri, A. Névéol, M. Neves, M. Post , L. Specia, M. Turchi, & K. Verspoor (Eds.), Proceedings of the Third conference on machine translation: Research papers (pp. 186–191). Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6319

52.

Puduppully

Dong

Lapata

(2019). Data-to-text generation with content selection and planning. In Proceedings of the Thirty-Third AAAI conference on artificial intelligence and thirty-first innovative applications of artificial intelligence conference and Ninth AAAI symposium on educational advances in artificial intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press. ISBN 978-1-57735-809-1. https://doi.org/10.1609/aaai.v33i01.33016908

53.

Radford

Child

Luan

Amodei

Sutskever

(2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

54.

Raffel

Shazeer

Roberts

Lee

Narang

Matena

Zhou

Liu

P. J.

(2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.

55.

Ramos-Soto

Bugarín

A. J.

Barro

Taboada

(2015). Linguistic descriptions for automatic generation of textual short-term weather forecasts on real prediction data. IEEE Transactions on Fuzzy Systems, 23(1), 44–57. https://doi.org/10.1109/TFUZZ.2014.2328011

56.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 3982–3992). Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410

57.

Reimers

Gurevych

(2020). Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 conference on empirical methods in natural language processing. Association for Computational Linguistics. https://arxiv.org/abs/2004.09813.

58.

Reiter

(2018). A structured review of the validity of BLEU. Computational Linguistics, 44(3), 393–401. https://doi.org/10.1162/coli_a_00322

59.

Reiter

Dale

(1997). Building applied natural language generation systems. Natural Language Engineering, 3(1), 57-87. https://doi.org/10.1017/S1351324997001502

60.

Satopaa

Albrecht

Irwin

Raghavan

(2011). Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In 2011 31st International conference on distributed computing systems workshops (pp. 166–171). https://doi.org/10.1109/ICDCSW.2011.20

61.

Vandyke

Wang

Fang

Collier

(2021). Plan-then-generate: Controlled data-to-text generation via planning. In M. F. Moens, X. Huang, L. Specia, & S. W. T. Yih (Eds.), Findings of the association for computational linguistics: EMNLP 2021 (pp. 895–909). Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.76

62.

Team

(2024). Qwen2.5: A party of foundation models. https://qwenlm.github.io/blog/qwen2.5/.

63.

Thorndike

R. L.

(1953). Who belongs in the family? Psychometrika, 18(4), 267–276.

64.

van der Lee

Krahmer

Wubben

(2018). Automated learning of templates for data-to-text generation: Comparing rule-based, statistical and neural methods. In E. Krahmer, A. Gatt, & M. Goudbeek (Eds.), Proceedings of the 11th international conference on natural language generation (pp. 35–45). Tilburg University, The Netherlands: Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-6504

65.

Vrandečić

Krötzsch

(2014). Wikidata: A free collaborative knowledgebase. Communications of the ACM, 57(10), 78-85. https://doi.org/10.1145/2629489

66.

Wiseman

Shieber

Rush

(2017). Challenges in data-to-document generation. In M. Palmer, R. Hwa, & S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2253–2263). Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1239

67.

Wiseman

Shieber

Rush

(2018). Learning neural templates for text generation. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3174–3187). Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1356

68.

Wolf

Debut

Sanh

Chaumond

Delangue

Moi

Cistac

Rault

Louf

Funtowicz

Davison

Shleifer

von Platen

Jernite

Plu

Scao

T. L.

Gugger

Rush

A. M.

(2020). Huggingface’s transformers: State-of-the-art natural language processing. https://arxiv.org/abs/1910.03771.

69.

World Wide Web Consortium (2014) RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/rdf11-concepts/. Accessed: 2025-02-17.

70.

Zhu

Zhang

Zhou

(2021). XLPT-AMR: Cross-lingual pre-training via multi-task learning for zero-shot AMR parsing and text generation. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th International joint conference on natural language processing (Volume 1: Long Papers) (pp. 896–907). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.73

71.

Yang

Hui

Zheng

Zhou

Liu

Zhou

Lin

(2024). Qwen2 technical report. arXiv preprint arXiv:2407.10671.

72.

Yang

Liu

Feng

(2022). Text generation from data with dynamic planning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 26–34. https://doi.org/10.1109/TASLP.2021.3129346

73.

Zhang

Kishore

Weinberger

K. Q.

Artzi

(2020). BERTScore: Evaluating text generation with BERT. https://arxiv.org/abs/1904.09675.

74.

Zhao

Walker

Chaturvedi

(2020). Bridging the structural gap between encoding and decoding for data-to-text generation. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 2481–2491). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.224

75.

Zhong

Xiong

Socher

(2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. https://arxiv.org/abs/1709.00103.

	Parameters
Model	Total	Trainable	Non-trainable
Qwen2.5-0.5B-Instruct	496,232,320	2,199,552	494,032,768
Llama-3.2-1B-Instruct	1,238,632,448	2,818,048	1,235,814,400
Qwen2.5-1.5B-Instruct	1,548,330,496	4,616,192	1,543,714,304
Salamandra-2B-Instruct	2,253,490,176	3,729,408	2,249,760,768

					Metrics
Model	Test modality	Triples	Ground truth	Prediction	B	M	C	CS	BP	BR	BF1
Qwen2.5-1.5B-Instruct	Few-shot	Triple 1: [Subject: MotorSport_Vision, Predicate: Ciudad, Object: Fawkham]	‘‘MotorSport Vision se encuentra en la ciudad de Fawkham.", ‘‘MotorSport Vision se encuentra en la ciudad de Fawkham, Reino Unido.", and ‘‘MotorSport Vision se ubica en Fawkham.”	‘‘La empresa MotorSport Vision tiene su sede en Fawkham.”	0.14	0.60	0.63	0.97	0.91	0.94	0.92
Salamandra-2B-Instruct	Zero-shot	Triple 1: [Subject: MotorSport_Vision, Predicate: Ciudad, Object: Fawkham]	‘‘MotorSport Vision se encuentra en la ciudad de Fawkham.", ‘MotorSport Vision se encuentra en la ciudad de Fawkham, Reino Unido.", and ‘‘MotorSport Vision se ubica en Fawkham.”	‘‘La ciudad Fawkham atrae cada año a miles de visitantes que disfrutan del hermoso paisaje natural y sus monumentos históricos. Los sitios más populares incluyen el Parque Nacional Mount Abundance, conocido por su gran variedad de flora y fauna; además, cuenta con una amplia gama de actividades recreativas para todas las edades. Para aquellos interesados en la historia local, no pueden perderse el Museo Histórico de Fawkham, donde podrán descubrir detalles interesantes sobre la vida cotidiana durante siglos pasados. Por lo tanto, si estás buscando lugares fascinantes cerca de Londres pero quieres evitar grandes multitudes, definitivamente deberías visitar Fawkham; te garantizo que no te arrepentirás!”	0.01	0.1	0.18	0.48	0.57	0.73	0.64

Spanish Triple-to-Text Benchmark on Low-Resource Large Language Models

Abstract

Keywords

1. Introduction

2. Background

2.2. Modeling Approaches for RDF Triple-to-Text Generation

2.3. Evaluation of RDF Triple Verbalization

3. Spanish WebNLG

3.1. Methodology Overview

3.1.2. Detection of Problematic Cases and Manual Oversight

4.1. Learning Approaches and Models

4.1.1. Context Learning Through Prompts

4.1.2. Data-Based Learning Through Fine-Tuning

4.1.3. Model Selection

4.1.4. Evaluation Metrics

Table 3. Evaluation Metrics Categorized by Type.

5. Conclusions and Future Work

Footnotes

Acknowledgments

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

Appendix A. Context Learning Prompts

References

Table 3.
Evaluation Metrics Categorized by Type.