Sage Journals: Discover world-class research

Abstract

Argument mining is a subfield of argumentation that aims to automatically extract argumentative structures and their relations from natural language texts. This article investigates how a single large language model can be leveraged to perform one or several argument mining tasks. Our contributions are two-fold. First, we construct a multi-task dataset by surveying and converting 19 well-known argument mining datasets from the literature into a unified format. Second, we explore various training strategies using Meta AI’s Llama-3.1-8B-Instruct model: (1) fine-tuning on individual tasks, (2) fine-tuning jointly on multiple tasks, and (3) merging models fine-tuned separately on individual tasks. Our experiments show that task-specific fine-tuning significantly improves individual performance across all tasks. Moreover, multi-task fine-tuning maintains strong performance without degradation, suggesting effective transfer learning across related tasks. Finally, we demonstrate that model merging offers a viable compromise: it yields competitive performance while mitigating the computational costs associated with full multi-task fine-tuning.

Keywords

argumentation large language models machine learning artificial intelligence

1. Introduction

Argumentation theory aims to model, analyse, and automate argumentative reasoning by relying on formal representations of arguments and their relationships. In his seminal paper, Dung¹ introduced the concept of abstract argumentation frameworks (AAFs) where arguments are represented as abstract nodes in a graph and attacks between them are directed edges. He then defined various acceptability semantics (such as preferred, complete, and grounded among others) to determine which subsets of arguments (known as extensions) can be collectively accepted. These acceptability semantics (We refer the reader to the guide by Baroni et al.⁸¹ for a comprehensive tutorial with guided explanations.), offer specific criteria to select acceptable arguments and enable reasoning under uncertainty and inconsistency.^2,3

Building on Dung’s framework, researchers have massively explored extensions of this abstract model with additional expressive components (such as support relations^4,5 or sets of attacking arguments^6,7) or the development of new semantics (e.g. to assess the strength of arguments in a more gradual manner beyond binary acceptability^8,9).

These research efforts enhance the expressivity of the original framework and also contribute to more diverse applicability in real-world domains such as legal reasoning and political analysis.¹⁰

However, a key limitation of approaches based on Dung’s AAFs lies in their reliance on the assumption that both arguments and their relations are explicitly extracted/specified. While manual formalization may be feasible at small scale in expert-driven systems, the automatic extraction of arguments and their interrelations from large-scale, unstructured data remains an open research challenge. Furthermore, even when arguments are extracted, it has been demonstrated that humans do not always reach a consensus on the directionality of attacks between arguments.¹¹

In recent years, argument mining (AM), the task of automatically identifying and extracting argumentative structures from natural language texts, has received growing attention in the natural language processing (NLP) community.¹² In more details, this task involves detecting argumentative components (such as claims and premises) and the relations between them. Given the pervasive role of argumentation in both written and spoken communication, the development of computational models capable of parsing and evaluating arguments has promising applications in downstream tasks such as decision-making, persuasion, fact-checking, and misinformation detection.^13,14

Early approaches to AM relied on models such as support vector machines (SVMs) and recurrent neural networks (RNNs).¹⁵ However, the development of deep-learning and transformer-based architectures, allowed AM systems to gradually improve their performance. For instance, Mayer et al.¹⁶ applied transformer-based models to mine arguments in healthcare-related texts, demonstrating that domain-adapted language models can capture argumentation patterns in specialized corpora. Moreover, with the recent development of large language models (LLMs), recent work has demonstrated the potential of prompting or fine-tuning LLMs specifically for argument mining.^17–20 For instance, Cabessa et al.¹⁷ showed that appropriately fine-tuned LLMs can outperform traditional architectures in extracting argumentative structure from natural language texts and Stahl et al.²⁰ developed Arginstruct, which fine-tuned LLMs using argumentation instruction tuning to enhance model performance in argumentative tasks.

However, we argue that these previous approaches mainly focused on individual AM tasks (such as argument component or argument relation classification) limiting their applicability to more end-to-end or open-ended argumentative tasks. Furthermore, while instruction tuning methods, such as the one used in Arginstruct, Stahl et al.²⁰ show promise for improving LLMs’ performance in argumentative contexts, it largely depends on synthetic, LLM-generated instructions as training data, rather than leveraging real-world, human-annotated argumentation data. This may, in turn, limit the generalizability of the models’ argumentative capabilities in real-world settings.

Thus, this work aims to explore the potential of a single LLM to perform one or several argument mining tasks.

In doing this, we explore two central research questions:

RQ1:
“To what extent does fine-tuning improve the performance of an LLM on argument mining tasks?”
RQ2:
“Can an LLM obtain good performance in multiple argument mining tasks? If yes, how?”

Our several contributions aimed at answering these two questions. Namely, we first surveyed and collected 19 real-world AM datasets from the literature and converted them into a standardized format suitable to train LLMs. Next, we fine-tuned and evaluated Meta AI’s Llama-3.1-8B-Instruct LLM on eight identified AM tasks. Lastly, to explore the capabilities of LLMs to perform multiple AM tasks, we explored (1) a model merging approach to combine the previously fine-tuned models and (2) fine-tuning jointly on multiple AM tasks.

Our experiments show that task-specific fine-tuning significantly improves individual performance across all tasks. Moreover, multi-task fine-tuning maintains strong performance without degradation, suggesting effective transfer learning across related tasks. Finally, we demonstrate that model merging offers a viable compromise: it yields competitive performance while mitigating the computational costs associated with full multi-task fine-tuning.

This report is structured as follows. In Section 2, we introduced the background and related work on LMMs and argument mining. In Section 3, we describe the refinement of the existing dataset and the creation of our multi-task dataset. In Section 4, we explain the training regiment for the several LLMs in the context of AM and their evaluations. Finally, we conclude in Section 5.
2. Related work

In this section, we begin with a brief overview of prior work on LLMs and their connection to AM. We then narrow our focus to recent studies that specifically apply LLMs to AM tasks. Finally, we present the various AM datasets available in the literature, which serve as the foundation for our experiments in the following sections.

2.1. Large language models

A language model is a computational system designed to generate or understand human language by estimating the probability of word sequences. This allows the model to predict, complete, or generate coherent text based on a given context. Early language models were purely statistical, relying on methods such as $n$ -gram or skip-gram models.²¹ Over time, these approaches gave way to neural models, first using RNNs and later advancing to transformer-based architectures, which now form the backbone of modern LLMs. Among transformer-based models, BERT (bidirectional encoder representations from transformers) stands out as a prominent encoder-only model. It introduced masked language modelling and bidirectional attention, achieving state-of-the-art performance on a wide range of natural language understanding tasks through fine-tuning. In 2019, GPT-2²² sparked widespread public interest in generative language models. Built on a decoder-only architecture and trained using an autoregressive objective, GPT-2 demonstrated that scaling up both model size and training data could lead to strong performance across diverse tasks, without the need for task-specific supervision.

Nowadays, we call large language models, generative language models with many parameters (usually in billions) and trained on a vast amount of textual data to perform a wide range of natural language tasks. The release of OpenAI’s GPT-3²³ further demonstrated that increasing model scale improved zero-shot and few-shot performances. This led to the development of various open-source/closed-sourced models such as Meta AI’s Llama models,²⁴ Mistral-7B,²⁵ Alibaba’s Qwen2.5,²⁶ and OpenAI’s GPT-4²⁷ among plenty of others which further improved performance and exhibited improved alignment and reasoning capabilities. Further improvement can be seen in models such as Claude,²⁸ which integrated architectural changes to support extended thinking and Cogito,²⁹ which explores human-inspired reasoning architecture. This race to better reasoning has been seen in other models such as DeepSeek-R1³⁰ and Deepseek-V3,³¹ which emphasize reasoning through reinforcement learning training, and Alibaba’s QwQ model,³² designed for code generation and theorem proving.

As LLMs became more ubiquitous, controlling their performances for specific tasks became essential. On the one hand, research on prompt engineering, such as Chain-of-Thought (CoT) prompting,³³ which enhances reasoning by guiding an LLM to generate intermediate steps, has been shown to enhance accuracy on reasoning-intensive tasks. This was followed by more structured techniques like Graph-of-Thought (GoT)³⁴ and Thread-of-Thought (ToT),³⁵ which aim to improve interpretability and robustness. On the other hand, supervised fine-tuning (SFT) and instruction tuning have also emerged as powerful methods to improve LLM performances. Chung et al.³⁶ have shown that scaling instruction tuning improves generalization and task transfer, while PEFT (performance effective fine tuning) techniques such as LoRA³⁷ (and its many variants) achieve performance comparable to full fine-tuning with greatly reduced computational cost.

2.2. AM and LLMs

AM is a field of computational argumentation that aims to automatically identify and structure argumentative discourse in natural language texts. The main AM tasks include the detection of argumentative components (e.g. claims or premises) or the classification of relations between them.¹² Note that in this work, we also study argument quality assessment, which is one of the major related AM tasks. It consists in assessing the quality of claims and their revisions.³⁸ Other variants are post quality³⁹ and overall quality assessment.⁴⁰

Early work in AM relied on structured prediction methods and neural architecture, such as structured SVMs and RNNs for component identification and relation prediction¹⁵ and a relation-based approach to capture argumentative structure.⁴¹ We also note that relation classification can also be improved via the use of argumentation scheme and formal logic.⁴² More recent approaches leverage transformer-based architecture, which demonstrate strong performance across a range of argument mining tasks, with applications in domains such as healthcare,¹⁶ political debates analysis,⁴³ and fallacy detection.^44,45

End-to-end frameworks, which perform multiple AM tasks on natural language texts, have also been explored. For example, Lenz et al.⁴⁶ introduce a system that automatically transforms natural language text into an argument graph while Morio et al.⁴⁷ and Schulz et al.⁴⁸ propose models trained via multi-task learning to perform AM in low-resource settings. To help reduce the need for a large amount of data, cross-corpora and training in low-resource settings have also been studied, with methods using generalizability and transfer learning.^39,49,50

Generative approaches have also been in argumentation as a text-to-text generation task⁵¹ or to perform structured extraction of complex argumentative structure, like argument quadruplet extraction.⁵² We refer the interested reader to the seminal paper by Chen et al.⁵³ on the potential of LLMs in computational argumentation. However, in this work, we restrict the scope of our investigation and purposely do not put our focus on generative tasks.

The use of LLMs in AM has shown progress across core tasks (such as argument component identification, relation classification, fallacy detection and quality assessment). Initial explorations have demonstrated the potential of LLM in capturing argumentative structures, while recent evaluations have assessed their performance across precise sub-tasks.^19,54,55

However, LLMs showed several limitations, such as difficulties in reliably detecting argumentative fallacies in natural settings⁵⁶ or logical inconsistency in generated fallacy annotation.⁵⁷ As a solution, fine-tuning and in-context learning have been explored to adapt LLMs to AM. Cabessa et al.^17,58 showed that fine-tuned models improve in the task of argument component classification and relation classification. An instruction-tuned variant such as ArgInstruct Stahl et al.,²⁰ further enhances LLM capabilities by aligning them more closely with the reasoning requirement of argumentative tasks. However, they only use synthetic LLM-generated instructions to perform their instruction tuning. We argue that this may limit the generalizability of the models’ argumentative capabilities in real-world settings. In our approach, we make use of real datasets from the literature which we introduce in the next section.

2.3. Survey of datasets in AM

In the following, we present the 19 datasets collected in the literature, which we will convert and use for fine-tuning LLMs. They are presented in alphabetical order, with illustrative examples provided for only a subset for the sake of brevity.

AbstRCT

Proposed by Mayer et al.,¹⁶ this dataset targets AM in healthcare. It includes 500 medical texts annotated for argumentative component (major claim, claim and premises) and argument relations (attack or support). The dataset highlights challenges like domain-specific vocabulary, evidence scarcity and interpretability.

AQM

Guo et al.⁵² presented AQM as a dataset for their Argument Quadruplet Extraction (AQE) task, which involves identifying four elements in a statement: the topic, stance, opinion and rationale. The data consist of 34,369 sentences from 801 articles annotated for three argument components: claim, evidence and stance.

ArgSum

Li et al.⁵⁹ introduced ArgSum as a comprehensive multi-task dataset designed for end-to-end argument summarization and evaluation. It contains user-generated discussions with associated stances, generated summaries, and human or model-generated quality judgments. The dataset supports carrying out tasks such as argument component extraction, stance detection, summarization and summary evaluation. Its structure supports joint training across those tasks, enabling multi-task learning.

ComArg

Boltužić and Šnajder⁶⁰ developed this dataset by extracting data from online debate forums and social platforms. It includes 2,436 comments annotated for stance and argument recognition. This dataset focuses on short, user-generated responses to controversial questions. This dataset supports tasks such as argument relation classification and stance detection.

CoCoLoFa

Yeh et al.⁵⁷ introduced CoCoLoFa as a dataset containing 7,706 news comments from 648 news articles annotated for eight common logical fallacies (each comment is annotated with one fallacy among the eight selected fallacies), which were verified with the help of an LLM and human annotators. It covers various fallacy types and is situated in real-world opinionated discussions, such as reader comments on a news article.

Dagstuhl-15512 ArgQuality

Wachsmuth et al.⁴⁰ proposed this dataset to assess argument quality in natural language across several dimensions such as clarity, cogency, sufficiency and effectiveness. The dataset consists of 320 arguments rated by human annotators over the different quality dimensions. This dataset helps quantify what makes an argument ‘good’ and ‘persuasive’.

FEVER

Thorne et al.⁶¹ compiled a dataset consisting of 185,445 claims for fact verification by pairing claims with evidence from Wikipedia. Though primarily designed for factual verification, the dataset’s structure overlaps with tasks in AM. Each claim is annotated as ‘supported’, ‘refuted’ or ‘not enough information’ based on the retrieved evidence.

IAM

Cheng et al.⁶² presented IAM, a large-scale dataset designed for multiple tasks of AM, including argument component identification, relation classification and stance detection. It includes 69,666 sentences spanning multiple domains and having extensive annotations. IAM supports both individual and multi-task learning. It is one of the most comprehensive datasets currently available for training and benchmarking AM models.

IBM claim-polarity

Bar-Haim et al.⁶³ proposed this dataset of 2,394 claims associated with 55 topics annotated for stance classification. The claims are labelled for polarity (support, oppose and neutral) and contextual dependency.

IBM type

Aharoni et al.⁶⁴ introduced this dataset with claim and evidence annotations across 33 controversial topics, including 2,883 arguments across 586 documents from Wikipedia. The focus is on distinguishing argument components and associating claims with relevant evidence.

IBM claim

Levy et al.⁶⁵ constructed this dataset, consisting of 2500 claims, by extracting data from web sources associated with 50 distinct topics. It focused on claim detection and aims to bridge the gap between large, noisy web data and structured argumentative search.

IBM evidence

Shnarch et al.⁶⁶ presented this dataset of 5,785 sentences annotated for evidence (either supporting or contesting the topic) across 83 topics.

IBM argument

Shnarch et al.⁶⁷ proposed this dataset to address the challenge of domain adaptation in AM. The datasets enable models to generalize to unfamiliar domains while retaining interpretability. The dataset includes 700 sentences annotated as to whether they contain an argument for the given topic. The sentence annotations span across 20 topics, and the rules provide insight into how argument patterns across multiple domains.

MAFALDA

Helwe et al.⁴⁵ released MAFALDA as a benchmark dataset for logical fallacy detection and classification, covering over 30 fallacy types. It contains a mix of real and synthetic texts, annotated by experts and LLM-assisted crowd-workers. It includes 200 texts in which sentences have been with one or more fallacies. We give a single entry of this dataset in Example 1.

Example 1
MAFALDA entry identifying two fallacies (false dilemma and hasty generalization) from a post claiming that because a bar in Thurles wasn’t attacked over an ad showing Jesus with a pint, Christians are not as sensitive as Muslims.

Microtext part 1

Peldszus and Stede⁶⁸ released a structured dataset composed of 112 microtexts, each containing a full argument annotated (claim and premises) and their relations (attack and support). We give a single entry of this dataset in Example 2.
Example 2
Microtext part 1 entry about the topic of waste separation. Five elementary discourse units (edu) are identified and associated with argumentative discourse units (adu). Relations between argumentative discourse units and their relations are specified in $c_{1}, c_{2}, c_{3}$ and $c_{4}$ . The corresponding micro-level argument graph is represented in Figure 1.

Figure 1.
Representation of the micro-level argument graph on the topic of waste separation from Microtext part 1.

Microtext part 2

Skeppstedt et al.⁶⁹ extended Microtext part 1 with 171 texts by crowd-sourcing new argumentative texts under controlled prompts. The goal was to enlarge the original dataset while maintaining its clarity and consistency in argument structure.

Nixon-Kennedy debates

Menini et al.⁴³ curated and annotated the 1960 US Nixon-Kennedy presidential debates with a focus on argumentation strategies in political speech. The dataset includes 1,907 argument pairs covering five topics annotated for argumentative relations and rhetorical patterns, enabling analysis of persuasive techniques and discourse dynamics.

Node

Cabrio and Villata⁷⁰ introduced the Node dataset, comprising 260 arguments extracted from online and encyclopedic sources. Each argument is annotated with acceptability judgments based on its coherence and logical structure.

Persuasive essays

Stab and Gurevych⁷¹ provided a dataset containing persuasive essays annotated with argument components (major claim, claim and premise) and their relations (support and attack). It includes 402 essays written by students and annotated by three annotators (two non-expert and one expert annotator), offering a consistent structure and real-world argumentative writing.
Example 3
Corresponding Json entry for the Microtext part 1 entry described in Example 2.

3. A new dataset for multi-task AM

The 19 datasets identified in Section 2.3 cannot be directly used for training as they possess widely different specificities and formats. In this section, we explain our approach to unify (Section 3.1) and exploit these datasets for eight AM tasks we consider (Section 3.2).

3.1. Unifying existing AM datasets

To facilitate the fine-tuning and testing of LLMs (as well as reproducibility), we first unify each of the datasets under a handcrafted jsonl format. This unique format eliminates the need for dataset-specific parsers. During conversion, we also clean the data (e.g. removing unnecessary tags) and add relevant informational keywords. Example 3 shows how the entry from Example 2 is converted into a json entry.

Beyond structural unification, we also harmonize the annotation labels across datasets belonging to the same task. For most tasks, this process only involves minor renamings (e.g. ‘Anecdotal Evidence’ to $A N E C D O T A L$ ) to ensure a single, consistent label per category.

Three tasks required more substantial adjustments based on the definitions of the original labels and the specific phenomena each task aims to detect:

ACC: ‘Major claim’ and ‘Implicit Claim’ were merged into $C l a i m$ .

AR: All pro-support variants (‘add’, ‘Pro’, ‘Implicit support’ and ‘sup’) were mapped to $s u p p o r t$ , while negative counterparts (‘und’, ‘reb’, ‘Partial-attack’, ‘Con’ and ‘Implicit attack’) were unified as $a t t a c k$ . This reflects the task’s focus on identifying only support versus attack relations.

SD: ‘Pro’ and ‘support’ labels were merged into $F o r$ while ‘Con’, ‘contest’ and ‘attack’ were merged into $A g a i n s t$ .

To preserve the original datasets, these label modifications are not applied during the jsonl conversion itself. Instead, we apply them dynamically when loading the data for training. We encourage the reader to explore the different converted datasets in our GitHub repository: https://github.com/brunoyun/amelia.

We now introduce the argument mining tasks considered in the article.

3.2. AM tasks considered

All the studied tasks (in light red boxes) and associated datasets (in white boxes) are represented in Figure 2. We selected those AM tasks by surveying the literature on AM and restricting ourselves to tasks which can be converted into classification tasks. Note that some datasets (e.g. AQM) are reused for different tasks in different ways.

Figure 2.
Graph of the different argumentation tasks and the associated datasets.

In the rest of this section, we will describe and formalize each of those tasks.

Argument component classification (ACC)

Argumentative discourse units represent the smallest components within a text that contribute to its argumentative structure. ACC is the task of classifying argument component as either ‘premises’ or ‘claim’. This classification task does not address the distinction between argument and non-argumentative materials. We formalized this task as follows:

Input : A topic $t$ and a sentence $s$

Output : $o \in {C l a i m, P r e m i s e s}$

We fine-tuned and evaluated this task on datasets including Microtext parts 1 and 2, Persuasive Essays, and AbstRCT. Example 4 shows an example of input/output for the ACC task from Microtext part 1.
Example 4
Consider this example from Microtext part 1.

Input :
$t$ : ‘introduce capital punishment’

$s$ : ‘If Germany were to introduce the death penalty, convicted felons could not be paroled or break out of prison and commit new felonies’
Output : $P r e m i s e s$

The sentence $s$ provides a reason supporting the topic $t$ , rather than asserting the main point or an opinion about the topic, thus leading to the classification as ‘Premises’.

Claim detection (CD)

A claim is a statement that asserts something to be true or false. In AM, a claim serves as a central component that forms the basis of reasoning and debate. The CD task aims to identify and extract claims relevant to a given debate’s topic from texts. We formalized this task as follows:

Input : A topic $t$ and sentence $s$

Output : $o \in {C l a i m, N o n$ - $C l a i m}$

We fine-tuned and evaluated the CD task on datasets including IAM Claim, IBM Claim, and IBM Argument. Example 5 shows an example of input/output for the CD task from IAM Claim.
Example 5
Consider this example from IAM Claim.

Input :
$t$ : ‘Shouldn’t the sale of violent video games be banned’.

$s$ : ‘It has nothing to do with video games or Paxil, and my son’s no murderer’.
Output : $N o n$ - $C l a i m$

Despite being an opinionated sentence, $s$ does not express a claim about the topic.

Evidence detection (ED)

An evidence refers to any information or data that either supports or challenges a claim. In AM, the task of evidence detection focuses on the identification and extraction of relevant pieces of text that help validate or refute claims. We formalized ED as follows:

Input : A topic $t$ , a claim $c$ , and a sentence $s$

Output : $o \in {E v i d e n c e, N o n$ - $e v i d e n c e}$

We fine-tuned and evaluated the evidence detection task on the ArgSum Evidence, IAM Evidence, and IBM Evidence datasets. Example 6 shows an example of input/output for the ED task from ArgSum Evidence.
Example 6
Consider this example from ArgSum Evidence.

Input :
$t$ : ‘We should ban private military companies’

$c$ : ‘Private military companies’ main interest is profit’

$s$ : ‘A report by Human Rights Watch in 2006 found that private military companies often prioritize profit over other considerations, including ethical and legal considerations’.
Output : $E v i d e n c e$

Here, $s$ cite a report that validate the claim $c$ .

Argument relation (AR) classification

The objective of AR classification is to determine whether a given pair of arguments is connected through an argumentative relationship. Given a pair of arguments (a source argument and a target argument), the task is to classify the relationship from the source argument to the target argument as either ‘attack’, ‘support’ or ‘no relation’. Formally, this task is described as follows:

Input : A topic $t$ , a source argument $a_{s r c}$ , and a target argument $a_{t r g}$

Output : $o \in {a t t a c k, s u p p o r t, n o r e l a t i o n}$

We fine-tuned and evaluated the argument relation classification task on datasets including Microtext parts 1 and 2, Persuasive Essays, AbstRCT, Nixon-Kennedy Debates, Node, IBM Claim-polarity and ComArg. Example 7 shows an example of input/output for the AR task from node.
Example 7
Consider this example from Node.

Input :
$t$ : ‘Tablet’

$a_{s r c}$ : ‘Tablets help students learn more material faster than textbooks. Technology-based instruction can reduce the time students take to reach a learning objective by 30%–80%, according to the US Department of Education and studies by the National Training and Simulation Association’.

$a_{t r g}$ : ‘People who read print text comprehend more, remember more, and learn more than those who read digital text’.
Output : $a t t a c k$

$a_{s r c}$ disagrees with $a_{t r g}$ and uses a study to attack $a_{t r g}$ .

Evidence type (ET) classification

The ETs refer to the different categories of evidence that can either support or challenge a claim. Common types of evidence include anecdotal, expert opinion, explanation and study. Formally, this task is described as follows:

Input : A topic $t$ , a claim $c$ , and an evidence $e$

Output : $o \in {N O N E, A N E C D O T A L, E X P E R T,$ $E X P L A N A T I O N,$ $S T U D Y}$

This task was fine-tuned and evaluated on datasets including ArgSum Evidence Type, IBM Type and AQM. Example 8 shows an example of input/output for the ET task from AQM.
Example 8
Consider this example from AQM.

Input :
$t$ : ‘Shouldn’t the sale of violent video games be banned’.

$c$ : ‘Conversely, playing violent video games had significantly more hurtful behaviors in children than the children who played prosocial games’.

$e$ : ‘Results indicated that playing prosocial games significantly more helpful behaviors in children than those who played violent video games’.
Ouput : $S T U D Y$

Here, the evidence $e$ support the claim $c$ using the result from a study.

Stance detection (SD)

A stance reflects a point of view on a debated subject, expressed as either support or opposed. The task of SD is to determine whether an argument supports or opposes a specific topic. Formally, this task is described as follows:

Input : A topic $t$ and a sentence $s$

Output : $o \in {F o r, A g a i n s t}$

We fine-tuned and evaluated the SD task using datasets such as IBM Claim-Polarity, ComARg, IAM Stance, FEVER and AQM. Example 9 shows an example of input/output for the SD task from IAM Stance.
Example 9
Consider this example from IAM Stance.

Input :
$t$ : ‘Should all museums be opened for free’.

$s$ : ‘Free access is essential to provide freedom of cultural and educational opportunity’.
Output : $F o r$

The sentences $s$ agree with the topic, thus leading to the classification $F o r$ .

Fallacies detection (FD)

A fallacy is an argument where the premises do not entail the conclusion. The goal of fallacy detection is to identify whether a given argument contains a fallacy or not. In the latter case, the output is $n o n e$ . If the argument is identified as fallacious, the task further involves classifying the kind of fallacy it contains. Formally, this task is described as follows:

Input : A sentence $s$

Output : $o \in {n o n e, a p p e a l t o f e a r,$ $h a s t y g e n e r a l i z a t i o n,$ $a p p e a l t o w o r s e p r o b l e m,$ $a p p e a l t o a u t h o r i t y,$ $f a l s e c a u s a l i t y,$ $a p p e a l t o t r a d i t i o n,$ $a d p o p u l u m,$ $g u i l t b y a s s o c i a t i o n,$ $c a u s a l o v e r s i m p l i f i c a t i o n,$ $f a l s e d i l l e m a,$ $a p p e a l t o r i d i c u l e,$ $f a l s e a n a l o g y,$ $s l i p p e r y s l o p e,$ $a p p e a l t o m a j o r i t y,$ $a p p e a l t o n a t u r e,$ $s t r a w m a n,$ $c i r c u l a r r e a s o n i n g,$ $e q u i v o c a t i o n,$ $a d h o m i n e m}$

We fine-tuned and evaluated the fallacy detection task on two datasets CoCoLoFa and MAFALDA. Example 10 shows an example of input/output for the FD task from CoCoLoFa. Since MAFALDA includes instances with multiple ground-truth fallacies, we define two sub-tasks, denoted as $F D_{S i n g l e}$ and $F D_{M u l t i}$ . The first considers cases with a single ground-truth fallacy, while the second accounts for cases with multiple fallacies. In $F D_{M u l t i}$ , a prediction is considered correct if it belongs to the set of ground-truth fallacies. Recall is then computed as the proportion of ground-truth fallacies correctly identified across multiple LLM predictions.

This task is considered more challenging than other AM tasks considered due to the larger variety of fallacy categories (as opposed to many binary/ternary distinctions in the other tasks).
Example 10
Consider this example from CoCoLoFa.

Input :
$s$ : ‘People must come together and take this threat seriously, like this honourable man. Denialism needs to be completely shut down. All the experts agree, climate change is real and is a massive threat to the whole planet. As Obama says, no challenge poses a greater threat to future generations than climate change’.
Output : $a p p e a l t o a u t h o r i t y$

The sentence $s$ relies on the opinion of an authority figure who may not have relevant expertise, thus leading to the classification as $a p p e a l t o a u t h o r i t y$ .

Argument quality (AQ) assessment

Argument quality refers to how good an argument is; it indicates the degree to which an argument is considered strong and effective. This quality is evaluated based on 15 different quality dimensions: overall quality, local acceptability, appropriateness, arrangement, clarity, cogency, effectiveness, global acceptability, global relevance, global sufficiency, reasonableness, local relevance, credibility, emotional appeal, and sufficiency. The goal of argument quality assessment is to evaluate an argument by rating it as low, average or good across each of the 15 quality dimensions. Formally, this task is described as follows:

Input : A topic $t$ , a stance $s$ , an argument $a$ , and a quality dimension $q$ with its textual definition $d_{q}$

Output : $o \in {L o w, A v e r a g e, H i g h}$

We fine-tuned and evaluated this task using the Dagstuhl-15512 ArgQuality dataset. Example 11 shows an example of input/output for the AQ task from Dagstuhl-15512 ArgQuality Corpus.
Example 11
Consider this example from Dagstuhl-15512 ArgQuality Corpus.

Input :
$t$ : ‘personal-pursuit-or-advancing-the-common-good’.

$s$ : ‘advancing-the-common-good’.

$a$ : ‘I feel like advancing the common good is better than personal pursuit because most people will look out for themselves rather than look out for others. Looking out for yourself before looking out for others won’t do a lot of good in the long run. I believe in the what goes around comes around thing, so when you do something for other people before you try to please yourself, not only does it help them, but something good will also come back to you, helping you better yourself’.

$q$ : ‘clarity’.

$d_{q}$ : ‘Argumentation has a clear style if it uses correct and widely unambiguous language as well as if it avoids unnecessary complexity and deviation from the issue’.
Output : $A v e r a g e$

Here, $s$ use ambiguous language and add complexity to the argument by contradicting itself in the first sentences, hence the $A v e r a g e$ clarity.
3.3. Creating AM task-specific datasets

We created task-specific dataset for each of the eight tasks identified in Section 3.2 by extracting the corresponding input/output from several datasets (as shown in Figure 2) using a specific methodology.

First, we divided each of the 19 datasets into three splits: training ( $60 %$ ), validation ( $20 %$ ) and test split ( $20 %$ ).

Then, we perform a sampling on each of those three split to obtain the data needed to fine-tuned, validate, and test the model for our eight tasks. More precisely, the sampling was performed on each corresponding split of each dataset corresponding to a specific task. Given the disparities in dataset size and class distribution, the sampling procedure was designed to ensure an equal number of instances per class within each task, mitigating class imbalance. Additionally, the sampling preserved the original proportion of each dataset split, maintaining the original proportion of examples contributed by each dataset.

More formally, let us consider a task $t \in {A D U C, C D, E D, A R C, E T, S D, F D, A Q}$ with the corresponding set of datasets $D_{t}$ , for example, $D_{F D} = {C o C o L o F a, M A F A L D A}$ . For a dataset $d \in D_{t}$ , we use the notation $C l a s s (d)$ for the set of class labels in the dataset. We write $C L_{D_{t}} = ⋃_{d \in D_{t}} C l a s s (d)$ for the set of all class labels within $D_{t}$ . Moreover, for any dataset $d \in D_{t}$ and class $c \in C l a s s (d)$ , $N u m E l (c, d)$ is the number of instances of class $c$ in $d$ .

To retain the original dataset contributions, we compute for each dataset $d \in D_{t}$ the sampling ratio $r_{c, d, t}$ for a class label $c \in C L_{D_{t}}$ :
$r_{c, d, t} = \frac{N u m E l (c, d)}{\sum_{d^{'} \in D_{t}} N u m E l (c, d^{'})} .$
Subsequently, the number of instances of class $c$ to sample from $d$ is computed as follows:
$s a m p l e_{c, d, t} = r_{c, d, t} \times \frac{n_{s a m p l e}}{| C l_{D_{t}} |}$
with $n_{s a m p l e}$ the number of samples needed.

This sampling method was used to generate $4000$ training samples and $800$ samples for both validation and test sets.

Table 1 shows an example of sampling the IAM Claim (69,666 elements), IBM Claim (2500 elements) and IBM Arguments (700 elements) datasets used for the Claim Detection (CD) task. We can notice that the train, validation, and test datasets are balanced but keep the same proportion of elements from each dataset.

Table 1.
Sampling example for the task of claim detection. The validation and test split have the same number of instances.

Split Train Validation & Test

Datasets IAM Claim IBM Claim IBM Arguments Total IAM Claim IBM Claim IBM Arguments Total

Claim $1659$ $252$ $89$ $2000$ $333$ $50$ $18$ $401$

Non-Claim $1933$ $54$ $13$ $2000$ $387$ $11$ $3$ $401$

Total $3592$ $306$ $102$ $4000$ $720$ $61$ $21$ $802$

4. LLMs for (multi-task) AM

Split	Train	Validation & Test
Claim	$1659$	$252$	$89$	$2000$	$333$	$50$	$18$	$401$
Non-Claim	$1933$	$54$	$13$	$2000$	$387$	$11$	$3$	$401$
Total	$3592$	$306$	$102$	$4000$	$720$	$61$	$21$	$802$

In this section, we describe the experiments to answer our two research questions and our results.

4.1. Experiment setup

For all of our experiments, we further fine-tuned Llama3.1-8B-Instruct, an already fine-tuned version of the base Meta AI’s Llama 3.1-8B for instruction following. Llama 3.1-8B-Instruct was selected for all experiments because it strikes a strong balance between performance, efficiency, and accessibility. The open-source model is large enough to capture complex reasoning and language patterns, making it suitable for tasks that require nuanced understanding, while still being lightweight enough to run reliably with reasonable computational resources. Its instruction-tuned design ensures consistent and helpful responses across diverse prompts, which was essential for maintaining reproducibility and comparability across experiments. Using a single, well-established model throughout also guarantees methodological consistency and avoids confounding effects from model variation.

Moreover, a uniform prompt format for each task was used for both fine-tuning and inference; this format consists of a task description and an explicit specification of the expected output format. We used the special token <|ANSWER|> to delimitate the LLM’s answer. This choice was inspired by the syntax of special tokens used in the Llama3.1 model’s prompt template. The first part of the prompt describes the objectives and the guidelines of the task, followed by the list of all possible output labels. The second part of the prompt ensures that the generated outputs remain within the predefined labels set for the given task. Example 12 shows the prompt used for the SD task. The prompts for each task are provided in Appendix B.

Example 12
You are an expert in argumentation. Your task is to determine whether the given [SENTENCE] is For or Against. Utilise the [TOPIC] as context to support your decision.

Your answer must be in the following format, with only For or Against in the answer section:

$< |$ ANSWER $| >$ <answer> $< |$ ANSWER $| >$ .

[TOPIC]: <topic>

[SENTENCE]: <sentence>

All experiments were conducted on an Ubuntu machine equipped with an AMD EPYC 7443 24-core CPU, an NVIDIA A40 48 GB GPU, and 65 GB of RAM. All fine-tuning performed in this work spanned two epochs with a batch size of 32. We employed Low-Rank Adaptation (LoRA)³⁷ with a rank of 16 to mitigate computational costs and memory requirements while ensuring sufficient parameters to accommodate diverse tasks.

As the baselines for our evaluation, we employed the Llama3.1-8B-Instruct model in both zero-shot and few-shot settings, along with a DeBERTa model.⁷² For the few-shot setting, we provided one example per label of the diverse tasks. This resulted in two examples for ACC, CD, ED and SD, three examples for AR, five for ET, 20 for FD and three examples for each of the quality dimensions for AQ.
4.2. Exploiting model merging for (multi-task) AM with LLMs

In this section, we explain how we exploit model merging to create a multi-task AM LLM. The three-steps pipeline for this model (illustrated as $S T_{M e r g e d}$ in Figure 3) is as follows:

We first created a collection of eight models, each fine-tuned for a specific AM task, and released them in two formats: Safetensors and GGUF (GGML Universal File). The Safetensors versions enable accurate reproduction of our evaluation results and support potential retraining, while the GGUF format (a compact binary representation that stores both tensors and metadata in a single file) is optimized for efficient local inference on platforms such as Ollama and LM Studio. Both formats are available on Hugging Face (Our trained LLM collection is accessible at https://huggingface.co/collections/brunoyun/amelia-collection-68518343bf75869b53d0d8bd.).

We categorized the different argumentation tasks into three levels of difficulty: hard, medium, and easy. This categorization was based on the performance of the various fine-tuned models across the tasks (see Table 2 in the subsequent section). Specifically, a task is considered easy if most fine-tuned models – including those not specifically trained for that task – achieve over $60 %$ performance. Tasks for which model performance predominantly falls between $50 %$ and $60 %$ are considered medium, while tasks with lower performance are classified as hard. Accordingly, the challenging tasks include fallacies detection (FD), argument quality (AQ) assessment and evidence type (ET) classification. Intermediate tasks comprise argument component classification (ACC) and argument relation (AR) Classification, whereas the easier tasks are claim detection (CD), evidence detection (ED) and stance detection (SD). Notably, although most fine-tuned models achieve only around $40 %$ on the AR task, we classify it as medium because merged configurations succeed in obtaining strong performance on this task without retaining a high fraction of task-specific vectors.

Subsequently, we merged the eight finely tuned models using the mergekit library.⁷³ Although there are numerous merging approaches,^74–77 we evaluated different configurations for two of them: DARE⁷⁷ and DELLA.⁷⁴ On the one hand, the former performs a random pruning on the task vectors, followed by a rescaling method to match the performance of the original model. The latter, on the other hand, enhances DARE using magnitude-based adaptive pruning, which assigns higher probabilities to parameters with larger magnitudes, followed by DARE-like rescaling. This method is designed to preserve significant modifications while minimizing interference between task vectors.

Figure 3.

Representation of the merging pipeline. All fine-tuned models are available in our Hugging Face collection.

Table 2.

Performance of the fine-tuned models merged in 16-bit and the GGUF models in 8-bit.

Model	ACC	CD	ED	AR	ET	SD	FD $_{S i n g l e}$	FD $_{M u l t i}$	AQ
Llama-3.1-8B-Instruct zero-shot	$73.52 %$	$51.50 %$	$17.06 %$	$28.32 %$	$37.41 %$	$14.10 %$	$44.07 %$	$21.77 %$	$15.10 %$
Llama-3.1-8B-Instruct few-shot	$75.47 %$	$67.83 %$	$64.20 %$	$35.97 %$	$49.31 %$	$80.00 %$	$48.50 %$	$17.25 %$	$31.83 %$
DeBERTa	$78.64 %$	$71.59 %$	$63.97 %$	$69.17 %$	$68.75 %$	$33.33 %$	$36.82 %$	/	$48.00 %$
Llama-3.1-8B-Instruct FT for ACC	$\underline{89.61 %}$	$61.35 %$	$68.25 %$	$38.51 %$	$41.43 %$	$65.82 %$	$38.43 %$	$21.58 %$	$33.07 %$
Llama-3.1-8B-Instruct FT for CD	$50.18 %$	$85.16\%$	$68.91 %$	$38.29 %$	$33.91 %$	$66.97 %$	$38.90 %$	$22.67 %$	$31.24 %$
Llama-3.1-8B-Instruct FT for ED	$63.32 %$	$74.94 %$	$78.00\%$	$28.60 %$	$38.67 %$	$68.42 %$	$39.65 %$	$18.47 %$	$29.01 %$
Llama-3.1-8B-Instruct FT for AR	$50.81 %$	$59.98 %$	$67.00 %$	$\underline{87.20 %}$	$35.07 %$	$76.00 %$	$35.14 %$	$25.86 %$	$27.97 %$
Llama-3.1-8B-Instruct FT for ET	$56.10 %$	$67.08 %$	$61.45 %$	$26.88 %$	$75.22\%$	$69.82 %$	$46.78 %$	$29.68 %$	$29.03 %$
Llama-3.1-8B-Instruct FT for SD	$50.93 %$	$48.88 %$	$57.62 %$	$38.26 %$	$39.17 %$	$\underline{94.63 %}$	$43.23 %$	$20.99 %$	$20.39 %$
Llama-3.1-8B-Instruct FT for FD	$66.58 %$	$65.13 %$	$64.50 %$	$38.64 %$	$46.83 %$	$64.32 %$	$82.92\%$	$50.77 %$	$41.90 %$
Llama-3.1-8B-Instruct FT for AQ	$74.46 %$	$59.73 %$	$68.00 %$	$30.86 %$	$44.06 %$	$60.43 %$	$47.98 %$	$24.31 %$	$\underline{69.54 %}$
GGUF $_{A C C}$	$87.73 %$	$63.59 %$	$63.75 %$	$36.31 %$	$37.98 %$	$64.63 %$	$30.19 %$	$29.27 %$	$32.94 %$
GGUF $_{C D}$	$54.10 %$	$81.92 %$	$60.70 %$	$36.43 %$	$31.99 %$	$63.82 %$	$30.00 %$	$31.21 %$	$33.20 %$
GGUF $_{E D}$	$56.20 %$	$63.72 %$	$71.62 %$	$34.63 %$	$36.22 %$	$61.84 %$	$34.10 %$	$34.54 %$	$34.77 %$
GGUF $_{A R}$	$55.19 %$	$60.25 %$	$63.70 %$	$84.57 %$	$31.71 %$	$76.50 %$	$29.94 %$	$34.18 %$	$32.15 %$
GGUF $_{E T}$	$58.23 %$	$64.37 %$	$58.59 %$	$29.14 %$	$72.47 %$	$68.20 %$	$39.05 %$	$32.94 %$	$31.48 %$
GGUF $_{S D}$	$56.70 %$	$50.75 %$	$57.75 %$	$38.27 %$	$33.67 %$	$93.75 %$	$34.66 %$	$30.32 %$	$21.43 %$
GGUF $_{F D}$	$62.20 %$	$59.91 %$	$62.88 %$	$35.51 %$	$42.52 %$	$64.68 %$	$74.08 %$	$\underline{62.16 %}$	$41.69 %$
GGUF $_{A Q}$	$67.08 %$	$59.73 %$	$69.50 %$	$31.17 %$	$41.31 %$	$61.16 %$	$41.86 %$	$30.02 %$	$66.53 %$
Llama-3.1-8B-Instruct FT on Multi-task	$90.74\%$	$\underline{84.71 %}$	$\underline{77.75 %}$	$88.33\%$	$\underline{73.84 %}$	$95.75\%$	$\underline{82.53 %}$	$50.22 %$	$69.80\%$
GGUF $_{M T F T}$	$87.36 %$	$81.17 %$	$73.75 %$	$83.06 %$	$71.09 %$	$94.50 %$	$79.65 %$	$65.32\%$	$63.14 %$
Merged Model	$78.72 %$	$70.69 %$	$69.62 %$	$72.52 %$	$54.60 %$	$77.04 %$	$57.00 %$	$35.03 %$	$57.52 %$
GGUF $_{M e r g e d}$	$65.95 %$	$65.83 %$	$62.13 %$	$62.93 %$	$49.06 %$	$74.38 %$	$50.04 %$	$40.75 %$	$44.97 %$

The cells with red backgrounds indicate the hard tasks, orange ones indicate the medium tasks and green ones indicate the easy tasks.

ACC = argument component classification; CD = claim detection; ED = evidence detection; AR = argument relation classification; ET= evidence type classification; SD = stance detection; FD = fallacies detection; AQ = argument quality assessment.

We adjusted the following three parameters: the density

ρ

which represents the fraction of weight to retain in difference to the base model, the weight

w

of each model task vectors which represents the probability of each model task vector to be used during merging and the

ϵ

parameters (used only with the DELLA method) which define the half-width range for the keep probabilities. Keep probabilities for parameters in a task vector will range between

ρ - ϵ > 0

and

ρ + ϵ < 1

. The configurations for the DELLA method were formulated based on the outcomes of the diverse DARE configurations. The intuition was that pruning the task vectors’ parameters in accordance with their respective importance would enhance the performance of the merging process. Table 3 summarizes the merging configurations investigated and Table 4 shows the performance of the merging methods across the eight different AM tasks.

Table 3.

Parameters for the diverse merging configurations.

	$ρ$			$ϵ$			w
Merge configuration	Hard	Medium	Easy	Hard	Medium	Easy	Hard	Medium	Easy
DARE I	$0.5$	$0.5$	$0.5$		–		$0.125$	$0.125$	$0.125$
DARE II	$0.7$	$0.7$	$0.7$		–		$0.125$	$0.125$	$0.125$
DARE III	$0.85$	$0.8$	$0.5$		–		$0.125$	$0.125$	$0.125$
DARE IV	$0.8$	$0.8$	$0.8$		–		$0.2$	$0.15$	$0.03$
DARE V	$0.85$	$0.8$	$0.5$		–		$0.2$	$0.15$	$0.03$
DELLA I	$0.85$	$0.8$	$0.5$	$0.1$	$0.1$	$0.4$	$0.125$	$0.125$	$0.125$
DELLA II	$0.9$	$0.7$	$0.5$	$0.1$	$0.15$	$0.4$	$0.2$	$0.15$	$0.03$
DELLA III	$0.9$	$0.85$	$0.8$	$0.05$	$0.1$	$0.1$	$0.2$	$0.15$	$0.03$

The cells with red backgrounds indicate the hard tasks, orange ones indicate the medium tasks and green ones indicate the easy tasks. Within each configuration, tasks with the same difficulty were assigned the same hyperparameter value.

Table 4.

Performance (F1 score) of the merge configurations.

Merge Configuration	ACC	CD	ED	AR	ET	SD	FD $_{S i n g l e}$	FD $_{M u l t i}$	AQ	Mean $F 1$
DARE I	$68.71 %$	$73.79 %$	$71.88\%$	$58.85 %$	$49.19 %$	$84.06 %$	$55.47 %$	$30.32 %$	$42.35 %$	$59.40 %$
DARE II	$68.96 %$	$74.68 %$	$71.00 %$	$59.22 %$	$49.94 %$	$83.98 %$	$53.21 %$	$30.13 %$	$41.56 %$	$59.18 %$
DARE III	$73.09 %$	$74.31 %$	$71.75 %$	$59.60 %$	$49.69 %$	$84.35 %$	$55.52 %$	$35.03 %$	$41.30 %$	$60.51 %$
DARE IV	$78.84\%$	$69.70 %$	$69.50 %$	$68.75 %$	$54.44 %$	$74.65 %$	$57.58 %$	$35.04 %$	$51.11 %$	$62.17 %$
DARE V	$77.71 %$	$70.45 %$	$70.00 %$	$67.75 %$	$54.19 %$	$75.68 %$	$57.00 %$	$31.55 %$	$51.50 %$	$61.75 %$
DELLA I	$74.71 %$	$74.68\%$	$71.12 %$	$63.86 %$	$50.56 %$	$84.83\%$	$57.19 %$	$35.24\%$	$44.84 %$	$61.89 %$
DELLA II	$78.72 %$	$70.69 %$	$69.62 %$	$72.52\%$	$54.60 %$	$77.04 %$	$57.00 %$	$35.03 %$	$57.52\%$	$63.64\%$
DELLA III	$78.60 %$	$71.70 %$	$69.37 %$	$70.64 %$	$54.94\%$	$77.20 %$	$58.54\%$	$35.03 %$	$55.16 %$	$63.46 %$

The cells with red backgrounds indicate the hard tasks, orange ones indicate the medium tasks and green ones indicate the easy tasks.

We make the following observations:

The fine-tuned models on individual tasks consistently outperform the base model in both zero-shot and few-shot settings, as well as the DeBERTa baseline. Notably, the fine-tuned model for the ACC task achieves an F1 score of $89.61 %$ , a substantial improvement over the zero-shot ( $73.52 %$ ) and few-shot ( $75.46 %$ ) baselines. Similar gains are observed across all tasks, underscoring the effectiveness of task-specific fine-tuning in enhancing model performance. From Table 2, we can see that the merged model has an average gain of efficiency of $↑ 29.98 %$ (resp. $↑ 11.38 %$ ) compared to the base model in zero-shot (resp. few-shot) prompting.

DARE IV and DELLA II achieve high mean F1 scores (61.75% and 63.64%, respectively). In particular, DELLA II offers the best overall trade-off, with superior performance across multiple categories, including the challenging tasks such as AR and AQ tasks. This suggests that the introduction of non-uniform hyper-parameter values (especially different weights $w$ ) is beneficial when dealing with heterogeneous task difficulty.

DARE I and II, where all tasks share identical hyperparameters, tend to underperform in terms of mean F1 (59.40% and 59.18%). While these models perform consistently, they fail to exploit task-specific adjustments, particularly struggling on harder tasks such as AR, AQ and ET. This supports the intuition that hard tasks require stronger density and adjusted weight to avoid being overshadowed during merging.

DELLA variants outperform most DARE configurations in mean F1 scores. DELLA I and DELLA III perform comparably, but DELLA II benefits the most from balancing relatively high $ρ$ values with selective $ϵ$ values and non-uniform $w$ , achieving the best mean F1. This suggests that carefully tuning can improve generalization on difficult tasks.

Overall, the DELLA merging method, especially DELLA II and III, are more effective than the DARE configurations. They maintain better performance over different tasks, suggesting that these two configurations are better suited for building a multi-task models for argument mining. When comparing the different models in Table 2, ‘Merged Model’ refers to our best merged model, that is, DELLA II.

4.3. Fine-tuning a LLM for multi-task AM

We also fine-tuned the Llama 3.1-8B-Instruct model on our multi-task dataset (composed of the training split of all eight AM tasks). This fine-tuned model used LoRA with the same parameters as the task-specific fine-tuning (see Section 4.1). This model is represented by $S T_{M T F T}$ in Figure 3. For the evaluation, we tested the fine-tuned model (referred to as ‘Llama-3.1-8B-Instruct FT on Multi-task’ in Table 2) on each task’s test dataset separately. Figure 4 shows a representation of the multi-task fine-tuning.

Figure 4.

Representation of the multi-task fine-tuning.

This multi-task fine-tuned model achieves the best overall performance, outperforming all other trained models on most tasks. It obtains state-of-the-art results on ACC ( $90.74 %$ ), AR ( $88.33 %$ ), SD ( $95.75 %$ ) and AQ ( $69.80 %$ ), while maintaining comparable performance to the models fine-tuned on individual AM tasks. These results suggest not only an effective transfer of knowledge across tasks – particularly among those with structural similarities such as ACC, CD, ET, and SD – but also that no noticeable conflict arises during joint training, allowing the model to benefit from shared representations without degradation on any individual task.

While the multi-task fine-tuned model achieves superior overall performance compared to the merged model, the latter also offers notable advantages. Model merging provides a flexible and lightweight alternative that does not require joint training on all tasks, making it especially useful when task-specific data is scarce, computational resources are limited, or continual updates are needed. In addition, merging preserves the strengths of task-specialized models without the risk of overfitting to a combined dataset, and allows new tasks to be incorporated incrementally without re-training the entire system. Thus, although multi-task fine-tuning yields the best aggregate results, model merging remains a valuable approach in scenarios where efficiency, modularity, and adaptability are prioritized.

5. Conclusion

In this work, we introduced AMELIA, a family of (multi-task) end-to-end language models for AM. Our contributions are three-fold. First, we consolidated and unified 19 widely used AM datasets into a common format, thereby providing a large, standardized resource that enables reproducibility and facilitates the application of LLMs to diverse argumentative tasks. Second, we conducted an extensive evaluation of fine-tuning strategies using Meta AI’s Llama-3.1-8B-Instruct model, demonstrating that task-specific fine-tuning substantially improves performance across all tasks, while multi-task fine-tuning preserves strong results without degradation, thus confirming the potential of transfer learning across closely related tasks. Third, we explored model merging techniques as a resource-efficient alternative, showing that methods such as DELLA can yield competitive results while maintaining modularity and adaptability.

Our experiments highlight several important insights. Multi-task fine-tuning consistently outperforms both zero-shot and few-shot baselines as well as traditional architectures, establishing a new state-of-the-art on multiple tasks. At the same time, model merging offers a practical compromise when computational or data constraints prevent joint training, making it a promising strategy for scalable deployment. Together, these findings underline the flexibility of LLMs for AM and provide evidence that both fine-tuning and merging can be leveraged to address different application scenarios.

Looking ahead, several promising research directions emerge. First, while our study focused primarily on classification-based tasks, extending this framework to generative AM tasks – such as automatic argument graph construction or debate summarization – would substantially broaden its scope. Pursuing this avenue would require the development of more sophisticated evaluation metrics beyond conventional NLP measures (e.g. ROUGE,⁷⁸ METEOR,⁷⁹ or BERTScore⁸⁰) and the integration of human or LLM-based evaluations via online platforms to better capture argumentative dimensions that may escape automatic scoring. Second, incorporating explainability mechanisms and formal argumentation semantics into LLM-based systems would help bridge the gap between high predictive performance and interpretability. This step is critical for deployment in sensitive domains such as law, healthcare, and policy-making, where transparency and accountability are essential. Third, to further enhance performance, we plan to explore hyperparameter optimization and prompt engineering. In particular, we will experiment with varying the LoRA rank parameter $r$ (e.g. 32, 64 and 128) to identify the most effective configurations across tasks. Additionally, we will investigate optimized prompts in both zero-shot and few-shot settings, with the goal of designing task-specific prompts that maximize model performance. Finally, we intend to extend our approach to other LLM architectures, thereby enabling a more comprehensive comparison across model families. This includes evaluating instruction-tuned models beyond Llama and systematically analysing performance under consistent experimental settings, in order to better characterize model-specific strengths and limitations in addressing diverse AM tasks.

Our work demonstrates that LLMs, when carefully adapted through fine-tuning and merging strategies, offer a powerful foundation for advancing AM. By releasing the AMELIA models and datasets publicly, we aim to provide the community with both methodological insights and practical tools to accelerate research in computational argumentation.

Footnotes

Acknowledgements

This work was carried out as part of the AMELIA project, funded by the Computer Science Department of Université Claude Bernard Lyon 1.

ORCID iD

Bruno Yun

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Appendix A. All results

References

Dung

. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and

n

-person games. Artif Intell 1995; 77(2): 321–357.

Croitoru

Vesic

. What can argumentation do for inconsistent ontology query answering? In: Liu W, Subrahmanian VS and Wijsen J (eds) Scalable uncertainty management – 7th international conference, SUM 2013, Washington, DC, USA, 16–18 September 2013. Proceedings, Volume 8078, Lecture notes in computer science, 2013, pp.15–29. Berlin, Heidelberg: Springer. 10.1007/978-3-642-40381-1_2

Yun

Vesic

Oren

. Representing pure Nash equilibria in argumentation. Argument Comput 2022; 13(2): 195–208. 10.3233/AAC-210007

Amgoud

Cayrol

Lagasquie-Schiex

. On the bipolarity in argumentation frameworks. In: Delgrande JP and Schaub T (eds) 10th international workshop on non-monotonic reasoning (NMR 2004), Whistler, Canada, 6–8 June 2004, proceedings, 2004, pp.1–9. http://www.pims.math.ca/science/2004/NMR/papers/paper01.pdf.

Yun

Vesic

. Gradual semantics for weighted bipolar setafs. In: Vejnarová J and Wilson N (eds) Symbolic and quantitative approaches to reasoning with uncertainty – 16th European conference, ECSQARU 2021, Prague, Czech Republic, 21–24 September 2021, proceedings, Volume 12897, Lecture notes in computer science, 2021, pp.201–214. Cham, Switzerland: Springer. 10.1007/978-3-030-86772-0_15

Nielsen

Parsons

. Computing preferred extensions for argumentation systems with sets of attacking arguments. In: Dunne PE and Bench-Capon TJM (eds) Computational models of argument: proceedings of COMMA 2006, 11–12 September 2006, Liverpool, UK, Volume 144, Frontiers in artificial intelligence and applications, 2006, pp.97–108. Amsterdam, The Netherlands: IOS Press. http://www.booksonline.iospress.nl/Content/View.aspx?piid=1930.

Yun

Vesic

Croitoru

. Sets of attacking arguments for inconsistent datalog knowledge bases. In: Prakken H, Bistarelli S, Santini F and Taticchi C (eds) Computational models of argument – proceedings of COMMA 2020, Perugia, Italy, 4–11 September, 2020,Volume 326, Frontiers in artificial intelligence and applications, 2020, pp.419–430. Amsterdam, The Netherlands: IOS Press. 10.3233/FAIA200526

Amgoud

Doder

Vesic

. Evaluation of argument strength in attack graphs: foundations and semantics. Artif Intell 2022; 302: 103607.

Yun

Vesic

Croitoru

. Ranking-based semantics for sets of attacking arguments. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, 7–12 February, 2020, 2020, pp.3033–3040. Palo Alto, California, USA: AAAI Press. 10.1609/aaai.v34i03.5697

10.

Amgoud

Prade

. Using arguments for making and explaining decisions. Artif Intell 2009; 173(3-4): 413–436.

11.

Cramer

Guillaume

. Directionality of attacks in natural language argumentation. In: Schon C (ed) Proceedings of the fourth workshop on bridging the gap between human and automated reasoningco-located with the 27th international joint conference on artificial intelligence and the 23rd European conference on artificial intelligence (IJCAI-ECAI 2018), Stockholm, Schweden, 14 July 201, Volume 2261, CEUR workshop proceedings, 2018, pp.40–46. Aachen, Germany: CEUR-WS.org. https://ceur-ws.org/Vol-2261/paper7.pdf.

12.

Lawrence

Reed

. Argument mining: a survey. Comput Linguist 2020; 45(4): 765–818.

13.

Figueras

. Using argumentation theory to fight misinformation. In: Bonet-Jover A, Calvo DV, Pastor EL, Molina-González MD and Jiménez-Zafra SM (eds) Proceedings of the doctoral symposium on natural language processing (NLP-DS 2024) held as part of the XL edition of the international conference of the Spanish society for natural language processing (SEPLN 2024), Valladolid, Spain, 26 September, 2024, Volume 3797, CEUR workshop proceedings, 2024, pp.114–122. Aachen, Germany: CEUR-WS.org. https://ceur-ws.org/Vol-3797/paper14.pdf.

14.

Wang

Cabrio

Villata

. When automated fact-checking meets argumentation: unveiling fake news through argumentative evidence. Argum Comput 2025; 16: 405–424. https://doi.org/10.1177/19462174251330980

15.

Niculae

Park

Cardie

. Argument mining with structured SVMS and RNNS, 2017. arXiv preprint arXiv:1704.06869.

16.

Mayer

Cabrio

Villata

. Transformer-based argument mining for healthcare applications. In: ECAI 2020, 2020, pp.2108–2115. Amsterdam, The Netherlands: IOS Press.

17.

Cabessa

Hernault

Mushtaq

. Argument mining with fine-tuned large language models. In: Proceedings of the 31st international conference on computational linguistics, 2025, pp.6624–6635. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.

18.

Faugier

Armetta

Bonifati

, et al. Assisted debate builder with large language models. In: ECAI 2024, 2024, pp.4447–4450. Amsterdam, The Netherlands: IOS Press.

19.

Gorur

Rago

Toni

. Can large language models perform relation-based argument mining? In: Rambow O, Wanner L, Apidianaki M, Al-Khalifa H, Eugenio BD and Schockaert S (eds) Proceedings of the 31st international conference on computational linguistics, COLING 2025, Abu Dhabi, UAE, 19–24 January 2025, 2025, pp.8518–8534. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. https://aclanthology.org/2025.coling-main.569/.

20.

Stahl

Ziegenbein

Park

, et al. Arginstruct: specialized instruction fine-tuning for computational argumentation, 2025. arXiv preprint arXiv:2505.22076.

21.

Mikolov

Chen

Corrado

, et al. Efficient estimation of word representations in vector space, 2013. arXiv preprint arXiv:1301.3781.

22.

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI blog 2019; 1(8): 9.

23.

Mann

Ryder

Subbiah

, et al. Language models are few-shot learners, 2020. arXiv preprint arXiv:2005.14165 1: 3.

24.

Llama Team. The llama 3 herd of models. CoRR, 2024. abs/2407.21783. https://doi.org/10.48550/arXiv.2407.21783

25.

Jiang

Sablayrolles

Mensch

, et al. Mistral 7b, 2023. https://arxiv.org/abs/2310.06825.

26.

Yang

, et al. Qwen3 technical report, 2025. arXiv preprint arXiv:2505.09388.

27.

Achiam

Adler

Agarwal

, et al. Gpt-4 technical report, 2023. arXiv preprint arXiv:2303.08774 .

28.

Anthropic. Claude’s extended thinking, 2025. https://www.anthropic.com/research/visible-extended-thinking.

29.

DeepCogito. Introducing cogito preview, 2025. https://deep-cogito-website.vercel.app/research/cogito-v1-preview.

30.

Guo

Yang

Zhang

, et al. Deepseek-r1: incentivizing reasoning capability in LLMs via reinforcement learning, 2025. arXiv preprint arXiv:2501.12948.

31.

Liu

Feng

Xue

, et al. DeepSeek-V3 technical report, 2024. arXiv preprint arXiv:2412.19437.

32.

Yang

Wang

, et al. QwQ: a high-performance reasoning model for theorem proving and code generation, 2024. arXiv preprint arXiv:2404.07158.

33.

Wei

Wang

Schuurmans

, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022; 35: 24824–24837.

34.

Yao

Zhao

. Beyond chain-of-thought, effective graph-of-thought reasoning in language models, 2023. arXiv preprint arXiv:2305.16582.

35.

Zhou

Geng

Shen

, et al. Thread of thought unraveling chaotic contexts, 2023. arXiv preprint arXiv:2311.08734.

36.

Chung

Hou

Longpre

, et al. Scaling instruction-finetuned language models. J Mach Learn Res 2024; 25(70): 1–53.

37.

Shen

Wallis

, et al. Lora: low-rank adaptation of large language models. ICLR 2022; 1(2): 3.

38.

Skitalinskaya

Klaff

Wachsmuth

. Learning from revisions: quality assessment of claims in argumentation at scale, 2021. arXiv preprint arXiv:2101.10250.

39.

Wang

Chen

, et al. Contextual interaction for argument post quality assessment. In: Proceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp.10420–10432. Singapore: Association for Computational Linguistics (ACL).

40.

Wachsmuth

Naderi

Hou

, et al. Computational argumentation quality assessment in natural language. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 1, long papers, 2017, pp.176–187. Valencia, Spain: Association for Computational Linguistics.

41.

Carstens

Toni

. Towards relation based argumentation mining. In: Proceedings of the 2nd workshop on argumentation mining, 2015, pp.29–34. Denver, Colorado, USA: The Association for Computational Linguistics.

42.

Bang

Reed

, et al. Classifying argumentative relations using logical mechanisms and argumentation schemes. Trans Assoc Comput Linguist 2021; 9: 721–739.

43.

Menini

Cabrio

Tonelli

, et al. Never retreat, never retract: argumentation analysis for political speeches. In: McIlraith SA and Weinberger KQ (eds) Proceedings of the thirty-second AAAI conference on artificial intelligence, (AAAI-18), the 30th innovative applications of artificial intelligence (IAAI-18), and the 8th AAAI symposium on educational advances in artificial intelligence (EAAI-18), New Orleans, Louisiana, USA, 2–7 February 2018, 2018, pp.4889–4896. Palo Alto, California, USA: AAAI Press. https://doi.org/10.1609/aaai.v32i1.11920

44.

Goffredo

Chaves

Villata

, et al. Argument-based detection and classification of fallacies in political debates. In: EMNLP 2023-conference on empirical methods in natural language processing, Volume 2023, 2023, pp.11101–11112. Association for Computational Linguistics Singapore.

45.

Helwe

Calamai

Paris

, et al. Mafalda: a benchmark and comprehensive study of fallacy detection and classification, 2023. arXiv preprint arXiv:2311.09761.

46.

Lenz

Sahitaj

Kallenberg

, et al. Towards an argument mining pipeline transforming texts to argument graphs. In: Computational models of argument, 2020, pp.263–270. Amsterdam, The Netherlands: IOS Press.

47.

Morio

Ozaki

Morishita

, et al. End-to-end argument mining with cross-corpora multi-task learning. Trans Assoc Comput Linguist 2022; 10: 639–658.

48.

Schulz

Eger

Daxenberger

, et al. Multi-task learning for argumentation mining in low-resource settings, 2018. arXiv preprint arXiv:1804.04083.

49.

Cocarascu

Cabrio

Villata

, et al. Dataset independent baselines for relation prediction in argument mining. In: Computational models of argument, 2020, pp.45–52. IOS Press.

50.

Gemechu

Ruiz-Dolz

Reed

. Aries: a general benchmark for argument relation identification. In: 11th workshop on argument mining, ArgMining 2024, 2024, pp.1–14. Bangkok, Thailand: Association for Computational Linguistics (ACL).

51.

Kawarada

Hirao

Uchida

, et al. Argument mining as a text-to-text generation task. In: Proceedings of the 18th conference of the European chapter of the association for computational linguistics (Volume 1: long papers), 2024, pp.2002–2014. St. Julian’s, Malta: Association for Computational Linguistics (ACL).

52.

Guo

Cheng

Zhang

, et al. AQE: argument quadruplet extraction via a quad-tagging augmented generative approach, 2023. arXiv preprint arXiv:2305.19902.

53.

Chen

Cheng

Luu

, et al. Exploring the potential of large language models in computational argumentation. In: Ku L, Martins A and Srikumar V (eds) Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: long papers), ACL 2024, Bangkok, Thailand, 11–16 August, 2024, 2024, pp.2309–2330. Bangkok, Thailand: Association for Computational Linguistics. 10.18653/v1/2024.acl-long.126

54.

Abkenar

Wang

Graupner

, et al. Assessing open-source large language models on argumentation mining subtasks, 2024. arXiv preprint arXiv:2411.05639.

55.

Mirzakhmedova

Gohsen

Chang

, et al. Are large language models reliable argument quality annotators? In: Conference on advances in robust argumentation machines, 2024, pp.129–146. Cham, Switzerland: Springer.

56.

Ruiz-Dolz

Lawrence

. Detecting argumentative fallacies in the wild: problems and limitations of large language models. In: Alshomary M, Chen C, Muresan S, Park J and Romberg J (eds) Proceedings of the 10th workshop on argument mining, ArgMining 2023, Singapore, 7 December 2023, 2023, pp.1–10. Singapore: Association for Computational Linguistics. 10.18653/v1/2023.argmining-1.1

57.

Yeh

Wan

Huang

. Cocolofa: a dataset of news comments with common logical fallacies written by LLM-assisted crowds, 2024. arXiv preprint arXiv:2410.03457.

58.

Cabessa

Hernault

Mushtaq

. In-context learning and fine-tuning GPT for argument mining, 2024. arXiv preprint arXiv:2406.06699.

59.

Schlegel

, et al. Which side are you on? A multi-task dataset for end-to-end argument summarisation and evaluation, 2024. arXiv preprint arXiv:2406.03151.

60.

Boltužić

Šnajder

. Back up your stance: Recognizing arguments in online discussions. In: Proceedings of the first workshop on argumentation mining, 2014, pp.49–58. Gothenburg, Sweden: Association for Computational Linguistics (ACL).

61.

Thorne

Vlachos

Christodoulopoulos

, et al. Fever: a large-scale dataset for fact extraction and verification, 2018. arXiv preprint arXiv:1803.05355.

62.

Cheng

Bing

, et al. IAM: a comprehensive and large-scale dataset for integrated argument mining tasks, 2022. arXiv preprint arXiv:2203.12257.

63.

Bar-Haim

Bhattacharya

Dinuzzo

, et al. Stance classification of context-dependent claims. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 1, long papers, 2017, pp.251–261. Valencia, Spain: Association for Computational Linguistics.

64.

Aharoni

Polnarov

Lavee

, et al. A benchmark dataset for automatic detection of claims and evidence in the context of controversial topics. In: Proceedings of the first workshop on argumentation mining, 2014, pp.64–68. Baltimore, Maryland, USA: Association for Computational Linguistics.

65.

Levy

Bogin

Gretz

, et al. Towards an argumentative content search engine using weak supervision. In: Proceedings of the 27th international conference on computational linguistics, 2018, pp.2066–2081. Santa Fe, New Mexico, USA: Association for Computational Linguistics.

66.

Shnarch

Alzate

Dankin

, et al. Will it blend? Blending weak and strong labeled data in a neural network for argumentation mining. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 2: short papers), 2018, pp.599–605. Association for Computational Linguistics, Melbourne, Australia.

67.

Shnarch

Choshen

Moshkowich

, et al. Unsupervised expressive rules provide explainability and assist human experts grasping new domains, 2020. arXiv preprint arXiv:2010.09459.

68.

Peldszus

Stede

. An annotated corpus of argumentative microtexts. In: Argumentation and reasoned action: proceedings of the 1st European conference on argumentation, Lisbon, Volume 2, 2015, pp.801–815. London, UK: College Publications.

69.

Skeppstedt

Peldszus

Stede

. More or less controlled elicitation of argumentative text: enlarging a microtext corpus via crowdsourcing. In: Proceedings of the 5th workshop on argument mining, 2018, pp.155–163. Brussels, Belgium: Association for Computational Linguistics.

70.

Cabrio

Villata

. Node: a benchmark of natural language arguments. In: Computational models of argument, 2014, pp.449–450. Amsterdam, The Netherlands: IOS Press.

71.

Stab

Gurevych

. Parsing argumentation structures in persuasive essays. Comput Linguist 2017; 43(3): 619–659.

72.

Liu

Gao

, et al. Deberta: decoding-enhanced BERT with disentangled attention. In: 9th international conference on learning representations, ICLR 2021, virtual event, Austria, 3–7 May 2021, 2021. OpenReview.net. https://openreview.net/forum?id=XPZIaotutsD.

73.

Goddard

Siriwardhana

Ehghaghi

, et al. Arcee’s MergeKit: a toolkit for merging large language models. In: Dernoncourt F, Preoţiuc-Pietro D and Shimorina A (eds) Proceedings of the 2024 conference on empirical methods in natural language processing: industry track, 2024, pp.477–485. Miami, Florida, US: Association for Computational Linguistics. https://aclanthology.org/2024.emnlp-industry.36.

74.

Deep

Bhardwaj

Poria

. Della-merging: reducing interference in model merging through magnitude-based sampling, 2024. arXiv preprint arXiv:2406.11617.

75.

Ilharco

Ribeiro

Wortsman

, et al. Editing models with task arithmetic, 2022. arXiv preprint arXiv:2212.04089.

76.

Yadav

Tam

Choshen

, et al. Ties-merging: resolving interference when merging models. Adv Neural Inf Process Syst 2023; 36: 7093–7115.

77.

, et al. Language models are super mario: absorbing abilities from homologous models as a free lunch. In: Forty-first international conference on machine learning, ICML 2024, Vienna, Austria, 21–27 July 2024, 2024. Vienna, Austria: Association for Computing Machinery/PMLR (for the proceedings), OpenReview.net. https://openreview.net/forum?id=fq0NaiU8Ex.

78.

Lin

. ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out, 2004, pp.74–81. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013/.

79.

Denkowski

Lavie

. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation, 2011, pp.85–91. Edinburgh, Scotland, UK: Association for Computational Linguistics.

80.

Zhang

Kishore

, et al. Bertscore: evaluating text generation with bert, 2019. arXiv preprint arXiv:1904.09675.

81.

Baroni

Caminada

Giacomin

. An introduction to argumentation semantics. Knowl Eng Rev 2011; 26(4): 365–410.

Split	Train				Validation & Test
Datasets	IAM Claim	IBM Claim	IBM Arguments	Total	IAM Claim	IBM Claim	IBM Arguments	Total
Claim	$1659$	$252$	$89$	$2000$	$333$	$50$	$18$	$401$
Non-Claim	$1933$	$54$	$13$	$2000$	$387$	$11$	$3$	$401$
Total	$3592$	$306$	$102$	$4000$	$720$	$61$	$21$	$802$

AMELIA: A family of multi-task end-to-end language models for argumentation

Abstract

Keywords

1. Introduction

2.1. Large language models

2.2. AM and LLMs

2.3. Survey of datasets in AM

AbstRCT

AQM

ArgSum

ComArg

CoCoLoFa

Dagstuhl-15512 ArgQuality

FEVER

IAM

IBM claim-polarity

IBM type

IBM claim

IBM evidence

IBM argument

MAFALDA

Microtext part 1

Microtext part 2

Nixon-Kennedy debates

Node

Persuasive essays

3.1. Unifying existing AM datasets

3.2. AM tasks considered

Argument component classification (ACC)

Claim detection (CD)

Evidence detection (ED)

Argument relation (AR) classification

Evidence type (ET) classification

Stance detection (SD)

Fallacies detection (FD)

Argument quality (AQ) assessment

4.1. Experiment setup

Footnotes

Acknowledgements

ORCID iD

Funding

Declaration of conflicting interests

Appendix A. All results

References