Sage Journals: Discover world-class research

Abstract

This study aims to combat health misinformation by enhancing the retrieval of credible health information using effective fusion-based techniques. It focuses on clustering-based subset selection to improve data fusion performance. Five clustering methods — two K-means variants, Agglomerative Hierarchical (AH) clustering, BIRCH, and Chameleon — are evaluated for selecting optimal subsets of information retrieval systems. Experiments are conducted on two health-related datasets from the TREC challenge. The selected subsets are used in data fusion to boost retrieval quality and credibility. AH and BIRCH outperform other methods in identifying effective IR subsets. Using AH-based fusion of up to 20 systems results in a 60% gain in MAP and over a 30% increase in NDCG_UCC, a credibility-focused metric, compared to the best single system. Clustering-based fusion strategies significantly enhance the retrieval of trustworthy health content, helping to reduce misinformation. These findings support incorporating advanced data fusion into health information retrieval systems to improve access to reliable information. The source code of this research is publicly available at https://github.com/Gary752752/DataFusion.

Keywords

data fusion credible information retrieval health misinformation clustering subset selection

Introduction

Advancements in computing and networking technologies have enabled the generation and dissemination of information at unprecedented speeds. While this progress offers significant benefits to individuals and society, it also brings challenges, most notably the rapid spread of misinformation. Health-related misinformation is particularly harmful, as it can profoundly and negatively affect individuals, public health, and society in various ways.¹ For instance, exposure to misleading or inaccurate content can fuel vaccine hesitancy, encourage unsafe self-medication practices, and delay timely medical interventions.^2,3 At a broader level, widespread misinformation erodes public trust in healthcare systems, exacerbates health disparities, and can undermine collective responses to crises such as pandemics.

In this context, the quality of information retrieval systems plays a critical role. Effective retrieval not only ensures that users are directed toward reliable, evidence-based health information but also supports health literacy outcomes by enabling individuals to make informed decisions about their well-being. Conversely, poor retrieval performance that elevates low-quality or misleading documents can amplify the negative consequences of misinformation. Thus, advancing retrieval methods that prioritize accuracy, credibility, and trustworthiness is essential for mitigating the societal risks associated with health misinformation.⁴

Combating health misinformation requires sustained efforts across multiple sectors such as government policies,⁵ education interventions,^6,7 technology on credible information retrieval,⁸ technical and especially AI ethics,^9,10 along with individual awareness and responsibility.

In this paper, we focus on a specific technical challenge: enhancing information search engines to effectively deliver reliable and accurate health-related information. We investigate fusion-based approaches to address this issue.^11,12

Search engines that can provide credible information can be a vital component in a wide range of health-related applications, including:

• Clinical Decision Support Systems: Integrate authoritative evidence directly into a physician’s workflow.

• Consumer Health Apps: Deliver safe, reliable self-care information sourced from vetted medical resources.

• Drug Interaction Checkers: Provide authoritative data on potential drug–drug interactions.

• Telemedicine Platforms: Rely heavily on accurate, timely, and personalized medical information to support both healthcare providers and patients.

• General Health Chatbots: Reduce the risk of AI hallucinations by grounding responses in verified medical knowledge.

In the information retrieval research community, TREC (Text REtrieval Conference) is a major ongoing series of workshops that focus on various information retrieval research areas, or tracks.¹ Between 2020 and 2022, TREC hosted a Health Misinformation Track.¹³

Each year, the organizers compiled a collection of health-related web documents and created a set of queries. These resources were distributed to participating research groups, who then used their retrieval systems and new technologies to produce ranked lists of documents for each query. The retrieval results were submitted to the organizers, who engaged domain experts to evaluate the relevance, correctness, and reliability of the retrieved documents. Based on these expert judgments, all submitted results were assessed. The queries, retrieval results, and relevance judgments are publicly available on TREC’s website, providing a valuable resource for conducting empirical research on fusion-based retrieval methods.

This study investigates clustering-based subset selection for fusion, aiming to identify a subgroup from the available candidates to achieve optimal performance. In standard retrieval tasks, relevance is the primary criterion for evaluating performance. However, the Health Misinformation Track introduces three equally critical factors: usefulness, correctness, and credibility. This shift in evaluation criteria may significantly influence the design of subset selection algorithms and data fusion methods, requiring approaches that effectively address these additional dimensions.

Experimented with five clustering methods including two variations of K-means, Agglomerative Hierarchical clustering (AH), Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), and Chameleon clustering on two datasets from TREC, we find that all of them are effective. The fused results are also better than the best constituent retrieval system by a clear margin. Among the five clustering methods, the Agglomerative Hierarchical method performs best on average.

The fusion approach has been used in various information retrieval tasks. However, credible information retrieval is very special in which three aspects including usefulness, correctness, and credibility need to be considered simultaneously for retrieval evaluation. There are two major contributions:

• To the best of our knowledge, this is the first study to address the task of credible information retrieval to combat health misinformation by a clustering-based fusion approach. We demonstrate the effectiveness of the proposed methods.

• A group of methods are evaluated for subset selection from a large pool. Our empirical results show that Agglomerative Hierarchical and BIRCH outperform other baseline approaches, including Chameleon-based methods and Top-J subset selection methods proposed before for other retrieval tasks.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 details the five clustering-based data fusion methods considered. Section 4 describes the experimental setup and presents the results of the proposed methods alongside the baseline methods. It also includes a discussion of the study’s limitations and mitigation measures. Section 5 provides additional analytical insights into the clustering methods. Finally, Section 6 concludes the paper.

Related work

This paper explores the application of data fusion in health information retrieval, particularly within collections containing substantial misinformation. To provide context, we first review prior research on misinformation detection and health information retrieval. Subsequently, we examine various data fusion methods and their applications in health information retrieval.

Misinformation detection

Misinformation detection has garnered significant attention in recent years due to its societal and political implications. Researchers have explored various approaches to identify and mitigate the spread of false information, focusing on computational models,^14–16 social network analysis,^17–20 and human-in-the-loop systems.^21–23

Early works in misinformation detection concentrated on linguistic features. Specifically, Rubin et al.²⁴ analysed writing styles, lexical patterns, and rhetorical structures to identify deceptive content. Ott et al.²⁵ explored the use of n-grams, syntax, and psycholinguistic features to detect deception in online reviews, highlighting the potential of linguistic markers in identifying misinformation. Similarly, Zhou et al.²⁶ investigated deception in text-based computer-mediated communication, identifying significant linguistic cues used by deceivers and emphasizing differences between truthful and deceptive messages. Such studies laid the groundwork for automated systems by emphasizing textual analysis. However, linguistic models often struggled with context, particularly in cases involving satire or humour, leading researchers to explore more advanced methods.

The emergence of machine learning revolutionized the field.²⁷ These methods trained models on labelled datasets of misinformation, using features like source credibility, sentiment, and consistency across articles. Neural networks, such as Feed-forward neural networks and convolutional neural networks,²⁸ graph neural networks,²⁹ and transformers like BERT³⁰ have further enhanced detection capabilities by capturing nuanced relationships in text.

Social network analysis has been another critical avenue. Vosoughi et al.³¹ examined the spread of true and false news on Twitter, revealing that misinformation spreads faster and reaches broader audiences than factual information. Friggeri et al.³² explored the dynamics of rumour propagation on Facebook, finding that even after rumours are debunked, they continue to spread, largely due to users ignoring fact-checking interventions and interacting primarily with like-minded communities. Such studies highlight the importance of examining user behaviours, bot activity, and network structures to complement textual analysis.

Fact-checking systems have also played a pivotal role. Platforms like PolitiFact and Snopes provide labelled datasets for machine learning models, and automated fact-checkers such as ClaimBuster aim to scale the verification process.³³ However, reliance on static datasets limits adaptability to new forms of misinformation.

Despite progress, challenges remain. The adversarial nature of misinformation and cultural-linguistic diversity demand more robust, multilingual, and real-time detection systems. Future research is expected to focus on integrating explainable AI and collaborative strategies between academia, industry, and policymakers.

Health information retrieval

Health Information Retrieval (HIR), also known as Medical Document Retrieval (MDT), focuses on developing systems and methodologies to retrieve accurate and relevant health-related information from diverse and often complex sources. Previous work in this field has explored various techniques, including keyword-based search, natural language processing, and machine learning models, to improve retrieval precision and user experience. For example, Luo et al.³⁴ introduced MedSearch, a search engine designed for patients to retrieve medical information by considering layman-friendly language and ranking results based on their readability. Recent work has emphasized semantic search using ontologies such as SNOMED-CT or UMLS to better understand medical terminology and user intent, as seen in the work of Koopman et al.,³⁵ who developed a system to enhance document retrieval in the clinical domain. Besides, Medical Subject Headings^{2
36} the International Classification of Diseases (ICD),³⁷ and independently constructed resources³⁸ have also been used. Despite these advancements, challenges remain in addressing the linguistic complexity of medical language, the dynamic nature of health information, and the need to balance precision with the accessibility of results for different audiences.

In recent years, transformer-based IR models have gained popularity across many information retrieval tasks, and this trend is evident for health information retrieval tasks. For instance, in the 2021 Health Misinformation track, all participating teams employed transformer-based IR models, including BERT variants (such as RoBERTa and Bio Sentence-BERT) and T5 models (MonoT5, DuoT5, and T5-Large).

Recent research in Health Information Retrieval has also increasingly focused on assessing and enhancing the credibility of retrieved medical information, acknowledging the critical role of trustworthy data in healthcare decision-making.^39–43

Health Information Retrieval has been the subject for some major information retrieval evaluation events, such as TREC and CLEF.³ A variety of retrieval tasks have been addressed in these venues, such as Genomics, Medical Records, Clinic Decision Support, Precision Medicine, Health Misinformation, and Clinical Trials in TREC, eHealth in CLEF, Medical Case-based Retrieval track in ImageCLEF, to name but a few.

Data fusion

Data fusion in information retrieval (IR) refers to the process of combining results or outputs from multiple retrieval systems or methods to improve overall performance. The primary goal is to leverage the strengths of individual systems while mitigating their weaknesses.¹¹

Data fusion methods can be broadly categorized into supervised and unsupervised approaches. CombSum,⁴⁴ CombMNZ,⁴⁴ and the Reciprocal Rank method⁴⁵ are typical unsupervised methods, while linear combination⁴⁶ represents a common supervised method. Unsupervised methods are easy to implement, whereas supervised methods are better suited for scenarios where unsupervised methods may not perform well.

Data fusion methods have been applied to various tasks in information retrieval.^47–51 They are also widely used in health information retrieval tasks.^52–54 For instance, in the 2020 TREC Health Misinformation Track, the CiTIUS group⁵⁵ submitted two runs, CiTIUSCrdRelAdh and CiTIUSSimRelAdh, both of which used Borda Count to combine two types of rankings: usefulness and reliability (credibility and correctness). Another example is the h2oloo group.⁵⁶ Based on the BM25 baseline run, they applied query expansion and two types of machine learning techniques for re-ranking the results. All eight submitted runs were various combinations of these methods, utilizing equal or simple unequal weighting schemes.

Typically, the number of constituent retrieval systems involved serves as a good indicator of the complexity of a fusion-based system. When final performance is equal, it is preferable to involve fewer constituent retrieval systems. Juarez-Gonzalez et al.⁵⁷ investigated how to select a subset from a large group of retrieval systems to achieve better fusion performance, although their study was not focused on medical retrieval tasks. For this purpose, they defined a DCG-like metric (Discounted Cumulative Gain, a commonly used metric in information retrieval evaluation). Four datasets from CLEF (Cross-Language Evaluation Forum) were used for their empirical investigation. The method is referred to as Top(J) later in this paper. Xu et al.¹² proposed a clustering-based method, Chameleon Hierarchical clustering (CH), to select a subset of retrieval systems from the available ones. Two medical datasets from TREC (the Precision Medicine Track in 2017 and 2018) were used to evaluate the effectiveness of their proposed method.

In this work, we investigate how to achieve the best possible results using data fusion technology for the misinformation retrieval task. Specifically, we focus on the subset selection problem for effective fusion: given a group of N retrieval systems, how can we select n (n < N) of them to achieve the best fusion performance? The subset selection problem is the same as that addressed in 57 and 12. However, we apply it to a different task compared to 57 and.¹² Additionally, some special measures in different aspects are necessary for the specific task undertaken in this study. Our research is also more comprehensive, incorporating more clustering methods and data fusion methods.

Clustering-based subset selection

Selecting an optimal subset of information retrieval models (or systems) to maximize fusion effectiveness is a significant challenge. For example, given 50 retrieval models, selecting 10 for improved fusion performance involves an astronomically large number of possible combinations — specifically, 50 × 49× … × 41, which equals 37,276,043,023,296,000. Exhaustively evaluating all these combinations is computationally impractical. Therefore, Instead of relying on a brute-force approach, it is more practical to develop and apply heuristic methods that can efficiently identify promising subsets.

Previous research⁵⁸ has shown that the performance of individual constituent systems or results is not the only factor affecting fusion performance. The diversity among the constituent retrieval systems or results also plays a crucial role. To incorporate diversity into the selection process, clustering-based methods offer an effective approach.

These methods involve two main steps:

• Clustering: All constituent systems or results are grouped into clusters based on their similarity. Systems or results within the same cluster are expected to be highly similar, while those in different clusters are significantly different.

• Selection: A subset of retrieval systems is chosen for fusion. By selecting top-performing systems from different clusters, the method ensures that both performance (by selecting the best systems within each cluster) and diversity (by including systems from different clusters) are simultaneously accounted for.

To perform clustering of retrieval systems, we assume that the characteristics of a retrieval system are fully reflected by the results it retrieves. The similarity or dissimilarity between two retrieval systems can be analysed by comparing the ranked lists of results they produce for the same query.

In this work, scoring is used to define the dissimilarity between two ranked lists. Specifically, the Euclidean distance is employed, as defined below:

D i s t (L_{1}, L_{2}) = \sum_{i = 1}^{| D |} \sqrt{{(s_{1} (d_{i}) - s_{2} (d_{i}))}^{2}}

(1)

Here L₁ and L₂ represent the ranked lists of results retrieved by two systems for the same document collection D and query q. |D| is the total number of documents in D, s₁ (d_i) is the score assigned to document d_i in L₁, and s₂ (d_i) is the score assigned to d_i in L₂. For documents in D that do not appear in L₁ (or L₂), a default score (e.g., zero) is assigned. The computed distance, Dist (L₁, L₂), serves as an effective measure of dissimilarity between the two ranked lists. While a scoring-based distance is utilized here, alternative methods, such as ranking-based measures, can also be employed for the same purpose.

Another important consideration is the choice of clustering method, as many options are available. In this study, we evaluate five clustering methods: two variations of K-means (denoted as K1 and K2) and three hierarchical methods: Agglomerative Hierarchical clustering (AH), the Chameleon Hierarchical clustering (CH), and Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH, also referred to as BI later in this paper).

In TREC, the results submitted by a participant using a specific retrieval model (system) are referred to as a run (corresponding to a collection of documents and a set of queries). The algorithms for K1, K2, AH, BI, and CH are described in Algorithms 1–5, respectively. Please refer to 59 for more details about the Chameleon Hierarchical method (CH).

Experiments and results

In this section, we describe the experimental setup and present the results obtained to validate the proposed methods. Specifically, we used the submissions, queries, and relevance judgments from the ad-hoc task of the Health Misinformation Track in TREC 2020 and TREC 2021.

The datasets for these events are based on the CommonCrawl News dataset,⁴ which comprises news articles collected from websites worldwide. For TREC 2020, the dataset includes news articles crawled between January 1, 2020, and April 30, 2020. In contrast, the dataset for TREC 2021 is derived from the “no-clean” version of the C4 dataset,⁵ originally created by Google for training the T5 model. This collection consists of plain text extracted from the April 2019 snapshot of CommonCrawl, encompassing over one billion English web pages.

For each event, the organizers provided a set of queries (see Figure 1 for an example of the queries used in TREC 2021). The datasets and queries were distributed to the participants, who then ran their information retrieval systems on the provided datasets using the queries. The participants submitted their retrieval results to the organizers for evaluation.

Figure 1.

Example of a topic for the TREC 2021 health misinformation track.

The organizers arranged for manual judgment of the retrieved documents by human experts. Each document was assessed for its relevance to the query, as well as its correctness and credibility. Based on these expert judgments, the organizers conducted retrieval evaluations for all submissions, assessing their performance against the defined criteria.

Evaluation measures & settings

The retrieval results were evaluated using four measures: MAP (Mean Average Precision), P@10 (Precision at the top 10 documents), CAM (Convex Aggregation Measure), and NDCG_UCC (Normalized Discounted Cumulative Gain based on the binary relevance of Usefulness, Correctness, and Credibility).

Both MAP and P@10 are classical measures commonly used for evaluating the effectiveness of retrieval results. In contrast, CAM and NDCG_UCC are specifically designed for the modern context of combating misinformation.⁶⁰

NDCG_UCC is defined as the NDCG score calculated by considering usefulness, correctness, and credibility collectively. A document is considered relevant only if it is simultaneously useful, correct, and credible; otherwise, it is deemed non-relevant. NDCG_UCC values are computed based on this binary relevance judgment. For brevity, NDCG_UCC is referred to as NDCG in the remainder of this article.

CAM is defined as

C A M (L) = M_{use} (L) / 3 + M_{cor} (L) / 3 + M_{cre} (L) / 3

(2)

where L is a ranked list of documents with multi-aspect labels, and M_use, M_cor, and M_cre denote the evaluation measures for usefulness, correctness, and credibility, respectively.

In this study, we adopt the TREC instantiation by using NDCG for each individual aspect. Specifically, M_use is computed as standard NDCG with respect to usefulness, M_cor as standard NDCG with respect to correctness labels, and M_cre as standard NDCG with respect to credibility labels.

For our experiments, we used two subsets of runs selected from all the submissions to the TREC Health Misinformation Track in 2020 and 2021. Runs with very poor performance (MAP <0.1) or very few documents were excluded. Table 1 presents relevant statistics for these two groups of runs.

Table 1.

Statistics of the two groups of runs in the experiment.

Year	Runs (Total)	Runs (Selected)	MAP	P10	NDCG	CAM
2020	51	37	0.3972	0.8222	0.5645	0.6351
2021	71	61	0.3979	0.9030	0.5173	0.6038

For the 2020 and 2021 years groups, 51 and 71 runs were submitted to the event, respectively. From these, 37 runs from 2020 to 61 runs from 2021 were retained for this study. Using a given clustering algorithm, we generated 2 to 12 clusters for the 2020 group and 2 to 20 clusters for the 2021 group. The upper limits (12 and 20) were set as they are approximately one-third of the total number of runs — 37 for the 2020 group and 61 for the 2021 group.

From each cluster, the best-performing run (based on a specific metric such as MAP or CAM) was selected to carry out the fusion operation.

We tested three commonly used fusion methods: CombSum,⁴⁴ CombMNZ,⁴⁴ and linear combination.⁴⁶ In all cases, we applied the score normalization method proposed in 45, defined as:

s_{normalized} (d) = \frac{1}{60 + rank (d)}

(3)

where rank(d) is the ranking position of document d in the retrieval result list.

For the linear combination method, training is required for each constituent retrieval system to determine appropriate weights. To achieve this, we employed a two-fold cross-validation strategy, commonly used in machine learning. In this approach, all the queries were divided into two equal parts: one part was used for training, and the other for testing; this process was then repeated by swapping the roles of the two parts.

We tested two different approaches for assigning weights using linear regression:

• Considering usefulness only (LC(U)): We assigned weights based solely on the usefulness aspect of the documents.

• Considering usefulness, correctness, and credibility together (LC(UCC)): A document was assigned a relevance score of one if it was simultaneously useful, correct, and credible; otherwise, it was assigned a score of 0.

These two approaches effectively represent different linear models. The first approach is better suited to traditional evaluation metrics such as MAP and P@10, while the second approach aligns more closely with metrics like CAM and, in particular, NDCG, which considers multiple aspects of document quality.

Given that both clustering algorithms K1 and K2 involve randomness, we repeated the experiments 50 times and averaged the results to ensure greater reliability.

In 12, a clustering-based subset selection method was proposed as follows: first, all candidates were grouped using Chameleon hierarchical clustering, where a ranking-based approach was employed to determine the dissimilarity between ranked lists of documents. We adopt the same Chameleon hierarchical clustering approach in this study.

After clustering, the second stage involves selecting one candidate from each cluster. While¹² utilized a local search method for this step, we replace it with a simpler approach, selecting the best-performing candidate from each cluster to ensure consistency across all clustering methods in this study. This modification ensures a fair comparison among the methods.

For AH, two parameter needs to be set. For the 2020 dataset, we let T = 0.5 and B = 30; For the 2021 dataset, we let T = 0.5 and B = 18.

Experimental results

The following results are averaged across all queries. The performance of CombSum and CombMNZ on the 2020 and 2021 datasets is presented in Tables 2 and 3, respectively. Results for linear combination methods with two different weight training approaches are shown in Tables 4 and 5.

Table 2.

Average performance of data fusion methods CombSum and CombMNZ over 2–12 retrieval systems (2020).

Method	CombSum				CombMNZ
Method	MAP	P10	NDCG	CAM	MAP	P10	NDCG	CAM
Top (J)	0.4461	0.7530	0.5457	0.6451	0.4354	0.7530	0.5399	0.6387
Top (MAP)	0.4435	0.8005	0.5564	0.6496	0.4341	0.7970	0.5512	0.6438
K1 (MAP)	0.4420	0.7560	0.5549	0.6513	0.4285	0.7540	0.5461	0.6412
K2 (MAP)	0.4440	0.7591	0.5568	0.6535	0.4290	0.7565	0.5474	0.6429
AH (MAP)	0.4469	0.7611	0.5661	0.6626	0.4266	0.7604	0.5532	0.6488
BI (MAP)	0.4203	0.7288	0.5453	0.6457	0.4073	0.7265	0.5385	0.6378
CH (MAP)	0.4463	0.7629	0.5591	0.6536	0.4329	0.7629	0.5512	0.6448
Top (CAM)	0.4172	0.8161	0.5685	0.6439	0.4170	0.8153	0.5683	0.6438
K1 (CAM)	0.4373	0.7657	0.5615	0.6524	0.4255	0.7649	0.5537	0.6432
K2 (CAM)	0.4395	0.7729	0.5665	0.6561	0.4261	0.7713	0.5576	0.6461
AH (CAM)	0.4400	0.7699	0.5714	0.6638	0.4213	0.7672	0.5606	0.6515
BI (CAM)	0.4168	0.7308	0.5455	0.6454	0.4050	0.7303	0.5390	0.6374
CH (CAM)	0.4449	0.7785	0.5699	0.6559	0.4331	0.7768	0.5622	0.6476
Top (NDCG)	0.4216	0.8154	0.5746	0.6474	0.4212	0.8141	0.5740	0.6468
K1 (NDCG)	0.4413	0.7724	0.5734	0.6586	0.4285	0.7706	0.5646	0.6485
K2 (NDCG)	0.4428	0.7786	0.5774	0.6622	0.4286	0.7772	0.5679	0.6513
AH (NDCG)	0.4498	0.7816	0.5915	0.6764	0.4284	0.7763	0.5773	0.6615
BI (NDCG)	0.4297	0.7545	0.5648	0.6589	0.4151	0.7465	0.5562	0.6488
CH (NDCG)	0.4464	0.7826	0.5757	0.6586	0.4340	0.7831	0.5676	0.6501

Note. Top (J), Top (MAP), Top (CAM) and Top (NDCG) denote the methods of selecting a number of top performers based on J, MAP, CAM, and NDCG metrics, respectively. C (X) denotes using the clustering method C (K1, K2, AH, BI, or CH) to generate a number of clusters first, and then taking the top performer from each cluster based on the X (can be MAP, CAM, or NDCG) metric. Number in bold indicates the best performer among a group of clustering methods under the same condition.

Table 3.

Average performance of data fusion methods CombSum and CombMNZ over 2–20 retrieval systems (2021).

Method	CombSum				CombMNZ
Method	MAP	P10	NDCG	CAM	MAP	P10	NDCG	CAM
Top (J)	0.4322	0.8892	0.4837	0.6047	0.4323	0.8892	0.4836	0.6047
Top (MAP)	0.4347	0.8885	0.4884	0.6115	0.4343	0.8887	0.4839	0.6056
K1 (MAP)	0.4775	0.8615	0.5317	0.6753	0.4479	0.8653	0.5030	0.6379
K2 (MAP)	0.5452	0.8696	0.5740	0.7223	0.5007	0.8744	0.5502	0.6921
AH (MAP)	0.5389	0.8488	0.5762	0.7272	0.4970	0.8571	0.5576	0.7024
BI (MAP)	0.5359	0.8427	0.5760	0.7275	0.4943	0.8530	0.5575	0.7019
CH (MAP)	0.4720	0.8729	0.5255	0.6664	0.4488	0.8754	0.4957	0.6269
Top (CAM)	0.4481	0.8738	0.5403	0.6511	0.4343	0.8783	0.5186	0.6212
K1 (CAM)	0.4808	0.8582	0.5460	0.6830	0.4501	0.8616	0.5176	0.6453
K2 (CAM)	0.5346	0.8609	0.5832	0.7221	0.4931	0.8658	0.5604	0.6936
AH (CAM)	0.5367	0.8440	0.5887	0.7314	0.4959	0.8506	0.5687	0.7060
BI (CAM)	0.5298	0.8375	0.5876	0.7300	0.4898	0.8458	0.5666	0.7034
CH (CAM)	0.4664	0.8746	0.5416	0.6661	0.4473	0.8746	0.5103	0.6250
Top (NDCG)	0.4700	0.8742	0.5967	0.6983	0.4419	0.8764	0.5623	0.6599
K1 (NDCG)	0.4896	0.8584	0.5878	0.7062	0.4551	0.8576	0.5548	0.6682
K2 (NDCG)	0.5370	0.8592	0.6162	0.7407	0.4950	0.8610	0.5911	0.7119
AH (NDCG)	0.5304	0.8416	0.6099	0.7417	0.4901	0.8467	0.5875	0.7157
BI (NDCG)	0.5310	0.8404	0.6135	0.7451	0.4900	0.8467	0.5910	0.7186
CH (NDCG)	0.4738	0.8657	0.5824	0.6942	0.4469	0.8654	0.5415	0.6462

Note. Top (J), Top (MAP), Top (CAM) and Top (NDCG) denote the methods of selecting a number of top performers based on J, MAP, CAM, and NDCG metrics, respectively. C (X) denotes using the clustering method C (can be K1, K2, AH, BI, or CH) to generate a number of clusters first, and then taking the top performer from each cluster based on the X (can be MAP, CAM, or NDCG) metric. Number in bold indicates the best performer among a group of clustering methods under the same condition.

Table 4.

Average performance of data fusion methods LC(U) and LC(UCC) over 2–12 retrieval systems (2020).

Method	LC(U)				LC(UCC)
Method	MAP	P10	NDCG	CAM	MAP	P10	NDCG	CAM
Top (CAM)	0.4313	0.8269	0.5756	0.6515	0.4167	0.8078	0.5797	0.6489
K1 (CAM)	0.4644	0.7928	0.5755	0.6681	0.4557	0.8021	0.5828	0.6689
K2 (CAM)	0.4665	0.7926	0.5768	0.6696	0.4588	0.8021	0.5837	0.6707
AH (CAM)	0.4692	0.7876	0.5797	0.6765	0.4571	0.7982	0.5881	0.6768
BI (CAM)	0.4705	0.7790	0.5736	0.6745	0.4719	0.7944	0.5793	0.6783
CH (CAM)	0.4625	0.7977	0.5752	0.6648	0.4488	0.8005	0.5838	0.6654
Top (NDCG)	0.4376	0.8202	0.5764	0.6526	0.4205	0.8005	0.5805	0.6502
K1 (NDCG)	0.4622	0.7943	0.5857	0.6714	0.4548	0.7955	0.5938	0.6726
K2 (NDCG)	0.4633	0.7929	0.5860	0.6725	0.4566	0.7952	0.5943	0.6743
AH (NDCG)	0.4635	0.7838	0.5988	0.6828	0.4577	0.7823	0.6061	0.6845
BI (NDCG)	0.4629	0.7803	0.5922	0.6802	0.4639	0.7818	0.5973	0.6832
CH (NDCG)	0.4639	0.8018	0.5820	0.6671	0.4491	0.7990	0.5892	0.6667

Note. Top (J), Top (MAP), Top (CAM) and Top (NDCG) denote the methods of selecting a number of top performers based on J, MAP, CAM, and NDCG metrics, respectively. C (X) denotes using the clustering method C (can be K1, K2, AH, BI, or CH) to generate a number of clusters first, and then taking the top performer from each cluster based on the X (can be MAP, CAM, or NDCG) metric. Number in bold indicates the best performer among a group of clustering methods under the same condition.

Table 5.

Average performance of data fusion methods LC(U) and LC(UCC) over 2–20 retrieval systems (2021).

Method	LC(U)				LC(UCC)
Method	MAP	P10	NDCG	CAM	MAP	P10	NDCG	CAM
Top (CAM)	0.4850	0.8892	0.5314	0.6647	0.4473	0.8574	0.5594	0.6653
K1 (CAM)	0.5567	0.8774	0.5790	0.7247	0.5092	0.8623	0.6052	0.7199
K2 (CAM)	0.5761	0.8799	0.5941	0.7388	0.5267	0.8666	0.6214	0.7336
AH (CAM)	0.5930	0.8761	0.6194	0.7601	0.5420	0.8767	0.6452	0.7532
BI (CAM)	0.5935	0.8754	0.6198	0.7610	0.5420	0.8748	0.6458	0.7534
CH (CAM)	0.5437	0.8805	0.5716	0.7134	0.4905	0.8561	0.6108	0.7122
Top (NDCG)	0.5238	0.9035	0.6108	0.7223	0.4880	0.8665	0.6364	0.7215
K1 (NDCG)	0.5544	0.8691	0.6205	0.7405	0.5078	0.8619	0.6461	0.7370
K2 (NDCG)	0.5785	0.8774	0.6272	0.7564	0.5313	0.8663	0.6555	0.7519
AH (NDCG)	0.5937	0.8649	0.6343	0.7679	0.5447	0.8678	0.6616	0.7614
BI (NDCG)	0.5946	0.8656	0.6354	0.7697	0.5454	0.8684	0.6630	0.7621
CH (NDCG)	0.5541	0.8718	0.6143	0.7382	0.5041	0.8555	0.6475	0.7361

Note. Top (J), Top (MAP), Top (CAM) and Top (NDCG) denote the methods of selecting a number of top performers based on J, MAP, CAM, and NDCG metrics, respectively. C (X) denotes using the clustering method C (can be K1, K2, AH, BI, or CH) to generate a number of clusters first, and then taking the top performer from each cluster based on the X (can be MAP, CAM, or NDCG) metric. Number in bold indicates the best performer among a group of clustering methods under the same condition.

In LC(U), weights are trained based solely on the usefulness score. In contrast, LC(UCC) considers three aspects—usefulness, correctness, and credibility—when training the weights.

From Tables 2 and 3, we can see that CombSUM outperforms ComnMNZ in most cases. CombSum is consistently better than CombMNZ when considering all clustering methods together. By comparing Tables 2 and 4, as well as Tables 3 and 5, we observe that the linear combination method performs better than CombSum and CombMNZ.

For all five clustering methods including K1, K2, AH, BI, and CH, AH and BI perform better than the others in more cases, K2 and CH are in the middle, while K1 never performs the best in any case. More specifically, AH is the best in 25 cases, which is followed by BI (20 cases), K2 (6 cases), CH (4 times), and K1 (0 times). Besides, selecting top performers, such as Top (J), Top (MAP), Top (CAM), and Top (NDCG), without considering the diversity of all the selected components can be a good strategy in some cases, especially when P10 is used for retrieval evaluation. Intuitively, selecting top performers by a given metric, such as Top (MAP), should be beneficial for evaluating fusion results by the same metric. Consider Top (MAP), Top (CAM), and Top (NDCG) collectively and refer to them as the Top (X) method, where X is the metric for selecting top components and for retrieval evaluation. Top (X) performs the best in 10 cases. It is also apprears that LC (U) is more suitable for traditional metrics such as MAP and P10, while LC (UCC) is better suited for NDCG and CAM. All of these observations are confirmed in this experiment.

We compare several pairs head-to-head. The win/loss ratio between AH and CH is 56/24, while for BI and CH, it is 42/38. These results suggest that AH is likely superior to CH, whereas BI and CH exhibit comparable performance.

Further, comparing these clustering methods with Top (X), the win/loss ratio between AH and Top (X) is 60/20, while for BI and Top (X), it is 47/33, and for CH and Top (X), it is 51.3/28.5. These results indicate that all three methods are likely more effective than Top(X).

Additionally, when comparing Top (J) and Top (MAP), the win/loss ratio is 4/12, suggesting that Top (J) is not as effective as Top (MAP). This outcome is unsurprising, given that MAP serves as the performance evaluation metric in all 16 cases.

After reviewing the overall performance of the clustering-based methods, we further examine some specific aspects. First, we consider their performance with different number of constituent retrieval systems. The performance of the best run is also shown for comparison. Figures 2 and 3 present the results under various scenarios.

Figure 2.

Performance of a group of clustering-based fusion methods with different number of constituent retrieval systems (2020 dataset). (a) CombSum/MAP, (b) CombSum/NDCG, (c) LC (U)/MAP, (d) LC (UCC)/NDCG.

Figure 3.

Performance of a group of clustering-based fusion methods with different number of constituent retrieval systems (2021 dataset). (a) CombSum/MAP, (b) CombSum/NDCG, (c) LC (U)/MAP, (d) LC (UCC)/NDCG.

We observe that the performance of the linear combination method improves consistently as the number of constituent retrieval systems increases. In contrast, this trend is less evident for CombSum. This is expected, as CombSum is a centroid-based method in which all constituent systems contribute equally to the final results. Consequently, any system — especially those with exceptionally high or low performance, significant divergence from others, or strong similarity to a subset of systems — can significantly impact fusion performance. In contrast, the linear combination approach assigns learned weights, allowing it to better adapt to different scenarios.

Fusion performance generally surpasses that of the best individual system, with a few exceptions for BI (CombSum, 2020, measured by NDCG). This is an encouraging outcome, though the extent of improvement varies across different cases. For the 2021 dataset, the highest improvement rate reaches 62.73% ((0.6475-0.3979)/0.3979) when fusing 20 retrieval systems using AH (MAP)-LC(U). This involves generating 20 clusters via AH, selecting the top performer from each based on MAP, combining them using LC with weights trained on Usefulness scores, and evaluating the final results using MAP. Additionally, substantial improvements of approximately 30% are observed in NDCG when fusing 15-20 retrieval systems using BI(NDCG)-LC(UCC) or AH(NDCG)-LC(UCC) on the same dataset.

When comparing LC and CombSum, LC consistently outperforms CombSum by 2% to 10%, demonstrating the advantage of supervised learning over unsupervised methods. However, the degree of improvement varies depending on the specific scenario.

Discussions

As we demonstrated, the fusion-based credible retrieval techniques in this study involve two major steps: selecting a small number of candidates from a large collection, and fusing the selected systems using a data fusion algorithm. Both steps significant impact on final fusion performance. In this study, we focused on the first step and investigated the effectiveness of five clustering methods for the selection process.

One notable feature of both information retrieval systems and data fusion methods is the inherent uncertainty in their performance. To illustrate this, we consider two examples: four selected retrieval systems from BH in each of the two datasets. In the 2020 dataset, these are h2oloo.m10 (Run 1), h2oloo.m8 (Run 2), adhoc_run3 (Run 3), and adhoc_run13 (Run 4). In the 2021 dataset, they are WatSMC-Correct (Run 1), vera_mdt5_0.95 (Run 2), WatSMC-CALQAHC1 (Run 3), and citius.r1 (Run 4). Figures 4 and 5 show their query-by-query performance measured by NDCG, along with the performance of the fusion result obtained by LC (UCC).

Figure 4.

Performance of the LC(UCC) fusion method with four constituent retrieval systems selected by BH (2020 dataset, NDCG).

Figure 5.

Performance of the LC(UCC) fusion method with four constituent retrieval systems selected by BH (2021 dataset, NDCG).

For all runs and the fused results, performance varies substantially from one query to another. On average, the fused result outperforms all constituent runs by approximately 10% to 20% in 2020 and 20% to 40% in 2021, and it achieves the best score on most queries. However, it does not achieve the best performance for every query.

To help readers to better understand NDCG, we provide an example illustrating what a 30% increase in NDCG means. Note that in NDCG (or NDCG_UCC more exactly), a document is considered relevant only if it is simultaneously useful, correct, and credible.

Example 1

We use binary relevance (1 = relevant, 0 = not) and assume there are 3 relevant documents in the corpus. Consider the top 5 retrieved results. NDCG is a ranking-based metric in which each relevant document contributes a score based on its position. For ranks 1 through 5, the normalized contributions of a relevant document are 0.469, 0.296, 0.235, 0.202, and 0.182, respectively.

If the relevant documents appear at ranks 2, 4, and 5, the total NDCG score is

0.296 + 0.202 + 0.182 = 0.680 .

If instead they appear at ranks 1, 3, and 5, the total score is

0.469 + 0.235 + 0.182 = 0.886 .

The relative improvement is

\frac{0.886 - 0.680}{0.680} \approx 0.303 (= 30.3 %) .

In this example, elevating relevant documents from rank 2 to rank 1 and from rank 4 to rank 3, while maintaining the document at rank five in its position, is expected to substantially enhance user utility. □

One limitation of this study—as with most studies in this area—is that we do not yet fully understand the collective impact of all factors contributing to fusion performance.¹¹ Different datasets have distinct characteristics that may require specialized handling, and elements such as corpus, query set, and relevance judgments can all have significant effects. Furthermore, there are numerous clustering and data fusion methods, and even within a single method, many parameters must be carefully tuned. The choice of evaluation metrics also adds to the complexity of the situation.

To address these challenges, we employed five clustering algorithms, three data fusion methods, and two groups of performance evaluation metrics, covering both traditional and credibility-based measures. This combination helps improve the reliability of our results. Nonetheless, further extensive research is needed to advance this field.

Clustering analysis

As we know, the performance of clustering methods is a key factor that affects fusion performance. In this section, we use some quality metrics to directly evaluate the performance of the clustering methods. We employed three internal clustering quality metrics: the Davies–Bouldin Index (DBI), the Calinski–Harabasz Index (CHI), and the Silhouette Coefficient (SC).

As described in the previous section, we generated up to 12 clusters for the 2020 dataset and up to 20 clusters for the 2021 dataset using each clustering method. The top-performing system from each cluster, selected based on a given metric (e.g., MAP, CAM, or NDCG), was chosen for fusion. We assessed the quality of the clustering methods across all these cases using DBI, CHI, and SC. The results are presented in Tables 6–9. Notably, K1 performed the worst in almost all cases, so its results are omitted due to space limitations.

Table 6.

Evaluation of four clustering methods over 2–12 constituent retrieval results (2020, ranked by MAP).

Num.	DBI				CHI				SC
Num.	BI	K2	AH	CH	BI	K2	AH	CH	BI	K2	AH	CH
2	0.6272	0.9678	0.5192	0.5657	3.5108	2.6865	3.5750	2.6105	0.5898	0.4007	0.6446	0.5169
3	0.6432	1.0612	0.6295	0.5071	2.6387	2.2107	2.7024	2.1976	0.5360	0.3711	0.5688	0.5876
4	0.5868	1.1423	0.6776	1.2365	2.1424	2.0077	2.2340	2.1066	0.4919	0.3346	0.5348	0.2870
5	0.5557	1.1504	0.6109	1.0750	1.9406	1.8159	2.0118	2.0644	0.5112	0.2939	0.5532	0.3909
6	0.5200	1.1298	0.5507	1.1662	1.7811	1.6910	1.8394	1.8681	0.4915	0.2755	0.5319	0.3045
7	0.5349	1.1265	0.5693	1.4336	1.7222	1.6030	1.7707	1.5633	0.4911	0.2674	0.5251	0.2202
8	0.5145	1.1132	0.5358	1.3919	1.7080	1.5357	1.7540	1.5171	0.5277	0.2567	0.5585	0.2360
9	0.4908	1.1339	0.5007	1.3463	1.6371	1.4978	1.6783	1.4885	0.5117	0.2602	0.5416	0.2742
10	0.4758	1.1174	0.4768	1.3264	1.6113	1.4660	1.6503	1.4774	0.5347	0.2697	0.5629	0.2922
11	0.4579	1.0923	0.4520	1.2928	1.5644	1.4323	1.6003	1.4872	0.5196	0.2718	0.5470	0.3412
12	0.4434	1.0586	0.4316	1.3297	1.5259	1.4045	1.5594	1.4449	0.5056	0.2710	0.5322	0.3424

DB1: Davies–Bouldin index, lower values means better results; CHI: Calinski–Harabasz index, higher values means better results; SC: Silhouette Coefficient, higher values means better results. Best performance for a given metric is shown in bold.

Table 7.

Evaluation of four clustering methods over 2–12 constituent retrieval results (2020, ranked by CAM).

Num.	DBI				CHI				SC
Num.	BI	K2	AH	CH	BI	K2	AH	CH	BI	K2	AH	CH
2	0.6512	1.1563	0.7338	0.5657	2.7289	2.2883	2.3769	2.6105	0.5045	0.2722	0.4442	0.5169
3	0.6432	1.0942	0.7045	0.5725	2.6387	1.9378	2.1404	2.0180	0.5360	0.3038	0.5003	0.4941
4	0.5868	1.0901	0.6776	0.5311	2.1424	1.8697	2.2340	1.9968	0.4919	0.3397	0.5348	0.5776
5	0.5557	1.1238	0.6109	0.5150	1.9406	1.7848	2.0118	1.9043	0.5112	0.3080	0.5532	0.5775
6	0.5671	1.1337	0.6226	0.9935	1.8391	1.6833	1.8958	1.8987	0.5087	0.2772	0.5439	0.3946
7	0.5392	1.1488	0.5767	1.3666	1.8024	1.6026	1.8548	1.5532	0.5447	0.2654	0.5766	0.2910
8	0.5094	1.1489	0.5321	1.3919	1.7080	1.5489	1.7540	1.5171	0.5227	0.2640	0.5585	0.2360
9	0.4908	1.1417	0.5007	1.3463	1.6371	1.5074	1.6783	1.4885	0.5117	0.2697	0.5416	0.2742
10	0.4758	1.1315	0.4768	1.3264	1.6113	1.4644	1.6503	1.4774	0.5347	0.2697	0.5629	0.2922
11	0.4583	1.0947	0.4522	1.2928	1.5644	1.4320	1.6003	1.4872	0.5198	0.2717	0.5472	0.3412
12	0.4434	1.0586	0.4316	1.3297	1.5259	1.4045	1.5594	1.4449	0.5056	0.2710	0.5322	0.3424

Table 8.

Evaluation of four clustering methods over 2–20 constituent retrieval results (2021, ranked by MAP).

Num.	DBI				CHI				SC
Num.	BI	K2	AH	CH	BI	K2	AH	CH	BI	K2	AH	CH
2	0.6016	0.9909	0.6016	0.5728	4.7046	2.9766	4.7046	2.9225	0.6293	0.3463	0.6293	0.5085
3	0.4945	0.9079	0.4945	0.8765	2.9531	2.4951	2.9531	1.8096	0.5994	0.4620	0.5994	0.3106
4	0.5083	1.0399	0.5083	0.9830	2.4572	2.2044	2.4572	1.6679	0.5911	0.3801	0.5911	0.3071
5	0.4765	1.0517	0.4765	1.0346	2.3776	1.9438	2.3776	1.5959	0.6399	0.3419	0.6399	0.2720
6	0.5390	1.1061	0.5390	1.0520	2.2505	1.8111	2.2505	1.6172	0.5770	0.3342	0.5770	0.3011
7	0.5015	1.1982	0.5015	1.1113	2.0671	1.7357	2.0671	1.6282	0.5626	0.3326	0.5626	0.3167
8	0.4751	1.2467	0.4751	1.4175	1.9359	1.6870	1.9359	1.4485	0.5466	0.3241	0.5466	0.2270
9	0.4983	1.2524	0.4538	1.4073	1.8549	1.6344	1.8375	1.4379	0.5386	0.3084	0.5336	0.2494
10	0.4781	1.3049	0.4380	1.3678	1.7751	1.5966	1.7609	1.4247	0.5271	0.2863	0.5212	0.2802
11	0.4619	1.3107	0.4252	1.3396	1.7112	1.5603	1.6995	1.4404	0.5161	0.2792	0.5093	0.2948
12	0.5461	1.3916	0.4517	1.3273	1.6586	1.5166	1.6661	1.4559	0.4850	0.2550	0.4898	0.3239
13	0.5278	1.3773	0.4432	1.3977	1.6147	1.4824	1.6222	1.4006	0.4755	0.2395	0.4798	0.2968
14	0.5125	1.3631	0.4310	1.3663	1.5776	1.4560	1.5851	1.3930	0.4663	0.2327	0.4702	0.2911
15	0.4990	1.3465	0.4560	1.3393	1.5458	1.4332	1.5569	1.3864	0.4575	0.2263	0.4601	0.3022
16	0.4877	1.3252	0.4554	1.3425	1.5181	1.4134	1.5288	1.3717	0.4491	0.2184	0.4456	0.3047
17	0.4765	1.3265	0.4535	1.3363	1.5483	1.3931	1.5041	1.3590	0.4836	0.2093	0.4375	0.3084
18	0.4655	1.2995	0.4431	1.3274	1.5242	1.3865	1.5350	1.3606	0.4754	0.2183	0.4722	0.3001
19	0.4571	1.2862	0.4330	1.3306	1.5028	1.3773	1.5131	1.3542	0.4674	0.2169	0.4643	0.3046
20	0.4478	1.2747	0.4234	1.3070	1.4836	1.3608	1.4936	1.3768	0.4590	0.2058	0.4559	0.3419

Table 9.

Evaluation of four clustering methods over 2–20 constituent retrieval results (2021, ranked by NDCG).

Num.	DBI				CHI				SC
Num.	BI	K2	AH	CH	BI	K2	AH	CH	BI	K2	AH	CH
2	0.2761	0.6645	0.2761	0.8705	2.2326	2.4003	2.2326	2.1207	0.5563	0.5025	0.5563	0.4660
3	0.4542	0.9865	0.4542	0.8222	1.9911	1.8522	1.9911	1.7500	0.5672	0.3382	0.5672	0.4233
4	0.5056	1.1518	0.5056	1.6592	2.4572	1.6485	2.4572	1.5205	0.6000	0.2812	0.6000	0.2152
5	0.4637	1.2880	0.4637	1.5923	2.1353	1.5680	2.1353	1.4705	0.5769	0.2620	0.5769	0.2519
6	0.4342	1.2645	0.4342	1.4970	1.9416	1.5152	1.9416	1.4237	0.5474	0.2492	0.5474	0.2715
7	0.4146	1.3336	0.4146	1.5965	1.8122	1.5549	1.8122	1.3981	0.5278	0.2816	0.5278	0.2205
8	0.4064	1.3843	0.4064	1.5621	1.8640	1.5442	1.8640	1.3943	0.5799	0.2839	0.5799	0.2364
9	0.4334	1.4143	0.3927	1.5191	1.7711	1.5250	1.7780	1.3753	0.5676	0.2755	0.5623	0.2233
10	0.4736	1.4084	0.4386	1.4781	1.7751	1.4903	1.7609	1.3629	0.5308	0.2383	0.5233	0.2328
11	0.4619	1.3708	0.4252	1.4473	1.7112	1.4825	1.6995	1.3732	0.5161	0.2453	0.5093	0.2517
12	0.5461	1.3607	0.4517	1.4298	1.6586	1.4727	1.6661	1.3548	0.4850	0.2414	0.4898	0.2634
13	0.5279	1.3393	0.4432	1.4012	1.6147	1.4667	1.6222	1.3678	0.4755	0.2498	0.4798	0.2739
14	0.5096	1.3601	0.4418	1.4004	1.5776	1.4486	1.5851	1.3542	0.4663	0.2329	0.4702	0.2788
15	0.4968	1.3364	0.4303	1.3865	1.5458	1.4278	1.5533	1.3685	0.4575	0.2259	0.4610	0.3042
16	0.4850	1.3345	0.4184	1.3597	1.5181	1.4141	1.5257	1.3635	0.4491	0.2226	0.4521	0.2988
17	0.4749	1.3293	0.4481	1.3363	1.4940	1.3947	1.5041	1.3590	0.4409	0.2108	0.4375	0.3084
18	0.4641	1.3119	0.4417	1.3274	1.4726	1.3779	1.4823	1.3606	0.4330	0.2063	0.4297	0.3001
19	0.4553	1.2851	0.4316	1.3306	1.5028	1.3621	1.4630	1.3542	0.4666	0.2003	0.4222	0.3046
20	0.4478	1.2747	0.4234	1.3070	1.4836	1.3608	1.4936	1.3768	0.4590	0.2058	0.4559	0.3419

Overall, AH emerged as the best-performing clustering method in most cases, closely followed by BI. In contrast, CH and K2 performed best in only a few instances.

However, high-quality clustering does not always guarantee the best fusion performance. Fusion effectiveness is influenced by two key factors: the performance of individual retrieval systems and their diversity. Since only the top-performing system from each cluster is selected for fusion, this choice introduces an element of uncertainty. As a result, while AH and BI generally lead to the best fusion performance, other clustering methods produce the best results in some cases.

Conclusion

This paper explored fusion-based methods to address the challenge of combating health misinformation. Specifically, we investigated how to select an optimal subset of constituent information retrieval systems to maximize performance. Using the TREC Health Misinformation datasets from 2020 to 2021, our experiments demonstrated the effectiveness of the proposed approaches. When fusing up to 12 or 20 retrieval systems, agglomerative hierarchical clustering and BIRCH were the top two clustering methods. In particular, agglomerative hierarchical clustering consistently outperformed the best individual system by 10% to 60% across both traditional and credibility-enhanced metrics. Additionally, it achieved a significant improvement over the existing subset selection methods.

A key strength of our approach is its ability to balance both the performance of individual retrieval systems and the diversity of their results. The use of Euclidean distance to measure dissimilarity between result lists, combined with a agglomerative hierarchical clustering strategy, proved highly effective in forming diverse and well-structured clusters. This diversification enhances fusion performance, reinforcing the potential of data fusion as a powerful and reliable strategy for health misinformation retrieval.

For future work, we aim to further investigate the relationship between the performance of constituent systems and the diversity of their results. A deeper understanding of this relationship could lead to more efficient and effective subset selection methods. Additionally, exploring large language model-based approaches presents a promising direction. Currently, generating training datasets relies on costly relevance judgments from human experts. Developing automated performance estimation methods could reduce this burden, enhancing both the salability and practicality of the proposed framework.

Footnotes

ORCID iD

Shengli Wu

Author contributions

YH, SW, HL, and XG collaboratively designed the study. YH implemented the program and did the empirical study. SW prepared the initial manuscript draft. SW and CN revised the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data used in this article can be downloaded from .

Notes

References

Swire-Thompson

Lazer

. Public health and online misinformation: challenges and recommendations. Annu Rev Publ Health 2020; 41(1): 433–451.

Zarocostas

. How to fight an infodemic. Lancet 2020; 395(10225): 676.

Loomba

De Figueiredo

Piatek

, et al. Measuring the impact of covid-19 vaccine misinformation on vaccination intent in the UK and USA. Nat Hum Behav 2021; 5(3): 337–348.

Zhang

Naderi

Jaume-Santero

, et al.Ds4dh at TREC health misinformation 2021: multi-dimensional ranking models with transfer learning and rank fusion. arXiv preprint arXiv:220206771 2022.

Abernethy

Adams

Barrett

, et al. The promise of digital health: then, now, and the future. NAM Perspect 2022; 2022: 10–31478.

Monaci

Rake

Acampora

, et al. Digital educational interventions for antimicrobial stewardship: a systematic review. Research in Social and Administrative Pharmacy. 2025.

Ghozali

Fortwengel

Ikawati

, et al. Mobile app-assisted patient education in the public health: minimizing vaccine anxiety and managing long-term and post-covid-19 effects. Indonesian Journal of Pharmacy/Majalah Farmasi Indonesia 2025; 36(2): 335.

Ginsca

Popescu

Lupu

. Credibility in information retrieval. FNT in Information Retrieval 2015; 9(5): 355–475.

Philippe

Sikder

Jackson

, et al. Digital health interventions for delivery of mental health care: systematic and comprehensive meta-review. JMIR Ment Health 2022; 9(5): e35159.

10.

Siala

Wang

. Shifting artificial intelligence to be responsible in healthcare: a systematic review. Soc Sci Med 2022; 296: 114782.

11.

. Data fusion in information retrieval. Springer, 2012.

12.

Huang

. Inexpensive and effective data fusion methods with performance weights. In: Pardede

Delir Haghighi

Khalil

, et al. (eds) Information integration and web intelligence - 24th international conference, iiWAS 2022. Lecture notes in computer science. Springer, 2022, vol 13635, pp. 367–377.

13.

Clarke

CLA

Rizvi

Smucker

, et al. Overview of the TREC 2020 health misinformation track. In: Voorhees

Ellis

(eds) Proceedings of the Twenty-ninth Text Retrieval Conference, TREC 2020. NIST Special Publication. National Institute of Standards and Technology (NIST), 2020, vol 1266.

14.

Deng

Moossavi

, et al. Simulated misinformation susceptibility (SMISTS): enhancing misinformation research with large language model simulations. In: Ku

Martins

Srikumar

(eds) Findings of the Association for Computational Linguistics, ACL 2024. Association for Computational Linguistics, 2024, pp. 2774–2788.

15.

Pastor-Galindo

Nespoli

Ruipérez-Valiente

. Large-language-model-powered agent-based framework for misinformation and disinformation research: opportunities and open challenges. IEEE Secur Priv 2024; 22(3): 24–36.

16.

Thakur

Cui

Knieling

, et al. Investigation of the misinformation about COVID-19 on youtube using topic modeling, sentiment analysis, and language analysis. Comput Times 2024; 12(2): 28.

17.

Nobre

Ferreira

CHG

Almeida

. A hierarchical network-oriented analysis of user participation in misinformation spread on whatsapp. Inf Process Manag 2022; 59(1): 102757.

18.

Peñas

Deriu

Sharma

, et al. Holistic analysis of organised misinformation activity in social networks. In: Ceolin

Caselli

Tulin

(eds) Disinformation in open online media - 5th Multidisciplinary International Symposium, MISDOOM 2023. Lecture notes in computer science. Springer, 2023, vol 14397, pp. 132–143.

19.

Valsaladevi

Thampi

. Misinformation detection in social networks using emotion analysis and user behavior analysis. Pattern Recognit Lett 2024; 182: 60–66.

20.

Morstatter

Carley

, et al. Misinformation in social media: definition, manipulation, and detection. SIGKDD Explor 2019; 21(2): 80–90.

21.

Demartini

Mizzaro

Spina

. Human-in-the-loop artificial intelligence for fighting online misinformation: challenges and opportunities. IEEE Data Eng Bull 2020; 43(3): 65–74.

22.

Mendes

Chen

, et al. Human-in-the-loop evaluation for early misinformation detection: a case study of COVID-19 treatments. In: Rogers

Boyd-Graber

Okazaki

(eds) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics, 2023, vol 1, pp. 15817–15835.

23.

Shabani

Charlesworth

Sokhn

, et al. SAMS: human-in-the-loop approach to combat the sharing of digital misinformation. In: Martin

Hinkelmann

Fill

, et al. (eds) Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021). CEUR workshop proceedings. CEUR-WS.org, 2021, vol 2846.

24.

Rubin

Chen

Conroy

. Deception detection for news: three types of fakes. Proc Assoc Inf Sci Technol 2015; 52: 1–4.

25.

Ott

Choi

Cardie

, et al. Finding deceptive opinion spam by any stretch of the imagination. In: Lin

Matsumoto

Mihalcea

(eds) The 49th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference. The Association for Computer Linguistics, 2011, pp. 309–319.

26.

Zhou

Twitchell

Qin

et al. An exploratory study into deception detection in text-based computer-mediated communication. In 36th annual Hawaii international conference on system sciences, 2003. Proceedings of the, Big Island, HI, USA, 06–09 January 2003, p. 10.

27.

Taylor

Jiang

Qin

, et al. Misinformation detection: a survey of AI techniques and research opportunities. Found Trends Inf Syst 2024; 8(2): 66–147.

28.

Wang

. “Liar, liar pants on fire”: a new benchmark dataset for fake news detection. In: Barzilay

Kan

(eds) Proceedings of the 55th annual meeting of the Association for Computational Linguistics, ACL 2017. Association for Computational Linguistics, 2017, vol 2, pp. 422–426.

29.

Monti

Frasca

Eynard

, et al. Fake news detection on social media using geometric deep learning. arXiv preprint arXiv:190206673 2019.

30.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein

Doran

Solorio

(eds) Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, 2019, vol 1, pp. 4171–4186.

31.

Vosoughi

Roy

Aral

. The spread of true and false news online. Science 2018; 359(6380): 1146–1151.

32.

Friggeri

Adamic

Eckles

, et al. Rumor cascades. In: Adar

Resnick

Choudhury

, et al. (eds) Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014. The AAAI Press, 2014.

33.

Hassan

Arslan

, et al. Toward automated fact-checking: detecting check-worthy factual claims by Claimbuster. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, 2017, pp. 1803–1812.

34.

Luo

Tang

Yang

, et al. Medsearch: a specialized search engine for medical information retrieval. In: Shanahan

Amer-Yahia

Manolescu

, et al. (eds) Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008. Association for Computing Machinery, 2008, pp. 143–152.

35.

Koopman

Zuccon

Nguyen

, et al. Automatic ICD-10 classification of cancers from free-text death certificates. Int J Med Inf 2015; 84(11): 956–965.

36.

Liu

Wacholder

. Evaluating the impact of mesh (medical subject headings) terms on different types of searchers. Inf Process Manag 2017; 53(4): 851–870.

37.

Amini

Martinez

, et al. Improving patient record search: a meta-data based approach. Inf Process Manag 2016; 52(2): 258–272.

38.

Yeganova

Kim

Chen

, et al. Better synonyms for enriching biomedical search. J Am Med Inf Assoc 2020; 27(12): 1894–1902.

39.

Sbaffi

Rowley

. Trust and credibility in web-based health information: a review and agenda for future research. J Med Internet Res 2017; 19(6): e218.

40.

Beylunioglu

. Using a credibility classifier to improve health-related information retrieval. Master’s thesis. University of Waterloo, 2020.

41.

Huang

, et al.Fight against covid-19 misinformation via clustering-based subset selection fusion methods. In: Proceedings of the 2nd Workshop Reducing Online Misinformation through Credible Information Retrieval 2022 co-located with The 44th European Conference on Information Retrieval (ECIR 2022), Stavanger, Norway, April 10, 2022. CEUR Workshop Proceedings 3138, pp. 11–26. CEUR-WS.org.

42.

Mayer

Karampiperis

Kukurikos

, et al. Applying semantic web technologies to improve the retrieval, credibility and use of health-related web resources. Health Inf J 2011; 17(2): 95–115.

43.

Viviani

Pasi

. Credibility in social media: opinions, news, and health information–a survey. WIREs Data Min & Knowl 2017; 7(5): e1209.

44.

Fox

Koushik

Shaw

, et al. Combining evidence from multiple searchs. In: The First Text REtrieval Conference (TREC-1), Gaitherburg, MD, USA, 4–6 November 1992, pp. 319–328.

45.

Cormack

Clarke

CLA

Buttcher

. Reciprocal rank fusion outperforms condorcet and individual rank learning mthods. In: Proceedings of the 32nd Annual International ACM SIGIR Conference, Boston, MA, USA, 2009, pp. 758–759.

46.

. Linear combination of component results in information retrieval. Data Knowl Eng 2012; 71(1): 114–126.

47.

Budíková

Batko

Zezula

. Fusion strategies for large-scale multi-modal image retrieval. Trans Large Scale Data Knowl Centered Syst 2017; 33: 146–184.

48.

Kato

Shimizu

Fujita

, et al. Unsupervised answer retrieval with data fusion for community question answering. In: Information retrieval technology - 15th Asia Information Retrieval Societies Conference, AIRS 2019. Lecture notes in computer science. Springer, 2019, vol 12004, pp. 10–21.

49.

Roostaee

Sadreddini

Fakhrahmad

. An effective approach to candidate retrieval for cross-language plagiarism detection: a fusion of conceptual and keyword-based schemes. Inf Process Manag 2020; 57(2): 102150.

50.

Smeaton

O’Connor

Regan

. Multimedia information retrieval and environmental monitoring: shared perspectives on data fusion. Ecol Inform 2014; 23: 118–125.

51.

. Applying the data fusion technique to blog opinion retrieval. Expert Syst Appl 2012; 39(1): 1346–1353.

52.

Clipa

Nunzio

GMD

. A study on ranking fusion approaches for the retrieval of medical publications. OR Inf 2020; 11(2): 103.

53.

de Herrera

AGS

Schaer

Markonis

, et al. Comparing fusion techniques for the imageclef 2013 medical case retrieval task. CMIG (Comput Med Imaging Graph) 2015; 39: 46–54.

54.

Mourão

Martins

Magalhães

. Multimodal medical information retrieval with unsupervised rank fusion. Comput Med Imag Graph 2015; 39: 35–45.

55.

Fernández-Pichel

Losada

Pichel

, et al. Citius at the TREC 2020 health misinformation track. In: Voorhees

Ellis

(eds) Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020. NIST Special Publication. National Institute of Standards and Technology (NIST), 2020, vol 1266.

56.

Pradeep

Zhang

, et al. H2oloo at TREC 2020: when all you got is a hammer…deep learning, health misinformation, and precision medicine. In: Voorhees

Ellis

(eds) Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020. NIST Special Publication. National Institute of Standards and Technology (NIST), 2020, vol 1266.

57.

Juárez-González

Montes-y-Gómez

Pineda

, et al. Selecting the n-top retrieval result lists for an effective data fusion. In: Gelbukh

(ed) Computational linguistics and intelligent text processing, 11th international conference, CICLing 2010. Lecture notes in computer science. Springer, 2010, vol 6008, pp. 580–589.

58.

McClean

. Performance prediction of data fusion for information retrieval. Inf Process Manag 2006; 42(4): 899–915.

59.

Huang

, et al. Clustering-based fusion for medical information retrieval. J Biomed Inf 2022; 135: 104213.

60.

Abualsaud

Lioma

Maistro

, et al. Overview of the TREC 2019 decision track. In: Voorhees

Ellis

(eds) Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019. NIST Special Publication. National Institute of Standards and Technology (NIST), 2019, vol 1250.

Combating health misinformation with fusion-based credible retrieval techniques

Abstract

Keywords

Introduction

Related work

Misinformation detection

Health information retrieval

Data fusion

Clustering-based subset selection

Experiments and results

Evaluation measures & settings

Experimental results

Discussions

Clustering analysis

Conclusion

Footnotes

ORCID iD

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

Notes

References